Most organizations use databases to efficiently capture, store, organize and read day-to-day operational data such as customer and transaction details. However, databases are not the optimal solution for every data processing task. This session will highlight when data lakes can offer a strong alternative to traditional databases and data warehouses. Data lakes are now a standard part of successful organizations’ information systems, primarily because data lakes frequently offer greater flexibility than databases. Data has a life cycle. If data is first written into a file (such as computer log files), by using a data lake architecture, the data in the log files are immediately readable using SQL. Because many people first encounter SQL when using databases, they assume that SQL requires that the data be in a database. However, when using data lake architectures, SQL can be used to get immediate answers from data in collections of files. The data may be in its raw form, or transformed into file formats optimized for fastest data retrieval. Data lakes also simultaneously support analysis using artificial intelligence and machine learning too. For this session, we are going to use an open-source technology to read data from log files stored on Backblaze B2 Cloud Storage. The open-source tool we will highlight is Trino, which is a powerful SQL query engine. We will use the publicly available Backblaze Drive Stats data as our sample data set. Drive Stats is a public, open data set containing over 9 years of daily metrics, including drive failures, on all hard drives in Backblaze’s cloud storage infrastructure. Currently, Drive Stats comprises over 300 million records, consuming 90 GB of storage in CSV format, with over 200,000 records, or 75 MB of data, added every day. In the presentation, we will share the data engineering experience we gained in working with the Drive Stats data, as well as insights we were able to gain by being able to run analytical queries on the entire data set for the first time. Attendees will be provided access to the Backblaze Drive Stats data for their own use, publicly hosted on B2 Cloud Storage. Using endpoint connections that we will share with you, attendees will be able to get up and running their own test environments to get hands-on experience of the ease and power of data getting business insights from a data lake.
Speaker
Greg Hamer
Data and Application Architecture Specialist @Backblaze
Greg is a specialist in data and application architecture. Greg's passion is explaining complex technology in easily understandable terms to diverse audiences. Greg has been a speaker at more than a dozen technical conferences on topics that include systems architecture, software development, programming, and data modeling. In addition, Greg was adjunct faculty at North Carolina State University (computer science). Greg has worked for over 20 years as a developer and a professional technical trainer teaching developers and architects on complex serverside technologies, software engineering and cloud development. Greg is a certified AWS Champion Technical Trainer and as a former AWS employee, regularly delivered training and presented at AWS technical summits. Greg holds 8 AWS technical certifications. Areas of expertise in AWS training include cloud architecture, databases, data analytics, data warehousing, microservice programming and various addition programming and various additional areas of developer programming. Greg has also held certifications from Microsoft, Sybase and Brightcove.