Optimal Data Storage Choices - Data Lakes vs Databases

 

Most organizations use databases to efficiently capture, store, organize and read day-to-day operational data such as customer and transaction details. However, databases are not the optimal solution for every data processing task. This session will highlight when data lakes can offer a strong alternative to traditional databases and data warehouses. Data lakes are now a standard part of successful organizations’ information systems, primarily because data lakes frequently offer greater flexibility than databases. Data has a life cycle. If data is first written into a file (such as computer log files), by using a data lake architecture, the data in the log files are immediately readable using SQL. Because many people first encounter SQL when using databases, they assume that SQL requires that the data be in a database. However, when using data lake architectures, SQL can be used to get immediate answers from data in collections of files. The data may be in its raw form, or transformed into file formats optimized for fastest data retrieval. Data lakes also simultaneously support analysis using artificial intelligence and machine learning too. For this session, we are going to use an open-source technology to read data from log files stored on Backblaze B2 Cloud Storage. The open-source tool we will highlight is Trino, which is a powerful SQL query engine. We will use the publicly available Backblaze Drive Stats data as our sample data set. Drive Stats is a public, open data set containing over 9 years of daily metrics, including drive failures, on all hard drives in Backblaze’s cloud storage infrastructure. Currently, Drive Stats comprises over 300 million records, consuming 90 GB of storage in CSV format, with over 200,000 records, or 75 MB of data, added every day. In the presentation, we will share the data engineering experience we gained in working with the Drive Stats data, as well as insights we were able to gain by being able to run analytical queries on the entire data set for the first time. Attendees will be provided access to the Backblaze Drive Stats data for their own use, publicly hosted on B2 Cloud Storage. Using endpoint connections that we will share with you, attendees will be able to get up and running their own test environments to get hands-on experience of the ease and power of data getting business insights from a data lake.


Speaker

Greg Hamer

Data and Application Architecture Specialist @Backblaze

Greg is a specialist in data and application architecture. Greg's passion is explaining complex technology in easily understandable terms to diverse audiences. Greg has been a speaker at more than a dozen technical conferences on topics that include systems architecture, software development, programming, and data modeling. In addition, Greg was adjunct faculty at North Carolina State University (computer science). Greg has worked for over 20 years as a developer and a professional technical trainer teaching developers and architects on complex serverside technologies, software engineering and cloud development. Greg is a certified AWS Champion Technical Trainer and as a former AWS employee, regularly delivered training and presented at AWS technical summits. Greg holds 8 AWS technical certifications. Areas of expertise in AWS training include cloud architecture, databases, data analytics, data warehousing, microservice programming and various addition programming and various additional areas of developer programming. Greg has also held certifications from Microsoft, Sybase and Brightcove.

Read more

Session Sponsored By

Scale applications and distribute services globally with simple, S3-compatible cloud object storage.

Date

Monday Oct 24 / 05:25PM PDT ( 50 minutes )

Share

From the same track

Session

Solutions Track Session 1

Details coming soon.

Session

Our journey into high performance and reliable document databases with RavenDB

Monday Oct 24 / 04:10PM PDT

When I started at Kobo, we needed to look beyond the relational and into document databases.

Trevor Hunter

Chief Technology Officer @Kobo Inc.

Session

Building Agile Data Architectures in Support of Digital Twins and Data Products

Monday Oct 24 / 02:55PM PDT

Agile software development and elastic cloud foundations have enabled on-demand expansion of compute functions from real-time processing to Machine Learning at scale but Data has been left behind.

Stuart Sim

Leader @Build by McKinsey

Session

Service abstractions to cloud service providers: A tale of trade-offs

Monday Oct 24 / 01:40PM PDT

 

Oscar Mullin

Sr. Tech Director and Head of Core Platform Services, Databases, Operational Excellence, and SRE @Mercado Libre

Session

Is Web3 Here to Stay?

Monday Oct 24 / 11:50AM PDT

You may be familiar with the current reputation of web3, but are you up to date on the advantages of distributed ledgers applied in the real world? Join us for a deeper dive into how companies are using this innovative technology today.

Richard Bair

VP of Software Engineering @Hedera

Session

Bringing green, sustainable software solutions into the enterprise

Monday Oct 24 / 10:35AM PDT

 

Adam Jordan

Distinguished Software Engineer @Shell