Algorithms Behind Modern Storage Systems

Next QConSF Conference: Applied AI for Developers QCon.ai April 2019

What You’ll Learn

1. Hear about storage solutions, which are optimized for read or for write, which fits better various databases.

2. Learn about B-trees and LSM-trees, what they are and what are the benefits of using one over the other

3. Find out how to evaluate various storage systems to see which one fits better for the problem at hand.

Abstract

In the world of Big Data, it’s important to know how the Storage Systems work in order to be able to pick a right tool right job. The talk covers modern storage system approaches, discussing storage internals, and evaluation techniques to choose a database with with the optimal read, write or memory overhead, best suitable for your data.

Question:

What's the focus of the work that you're doing today?

Answer:

I can tell you about Apache Cassandra and the patches I was working on recently. In Apache Cassandra, I've been recently working on transaction replication. Before that, I was working on SASI, an implementation of secondary indexes, on the commit log and on various storage and consistency related things. I had a chance to work on most of the Apache Cassandra subsystems. In Cassandra (or in Apache projects in general) people often don’t specialize (meaning that you work not only on one small subsection of it but you work on the project as a whole). There are few people who specialize on complex subsystems like compaction or other things but it's not very often. Usually, you get to work on the database as a whole. And this was pretty much what I was lucky enough to do.

Question:

What are you going to focus on in your talk?

Answer:

I'm going to focus on the distinction between the two storage types that I think are most prevailing at the moment: immutable and mutable storage. It seems to be that over the years, as the storage systems evolved, database community concentrated more on the mutable storage (the storage which was more suitable for spinning disks, like B-trees). Right now, people tend to move to something that is working slightly better for SSDs (meaning LSM-trees).

The main subject of the talk is going to be to describe what the B-trees (or B+-trees) are, what LSM-trees are, and then to give a rule of thumb to evaluate any paper that you can read. There are algorithms which are trying to optimize for the read, others for the write, and there are ones which are trying to optimize for storage overhead. I'm going to include several metrics to use in order to find a good balance between these three things: read optimization, write optimization, and storage overhead.

I’ll evaluate the two storage systems that I've been describing over the whole talk and summarize of what we've been discussing.

Question:

So is this talk really all about LSM trees?

Answer:

Even though Cassandra is using LSM-trees it doesn't mean that I'm going to bash on B-tree storage. First of all this is not the point of the talk and second you can't really say that LSM-trees are superior to B-trees or vice versa. They are used for different purposes, in different databases, maybe even at different times. So it's just going to be a summary for people's understanding rather than make them all fans of whatever was picked for the Apache Cassandra.

Question:

It this an academic talk or a practical talk?

Answer:

I will try to be more practical. I’ll include details on why people are picking a certain block size, why compaction is important in LSM-trees, what sort of maintenance you should be aware of in B-trees, things like that. I will try my best to include as many practical details as it is possible without sacrificing precision. The main part of the talk is describing what these data structures are, because knowing the tradeoffs without knowing how it actually works might be even less useful than the other way around. I will try to keep the balance but will do my best to include as many practical details as possible.

Question:

To better understand this discussion, is there anything that you would recommend to jumpstart the audience?

Answer:

There are three papers that I would recommend to anyone regardless of their job description, seniority, years of work in the industry, and the databases they are currently using. One of them is Ubiquitous B-trees by Douglas Comer, which summarizes the B-trees techniques, and the second paper is the "LSM paper" which is the Log-Structured Merge Trees. As a summary, I’ll also talk about the RUM Conjecture. These three papers would be ideal, maybe not read all the details but at least get the general idea. As a general overview, you can check out my ACM article on the subject that covers some things I’ll be talking about: https://queue.acm.org/detail.cfm?id=3220266

Speaker: Oleksandr Petrov

Apache Cassandra Committer, Distributed Systems Engineer

Alex Petrov is an infrastructure engineer and Apache Cassandra committer. He is interested in storage, distributed systems, and algorithms.

Find Oleksandr Petrov at

Speaker page

@ifesdjeen

Alex Petrov

Software Engineer @OpenRoboticsOrg

Louise Poubel

CRDTs in Production

Software Engineer @PayPal

Dmitry Martyanov

FreshEBT

CTO at Propel Inc, building @FreshEBT

Ram Mehta

npm and the Future of JavaScript

Co-Founder & Chief Operating Officer @npmjs

Laurie Voss

Control Theory In Container Orchestration

Software Developer and Cloud Specialist @Checkfront

Vallery Lancey

Capacity Planning for Crypto Mania

Software Engineer @coinbase

Jordan Sitkin

Capacity Planning for Crypto Mania

Software Engineer @coinbase

Luke Demi

The Most Secure Program Is One That Doesn’t Exist

Research Engineer @mozilla

Diane Hosfelt

Dropping The Work-Life Balancing Act

Senior Software Engineer @stitchfix

Cameron Jacoby

Tracks

Monday, 5 November

Microservices / Serverless Patterns & Practices

Evolving, observing, persisting, and building modern microservices
Practices of DevOps & Lean Thinking

Practical approaches using DevOps & Lean Thinking
JavaScript & Web Tech

Beyond JavaScript in the Browser. Exploring WebAssembly, Electron, & Modern Frameworks
Modern CS in the Real World

Thoughts pushing software forward, including consensus, CRDT's, formal methods, & probabilistic programming
Modern Operating Systems

Applied, practical, & real-world deep-dive into industry adoption of OS, containers and virtualization, including Linux on Windows, LinuxKit, and Unikernels
Optimizing You: Human Skills for Individuals

Better teams start with a better self. Learn practical skills for IC

Tuesday, 6 November

Architectures You've Always Wondered About

Next-gen architectures from the most admired companies in software, such as Netflix, Google, Facebook, Twitter, & more
21st Century Languages

Lessons learned from languages like Rust, Go-lang, Swift, Kotlin, and more.
Emerging Trends in Data Engineering

Showcasing DataEng tech and highlighting the strengths of each in real-world applications.
Bare Knuckle Performance

Killing latency and getting the most out of your hardware
Socially Conscious Software

Building socially responsible software that protects users privacy & safety
Delivering on the Promise of Containers

Runtime containers, libraries, and services that power microservices

Wednesday, 7 November

Applied AI & Machine Learning

Applied machine learning lessons for SWEs, including tech around TensorFlow, TPUs, Keras, PyTorch, & more
Production Readiness: Building Resilient Systems

More than just building software, building deployable production ready software
Developer Experience: Level up your Engineering Effectiveness

Improving the end to end developer experience - design, dev, test, deploy, operate/understand.
Security: Lessons Attacking & Defending

Security from the defender's AND the attacker's point of view
Future of Human Computer Interaction

IoT, voice, mobile: Interfaces pushing the boundary of what we consider to be the interface
Enterprise Languages

Workhorse languages found in modern enterprises. Expect Java, .NET, & Node in this track

This Year's Schedule

The all-new QCon app!

Available on iOS and Android

The new QCon app helps you make the most of your conference experience. Easily browse and follow the conference schedule, star the talks you want to attend, and keep tabs on your personal itinerary. Download the app now for free on iOS and Android.

Track: Modern CS in the Real World

Location: Pacific LMNO

Duration: 2:55pm - 3:45pm

Day of week: Monday

Level: Advanced

Persona: Backend Developer

What You’ll Learn

Abstract

Speaker: Oleksandr Petrov

Find Oleksandr Petrov at

Similar Talks

Tracks

Monday, 5 November

Microservices / Serverless Patterns & Practices

Practices of DevOps & Lean Thinking

JavaScript & Web Tech

Modern CS in the Real World

Modern Operating Systems

Optimizing You: Human Skills for Individuals

Tuesday, 6 November

Architectures You've Always Wondered About

21st Century Languages

Emerging Trends in Data Engineering

Bare Knuckle Performance

Socially Conscious Software

Delivering on the Promise of Containers

Wednesday, 7 November

Applied AI & Machine Learning

Production Readiness: Building Resilient Systems

Developer Experience: Level up your Engineering Effectiveness

Security: Lessons Attacking & Defending

Future of Human Computer Interaction

Enterprise Languages

The all-new QCon app!

Available on iOS and Android

Presentation: Algorithms Behind Modern Storage Systems

Track: Modern CS in the Real World

Location: Pacific LMNO

Duration: 2:55pm - 3:45pm

Day of week: Monday

Level: Advanced

Persona: Backend Developer

More talks on:

Share this on:

What You’ll Learn

Abstract

Speaker: Oleksandr Petrov

Find Oleksandr Petrov at

Similar Talks

Tracks

Monday, 5 November

Tuesday, 6 November

Wednesday, 7 November

The all-new QCon app!

Available on iOS and Android