Under most stateless services are stateful databases, caches, and systems which form the bedrock applications are built on. These stateful systems have to be extremely reliable as even the slightest latency or availability blip can easily cascade up to the application layer due to data in-cast and query fan-out. With careful consideration, we can design reliable inter-process-communication (IPC), software deployment, and data models to handle millions of requests per second over petabyte scale datastores at single-digit millisecond latency.
In this talk we will start with the architecture of Netflix's stateful caches and databases, including how we capacity plan, bulkhead, and deploy software to our global, full-active, data topology. Keeping petabytes of state across four AWS regions and twelve fault isolation zones consistent and cost-efficient is challenging, but it allows us to meet our millisecond service-level-objectives for both reads and write operations at the level of availability Netflix applications demand.
Then, we will cover how we make stateful systems fully idempotent for both reads and writes, and use this to build novel resilience techniques into our stateful IPC layers using our KeyValue and TimeSeries (event) use cases as concrete examples. Developers at Netflix may see simple read and write APIs, but under the hood we implement advanced load balancing, routing, hedging and other resilience techniques to make those APIs function properly every time. Finally, we will conclude with how we design these systems to, when they do fail, fail as gracefully as possible to limit the blast radius of those failures and contain the business impact to Netflix. For example, most Netflix developers would rather sacrifice properties like immediate strong consistency via caching or partial shedding in emergency situations rather than be unavailable.
Interview:
What's the focus of your work these days?
I focus on designing, implementing, operating, and scaling online data abstractions and storage engines. This work spans from high-level design and implementation of new atomic compare and swap protocols for our KeyValue service, to automating capacity planning optimal hardware to run our various storage engines, down to patching storage engines themselves to improve performance or scalability. On the side, I produce large quantities of memos.
What's the motivation for your talk at QCon San Francisco 2023?
To share stateful architectural and operational patterns as well as novel resilience techniques that get beyond five nines of availability.
How would you describe your main persona and target audience for this session?
A software developer, reliability engineer, or database operator who needs high availability from their caches and databases. Having a solid grasp of distributed systems and databases will enrich the talk, but is not required.
Is there anything specific that you'd like people to walk away with after watching your session?
With new knowledge and concrete advice on how to design, build and scale their stateful systems to be even more reliable than before - ranging from small changes with big benefits, to more complex system-wide changes.
Speaker
Joseph Lynch
Distributed Systems Engineer @Netflix Working on Online Datastores and Data Abstractions
Joey Lynch is a Principal Software Engineer for Netflix who focuses on building highly-reliable and high-leverage data abstractions on top of datastore infrastructure. He is a core contributor to Netflix’s datastore platform, which supports a polyglot data tier including Cassandra, Elasticsearch, CockroachDB, Zookeeper and more. He loves building distributed systems and learning the fun and exciting ways that they scale, operate, and break. Having wrangled many large scale distributed systems over the years, he currently spends much of his time wrangling KeyValue and TimeSeries event stores.