Abstract

Under most stateless services are stateful databases, caches, and systems which form the bedrock applications are built on. These stateful systems have to be extremely reliable as even the slightest latency or availability blip can easily cascade up to the application layer due to data in-cast and query fan-out. With careful consideration, we can design reliable inter-process-communication (IPC), software deployment, and data models to handle millions of requests per second over petabyte scale datastores at single-digit millisecond latency.

In this talk we will start with the architecture of Netflix's stateful caches and databases, including how we capacity plan, bulkhead, and deploy software to our global, full-active, data topology. Keeping petabytes of state across four AWS regions and twelve fault isolation zones consistent and cost-efficient is challenging, but it allows us to meet our millisecond service-level-objectives for both reads and write operations at the level of availability Netflix applications demand.

Then, we will cover how we make stateful systems fully idempotent for both reads and writes, and use this to build novel resilience techniques into our stateful IPC layers using our KeyValue and TimeSeries (event) use cases as concrete examples. Developers at Netflix may see simple read and write APIs, but under the hood we implement advanced load balancing, routing, hedging and other resilience techniques to make those APIs function properly every time. Finally, we will conclude with how we design these systems to, when they do fail, fail as gracefully as possible to limit the blast radius of those failures and contain the business impact to Netflix. For example, most Netflix developers would rather sacrifice properties like immediate strong consistency via caching or partial shedding in emergency situations rather than be unavailable.

Interview:

What's the focus of your work these days?

I focus on designing, implementing, operating, and scaling online data abstractions and storage engines. This work spans from high-level design and implementation of new atomic compare and swap protocols for our KeyValue service, to automating capacity planning optimal hardware to run our various storage engines, down to patching storage engines themselves to improve performance or scalability. On the side, I produce large quantities of memos.

What's the motivation for your talk at QCon San Francisco 2023?

To share stateful architectural and operational patterns as well as novel resilience techniques that get beyond five nines of availability.

How would you describe your main persona and target audience for this session?

A software developer, reliability engineer, or database operator who needs high availability from their caches and databases. Having a solid grasp of distributed systems and databases will enrich the talk, but is not required.

Is there anything specific that you'd like people to walk away with after watching your session?

With new knowledge and concrete advice on how to design, build and scale their stateful systems to be even more reliable than before - ranging from small changes with big benefits, to more complex system-wide changes.

Speaker

Joseph Lynch

Principal Software Engineer @Netflix Building Highly-Reliable and High-Leverage Infrastructure Across Stateless and Stateful Services

Joseph Lynch is a Principal Software Engineer for Netflix who focuses on building highly-reliable and high-leverage infrastructure across our stateless and stateful services. He led the shift of the Netflix data tier to abstraction, driving resilience through a Data Gateway architecture. He loves building distributed systems and learning the fun and exciting ways that they scale, operate, and break. Having wrangled many large scale distributed systems over the years, he currently spends much of his time building automated capacity management and resiliency features into the Netflix fleet.

How Netflix Ensures Highly-Reliable Online Stateful Systems

Abstract

Interview:

What's the focus of your work these days?

What's the motivation for your talk at QCon San Francisco 2023?

How would you describe your main persona and target audience for this session?

Is there anything specific that you'd like people to walk away with after watching your session?

Speaker

Joseph Lynch

Find Joseph Lynch at:

Speaker

Joseph Lynch

Date

Location

Track

Topics

Share

From the same track

How Do We Talk to Each Other? How Surfacing Communication Patterns in Organizations Can Help You Understand and Improve Your Resilience

Disaster Recovery Across a Million Pieces

Orchestrating Resilience: Building Modern Asynchronous Systems

Designing Fault-Tolerant Software with Control System Transparency

Unconference: Designing for Resilience

Follow QCon

Contact

Menu

Conferences around the World