How Netflix Ensures Highly-Reliable Online Stateful Systems

Under most stateless services are stateful databases, caches, and systems which form the bedrock applications are built on. These stateful systems have to be extremely reliable as even the slightest latency or availability blip can easily cascade up to the application layer due to data in-cast and query fan-out. With careful consideration, we can design reliable inter-process-communication (IPC), software deployment, and data models to handle millions of requests per second over petabyte scale datastores at single-digit millisecond latency.

In this talk we will start with the architecture of Netflix's stateful caches and databases, including how we capacity plan, bulkhead, and deploy software to our global, full-active, data topology. Keeping petabytes of state across four AWS regions and twelve fault isolation zones consistent and cost-efficient is challenging, but it allows us to meet our millisecond service-level-objectives for both reads and write operations at the level of availability Netflix applications demand.

Then, we will cover how we make stateful systems fully idempotent for both reads and writes, and use this to build novel resilience techniques into our stateful IPC layers using our KeyValue and TimeSeries (event) use cases as concrete examples. Developers at Netflix may see simple read and write APIs, but under the hood we implement advanced load balancing, routing, hedging and other resilience techniques to make those APIs function properly every time. Finally, we will conclude with how we design these systems to, when they do fail, fail as gracefully as possible to limit the blast radius of those failures and contain the business impact to Netflix. For example, most Netflix developers would rather sacrifice properties like immediate strong consistency via caching or partial shedding in emergency situations rather than be unavailable.

Interview:

What's the focus of your work these days?

I focus on designing, implementing, operating, and scaling online data abstractions and storage engines. This work spans from high-level design and implementation of new atomic compare and swap protocols for our KeyValue service, to automating capacity planning optimal hardware to run our various storage engines, down to patching storage engines themselves to improve performance or scalability. On the side, I produce large quantities of memos.

What's the motivation for your talk at QCon San Francisco 2023?

To share stateful architectural and operational patterns as well as novel resilience techniques that get beyond five nines of availability.

How would you describe your main persona and target audience for this session?

A software developer, reliability engineer, or database operator who needs high availability from their caches and databases. Having a solid grasp of distributed systems and databases will enrich the talk, but is not required.

Is there anything specific that you'd like people to walk away with after watching your session?

With new knowledge and concrete advice on how to design, build and scale their stateful systems to be even more reliable than before - ranging from small changes with big benefits, to more complex system-wide changes.


Speaker

Joseph Lynch

Distributed Systems Engineer @Netflix Working on Online Datastores and Data Abstractions

Joey Lynch is a Principal Software Engineer for Netflix who focuses on building highly-reliable and high-leverage data abstractions on top of datastore infrastructure. He is a core contributor to Netflix’s datastore platform, which supports a polyglot data tier including Cassandra, Elasticsearch, CockroachDB, Zookeeper and more. He loves building distributed systems and learning the fun and exciting ways that they scale, operate, and break. Having wrangled many large scale distributed systems over the years, he currently spends much of his time wrangling KeyValue and TimeSeries event stores.

Read more
Find Joseph Lynch at:

Date

Tuesday Oct 3 / 02:45PM PDT ( 50 minutes )

Location

Ballroom BC

Topics

Database Distributed Systems Reliability Data Architecture

Share

From the same track

Session Resiliency

How Do We Talk to Each Other? How Surfacing Communication Patterns in Organizations Can Help You Understand and Improve Your Resilience

Tuesday Oct 3 / 01:35PM PDT

As a system increases in inevitable complexity, it becomes impossible for a single operator to have a clear, unambiguous understanding of what's happening in the system. Understanding the system requires a joint effort between teammates and technology.

Speaker image - Nora Jones

Nora Jones

Founder and CEO @jeli_io, Founder of Learning From Incidents (LFI) Online Community and Conference

Session Architecture

Disaster Recovery Across a Million Pieces

Tuesday Oct 3 / 10:35AM PDT

Data recovery is more than just backing up and restoring a data store. The goal of any disaster recovery effort is getting the system back to working as expected across all of its parts.

Speaker image - Michelle Brush

Michelle Brush

Engineering Director, SRE @Google, Previously Director of HealtheIntent Architecture @Cerner Corporation & Lead Engineer @Garmin, Author of "2 out of the 97 Things Every SRE Should Know"

Session Resiliency

Orchestrating Resilience: Building Modern Asynchronous Systems

Tuesday Oct 3 / 03:55PM PDT

Building asynchronous, event-driven systems can be daunting. Managing states, ensuring resilience, maintaining traceability, and handling a myriad of other challenges often require more effort than building the functionality itself.

Speaker image - Sai Pragna Etikyala

Sai Pragna Etikyala

Technical Lead @Twilio

Session Reliability

Designing Fault-Tolerant Software with Control System Transparency

Tuesday Oct 3 / 11:45AM PDT

Teams at NASA and JPL that create mission-critical software for spacecraft take a principled approach to fault tolerance. Let's see how those same principles, centered around a concept of transparency, can help us achieve reliability in pragmatic, modern software delivery settings.

Speaker image - Jon Moore

Jon Moore

Staff Software Engineer @Stripe with over 35 years of software engineering experience across both academia and industry

Session

Unconference: Designing for Resilience

Tuesday Oct 3 / 05:05PM PDT

What is an unconference? An unconference is a participant-driven meeting. Attendees come together, bringing their challenges and relying on the experience and know-how of their peers for solutions.