How Netflix Shapes our Fleet for Efficiency and Reliability

Abstract

Netflix runs on a complex multi-layer cloud architecture made up of thousands of services, caches, and databases. As hardware options, workload patterns, cost dynamics and the Netflix products evolve, the cost-optimal hardware and configuration for running our services is constantly changing. It is no longer sufficient in modern cloud computing to buy large amounts of the same shape of computer and try to pack every workload on that with large fixed buffers, both for efficiency and availability reasons. It is also no longer sufficient for platform teams to work 1:1 with every service team to optimize their hardware selection, this does not scale.

This talk shows an alternative strategy, where each workload is placed on price-optimal hardware using automated understanding of hardware performance combined with workload characterization. Furthermore, as workload patterns shift, we can continuously re-evaluate and react for every cluster to ensure business outcomes for minimal spend.

We will start with understanding how we automatically model capacity requirements, including key concepts like service buffer allocation based on business criticality. Then we will show how we marry this understanding of workload needs with a deep understanding of AWS hardware performance and pricing to place each workload on efficient hardware. Finally, we will walk through the continuously running optimization loop, which monitors, detects changes, and re-shapes our fleet to maintain business outcomes as load patterns constantly change.

Even with all this planning, our systems still face unexpected load shifts that exceed modeled bounds, so to close we will briefly cover how we manage traffic demand and compute supply to ensure we can maintain availability while intelligently and rapidly injecting capacity into the right server groups to keep Netflix up and running and our customers happily streaming.


Speaker

Joseph Lynch

Principal Software Engineer @Netflix Building Highly-Reliable and High-Leverage Infrastructure Across Stateless and Stateful Services

Joseph Lynch is a Principal Software Engineer for Netflix who focuses on building highly-reliable and high-leverage infrastructure across our stateless and stateful services. He led the shift of the Netflix data tier to abstraction, driving resilience through a Data Gateway architecture. He loves building distributed systems and learning the fun and exciting ways that they scale, operate, and break. Having wrangled many large scale distributed systems over the years, he currently spends much of his time building automated capacity management and resiliency features into the Netflix fleet.

Read more
Find Joseph Lynch at:

Speaker

Argha C

Staff Software Engineer @Netflix Building Highly Available, High Throughput Systems

Argha C is a Staff Software Engineer at Netflix who focuses on building highly available, high throughput systems. He has led multiple initiatives hardening resilience and reliability at the Netflix Edge. He enjoys building and operating systems at scale, and driving technical strategy for key business outcomes. Recently, he has been rethinking Netflix's approach to capacity management and efficiency, while advocating for lessons from the Edge to ensure fleetwide resiliency. 

Read more

From the same track

Session

Realtime and Batch Processing of GPU Workloads

SS&C Technologies runs 47 trillion dollars of assets on our global private cloud. We have the primitives for infrastructure as well as platforms as a service like Kubernetes, Kafka, NiFi, Databases, etc.

Speaker image - Joseph Stein

Joseph Stein

Principal Architect of Research & Development @SS&C Technologies, Previous Apache Kafka Committer and PMC Member

Session

From ms to µs: OSS Valkey Architecture Patterns for Modern AI

As AI applications demand faster and more intelligent data access, traditional caching strategies are hitting performance and reliability limits. 

Speaker image - Dumanshu Goyal

Dumanshu Goyal

Software Engineer @Airbnb - Leading Online Data Priorities, Previously @Google and @AWS

Session

One Platform to Serve Them All: Autoscaling Multi-Model LLM Serving

AI teams are moving to self-hosted inference away from hosted LLMs as fine-tuning drives model performance. The catch is scale, hundreds of variants create long-tail traffic, cold starts, and duplicated stacks.

Speaker image - Meryem Arik

Meryem Arik

Co-Founder and CEO @Doubleword (Previously TitanML), Recognized as a Technology Leader in Forbes 30 Under 30, Recovering Physicist

Session

Cost-Conscious Cloud: Designing Systems that Don't Break the Bank

Details coming soon.