How Netflix Shapes our Fleet for Efficiency and Reliability

Abstract

Netflix runs on a complex multi-layer cloud architecture made up of thousands of services, caches, and databases. As hardware options, workload patterns, cost dynamics and the Netflix products evolve, the cost-optimal hardware and configuration for running our services is constantly changing. It is no longer sufficient in modern cloud computing to buy large amounts of the same shape of computer and try to pack every workload on that with large fixed buffers, both for efficiency and availability reasons. It is also no longer sufficient for platform teams to work 1:1 with every service team to optimize their hardware selection, this does not scale.

This talk shows an alternative strategy, where each workload is placed on price-optimal hardware using automated understanding of hardware performance combined with workload characterization. Furthermore, as workload patterns shift, we can continuously re-evaluate and react for every cluster to ensure business outcomes for minimal spend.

We will start with understanding how we automatically model capacity requirements, including key concepts like service buffer allocation based on business criticality. Then we will show how we marry this understanding of workload needs with a deep understanding of AWS hardware performance and pricing to place each workload on efficient hardware. Finally, we will walk through the continuously running optimization loop, which monitors, detects changes, and re-shapes our fleet to maintain business outcomes as load patterns constantly change.

Even with all this planning, our systems still face unexpected load shifts that exceed modeled bounds, so to close we will briefly cover how we manage traffic demand and compute supply to ensure we can maintain availability while intelligently and rapidly injecting capacity into the right server groups to keep Netflix up and running and our customers happily streaming.


Speaker

Joseph Lynch

Principal Software Engineer @Netflix Building Highly-Reliable and High-Leverage Infrastructure Across Stateless and Stateful Services

Joseph Lynch is a Principal Software Engineer for Netflix who focuses on building highly-reliable and high-leverage infrastructure across our stateless and stateful services. He led the shift of the Netflix data tier to abstraction, driving resilience through a Data Gateway architecture. He loves building distributed systems and learning the fun and exciting ways that they scale, operate, and break. Having wrangled many large scale distributed systems over the years, he currently spends much of his time building automated capacity management and resiliency features into the Netflix fleet.

Read more
Find Joseph Lynch at:

Speaker

Argha C

Staff Software Engineer @Netflix Building Highly Available, High Throughput Systems

Argha C is a Staff Software Engineer at Netflix who focuses on building highly available, high throughput systems. He has led multiple initiatives hardening resilience and reliability at the Netflix Edge. He enjoys building and operating systems at scale, and driving technical strategy for key business outcomes. Recently, he has been rethinking Netflix's approach to capacity management and efficiency, while advocating for lessons from the Edge to ensure fleetwide resiliency. 

Read more