Enhancing Reliability Using Service-Level Prioritized Load Shedding at Netflix

Abstract

How does Netflix maintain a seamless viewing experience for millions of users, especially during traffic spikes or when backend datastores are overloaded? Autoscaling can help during traffic spikes, but it costs money, takes a few minutes to kick in, and capacity may not always be available. Furthermore, if a downstream service or datastore is overloaded, autoscaling may exacerbate the problem. 

In this talk, we will cover service-level prioritized load shedding – our solution for prioritizing requests within a single application instance. This innovative solution ensures requests that are critical to user experience maintain high availability, and allows dynamically re-purposing non-critical capacity to serve critical traffic during times of duress. We will also discuss how we automated per-cluster tuning and validation of load shedding, enabling us to quickly deploy unique configurations to hundreds of clusters. 

Key Takeaways

  1. Understand the evolution of load shedding techniques at Netflix and how service-level prioritization enhances user experience and reliability.
  2. Gain insights from real-world applications and testing scenarios that demonstrate the effectiveness of prioritized load shedding.
  3. Learn how platform engineering and service owners at Netflix effectively collaborated to build a generic prioritized load shedding library, efficiently roll it out, and automate tuning of load shedding thresholds.
  4. Learn how we are continuing to improve the experience for service owners through our continued investments in making the load shedding capability more flexible and easier to operate.
     

Speaker

Anirudh Mendiratta

Staff Software Engineer, Playback Lifecycle @Netflix, Previously @Amazon Prime Video and @fuboTV

Anirudh Mendiratta is an engineer on the Playback Lifecycle team at Netflix who has been instrumental in launching live streaming at Netflix. Anirudh has 10+ years of experience as a distributed systems engineer, primarily in the video streaming domain. Prior to Netflix, he built a live video observability platform at Amazon Prime Video and built live video ingestion, storage and delivery systems at fuboTV. 

Together with Benjamin Fedorka, Anirudh pioneered service-level prioritized load shedding allowing Netflix to handle live events at unprecedented scale.
 

Read more

Speaker

Benjamin Fedorka

Senior Software Engineer, Productivity Engineering - Java Platform @Netflix

Benjamin Fedorka is an engineer on Netflix’s Java Platform, focused on IPC ergonomics and resilience. With the launch of live programming, Benjamin has focused on helping teams prepare their clusters to entertain the world during these massive events. Benjamin has over ten years of engineering experience, primarily focused on platform engineering, and is passionate about enabling teams to solve new challenges instead of toiling on common problems.

Together with Anirudh Mendiratta, Benjamin pioneered service-level prioritized load shedding allowing Netflix to handle live events at unprecedented scale.
 

Read more

From the same track

Session

Continuous Delivery for Foundational Platforms

Platform teams frequently inherit systems that were never architected for their current scale, yet are so foundational that downtime can halt the business.

Speaker image - Ian Nowland

Ian Nowland

CEO @Junction Labs, Author of O'Reilly's Platform Engineering, Previously SVP Core Engineering at Datadog and Leader of AWS Nitro

Session

Microservices Platforms: When Team Topologies Meets Microservices Patterns

When many teams work on a large, complex application, the microservice architecture potentially enables them to work independently and deliver a continuous stream of changes.

Speaker image - Chris Richardson

Chris Richardson

Creator of microservices.io, Java Champion, & Core Microservices Thoughtleader