Presentation: Stranger Things: The Forces that Disrupt Netflix

Duration

Duration: 
5:25pm - 6:15pm

Level:

Persona:

Key Takeaways

  • Hear about some of the production war stories from a Netflix engineer focused on the JVM.
  • Learn lessons and recommendations on protecting portions of a highly distributed system from unexpected failures.
  • Understand some of the approaches Netflix uses to provide better resiliency in distributed systems.

Abstract

Imagine a world where you do everything within your power to ensure the code you are pushing into production is as ready as possible to take traffic. You have thorough test coverage, you push out canaries, and you use push windows. You have truly operated your microservice in a top-notch way. And then all of a sudden...CPU spikes, GC churns, latency increases, you start spewing errors...enter sad Netflix customers.

This talk will explore ways in which behaviors of other systems result in surprise issues with your service. It will lay out strategies that you can employ to protect your systems and enable your on call staff to sleep better knowing that things will go wrong, but your system won’t fall over.

Key Takeaways:

  • Learn how microservice sharding can isolate the impact of outages and enable proper tuning for different use cases.
  • Understand the impact of fallbacks on your service and what makes a good fallback.
  • Why auto-loading/initializing/binding and package scanning is bad for operations.
  • Strategies for “canarying” data changes. 

Interview

Question: 
QCon: What have you been up to since last year when we talked to you last?
Answer: 

Haley: Since we last spoke, we took Netflix global, so I was involved in that effort. The part of the system that I work in is heavily involved in picking the languages we return to customers, so that was a large effort. The follow-up to that is making things work better around the world. Outside of that, we are working on re-architecting some of our back-end systems that deliver metadata to our microservice tiers. 

I am going to talk about this a little bit in the talk this year at QConSF. It will be about some of the challenges that we have had with our data architecture within Netflix, and how we are trying to change the way that works.

Question: 
QCon: Is this talk focused on RPC for that metadata?
Answer: 

Haley: Not specifically. Part of the talk is about our current data architecture at Netflix. The entire Netflix catalogue is loaded onto every instance in Netflix. So all metadata requests are local calls which are highly reliable, but it also has very unpredictable operational characteristics. If we get a large change in data, you get GC churn which results in latency spikes on your box and things start failing.

We’ve been experimenting with new mechanisms for remoting some of the data to another service, and many of the lessons that I’m discussing in my talk have been helpful in driving the architectural decisions for the new metadata tier.

Question: 
QCon: You’re talk is called “Stranger Things: The Forces that Disrupt Netflix.” Interesting title. Can you tell me about your talk?
Answer: 

Haley: I like to brand my talks with Netflix content because it is fun. The title is a good parallel, because, if you think about a distributed system, there are all these things happening underneath at all different levels even within your box (even within your own JVM). All these different things can just rear their ugly head at the worst time. So, the talk is about several of those cases and things that you can do to change the blast radius when things do go wrong.

Question: 
QCon: You mentioned one of the things that has disrupted Netflix with metadata. Any other interesting stories you’ll discuss in your talk?
Answer: 

Haley: We had an outage that occurred a month or two ago that I’ll discuss. There is a service whose whole job is to take logging data from all of the devices. It’s meant to be a super lightweight thin box that’s just accepts a firehose of data. We had a problem where one of the dependencies on that box started loading the metadata catalogue. So these boxes that are really thin and lightweight all of a sudden started loading 11 gigs of data into memory. Those boxes started running out of memory, and that whole farm became unhealthy. The unfortunate side effects rippled because we didn’t handle the failure properly. 

Not only did it take down the logging farm, it also took down any other farms that were behind the same routing tier.  I will use that story to discuss some of the lessons learned, like validating resource constraints on a box, problems with automatic loading of functionality for all JARs in an application.

In this case, that jar didn’t even need to be on the box, but (because we have this huge dependency tree) it got sucked in and loaded. I am going to talk through several of the lessons learned there. I will also discuss some of the ways that we could do a better job of keeping that app up and prevent it from impacting other apps that are behind the same tier.

Question: 
QCon: What did you ultimately do to prevent things like that from happening in the future?
Answer: 

Haley: We are still working on that, but there are a few things. 

One is that we are pruning out the unnecessary dependencies, and we also worked with the team that loads the big data blob and had them put in a kill switch. So now we can guarantee that no matter what dependency gets pulled in, it will never load on this lightweight box. 

Additionally, we are considering changes to some calls to be fire-and-forget, so that we can queue them up and process them offline. That would let us take the traffic and prevent the retry storm.

Question: 
QCon: Who is the primary persona you are talking to in this talk?
Answer: 

Haley: It is mainly architects and tech leads on the JVM. With that said, many of the lessons we learned are applicable to anybody operating a distributed system. These are the things that have ripple effects that are common in any distributed system. 

I am really talking about the failures in a distributed system, and how they tie into each other. This talk isn’t really about Netflix scale. I’m really trying to make this one about applications of any size, and offer suggestions on how to protect yourself from some of the problems we ran into.

Speaker: Haley Tucker

Senior Software Engineer, Playback Features @Netflix

Haley Tucker works on the Playback Features team at Netflix, responsible for ensuring that customers receive the best possible viewing experience every time they click play. Her services fill a key role in enabling Netflix to stream amazing content to 65M+ members on 1000+ devices. Prior to Netflix, Haley spent a few years building near-real-time command and control systems for Raytheon. She then moved into a consulting role where she built custom billing and payment solutions for cloud and telephony service providers by integrating Java applications with Oracle platforms. Haley enjoys applying new technologies to develop robust and maintainable systems and the scale at Netflix has been a unique and exciting challenge. Haley received a BS in Computer Science from Texas A&M University.

Find Haley Tucker at

.

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers