Conference: Nov 13-15, 2017
Workshops: Nov 16-17, 2017
Presentation: Stranger Things: The Forces that Disrupt Netflix
Duration
Level:
- Intermediate
Persona:
- Architect
- Developer
- Developer, JVM
Key Takeaways
- Hear about some of the production war stories from a Netflix engineer focused on the JVM.
- Learn lessons and recommendations on protecting portions of a highly distributed system from unexpected failures.
- Understand some of the approaches Netflix uses to provide better resiliency in distributed systems.
Abstract
Imagine a world where you do everything within your power to ensure the code you are pushing into production is as ready as possible to take traffic. You have thorough test coverage, you push out canaries, and you use push windows. You have truly operated your microservice in a top-notch way. And then all of a sudden...CPU spikes, GC churns, latency increases, you start spewing errors...enter sad Netflix customers.
This talk will explore ways in which behaviors of other systems result in surprise issues with your service. It will lay out strategies that you can employ to protect your systems and enable your on call staff to sleep better knowing that things will go wrong, but your system won’t fall over.
Key Takeaways:
- Learn how microservice sharding can isolate the impact of outages and enable proper tuning for different use cases.
- Understand the impact of fallbacks on your service and what makes a good fallback.
- Why auto-loading/initializing/binding and package scanning is bad for operations.
- Strategies for “canarying” data changes.
Interview
Haley: Since we last spoke, we took Netflix global, so I was involved in that effort. The part of the system that I work in is heavily involved in picking the languages we return to customers, so that was a large effort. The follow-up to that is making things work better around the world. Outside of that, we are working on re-architecting some of our back-end systems that deliver metadata to our microservice tiers.
I am going to talk about this a little bit in the talk this year at QConSF. It will be about some of the challenges that we have had with our data architecture within Netflix, and how we are trying to change the way that works.
Haley: Not specifically. Part of the talk is about our current data architecture at Netflix. The entire Netflix catalogue is loaded onto every instance in Netflix. So all metadata requests are local calls which are highly reliable, but it also has very unpredictable operational characteristics. If we get a large change in data, you get GC churn which results in latency spikes on your box and things start failing.
We’ve been experimenting with new mechanisms for remoting some of the data to another service, and many of the lessons that I’m discussing in my talk have been helpful in driving the architectural decisions for the new metadata tier.
Haley: I like to brand my talks with Netflix content because it is fun. The title is a good parallel, because, if you think about a distributed system, there are all these things happening underneath at all different levels even within your box (even within your own JVM). All these different things can just rear their ugly head at the worst time. So, the talk is about several of those cases and things that you can do to change the blast radius when things do go wrong.
Haley: We had an outage that occurred a month or two ago that I’ll discuss. There is a service whose whole job is to take logging data from all of the devices. It’s meant to be a super lightweight thin box that’s just accepts a firehose of data. We had a problem where one of the dependencies on that box started loading the metadata catalogue. So these boxes that are really thin and lightweight all of a sudden started loading 11 gigs of data into memory. Those boxes started running out of memory, and that whole farm became unhealthy. The unfortunate side effects rippled because we didn’t handle the failure properly.
Not only did it take down the logging farm, it also took down any other farms that were behind the same routing tier. I will use that story to discuss some of the lessons learned, like validating resource constraints on a box, problems with automatic loading of functionality for all JARs in an application.
In this case, that jar didn’t even need to be on the box, but (because we have this huge dependency tree) it got sucked in and loaded. I am going to talk through several of the lessons learned there. I will also discuss some of the ways that we could do a better job of keeping that app up and prevent it from impacting other apps that are behind the same tier.
Haley: We are still working on that, but there are a few things.
One is that we are pruning out the unnecessary dependencies, and we also worked with the team that loads the big data blob and had them put in a kill switch. So now we can guarantee that no matter what dependency gets pulled in, it will never load on this lightweight box.
Additionally, we are considering changes to some calls to be fire-and-forget, so that we can queue them up and process them offline. That would let us take the traffic and prevent the retry storm.
Haley: It is mainly architects and tech leads on the JVM. With that said, many of the lessons we learned are applicable to anybody operating a distributed system. These are the things that have ripple effects that are common in any distributed system.
I am really talking about the failures in a distributed system, and how they tie into each other. This talk isn’t really about Netflix scale. I’m really trying to make this one about applications of any size, and offer suggestions on how to protect yourself from some of the problems we ran into.
Similar Talks
.
Tracks
Monday Nov 7
-
Architectures You've Always Wondered About
You know the names. Now learn lessons from their architectures
-
Distributed Systems War Stories
“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Lamport.
-
Containers Everywhere
State of the art in Container deployment, management, scheduling
-
Art of Relevancy and Recommendations
Lessons on the adoption of practical, real-world machine learning practices. AI & Deep learning explored.
-
Next Generation Web Standards, Frameworks, and Techniques
JavaScript, HTML5, WASM, and more... innovations targetting the browser
-
Optimize You
Keeping life in balance is a challenge. Learn lifehacks, tips, & techniques for success.
Tuesday Nov 8
-
Next Generation Microservices
What will microservices look like in 3 years? What if we could start over?
-
Java: Are You Ready for This?
Real world lessons & prepping for JDK9. Reactive code in Java today, Performance/Optimization, Where Unsafe is heading, & JVM compile interface.
-
Big Data Meets the Cloud
Overviews and lessons learned from companies that have implemented their Big Data use-cases in the Cloud
-
Evolving DevOps
Lessons/stories on optimizing the deployment pipeline
-
Software Engineering Softskills
Great engineers do more than code. Learn their secrets and level up.
-
Modern CS in the Real World
Applied, practical, & real-world dive into industry adoption of modern CS ideas
Wednesday Nov 9
-
Architecting for Failure
Your system will fail. Take control before it takes you with it.
-
Stream Processing
Stream Processing, Near-Real Time Processing
-
Bare Metal Performance
Native languages, kernel bypass, tooling - make the most of your hardware
-
Culture as a Differentiator
The why and how for building successful engineering cultures
-
//TODO: Security <-- fix this
Building security from the start. Stories, lessons, and innovations advancing the field of software security.
-
UX Reimagined
Bots, virtual reality, voice, and new thought processes around design. The track explores the current art of the possible in UX and lessons from early adoption.