Track:

Duration

Duration:

5:25pm - 6:15pm

Level:

Intermediate

Persona:

Architect
Developer
Developer, JVM

Key Takeaways

Hear about some of the production war stories from a Netflix engineer focused on the JVM.
Learn lessons and recommendations on protecting portions of a highly distributed system from unexpected failures.
Understand some of the approaches Netflix uses to provide better resiliency in distributed systems.

Abstract

Imagine a world where you do everything within your power to ensure the code you are pushing into production is as ready as possible to take traffic. You have thorough test coverage, you push out canaries, and you use push windows. You have truly operated your microservice in a top-notch way. And then all of a sudden...CPU spikes, GC churns, latency increases, you start spewing errors...enter sad Netflix customers.

This talk will explore ways in which behaviors of other systems result in surprise issues with your service. It will lay out strategies that you can employ to protect your systems and enable your on call staff to sleep better knowing that things will go wrong, but your system won’t fall over.

Key Takeaways:

Learn how microservice sharding can isolate the impact of outages and enable proper tuning for different use cases.
Understand the impact of fallbacks on your service and what makes a good fallback.
Why auto-loading/initializing/binding and package scanning is bad for operations.
Strategies for “canarying” data changes.

Interview

Question:

QCon: What have you been up to since last year when we talked to you last?

Answer:

Haley: Since we last spoke, we took Netflix global, so I was involved in that effort. The part of the system that I work in is heavily involved in picking the languages we return to customers, so that was a large effort. The follow-up to that is making things work better around the world. Outside of that, we are working on re-architecting some of our back-end systems that deliver metadata to our microservice tiers.

I am going to talk about this a little bit in the talk this year at QConSF. It will be about some of the challenges that we have had with our data architecture within Netflix, and how we are trying to change the way that works.

Question:

QCon: Is this talk focused on RPC for that metadata?

Answer:

Haley: Not specifically. Part of the talk is about our current data architecture at Netflix. The entire Netflix catalogue is loaded onto every instance in Netflix. So all metadata requests are local calls which are highly reliable, but it also has very unpredictable operational characteristics. If we get a large change in data, you get GC churn which results in latency spikes on your box and things start failing.

We’ve been experimenting with new mechanisms for remoting some of the data to another service, and many of the lessons that I’m discussing in my talk have been helpful in driving the architectural decisions for the new metadata tier.

Question:

QCon: You’re talk is called “Stranger Things: The Forces that Disrupt Netflix.” Interesting title. Can you tell me about your talk?

Answer:

Haley: I like to brand my talks with Netflix content because it is fun. The title is a good parallel, because, if you think about a distributed system, there are all these things happening underneath at all different levels even within your box (even within your own JVM). All these different things can just rear their ugly head at the worst time. So, the talk is about several of those cases and things that you can do to change the blast radius when things do go wrong.

Question:

QCon: You mentioned one of the things that has disrupted Netflix with metadata. Any other interesting stories you’ll discuss in your talk?

Answer:

Haley: We had an outage that occurred a month or two ago that I’ll discuss. There is a service whose whole job is to take logging data from all of the devices. It’s meant to be a super lightweight thin box that’s just accepts a firehose of data. We had a problem where one of the dependencies on that box started loading the metadata catalogue. So these boxes that are really thin and lightweight all of a sudden started loading 11 gigs of data into memory. Those boxes started running out of memory, and that whole farm became unhealthy. The unfortunate side effects rippled because we didn’t handle the failure properly.

Not only did it take down the logging farm, it also took down any other farms that were behind the same routing tier. I will use that story to discuss some of the lessons learned, like validating resource constraints on a box, problems with automatic loading of functionality for all JARs in an application.

In this case, that jar didn’t even need to be on the box, but (because we have this huge dependency tree) it got sucked in and loaded. I am going to talk through several of the lessons learned there. I will also discuss some of the ways that we could do a better job of keeping that app up and prevent it from impacting other apps that are behind the same tier.

Question:

QCon: What did you ultimately do to prevent things like that from happening in the future?

Answer:

Haley: We are still working on that, but there are a few things.

One is that we are pruning out the unnecessary dependencies, and we also worked with the team that loads the big data blob and had them put in a kill switch. So now we can guarantee that no matter what dependency gets pulled in, it will never load on this lightweight box.

Additionally, we are considering changes to some calls to be fire-and-forget, so that we can queue them up and process them offline. That would let us take the traffic and prevent the retry storm.

Question:

QCon: Who is the primary persona you are talking to in this talk?

Answer:

Haley: It is mainly architects and tech leads on the JVM. With that said, many of the lessons we learned are applicable to anybody operating a distributed system. These are the things that have ripple effects that are common in any distributed system.

I am really talking about the failures in a distributed system, and how they tie into each other. This talk isn’t really about Netflix scale. I’m really trying to make this one about applications of any size, and offer suggestions on how to protect yourself from some of the problems we ran into.

Speaker: Haley Tucker

Senior Software Engineer, Playback Features @Netflix

Haley Tucker works on the Playback Features team at Netflix, responsible for ensuring that customers receive the best possible viewing experience every time they click play. Her services fill a key role in enabling Netflix to stream amazing content to 65M+ members on 1000+ devices. Prior to Netflix, Haley spent a few years building near-real-time command and control systems for Raytheon. She then moved into a consulting role where she built custom billing and payment solutions for cloud and telephony service providers by integrating Java applications with Oracle platforms. Haley enjoys applying new technologies to develop robust and maintainable systems and the scale at Netflix has been a unique and exciting challenge. Haley received a BS in Computer Science from Texas A&M University.

Find Haley Tucker at

Speaker page

@hwilson1204

Senior Software Engineer @Netflix

Data Science Manager @Uber

Franziska Bell

Creating A Culture of Observability at Stripe

Observability Specialist @Stripe

Cory Watson

Migrating to a Fault Tolerant System with Spanner

Software Engineer @Google

Edwin Fuquen

Freeing the Whale: How to Fail at Scale

CTO @Buoyant

Oliver Gould

Automating Chaos Experiments In Production

Senior Software Engineer @Netflix

Ali Basiri

Architecting for Failure in a Containerized World

Principle Data Analysis Leader @Infolace

Tom Faulhaber

Scaling Quality On Quora Using Machine Learning

Engineering Manager @Quora

Nikhil Garg

Query Understanding: a Manifesto

Data Scientist, Author of "Faceted Search"

Daniel Tunkelang

Iterative Design for Data Science Projects

Partner and Data Scientist @Datascope

Bo Peng

Tracks

Monday Nov 7

Architectures You've Always Wondered About

You know the names. Now learn lessons from their architectures
Distributed Systems War Stories

“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Lamport.
Containers Everywhere

State of the art in Container deployment, management, scheduling
Art of Relevancy and Recommendations

Lessons on the adoption of practical, real-world machine learning practices. AI & Deep learning explored.
Next Generation Web Standards, Frameworks, and Techniques

JavaScript, HTML5, WASM, and more... innovations targetting the browser
Optimize You

Keeping life in balance is a challenge. Learn lifehacks, tips, & techniques for success.

Tuesday Nov 8

Next Generation Microservices

What will microservices look like in 3 years? What if we could start over?
Java: Are You Ready for This?

Real world lessons & prepping for JDK9. Reactive code in Java today, Performance/Optimization, Where Unsafe is heading, & JVM compile interface.
Big Data Meets the Cloud

Overviews and lessons learned from companies that have implemented their Big Data use-cases in the Cloud
Evolving DevOps

Lessons/stories on optimizing the deployment pipeline
Software Engineering Softskills

Great engineers do more than code. Learn their secrets and level up.
Modern CS in the Real World

Applied, practical, & real-world dive into industry adoption of modern CS ideas

Wednesday Nov 9

Architecting for Failure

Your system will fail. Take control before it takes you with it.
Stream Processing

Stream Processing, Near-Real Time Processing
Bare Metal Performance

Native languages, kernel bypass, tooling - make the most of your hardware
Culture as a Differentiator

The why and how for building successful engineering cultures
//TODO: Security <-- fix this

Building security from the start. Stories, lessons, and innovations advancing the field of software security.
UX Reimagined

Bots, virtual reality, voice, and new thought processes around design. The track explores the current art of the possible in UX and lessons from early adoption.

SCHEDULE

Duration

Level:

Persona:

Key Takeaways

Abstract

Interview

Find Haley Tucker at

Similar Talks

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: Stranger Things: The Forces that Disrupt Netflix

Duration

Level:

Persona:

More talks on:

Key Takeaways

Abstract

Interview

Find Haley Tucker at

Similar Talks

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World