Track:

Location:

Pacific DEKJ

Duration

Duration:

1:40pm - 2:30pm

Level:

Intermediate

Persona:

Architect

Key Takeaways

Hear about Netflix’s motivation in creating a Chaos Automation Platform (ChAP).
Understand techniques Netflix used to implement ChAP, and how it helps teams identify systemic weaknesses.
Understand how to apply failure injection testing in a way that still protects customers and evolves the architecture.

Abstract

Imagine a world where you receive an alert about an outage that hasn’t happened yet. At Netflix, we are building a Chaos Automation Platform (ChAP) to realize this vision. ChAP runs experiments to test that microservices are resilient to failures in downstream dependencies. These experiments run in production. ChAP siphons off a fraction of real traffic, injects failures, and measures how these failures change system behavior.

ChAP focuses on a specific type of failure: a failed RPC call between microservices. Many types of failures at the level of an individual service can be modeled as an RPC failure or delay: a service that crashes, runs out of resources, or is highly loaded will appear to a client as either returning an error or increased latency.

This talk will cover the motivation behind ChAP, how we implemented it, and how Netflix service teams are using it to identify systemic weaknesses.

Interview

Question:

QCon: What is your role today and where did ChAP come from?

Answer:

Ali: I’m a chaos engineer at Netflix. The team goal for the Chaos Team is to leverage chaos to improve the reliability of services. We found value with consulting with other teams on failure injection, but that didn't scale well across the organization. We used the learnings from the individual engagements to build a platform (ChAP) that better leverages failure injection across netflix.

Question:

QCon: I am not familiar with ChAP. Can you tell me a bit about it?

Answer:

Ali: ChAP is what we call our Chaos Automation Platform. It does failure testing by separating out a fraction of the traffic and running it through an experiment and control cluster. We then inject failure into the experiment cluster and compare the results versus the control to find any issues or potential problems with the system.

Question:

QCon: So when you break off a section of traffic like that, how do you protect the customer from experiencing degradation in their services?

Answer:

Ali: We take a couple of precautions. First, we do some local testing internally with different devices connected to production environment. We inject failure into just those devices and make sure everything is working correctly. Then we run a larger scale experiment. If we do uncover a problem that happens at a larger scale, we can quickly stop that experiment. We do real-time analysis on the metrics in order to mitigate impact as quickly as possible.

Question:

QCon: What’s the motivation for your talk?

Answer:

Ali: I want to increase awareness about failure injection and how we can use it to make complex systems more reliable. If there are more practitioners of chaos engineering, then together we can all advance the field, as it is a new field.

Question:

QCon: What are your key takeaways for this talk?

Answer:

Ali: One of the main takeaways would be the effectiveness of failure injection testing and how we can apply it in a much less impactful way.

Question:

QCon: Are there small steps that I can take to perform this kind of test? Or is it this big kind of multi-region failover environment where I have got to deal with losing DNS, that there is just this huge thing that I have got to deal with?

Answer:

Ali: When we talk about chaos, people often think of Chaos Monkey, where you are injecting chaos in a sense. With ChAP we are injecting failure in a much more controlled manner into a system that is chaotic by nature. We try to understand where we are injecting failure, and how we are injecting failure. I will talk about some examples and some learnings that we’ve had and cover incremental steps that we can take to have big wins using the failure injection.

Question:

QCon: How has ChAP help Netflix make the architecture more resilient? Any interesting stories you’ll share?

Answer:

Ali: For one of our services, the expectation was that when the service is down, it is not that impactful to the user experience and that the UI would handle that failure gracefully. Instead of displaying the movies that are personalized to your experience, we would display a set of popular movies instead. This fallback would be a decent customer experience.

To test this assumption, we ran a large scale experiment that injected latency into responses from the service. The upstream service is configured to detect an increase in latency and timeouts and should short circuit the call by returning fallbacks. However, Instead of quickly triggering the fallback, the upstream service became overwhelmed and started falling over. To recover from this experiment, we had to perform a regional failover to another region. Even though the experiment had a catastrophic outcome, we learned a lot. We applied those learnings to how to tune these systems across all of our services, and it had a huge impact on our availability in the long term.

Speaker: Ali Basiri

Senior Software Engineer @Netflix

Find Ali Basiri at

Speaker page

@abasiri

Product Marketing Manager @Perforce

John Williston

Continuous Innovation through DevOps Pipelines

Senior Technology Strategist @Dynatrace

Andi Grabner

Hardware & Provisioning Engineering @Twitter

Provisioning Engineering SE @Twitter

Nik Johnson

Hardware & Provisioning Engineering @Twitter

Staff Hardware Engineer @Twitter

Matt Singer

Stranger Things: The Forces that Disrupt Netflix

Senior Software Engineer, Playback Features @Netflix

Haley Tucker

Creating a Collaborative Culture between Dev & Ops

VP, Production Engineering & Site Reliability @Facebook

Pedro Canahuati

Winston: Helping Netflix Engineers Sleep at Night

Senior Software Engineer, Diagnostics and Remediation Engineering (DaRE) @Netflix

Sayli Karmarkar

99.99% Availability via Smart Real-Time Alerting

Data Science Manager @Uber

Franziska Bell

Creating A Culture of Observability at Stripe

Observability Specialist @Stripe

Cory Watson

Tracks

Monday Nov 7

Architectures You've Always Wondered About

You know the names. Now learn lessons from their architectures
Distributed Systems War Stories

“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Lamport.
Containers Everywhere

State of the art in Container deployment, management, scheduling
Art of Relevancy and Recommendations

Lessons on the adoption of practical, real-world machine learning practices. AI & Deep learning explored.
Next Generation Web Standards, Frameworks, and Techniques

JavaScript, HTML5, WASM, and more... innovations targetting the browser
Optimize You

Keeping life in balance is a challenge. Learn lifehacks, tips, & techniques for success.

Tuesday Nov 8

Next Generation Microservices

What will microservices look like in 3 years? What if we could start over?
Java: Are You Ready for This?

Real world lessons & prepping for JDK9. Reactive code in Java today, Performance/Optimization, Where Unsafe is heading, & JVM compile interface.
Big Data Meets the Cloud

Overviews and lessons learned from companies that have implemented their Big Data use-cases in the Cloud
Evolving DevOps

Lessons/stories on optimizing the deployment pipeline
Software Engineering Softskills

Great engineers do more than code. Learn their secrets and level up.
Modern CS in the Real World

Applied, practical, & real-world dive into industry adoption of modern CS ideas

Wednesday Nov 9

Architecting for Failure

Your system will fail. Take control before it takes you with it.
Stream Processing

Stream Processing, Near-Real Time Processing
Bare Metal Performance

Native languages, kernel bypass, tooling - make the most of your hardware
Culture as a Differentiator

The why and how for building successful engineering cultures
//TODO: Security <-- fix this

Building security from the start. Stories, lessons, and innovations advancing the field of software security.
UX Reimagined

Bots, virtual reality, voice, and new thought processes around design. The track explores the current art of the possible in UX and lessons from early adoption.

SCHEDULE

Location:

Duration

Level:

Persona:

Key Takeaways

Abstract

Interview

Find Ali Basiri at

Similar Talks

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: Automating Chaos Experiments In Production

Location:

Duration

Level:

Persona:

More talks on:

Key Takeaways

Abstract

Interview

Find Ali Basiri at

Similar Talks

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World