Presentation: Automating Chaos Experiments In Production



1:40pm - 2:30pm



Key Takeaways

  • Hear about Netflix’s motivation in creating a Chaos Automation Platform (ChAP).
  • Understand techniques Netflix used to implement ChAP, and how it helps teams identify systemic weaknesses.
  • Understand how to apply failure injection testing in a way that still protects customers and evolves the architecture.


Imagine a world where you receive an alert about an outage that hasn’t happened yet. At Netflix, we are building a Chaos Automation Platform (ChAP) to realize this vision. ChAP runs experiments to test that microservices are resilient to failures in downstream dependencies. These experiments run in production. ChAP siphons off a fraction of real traffic, injects failures, and measures how these failures change system behavior.

ChAP focuses on a specific type of failure: a failed RPC call between microservices. Many types of failures at the level of an individual service can be modeled as an RPC failure or delay: a service that crashes, runs out of resources, or is highly loaded will appear to a client as either returning an error or increased latency.

This talk will cover the motivation behind ChAP, how we implemented it, and how Netflix service teams are using it to identify systemic weaknesses.


QCon: What is your role today and where did ChAP come from?

Ali: I’m a chaos engineer at Netflix. The team goal for the Chaos Team is to leverage chaos to improve the reliability of services. We found value with consulting with other teams on failure injection, but that didn't scale well across the organization. We used the learnings from the individual engagements to build a platform (ChAP) that better leverages failure injection across netflix. 

QCon: I am not familiar with ChAP. Can you tell me a bit about it?

Ali: ChAP is what we call our Chaos Automation Platform.  It does failure testing by separating out a fraction of the traffic and running it through an experiment and control cluster. We then inject failure into the experiment cluster and compare the results versus the  control to find any issues or potential problems with the system. 

QCon: So when you break off a section of traffic like that, how do you protect the customer from experiencing degradation in their services?

Ali: We take a couple of precautions. First, we do some local testing internally with different devices connected to production environment.  We inject failure into just those devices and make sure everything is working correctly. Then we run a larger scale experiment. If we do uncover a problem that happens at a larger scale, we can quickly stop that experiment. We do real-time analysis on the metrics in order to mitigate impact as quickly as possible.

QCon: What’s the motivation for your talk?

Ali: I want to increase awareness about failure injection and how we can use it to make complex systems more reliable. If there are more practitioners of chaos engineering, then together we can all advance the field, as it is a new field. 

QCon: What are your key takeaways for this talk?

Ali: One of the main takeaways would be the effectiveness of failure injection testing and how we can apply it in a much less impactful way.

QCon: Are there small steps that I can take to perform this kind of test? Or is it this big kind of multi-region failover environment where I have got to deal with losing DNS, that there is just this huge thing that I have got to deal with?

Ali: When we talk about chaos, people often think of Chaos Monkey, where you are injecting chaos in a sense. With ChAP we are injecting failure in a much more controlled manner into a system that is chaotic by nature. We try to understand where we are injecting failure, and how we are injecting failure.  I will talk about some examples and some learnings that we’ve had and cover incremental  steps that we can take to have big wins using the failure injection. 

QCon: How has ChAP help Netflix make the architecture more resilient? Any interesting stories you’ll share?

Ali: For one of our services, the expectation was that when the service is down, it is not that impactful to the user experience and that the UI would handle that failure gracefully. Instead of displaying the movies that are personalized to your experience, we would display a set of popular movies instead. This fallback would be a decent customer experience.

To test this assumption, we ran a large scale experiment that injected latency into responses from the service. The upstream service is configured to detect an increase in latency and timeouts and should short circuit the call by returning fallbacks. However, Instead of quickly triggering the fallback, the upstream service became overwhelmed and started falling over. To recover from this experiment, we had to perform a regional  failover to another region. Even though the experiment had a catastrophic outcome, we learned a lot. We applied those learnings to how to tune these systems across all of our services, and it had a huge impact on our availability in the long term. 

Speaker: Ali Basiri

Senior Software Engineer @Netflix

Find Ali Basiri at

Similar Talks

Senior Technology Strategist @Dynatrace
Provisioning Engineering SE @Twitter
Senior Software Engineer, Playback Features @Netflix
VP, Production Engineering & Site Reliability @Facebook
Senior Software Engineer, Diagnostics and Remediation Engineering (DaRE) @Netflix



Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers