Conference: Nov 13-15, 2017
Workshops: Nov 16-17, 2017
Presentation: Automating Chaos Experiments In Production
Location:
- Pacific DEKJ
Duration
Level:
- Intermediate
Persona:
- Architect
Key Takeaways
- Hear about Netflix’s motivation in creating a Chaos Automation Platform (ChAP).
- Understand techniques Netflix used to implement ChAP, and how it helps teams identify systemic weaknesses.
- Understand how to apply failure injection testing in a way that still protects customers and evolves the architecture.
Abstract
Imagine a world where you receive an alert about an outage that hasn’t happened yet. At Netflix, we are building a Chaos Automation Platform (ChAP) to realize this vision. ChAP runs experiments to test that microservices are resilient to failures in downstream dependencies. These experiments run in production. ChAP siphons off a fraction of real traffic, injects failures, and measures how these failures change system behavior.
ChAP focuses on a specific type of failure: a failed RPC call between microservices. Many types of failures at the level of an individual service can be modeled as an RPC failure or delay: a service that crashes, runs out of resources, or is highly loaded will appear to a client as either returning an error or increased latency.
This talk will cover the motivation behind ChAP, how we implemented it, and how Netflix service teams are using it to identify systemic weaknesses.
Interview
Ali: I’m a chaos engineer at Netflix. The team goal for the Chaos Team is to leverage chaos to improve the reliability of services. We found value with consulting with other teams on failure injection, but that didn't scale well across the organization. We used the learnings from the individual engagements to build a platform (ChAP) that better leverages failure injection across netflix.
Ali: ChAP is what we call our Chaos Automation Platform. It does failure testing by separating out a fraction of the traffic and running it through an experiment and control cluster. We then inject failure into the experiment cluster and compare the results versus the control to find any issues or potential problems with the system.
Ali: We take a couple of precautions. First, we do some local testing internally with different devices connected to production environment. We inject failure into just those devices and make sure everything is working correctly. Then we run a larger scale experiment. If we do uncover a problem that happens at a larger scale, we can quickly stop that experiment. We do real-time analysis on the metrics in order to mitigate impact as quickly as possible.
Ali: I want to increase awareness about failure injection and how we can use it to make complex systems more reliable. If there are more practitioners of chaos engineering, then together we can all advance the field, as it is a new field.
Ali: One of the main takeaways would be the effectiveness of failure injection testing and how we can apply it in a much less impactful way.
Ali: When we talk about chaos, people often think of Chaos Monkey, where you are injecting chaos in a sense. With ChAP we are injecting failure in a much more controlled manner into a system that is chaotic by nature. We try to understand where we are injecting failure, and how we are injecting failure. I will talk about some examples and some learnings that we’ve had and cover incremental steps that we can take to have big wins using the failure injection.
Ali: For one of our services, the expectation was that when the service is down, it is not that impactful to the user experience and that the UI would handle that failure gracefully. Instead of displaying the movies that are personalized to your experience, we would display a set of popular movies instead. This fallback would be a decent customer experience.
To test this assumption, we ran a large scale experiment that injected latency into responses from the service. The upstream service is configured to detect an increase in latency and timeouts and should short circuit the call by returning fallbacks. However, Instead of quickly triggering the fallback, the upstream service became overwhelmed and started falling over. To recover from this experiment, we had to perform a regional failover to another region. Even though the experiment had a catastrophic outcome, we learned a lot. We applied those learnings to how to tune these systems across all of our services, and it had a huge impact on our availability in the long term.
Similar Talks
.
Tracks
Monday Nov 7
-
Architectures You've Always Wondered About
You know the names. Now learn lessons from their architectures
-
Distributed Systems War Stories
“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Lamport.
-
Containers Everywhere
State of the art in Container deployment, management, scheduling
-
Art of Relevancy and Recommendations
Lessons on the adoption of practical, real-world machine learning practices. AI & Deep learning explored.
-
Next Generation Web Standards, Frameworks, and Techniques
JavaScript, HTML5, WASM, and more... innovations targetting the browser
-
Optimize You
Keeping life in balance is a challenge. Learn lifehacks, tips, & techniques for success.
Tuesday Nov 8
-
Next Generation Microservices
What will microservices look like in 3 years? What if we could start over?
-
Java: Are You Ready for This?
Real world lessons & prepping for JDK9. Reactive code in Java today, Performance/Optimization, Where Unsafe is heading, & JVM compile interface.
-
Big Data Meets the Cloud
Overviews and lessons learned from companies that have implemented their Big Data use-cases in the Cloud
-
Evolving DevOps
Lessons/stories on optimizing the deployment pipeline
-
Software Engineering Softskills
Great engineers do more than code. Learn their secrets and level up.
-
Modern CS in the Real World
Applied, practical, & real-world dive into industry adoption of modern CS ideas
Wednesday Nov 9
-
Architecting for Failure
Your system will fail. Take control before it takes you with it.
-
Stream Processing
Stream Processing, Near-Real Time Processing
-
Bare Metal Performance
Native languages, kernel bypass, tooling - make the most of your hardware
-
Culture as a Differentiator
The why and how for building successful engineering cultures
-
//TODO: Security <-- fix this
Building security from the start. Stories, lessons, and innovations advancing the field of software security.
-
UX Reimagined
Bots, virtual reality, voice, and new thought processes around design. The track explores the current art of the possible in UX and lessons from early adoption.