Presentation: Applying Failure Testing Research @Netflix

Location:

Duration

Duration: 
1:40pm - 2:30pm

Persona:

Key Takeaways

  • Hear recommendations on how to bring academia and industry together.
  • Understand how Netflix’s failure testing service (FIT) works and how it enabled us to build automatic failure testing
  • Learn how to take a theoretical model and make it a reality, including the success and pitfalls along the way.

Abstract

Industry and academia need each other. Far from the fires of production, researchers have the time to ask the big questions. Sometimes they get lucky, but detached from real world constraints they risk irrelevance by inventing and solving imaginary problems. Industry has the customers, the data, and the problems that only come with scale. They want answers to the big questions, but fear the risks of a bad investment. Despite this deep independence, collaborations between industry and academia are rare.

We present our experience: a fruitful industry/academic collaboration. We describe how a “big idea” -- lineage-driven fault injection -- evolved from a theoretical model into an automated failure testing service at Netflix. This collaboration required us to take risks, to accept defeats, and to constantly evolve our approach to “make it work”. We sketch the architecture of the automated failure testing service while providing intuition for why it works. Along the way, we will describe the challenges (expected as well as unexpected, technical as well as ideological) that arose, and how we overcame them.

Interview

Question: 
QCon: For people that don’t know about Gremlin, why don’t you tell us a bit about what you are doing at Gremlin?
Answer: 

Kolton: Gremlin provides failure as a service. At Netflix and Amazon, we had nice tooling that made it safe to run failure tests. This let us go out and be proactive in breaking things, so that we could learn how our systems failed.

Gremlin is meant to provide that to the rest of the world. Getting started with failure testing can be difficult, because you need a safety net and some of the control that comes with a tool purpose built to inject failure safely. Gremlin is built to make doing the right things as easy as possible. 

Question: 
QCon: So what does that mean (Failure as a Service I mean)? Is this a set of services you deploy on top of AWS and people register their environment with?
Answer: 

Kolton: We are Cloud agnostic. Linux is our target operating system right now, but it is a set of tools that you can deploy that cause these kinds of common bad behaviors. There is a command line interface, a service API, and a simple powerful web UI. An important aspect of the system is the ability to coordinate with continuous integration/continuous deployment. The web UI is what really makes it safe and easy for people to understand what is happening and most importantly how to stop things if they start going awry.

Question: 
QCon: Are you focused on AMI, starting and stopping them in containers? Are you knocking out DNS? How wide is the scope that you are focusing on?
Answer: 

Kolton: Right now, we are focused on infrastructure related failures. So things that happen on the box level or on the container level. It’s network related events, resource constraints, things of that nature. You lose a dependency, or it gets slow, you have a noisy neighbor, a host goes away, things along those lines. A bit of a tease though, the V2 that we are going to be working on in the Spring is more like FIT where it’s application layer failure testing. It lets us do more intelligent and precise point cuts.

Question: 
QCon: You mention FIT. Can you summarize FIT?
Answer: 

Kolton: FIT is the next generation of the monkeys at Netflix. It provides application level failure testing that allows us to break things at the request level. We can hit individual users or individual devices to understand how they behave. Then we slowly expand the ‘failure scope’ of that impact to hit more and more users. As we learn about the system and understand how a failure behaves on a small scale, we are able to ramp it up and learn about the large scale impact.

Question: 
QCon: So any tips that you could give someone else that is going down a similar type path?
Answer: 

Kolton: I think for the academic side of the house, be willing to write code. Be willing to sit down and get your hands dirty. Don’t discount those production issues. Things that really make a system robust and stable, especially in the failure testing world, are super important. 

On the engineering side, it is a question of time and effort. A lot of engineers are quite content to stay in their own world. Maybe they read some papers. Maybe they watch some talks. The hard part is reaching out to someone that you like or that you are interested in and just getting the conversation started. With Peter, I didn’t know his email. I had no one that introduced me. I saw his talk and then noticed that he was at Berkeley. I bet they have a pretty consistent user name and email policy. I blind sent three emails and said ‘Hey dude, this sounds awesome. Can we talk?’ He got back to me and the conversation went from there.

Speaker: Kolton Andrus

Founder of Gremlin Inc, former Netflix

Kolton is the founder of Gremlin Inc - helping companies build more robust services. He was a Chaos Engineer at Netflix, focused on the resilience of the Edge services. He designed and built FIT: Netflix’s failure injection service. Prior he improved the performance and reliability of the Amazon Retail website. At both companies he has served as a ‘Call Leader’, managing the resolution of company-wide incidents. Kolton is passionate about building resilient systems, primarily as it lets him break things for fun and profit.

Find Kolton Andrus at

Speaker: Peter Alvaro

Computer Science Assistant Professor @UniversityofCalifornia

Peter Alvaro is an Assistant Professor of Computer Science at the University of California Santa Cruz. His research focuses on using data-centric languages and analysis techniques to build and reason about data-intensive distributed systems, in order to make them scalable, predictable and robust to the failures and nondeterminism endemic to large-scale distribution. Peter is the creator of the Dedalus language and co-creator of the Bloom language. While pursuing his PhD at while UC Berkeley, Peter co-developed and taught Programming the Cloud, an undergraduate course that explored distributed systems concepts through the lens of software development. Prior to attending Berkeley, Peter worked as a Senior Software Engineer in the data analytics team at Ask.com. Peter's principal research interests are databases, distributed systems and programming languages.

Find Peter Alvaro at

.

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers