Conference: Nov 13-15, 2017
Workshops: Nov 16-17, 2017
Presentation: Applying Failure Testing Research @Netflix
Location:
- Pacific DEKJ
Duration
Persona:
- Architect
- Developer
- General Software
Key Takeaways
- Hear recommendations on how to bring academia and industry together.
- Understand how Netflix’s failure testing service (FIT) works and how it enabled us to build automatic failure testing
- Learn how to take a theoretical model and make it a reality, including the success and pitfalls along the way.
Abstract
Industry and academia need each other. Far from the fires of production, researchers have the time to ask the big questions. Sometimes they get lucky, but detached from real world constraints they risk irrelevance by inventing and solving imaginary problems. Industry has the customers, the data, and the problems that only come with scale. They want answers to the big questions, but fear the risks of a bad investment. Despite this deep independence, collaborations between industry and academia are rare.
We present our experience: a fruitful industry/academic collaboration. We describe how a “big idea” -- lineage-driven fault injection -- evolved from a theoretical model into an automated failure testing service at Netflix. This collaboration required us to take risks, to accept defeats, and to constantly evolve our approach to “make it work”. We sketch the architecture of the automated failure testing service while providing intuition for why it works. Along the way, we will describe the challenges (expected as well as unexpected, technical as well as ideological) that arose, and how we overcame them.
Interview
Kolton: Gremlin provides failure as a service. At Netflix and Amazon, we had nice tooling that made it safe to run failure tests. This let us go out and be proactive in breaking things, so that we could learn how our systems failed.
Gremlin is meant to provide that to the rest of the world. Getting started with failure testing can be difficult, because you need a safety net and some of the control that comes with a tool purpose built to inject failure safely. Gremlin is built to make doing the right things as easy as possible.
Kolton: We are Cloud agnostic. Linux is our target operating system right now, but it is a set of tools that you can deploy that cause these kinds of common bad behaviors. There is a command line interface, a service API, and a simple powerful web UI. An important aspect of the system is the ability to coordinate with continuous integration/continuous deployment. The web UI is what really makes it safe and easy for people to understand what is happening and most importantly how to stop things if they start going awry.
Kolton: Right now, we are focused on infrastructure related failures. So things that happen on the box level or on the container level. It’s network related events, resource constraints, things of that nature. You lose a dependency, or it gets slow, you have a noisy neighbor, a host goes away, things along those lines. A bit of a tease though, the V2 that we are going to be working on in the Spring is more like FIT where it’s application layer failure testing. It lets us do more intelligent and precise point cuts.
Kolton: FIT is the next generation of the monkeys at Netflix. It provides application level failure testing that allows us to break things at the request level. We can hit individual users or individual devices to understand how they behave. Then we slowly expand the ‘failure scope’ of that impact to hit more and more users. As we learn about the system and understand how a failure behaves on a small scale, we are able to ramp it up and learn about the large scale impact.
Kolton: I think for the academic side of the house, be willing to write code. Be willing to sit down and get your hands dirty. Don’t discount those production issues. Things that really make a system robust and stable, especially in the failure testing world, are super important.
On the engineering side, it is a question of time and effort. A lot of engineers are quite content to stay in their own world. Maybe they read some papers. Maybe they watch some talks. The hard part is reaching out to someone that you like or that you are interested in and just getting the conversation started. With Peter, I didn’t know his email. I had no one that introduced me. I saw his talk and then noticed that he was at Berkeley. I bet they have a pretty consistent user name and email policy. I blind sent three emails and said ‘Hey dude, this sounds awesome. Can we talk?’ He got back to me and the conversation went from there.
Similar Talks
.
Tracks
Monday Nov 7
-
Architectures You've Always Wondered About
You know the names. Now learn lessons from their architectures
-
Distributed Systems War Stories
“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Lamport.
-
Containers Everywhere
State of the art in Container deployment, management, scheduling
-
Art of Relevancy and Recommendations
Lessons on the adoption of practical, real-world machine learning practices. AI & Deep learning explored.
-
Next Generation Web Standards, Frameworks, and Techniques
JavaScript, HTML5, WASM, and more... innovations targetting the browser
-
Optimize You
Keeping life in balance is a challenge. Learn lifehacks, tips, & techniques for success.
Tuesday Nov 8
-
Next Generation Microservices
What will microservices look like in 3 years? What if we could start over?
-
Java: Are You Ready for This?
Real world lessons & prepping for JDK9. Reactive code in Java today, Performance/Optimization, Where Unsafe is heading, & JVM compile interface.
-
Big Data Meets the Cloud
Overviews and lessons learned from companies that have implemented their Big Data use-cases in the Cloud
-
Evolving DevOps
Lessons/stories on optimizing the deployment pipeline
-
Software Engineering Softskills
Great engineers do more than code. Learn their secrets and level up.
-
Modern CS in the Real World
Applied, practical, & real-world dive into industry adoption of modern CS ideas
Wednesday Nov 9
-
Architecting for Failure
Your system will fail. Take control before it takes you with it.
-
Stream Processing
Stream Processing, Near-Real Time Processing
-
Bare Metal Performance
Native languages, kernel bypass, tooling - make the most of your hardware
-
Culture as a Differentiator
The why and how for building successful engineering cultures
-
//TODO: Security <-- fix this
Building security from the start. Stories, lessons, and innovations advancing the field of software security.
-
UX Reimagined
Bots, virtual reality, voice, and new thought processes around design. The track explores the current art of the possible in UX and lessons from early adoption.