Track:

Location:

Pacific DEKJ

Duration

Duration:

1:40pm - 2:30pm

Persona:

Architect
Developer
General Software

Key Takeaways

Hear recommendations on how to bring academia and industry together.
Understand how Netflix’s failure testing service (FIT) works and how it enabled us to build automatic failure testing
Learn how to take a theoretical model and make it a reality, including the success and pitfalls along the way.

Abstract

Industry and academia need each other. Far from the fires of production, researchers have the time to ask the big questions. Sometimes they get lucky, but detached from real world constraints they risk irrelevance by inventing and solving imaginary problems. Industry has the customers, the data, and the problems that only come with scale. They want answers to the big questions, but fear the risks of a bad investment. Despite this deep independence, collaborations between industry and academia are rare.

We present our experience: a fruitful industry/academic collaboration. We describe how a “big idea” -- lineage-driven fault injection -- evolved from a theoretical model into an automated failure testing service at Netflix. This collaboration required us to take risks, to accept defeats, and to constantly evolve our approach to “make it work”. We sketch the architecture of the automated failure testing service while providing intuition for why it works. Along the way, we will describe the challenges (expected as well as unexpected, technical as well as ideological) that arose, and how we overcame them.

Interview

Question:

QCon: For people that don’t know about Gremlin, why don’t you tell us a bit about what you are doing at Gremlin?

Answer:

Kolton: Gremlin provides failure as a service. At Netflix and Amazon, we had nice tooling that made it safe to run failure tests. This let us go out and be proactive in breaking things, so that we could learn how our systems failed.

Gremlin is meant to provide that to the rest of the world. Getting started with failure testing can be difficult, because you need a safety net and some of the control that comes with a tool purpose built to inject failure safely. Gremlin is built to make doing the right things as easy as possible.

Question:

QCon: So what does that mean (Failure as a Service I mean)? Is this a set of services you deploy on top of AWS and people register their environment with?

Answer:

Kolton: We are Cloud agnostic. Linux is our target operating system right now, but it is a set of tools that you can deploy that cause these kinds of common bad behaviors. There is a command line interface, a service API, and a simple powerful web UI. An important aspect of the system is the ability to coordinate with continuous integration/continuous deployment. The web UI is what really makes it safe and easy for people to understand what is happening and most importantly how to stop things if they start going awry.

Question:

QCon: Are you focused on AMI, starting and stopping them in containers? Are you knocking out DNS? How wide is the scope that you are focusing on?

Answer:

Kolton: Right now, we are focused on infrastructure related failures. So things that happen on the box level or on the container level. It’s network related events, resource constraints, things of that nature. You lose a dependency, or it gets slow, you have a noisy neighbor, a host goes away, things along those lines. A bit of a tease though, the V2 that we are going to be working on in the Spring is more like FIT where it’s application layer failure testing. It lets us do more intelligent and precise point cuts.

Question:

QCon: You mention FIT. Can you summarize FIT?

Answer:

Kolton: FIT is the next generation of the monkeys at Netflix. It provides application level failure testing that allows us to break things at the request level. We can hit individual users or individual devices to understand how they behave. Then we slowly expand the ‘failure scope’ of that impact to hit more and more users. As we learn about the system and understand how a failure behaves on a small scale, we are able to ramp it up and learn about the large scale impact.

Question:

QCon: So any tips that you could give someone else that is going down a similar type path?

Answer:

Kolton: I think for the academic side of the house, be willing to write code. Be willing to sit down and get your hands dirty. Don’t discount those production issues. Things that really make a system robust and stable, especially in the failure testing world, are super important.

On the engineering side, it is a question of time and effort. A lot of engineers are quite content to stay in their own world. Maybe they read some papers. Maybe they watch some talks. The hard part is reaching out to someone that you like or that you are interested in and just getting the conversation started. With Peter, I didn’t know his email. I had no one that introduced me. I saw his talk and then noticed that he was at Berkeley. I bet they have a pretty consistent user name and email policy. I blind sent three emails and said ‘Hey dude, this sounds awesome. Can we talk?’ He got back to me and the conversation went from there.

Speaker: Kolton Andrus

Founder of Gremlin Inc, former Netflix

Kolton is the founder of Gremlin Inc - helping companies build more robust services. He was a Chaos Engineer at Netflix, focused on the resilience of the Edge services. He designed and built FIT: Netflix’s failure injection service. Prior he improved the performance and reliability of the Amazon Retail website. At both companies he has served as a ‘Call Leader’, managing the resolution of company-wide incidents. Kolton is passionate about building resilient systems, primarily as it lets him break things for fun and profit.

Find Kolton Andrus at

Speaker page

@KoltonAndrus

Speaker: Peter Alvaro

Computer Science Assistant Professor @UniversityofCalifornia

Peter Alvaro is an Assistant Professor of Computer Science at the University of California Santa Cruz. His research focuses on using data-centric languages and analysis techniques to build and reason about data-intensive distributed systems, in order to make them scalable, predictable and robust to the failures and nondeterminism endemic to large-scale distribution. Peter is the creator of the Dedalus language and co-creator of the Bloom language. While pursuing his PhD at while UC Berkeley, Peter co-developed and taught Programming the Cloud, an undergraduate course that explored distributed systems concepts through the lens of software development. Prior to attending Berkeley, Peter worked as a Senior Software Engineer in the data analytics team at Ask.com. Peter's principal research interests are databases, distributed systems and programming languages.

Find Peter Alvaro at

Speaker page

http://people.ucsc.edu/~palvaro/

@palvaro

Senior Software Engineer, Playback Features @Netflix

Haley Tucker

99.99% Availability via Smart Real-Time Alerting

Data Science Manager @Uber

Franziska Bell

Creating A Culture of Observability at Stripe

Observability Specialist @Stripe

Cory Watson

Migrating to a Fault Tolerant System with Spanner

Software Engineer @Google

Edwin Fuquen

Freeing the Whale: How to Fail at Scale

CTO @Buoyant

Oliver Gould

Automating Chaos Experiments In Production

Senior Software Engineer @Netflix

Ali Basiri

Architecting for Failure in a Containerized World

Principle Data Analysis Leader @Infolace

Tom Faulhaber

Scaling Quality On Quora Using Machine Learning

Engineering Manager @Quora

Nikhil Garg

Query Understanding: a Manifesto

Data Scientist, Author of "Faceted Search"

Daniel Tunkelang

Tracks

Monday Nov 7

Architectures You've Always Wondered About

You know the names. Now learn lessons from their architectures
Distributed Systems War Stories

“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Lamport.
Containers Everywhere

State of the art in Container deployment, management, scheduling
Art of Relevancy and Recommendations

Lessons on the adoption of practical, real-world machine learning practices. AI & Deep learning explored.
Next Generation Web Standards, Frameworks, and Techniques

JavaScript, HTML5, WASM, and more... innovations targetting the browser
Optimize You

Keeping life in balance is a challenge. Learn lifehacks, tips, & techniques for success.

Tuesday Nov 8

Next Generation Microservices

What will microservices look like in 3 years? What if we could start over?
Java: Are You Ready for This?

Real world lessons & prepping for JDK9. Reactive code in Java today, Performance/Optimization, Where Unsafe is heading, & JVM compile interface.
Big Data Meets the Cloud

Overviews and lessons learned from companies that have implemented their Big Data use-cases in the Cloud
Evolving DevOps

Lessons/stories on optimizing the deployment pipeline
Software Engineering Softskills

Great engineers do more than code. Learn their secrets and level up.
Modern CS in the Real World

Applied, practical, & real-world dive into industry adoption of modern CS ideas

Wednesday Nov 9

Architecting for Failure

Your system will fail. Take control before it takes you with it.
Stream Processing

Stream Processing, Near-Real Time Processing
Bare Metal Performance

Native languages, kernel bypass, tooling - make the most of your hardware
Culture as a Differentiator

The why and how for building successful engineering cultures
//TODO: Security <-- fix this

Building security from the start. Stories, lessons, and innovations advancing the field of software security.
UX Reimagined

Bots, virtual reality, voice, and new thought processes around design. The track explores the current art of the possible in UX and lessons from early adoption.

SCHEDULE

Location:

Duration

Persona:

Key Takeaways

Abstract

Interview

Find Kolton Andrus at

Find Peter Alvaro at

Similar Talks

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: Applying Failure Testing Research @Netflix

Location:

Duration

Persona:

More talks on:

Key Takeaways

Abstract

Interview

Find Kolton Andrus at

Find Peter Alvaro at

Similar Talks

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World