Track: Architecting for Failure


Day of week:

Complex systems fail in spectacular ways. Failure isn’t a question of if, but when. Resilient and robust systems can endure and gracefully recover from failure. In this track we’ll hear from experts who have designed systems that continue operating in the face of adverse circumstances. Attendees will learn about the different types of failure one needs to consider when designing fault-tolerant systems, architectural patterns and approaches that did and didn't work, and takeaways that can be applied to their own systems.

Track Host:
Aysylu Greenberg
Software Engineer @Google
Aysylu Greenberg works at Google on a distributed build system. In her spare time, she ponders the design of systems that deal with inaccuracies, paints and sculpts.

Trackhost Interview

QCon: Can you tell me a bit about yourself and your background?

Aysylu: I’m a software engineer at Google. I have primarily worked on large scale infrastructure projects at Google. Currently, I work on Google Drive’s infrastructure.

QCon: Can you tell us about the talks in your track?

Aysylu: Michelle Brush, Director of Engineering at Cerner, will share with us how we should think about failure in our systems. She will talk about ways to model complex systems, discuss how architectural changes introduce new modes of failure and how to mitigate them. The talk will cover examples with various systems and how the discussion around failure was set at engineering and organizational layers.

The talk by Tom Faulhaber is about how to navigate the different container offerings, and how one can use them when architecting for failure. Containers has been a hot topic for a while, but, with all the new technologies coming out, it’s difficult to navigate the offerings and understand what the right choices are. This talk will give attendees better understanding of these options and how to use them when designing fault-tolerant systems.

Another talk is by Ali Basiri from Netflix, in which we’ll learn about automating chaos experiments in production. Chaos Engineering is a very exciting trend in how we approach building fault-tolerant large scale systems. The idea is that instead of assuming failures will happen and hoping our systems will recover, we cause failures to happen. Then we see if the system recovers as expected. Basiri will specifically talk about the chaos automation platform they built at Netflix that allows them to run experiments on a system in production and verify the system’s behavior.

Oliver Gould, the CTO at Bouyant, will discuss the early days of Twitter and how Twitter learned to embrace failure to build resilient systems. The talk will cover Finagle, a high-scale RPC library, its mechanisms for dealing with failures and the naming scheme that allows for handling failure across system boundaries.

Edwin Fuquen, a fellow Googler working on developer infrastructure, will talk about architecture migrations in large-scale systems. Specifically, he will discuss migrating a key infrastructure system to a new architecture on top of Spanner, designing for recovery from different failure modes, and lessons learned during the design of the new system and its migration.

QCon: If this track came out exactly the way you want, what would people walk away with?

Aysylu: First of all, I would like the audience to understand the different types of failure.

Failures occur in different layers of the computational stack, starting from the low level (OS) and going all the way up to business and application logic. Then, there is organizational resilience to consider as well: people might leave the company permanently, or they might just go on vacation, but either way teams and systems have to be resilient to this even when the experts that know the system best are not available. There are failures from dealing with unknown unknowns which are the hardest types, because, by the nature of it, we don’t know what to expect so we might be unprepared to handle it.

Another thing I would like for the audience to walk out with after attending this track is that failure is bound to happen. So we should start designing our systems with failure in mind from the start by considering the different failure modes and planning for the graceful recovery from them.

Finally, I hope the audience will learn to embrace failure. Failures will happen, so we can focus our energy on understanding the different modes of failure and how best to recover from it. I hope that attendees will walk away with lots of practical considerations and that they will take back the lessons shared by the speakers and apply them in their day-to-day operations and development, for better fault-tolerant systems.

10:35am - 11:25am

by Tom Faulhaber
Principle Data Analysis Leader @Infolace

The container revolution is upon us and with it comes a new toolbox for building systems that are robust in the face of failures. Created in just the past few years, this new set of tools demands that we rethink our approach to architecting for failure. When we do, we will reap the benefits of architectural models that make it much simpler to reason about and handle failures of all types.

In this talk, we'll explore this toolbox and how the tools can be used to best effect. We will...

11:50am - 12:40pm

by Edwin Fuquen
Software Engineer @Google

Designing systems that take failure into account from the start is hard. Sometimes it’s very tempting to take shortcuts that will adversely affect your system in the long run. However, there are steps one can take to avoid these shortcuts and build a fault-tolerant system.

This talk is a case study in transitioning Guitar, an internal integration testing framework, to Spanner. Spanner is a database developed internally at Google that provides a fast, distributed data store for...

1:40pm - 2:30pm

by Ali Basiri
Senior Software Engineer @Netflix

Imagine a world where you receive an alert about an outage that hasn’t happened yet. At Netflix, we are building a Chaos Automation Platform (ChAP) to realize this vision. ChAP runs experiments to test that microservices are resilient to failures in downstream dependencies. These experiments run in production. ChAP siphons off a fraction of real traffic, injects failures, and measures how these failures change system behavior.

ChAP focuses on a specific type of failure: a failed RPC...

2:55pm - 3:45pm

by Michelle Brush
Director of Engineer @Cerner

As software practitioners we must move beyond a willful ignorance of the potential for failure in our architectures to a humble acceptance of our responsibility to not only monitor, but also mitigate that potential. To do this, we must understand the inherent risk of the systems we build. Only then can we decide how to become resilient, become vigilant, and adapt. This talk address the first step towards owning our potential for failure - which is identifying it. It does so by describing a...

4:10pm - 5:00pm

by Oliver Gould
CTO @Buoyant

Twitter was once known for its ever-present error page, the “Fail Whale.” Thousands of staff-years later, this iconic image has all but faded from memory. This transformation was only possible due to Twitter’s treatment of failure as something not just to be expected, but to be embraced.

In this talk, we discuss the technical insights that enabled Twitter to fail, safely and often. We will show how Finagle, the high-scale RPC library used at Twitter, Pinterest, SoundCloud, and other...

5:25pm - 6:15pm

Open Space



Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9