Location:

Ballroom A

Day of week:

Wednesday

Complex systems fail in spectacular ways. Failure isn’t a question of if, but when. Resilient and robust systems can endure and gracefully recover from failure. In this track we’ll hear from experts who have designed systems that continue operating in the face of adverse circumstances. Attendees will learn about the different types of failure one needs to consider when designing fault-tolerant systems, architectural patterns and approaches that did and didn't work, and takeaways that can be applied to their own systems.

Track Host:

Aysylu Greenberg

Software Engineer @Google

Aysylu Greenberg works at Google on a distributed build system. In her spare time, she ponders the design of systems that deal with inaccuracies, paints and sculpts.

Trackhost Interview

Question:

QCon: Can you tell me a bit about yourself and your background?

Answer:

Aysylu: I’m a software engineer at Google. I have primarily worked on large scale infrastructure projects at Google. Currently, I work on Google Drive’s infrastructure.

Question:

QCon: Can you tell us about the talks in your track?

Answer:

Aysylu: Michelle Brush, Director of Engineering at Cerner, will share with us how we should think about failure in our systems. She will talk about ways to model complex systems, discuss how architectural changes introduce new modes of failure and how to mitigate them. The talk will cover examples with various systems and how the discussion around failure was set at engineering and organizational layers.

The talk by Tom Faulhaber is about how to navigate the different container offerings, and how one can use them when architecting for failure. Containers has been a hot topic for a while, but, with all the new technologies coming out, it’s difficult to navigate the offerings and understand what the right choices are. This talk will give attendees better understanding of these options and how to use them when designing fault-tolerant systems.

Another talk is by Ali Basiri from Netflix, in which we’ll learn about automating chaos experiments in production. Chaos Engineering is a very exciting trend in how we approach building fault-tolerant large scale systems. The idea is that instead of assuming failures will happen and hoping our systems will recover, we cause failures to happen. Then we see if the system recovers as expected. Basiri will specifically talk about the chaos automation platform they built at Netflix that allows them to run experiments on a system in production and verify the system’s behavior.

Oliver Gould, the CTO at Bouyant, will discuss the early days of Twitter and how Twitter learned to embrace failure to build resilient systems. The talk will cover Finagle, a high-scale RPC library, its mechanisms for dealing with failures and the naming scheme that allows for handling failure across system boundaries.

Edwin Fuquen, a fellow Googler working on developer infrastructure, will talk about architecture migrations in large-scale systems. Specifically, he will discuss migrating a key infrastructure system to a new architecture on top of Spanner, designing for recovery from different failure modes, and lessons learned during the design of the new system and its migration.

Question:

QCon: If this track came out exactly the way you want, what would people walk away with?

Answer:

Aysylu: First of all, I would like the audience to understand the different types of failure.

Failures occur in different layers of the computational stack, starting from the low level (OS) and going all the way up to business and application logic. Then, there is organizational resilience to consider as well: people might leave the company permanently, or they might just go on vacation, but either way teams and systems have to be resilient to this even when the experts that know the system best are not available. There are failures from dealing with unknown unknowns which are the hardest types, because, by the nature of it, we don’t know what to expect so we might be unprepared to handle it.

Another thing I would like for the audience to walk out with after attending this track is that failure is bound to happen. So we should start designing our systems with failure in mind from the start by considering the different failure modes and planning for the graceful recovery from them.

Finally, I hope the audience will learn to embrace failure. Failures will happen, so we can focus our energy on understanding the different modes of failure and how best to recover from it. I hope that attendees will walk away with lots of practical considerations and that they will take back the lessons shared by the speakers and apply them in their day-to-day operations and development, for better fault-tolerant systems.

10:35am - 11:25am

by Tom Faulhaber
Principle Data Analysis Leader @Infolace

Architecting for Failure in a Containerized World

The container revolution is upon us and with it comes a new toolbox for building systems that are robust in the face of failures. Created in just the past few years, this new set of tools demands that we rethink our approach to architecting for failure. When we do, we will reap the benefits of architectural models that make it much simpler to reason about and handle failures of all types.

In this talk, we'll explore this toolbox and how the tools can be used to best effect. We will...

11:50am - 12:40pm

by Edwin Fuquen
Software Engineer @Google

Migrating to a Fault Tolerant System with Spanner

Designing systems that take failure into account from the start is hard. Sometimes it’s very tempting to take shortcuts that will adversely affect your system in the long run. However, there are steps one can take to avoid these shortcuts and build a fault-tolerant system.

This talk is a case study in transitioning Guitar, an internal integration testing framework, to Spanner. Spanner is a database developed internally at Google that provides a fast, distributed data store for...

1:40pm - 2:30pm

by Ali Basiri
Senior Software Engineer @Netflix

Automating Chaos Experiments In Production

Imagine a world where you receive an alert about an outage that hasn’t happened yet. At Netflix, we are building a Chaos Automation Platform (ChAP) to realize this vision. ChAP runs experiments to test that microservices are resilient to failures in downstream dependencies. These experiments run in production. ChAP siphons off a fraction of real traffic, injects failures, and measures how these failures change system behavior.

ChAP focuses on a specific type of failure: a failed RPC...

2:55pm - 3:45pm

by Michelle Brush
Director of Engineer @Cerner

Framing Our Potential for Failure

As software practitioners we must move beyond a willful ignorance of the potential for failure in our architectures to a humble acceptance of our responsibility to not only monitor, but also mitigate that potential. To do this, we must understand the inherent risk of the systems we build. Only then can we decide how to become resilient, become vigilant, and adapt. This talk address the first step towards owning our potential for failure - which is identifying it. It does so by describing a...

4:10pm - 5:00pm

by Oliver Gould
CTO @Buoyant

Freeing the Whale: How to Fail at Scale

Twitter was once known for its ever-present error page, the “Fail Whale.” Thousands of staff-years later, this iconic image has all but faded from memory. This transformation was only possible due to Twitter’s treatment of failure as something not just to be expected, but to be embraced.

In this talk, we discuss the technical insights that enabled Twitter to fail, safely and often. We will show how Finagle, the high-scale RPC library used at Twitter, Pinterest, SoundCloud, and other...

5:25pm - 6:15pm

Open Space

Architecting for Failure Open Space

Tracks

Monday Nov 7

Architectures You've Always Wondered About

You know the names. Now learn lessons from their architectures
Distributed Systems War Stories

“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Lamport.
Containers Everywhere

State of the art in Container deployment, management, scheduling
Art of Relevancy and Recommendations

Lessons on the adoption of practical, real-world machine learning practices. AI & Deep learning explored.
Next Generation Web Standards, Frameworks, and Techniques

JavaScript, HTML5, WASM, and more... innovations targetting the browser
Optimize You

Keeping life in balance is a challenge. Learn lifehacks, tips, & techniques for success.

Tuesday Nov 8

Next Generation Microservices

What will microservices look like in 3 years? What if we could start over?
Java: Are You Ready for This?

Real world lessons & prepping for JDK9. Reactive code in Java today, Performance/Optimization, Where Unsafe is heading, & JVM compile interface.
Big Data Meets the Cloud

Overviews and lessons learned from companies that have implemented their Big Data use-cases in the Cloud
Evolving DevOps

Lessons/stories on optimizing the deployment pipeline
Software Engineering Softskills

Great engineers do more than code. Learn their secrets and level up.
Modern CS in the Real World

Applied, practical, & real-world dive into industry adoption of modern CS ideas

Wednesday Nov 9

Architecting for Failure

Your system will fail. Take control before it takes you with it.
Stream Processing

Stream Processing, Near-Real Time Processing
Bare Metal Performance

Native languages, kernel bypass, tooling - make the most of your hardware
Culture as a Differentiator

The why and how for building successful engineering cultures
//TODO: Security <-- fix this

Building security from the start. Stories, lessons, and innovations advancing the field of software security.
UX Reimagined

Bots, virtual reality, voice, and new thought processes around design. The track explores the current art of the possible in UX and lessons from early adoption.

SCHEDULE

Location:

Day of week:

Trackhost Interview

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Track: Architecting for Failure

Location:

Day of week:

Trackhost Interview

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World