Track: Architecting for Failure
Day of week:
Complex systems fail in spectacular ways. Failure isn’t a question of if, but when. Resilient and robust systems can endure and gracefully recover from failure. In this track we’ll hear from experts who have designed systems that continue operating in the face of adverse circumstances. Attendees will learn about the different types of failure one needs to consider when designing fault-tolerant systems, architectural patterns and approaches that did and didn't work, and takeaways that can be applied to their own systems.
Aysylu: I’m a software engineer at Google. I have primarily worked on large scale infrastructure projects at Google. Currently, I work on Google Drive’s infrastructure.
Aysylu: Michelle Brush, Director of Engineering at Cerner, will share with us how we should think about failure in our systems. She will talk about ways to model complex systems, discuss how architectural changes introduce new modes of failure and how to mitigate them. The talk will cover examples with various systems and how the discussion around failure was set at engineering and organizational layers.
The talk by Tom Faulhaber is about how to navigate the different container offerings, and how one can use them when architecting for failure. Containers has been a hot topic for a while, but, with all the new technologies coming out, it’s difficult to navigate the offerings and understand what the right choices are. This talk will give attendees better understanding of these options and how to use them when designing fault-tolerant systems.
Another talk is by Ali Basiri from Netflix, in which we’ll learn about automating chaos experiments in production. Chaos Engineering is a very exciting trend in how we approach building fault-tolerant large scale systems. The idea is that instead of assuming failures will happen and hoping our systems will recover, we cause failures to happen. Then we see if the system recovers as expected. Basiri will specifically talk about the chaos automation platform they built at Netflix that allows them to run experiments on a system in production and verify the system’s behavior.
Oliver Gould, the CTO at Bouyant, will discuss the early days of Twitter and how Twitter learned to embrace failure to build resilient systems. The talk will cover Finagle, a high-scale RPC library, its mechanisms for dealing with failures and the naming scheme that allows for handling failure across system boundaries.
Edwin Fuquen, a fellow Googler working on developer infrastructure, will talk about architecture migrations in large-scale systems. Specifically, he will discuss migrating a key infrastructure system to a new architecture on top of Spanner, designing for recovery from different failure modes, and lessons learned during the design of the new system and its migration.
Aysylu: First of all, I would like the audience to understand the different types of failure.
Failures occur in different layers of the computational stack, starting from the low level (OS) and going all the way up to business and application logic. Then, there is organizational resilience to consider as well: people might leave the company permanently, or they might just go on vacation, but either way teams and systems have to be resilient to this even when the experts that know the system best are not available. There are failures from dealing with unknown unknowns which are the hardest types, because, by the nature of it, we don’t know what to expect so we might be unprepared to handle it.
Another thing I would like for the audience to walk out with after attending this track is that failure is bound to happen. So we should start designing our systems with failure in mind from the start by considering the different failure modes and planning for the graceful recovery from them.
Finally, I hope the audience will learn to embrace failure. Failures will happen, so we can focus our energy on understanding the different modes of failure and how best to recover from it. I hope that attendees will walk away with lots of practical considerations and that they will take back the lessons shared by the speakers and apply them in their day-to-day operations and development, for better fault-tolerant systems.
by Tom Faulhaber
Principle Data Analysis Leader @Infolace
The container revolution is upon us and with it comes a new toolbox for building systems that are robust in the face of failures. Created in just the past few years, this new set of tools demands that we rethink our approach to architecting for failure. When we do, we will reap the benefits of architectural models that make it much simpler to reason about and handle failures of all types.
In this talk, we'll explore this toolbox and how the tools can be used to best effect. We will...
by Edwin Fuquen
Software Engineer @Google
Designing systems that take failure into account from the start is hard. Sometimes it’s very tempting to take shortcuts that will adversely affect your system in the long run. However, there are steps one can take to avoid these shortcuts and build a fault-tolerant system.
This talk is a case study in transitioning Guitar, an internal integration testing framework, to Spanner. Spanner is a database developed internally at Google that provides a fast, distributed data store for...
by Ali Basiri
Senior Software Engineer @Netflix
Imagine a world where you receive an alert about an outage that hasn’t happened yet. At Netflix, we are building a Chaos Automation Platform (ChAP) to realize this vision. ChAP runs experiments to test that microservices are resilient to failures in downstream dependencies. These experiments run in production. ChAP siphons off a fraction of real traffic, injects failures, and measures how these failures change system behavior.
ChAP focuses on a specific type of failure: a failed RPC...
by Michelle Brush
Director of Engineer @Cerner
As software practitioners we must move beyond a willful ignorance of the potential for failure in our architectures to a humble acceptance of our responsibility to not only monitor, but also mitigate that potential. To do this, we must understand the inherent risk of the systems we build. Only then can we decide how to become resilient, become vigilant, and adapt. This talk address the first step towards owning our potential for failure - which is identifying it. It does so by describing a...
by Oliver Gould
Twitter was once known for its ever-present error page, the “Fail Whale.” Thousands of staff-years later, this iconic image has all but faded from memory. This transformation was only possible due to Twitter’s treatment of failure as something not just to be expected, but to be embraced.
In this talk, we discuss the technical insights that enabled Twitter to fail, safely and often. We will show how Finagle, the high-scale RPC library used at Twitter, Pinterest, SoundCloud, and other...
Monday Nov 7
Architectures You've Always Wondered About
You know the names. Now learn lessons from their architectures
Distributed Systems War Stories
“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Lamport.
State of the art in Container deployment, management, scheduling
Art of Relevancy and Recommendations
Lessons on the adoption of practical, real-world machine learning practices. AI & Deep learning explored.
Next Generation Web Standards, Frameworks, and Techniques
Keeping life in balance is a challenge. Learn lifehacks, tips, & techniques for success.
Tuesday Nov 8
Next Generation Microservices
What will microservices look like in 3 years? What if we could start over?
Java: Are You Ready for This?
Real world lessons & prepping for JDK9. Reactive code in Java today, Performance/Optimization, Where Unsafe is heading, & JVM compile interface.
Big Data Meets the Cloud
Overviews and lessons learned from companies that have implemented their Big Data use-cases in the Cloud
Lessons/stories on optimizing the deployment pipeline
Software Engineering Softskills
Great engineers do more than code. Learn their secrets and level up.
Modern CS in the Real World
Applied, practical, & real-world dive into industry adoption of modern CS ideas
Wednesday Nov 9
Architecting for Failure
Your system will fail. Take control before it takes you with it.
Stream Processing, Near-Real Time Processing
Bare Metal Performance
Native languages, kernel bypass, tooling - make the most of your hardware
Culture as a Differentiator
The why and how for building successful engineering cultures
//TODO: Security <-- fix this
Building security from the start. Stories, lessons, and innovations advancing the field of software security.
Bots, virtual reality, voice, and new thought processes around design. The track explores the current art of the possible in UX and lessons from early adoption.