Bugs get triggered. Hardware fails. Networks are unreliable. Accidents and malicious attacks happen. Any of these events can cause a system to stop functioning, or put it into a state so degraded that its performance is no longer acceptable.
Critical systems must continue to function and meet SLAs in the presence of internal or external faults. In this track, we will delve into industry best practices as well as innovative approaches to designing resilient systems.
These approaches can be technical, such as new ways to route around degraded network links. Alternatively, they can be sociotechnical: the more a system depends on its operator, the more important it is that the operator has a clear, unambiguous understanding of the system’s state and how to intervene. Most systems require a mix of both.
In this track, we will delve into each of these areas to provide attendees with the tools they need to build resilient systems and empower operators.
From this track
Disaster Recovery Across a Million Pieces
Tuesday Oct 3 / 10:35AM PDT
Data recovery is more than just backing up and restoring a data store. The goal of any disaster recovery effort is getting the system back to working as expected across all of its parts.
Michelle Brush
Engineering Director, SRE @Google, Previously Director of HealtheIntent Architecture @Cerner Corporation & Lead Engineer @Garmin, Author of "2 out of the 97 Things Every SRE Should Know"
Designing Fault-Tolerant Software with Control System Transparency
Tuesday Oct 3 / 11:45AM PDT
Teams at NASA and JPL that create mission-critical software for spacecraft take a principled approach to fault tolerance. Let's see how those same principles, centered around a concept of transparency, can help us achieve reliability in pragmatic, modern software delivery settings.
Jon Moore
Staff Software Engineer @Stripe with over 35 years of software engineering experience across both academia and industry
How Do We Talk to Each Other? How Surfacing Communication Patterns in Organizations Can Help You Understand and Improve Your Resilience
Tuesday Oct 3 / 01:35PM PDT
As a system increases in inevitable complexity, it becomes impossible for a single operator to have a clear, unambiguous understanding of what's happening in the system. Understanding the system requires a joint effort between teammates and technology.
Nora Jones
Founder and CEO @jeli_io, Founder of Learning From Incidents (LFI) Online Community and Conference
How Netflix Ensures Highly-Reliable Online Stateful Systems
Tuesday Oct 3 / 02:45PM PDT
Under most stateless services are stateful databases, caches, and systems which form the bedrock applications are built on.
Joseph Lynch
Distributed Systems Engineer @Netflix Working on Online Datastores and Data Abstractions
Orchestrating Resilience: Building Modern Asynchronous Systems
Tuesday Oct 3 / 03:55PM PDT
Building asynchronous, event-driven systems can be daunting. Managing states, ensuring resilience, maintaining traceability, and handling a myriad of other challenges often require more effort than building the functionality itself.
Sai Pragna Etikyala
Technical Lead @Twilio
Unconference: Designing for Resilience
Tuesday Oct 3 / 05:05PM PDT
What is an unconference? An unconference is a participant-driven meeting. Attendees come together, bringing their challenges and relying on the experience and know-how of their peers for solutions.