Designing for Resilience

Bugs get triggered. Hardware fails. Networks are unreliable. Accidents and malicious attacks happen. Any of these events can cause a system to stop functioning, or put it into a state so degraded that its performance is no longer acceptable.

Critical systems must continue to function and meet SLAs in the presence of internal or external faults. In this track, we will delve into industry best practices as well as innovative approaches to designing resilient systems.

These approaches can be technical, such as new ways to route around degraded network links. Alternatively, they can be sociotechnical: the more a system depends on its operator, the more important it is that the operator has a clear, unambiguous understanding of the system’s state and how to intervene. Most systems require a mix of both.

In this track, we will delve into each of these areas to provide attendees with the tools they need to build resilient systems and empower operators.


From this track

Session Architecture

Disaster Recovery Across a Million Pieces

Tuesday Oct 3 / 10:35AM PDT

Data recovery is more than just backing up and restoring a data store. The goal of any disaster recovery effort is getting the system back to working as expected across all of its parts.

Speaker image - Michelle Brush
Michelle Brush

Engineering Director, SRE @Google, Previously Director of HealtheIntent Architecture @Cerner Corporation & Lead Engineer @Garmin, Author of "2 out of the 97 Things Every SRE Should Know"

Session Reliability

Designing Fault-Tolerant Software with Control System Transparency

Tuesday Oct 3 / 11:45AM PDT

Teams at NASA and JPL that create mission-critical software for spacecraft take a principled approach to fault tolerance. Let's see how those same principles, centered around a concept of transparency, can help us achieve reliability in pragmatic, modern software delivery settings.

Speaker image - Jon Moore
Jon Moore

Staff Software Engineer @Stripe with over 35 years of software engineering experience across both academia and industry

Session Resiliency

How Do We Talk to Each Other? How Surfacing Communication Patterns in Organizations Can Help You Understand and Improve Your Resilience

Tuesday Oct 3 / 01:35PM PDT

As a system increases in inevitable complexity, it becomes impossible for a single operator to have a clear, unambiguous understanding of what's happening in the system. Understanding the system requires a joint effort between teammates and technology.

Speaker image - Nora Jones
Nora Jones

Founder and CEO @jeli_io, Founder of Learning From Incidents (LFI) Online Community and Conference

Session Database

How Netflix Ensures Highly-Reliable Online Stateful Systems

Tuesday Oct 3 / 02:45PM PDT

Under most stateless services are stateful databases, caches, and systems which form the bedrock applications are built on.

Speaker image - Joseph Lynch
Joseph Lynch

Distributed Systems Engineer @Netflix Working on Online Datastores and Data Abstractions

Session Resiliency

Orchestrating Resilience: Building Modern Asynchronous Systems

Tuesday Oct 3 / 03:55PM PDT

Building asynchronous, event-driven systems can be daunting. Managing states, ensuring resilience, maintaining traceability, and handling a myriad of other challenges often require more effort than building the functionality itself.

Speaker image - Sai Pragna Etikyala
Sai Pragna Etikyala

Technical Lead @Twilio

Session

Unconference: Designing for Resilience

Tuesday Oct 3 / 05:05PM PDT

What is an unconference? An unconference is a participant-driven meeting. Attendees come together, bringing their challenges and relying on the experience and know-how of their peers for solutions.

Track Host

Javier Fernandez-Ivern

Staff Software Engineer @Netflix with over 20 years in Software Engineering

Javier Fernandez-Ivern is a member of the DRM & Manifest team at Netflix, where he is responsible for ensuring that customers always enjoy their favorite shows with the best video, audio, text, and other features available. His services fill a key role in enabling Netflix to stream amazing content to more than 230M members on thousands of devices worldwide. Prior to Netflix, Javier spent a few years working at a competitive programming startup before moving into a consulting role where he built web applications for a variety of clients. After trying out management at Capital One, he returned to his software engineering roots and joined Netflix. Javier enjoys developing and operating highly available services, and the scale at Netflix has been a unique and exciting challenge. Javier received a MS in Computer Science from Eastern Washington University.

Read more
Find Javier Fernandez-Ivern at: