Disaster Recovery Across a Million Pieces

Data recovery is more than just backing up and restoring a data store. The goal of any disaster recovery effort is getting the system back to working as expected across all of its parts. Recovering data by itself only brings us back in time to a view of reality that might not reflect how each part sees the world. This could mean millions of different views of reality in large systems.

This talk covers challenges, patterns, and practices for disaster recovery actions in massively distributed systems. It focuses on two commonly used patterns for restoring the whole system to the same reality:

  • Rebuild the world
  • Restore & reconcile

We will discuss how these approaches were used in different systems, the challenges and tradeoffs experienced, and why sometimes the answer is "Why not both?" Finally, we’ll explore practices that help improve confidence and recovery time, reducing stress and ensuring things get back to working as fast as possible.  

Interview:

What's the focus of your work these days?

My work centers around improving the reliability of Google's infrastructure as a service offering. A lot of the work is proactively identifying and mitigating areas of risk in the system, but also, my teams run incident response and drive the learning from the incidents process. 

What's the motivation for your talk at QCon San Francisco 2023?

In all the companies I've worked for, there's been a moment where we needed to recover or repair some critical data, and the architectural decision made early on either made that moment a lot easier than it needed to be or a lot harder. I wanted to give folks some tools for reasoning about their architecture's ability to respond to disasters so maybe it will fall on the easy side for them.  

How would you describe your main persona and target audience for this session?

This talk is for senior engineers who might be making big architectural decisions and system engineers that might be involved in any disaster-level response. 

Is there anything specific that you'd like people to walk away with after watching your session?

I would like folks to walk away with the understanding that disaster recovery is more challenging than just backing up your data stores. If you want your system to be recoverable in a disaster, you have to make sure the architecture will support it.


Speaker

Michelle Brush

Engineering Director, SRE @Google, Previously Director of HealtheIntent Architecture @Cerner Corporation & Lead Engineer @Garmin, Author of "2 out of the 97 Things Every SRE Should Know"

Michelle Brush is a math geek turned computer geek with over 20 years of software development experience. She has developed algorithms and data structures for pathfinding, search, compression, and data mining in embedded as well as distributed systems. In her current role as an Engineering Director, SRE for Google, she leads teams of SREs that ensure GCP's Compute Engine and Persistent Disk products are reliable. Previously, she served as the Director of HealtheIntent Architecture for Cerner Corporation, responsible for the data engineering platform for Cerner’s Population Health solutions. Prior to her time at Cerner, she was the lead engineer for Garmin's automotive routing algorithm. She is the author of 2 out of the 97 Things Every SRE Should Know.

Read more
Find Michelle Brush at:

Date

Tuesday Oct 3 / 10:35AM PDT ( 50 minutes )

Location

Ballroom BC

Topics

Architecture Reliability Disaster Recovery

Video

Video is not available

Slides

Slides are not available

Share

From the same track

Session Database

How Netflix Ensures Highly-Reliable Online Stateful Systems

Tuesday Oct 3 / 02:45PM PDT

Under most stateless services are stateful databases, caches, and systems which form the bedrock applications are built on.

Speaker image - Joseph Lynch

Joseph Lynch

Distributed Systems Engineer @Netflix Working on Online Datastores and Data Abstractions

Session Resiliency

How Do We Talk to Each Other? How Surfacing Communication Patterns in Organizations Can Help You Understand and Improve Your Resilience

Tuesday Oct 3 / 01:35PM PDT

As a system increases in inevitable complexity, it becomes impossible for a single operator to have a clear, unambiguous understanding of what's happening in the system. Understanding the system requires a joint effort between teammates and technology.

Speaker image - Nora Jones

Nora Jones

Founder and CEO @jeli_io, Founder of Learning From Incidents (LFI) Online Community and Conference

Session Resiliency

Orchestrating Resilience: Building Modern Asynchronous Systems

Tuesday Oct 3 / 03:55PM PDT

Building asynchronous, event-driven systems can be daunting. Managing states, ensuring resilience, maintaining traceability, and handling a myriad of other challenges often require more effort than building the functionality itself.

Speaker image - Sai Pragna Etikyala

Sai Pragna Etikyala

Technical Lead @Twilio

Session Reliability

Designing Fault-Tolerant Software with Control System Transparency

Tuesday Oct 3 / 11:45AM PDT

Teams at NASA and JPL that create mission-critical software for spacecraft take a principled approach to fault tolerance. Let's see how those same principles, centered around a concept of transparency, can help us achieve reliability in pragmatic, modern software delivery settings.

Speaker image - Jon Moore

Jon Moore

Staff Software Engineer @Stripe with over 35 years of software engineering experience across both academia and industry

Session

Unconference: Designing for Resilience

Tuesday Oct 3 / 05:05PM PDT

What is an unconference? An unconference is a participant-driven meeting. Attendees come together, bringing their challenges and relying on the experience and know-how of their peers for solutions.