Data recovery is more than just backing up and restoring a data store. The goal of any disaster recovery effort is getting the system back to working as expected across all of its parts. Recovering data by itself only brings us back in time to a view of reality that might not reflect how each part sees the world. This could mean millions of different views of reality in large systems.
This talk covers challenges, patterns, and practices for disaster recovery actions in massively distributed systems. It focuses on two commonly used patterns for restoring the whole system to the same reality:
- Rebuild the world
- Restore & reconcile
We will discuss how these approaches were used in different systems, the challenges and tradeoffs experienced, and why sometimes the answer is "Why not both?" Finally, we’ll explore practices that help improve confidence and recovery time, reducing stress and ensuring things get back to working as fast as possible.
Engineering Director, SRE @Google, Previously Director of HealtheIntent Architecture @Cerner Corporation & Lead Engineer @Garmin, Author of "2 out of the 97 Things Every SRE Should Know"
Michelle Brush is a math geek turned computer geek with over 20 years of software development experience. She has developed algorithms and data structures for pathfinding, search, compression, and data mining in embedded as well as distributed systems. In her current role as an Engineering Director, SRE for Google, she leads teams of SREs that ensure GCP's Compute Engine and Persistent Disk products are reliable. Previously, she served as the Director of HealtheIntent Architecture for Cerner Corporation, responsible for the data engineering platform for Cerner’s Population Health solutions. Prior to her time at Cerner, she was the lead engineer for Garmin's automotive routing algorithm. She is the author of 2 out of the 97 Things Every SRE Should Know.