Week-Long Outage: Lifelong Lessons

Abstract

Routine database upgrades should be straightforward, especially with familiar, well-established technology. We were confident heading into our Elasticsearch upgrade, equipped with a solid plan and excited to see performance gains like we had seen from past upgrades. But what started as a straightforward upgrade became a week-long catastrophe that brought our platform to its knees. For six grueling days, we fought cluster instability while our Fortune 500 customers demanded answers we didn't have.

This talk shares the raw story of struggle and how disaster became the greatest teacher. The experience highlighted how creating psychological safety, leveraging community support, exceptional leadership, and team character can often matter more than technical solutions. You'll take away six hard-won lessons that will better prepare you for when your next "routine" upgrade goes sideways.


Speaker

Molly Struve

Staff Site Reliability Engineer @Netflix

Molly Struve is a Staff Site Reliability Engineer at Netflix with a degree in Aerospace Engineering from MIT. She is passionate about building reliable and scalable software and teams. Her diverse experience includes leading globally distributed teams, architecting databases, and optimizing complex systems and processes. Every day, she strives to lead by example and empower those around her by sharing all that she has learned from her time in the industry. When she isn't wrangling incidents or servers, she can be found riding and jumping her show horses.

Read more

Date

Wednesday Nov 19 / 02:45PM PST ( 50 minutes )

Location

Ballroom BC

Share

From the same track

Session

War Stories from the Front Lines of Production

Wednesday Nov 19 / 11:45AM PST

Details coming soon.

Speaker image - Vanessa Huerta Granda

Vanessa Huerta Granda

Resiliency Manager @Enova, Co-Author of the Howie Guide on Post Incident Analysis

Session

The Incident that Shaped Our Engineering Culture

Wednesday Nov 19 / 10:35AM PST

Details coming soon.

Session

Rebuilding A System After a Security Breach

Wednesday Nov 19 / 01:35PM PST

Details coming soon.

Session

Postmortem of a Downtime: What Was Learned from A Big Mistake

Wednesday Nov 19 / 03:55PM PST

Details coming soon.