The Stories Behind the Incidents

This track will take you behind the curtain and into the heart of system meltdowns at some of the world's leading software companies in "The stories behind the incidents" track. Learn directly from SREs about real-world, high-impact production failures at scale, including the immediate challenges of triage, diagnosis, and mitigation in complex distributed systems. From these stories, you’ll gain insights into the nature of real incidents and how skilled SREs recover from them.

You’ll learn about the ambiguous, confusing, and uncertain nature of incidents when you’re in the middle of them, and hear the tales of how engineers were able to improvise innovative solutions in order to restore service. You’ll also learn how fundamentally unpredictable incidents are, and, consequently, the importance of preparing to be surprised.


From this track

Session Incidents

The Human Toll of Incidents & Ways To Mitigate It

Wednesday Nov 19 / 10:35AM PST

Have you ever wondered what it's like to respond to a significant incident? Walk through an hour by hour reconstruction of an incident response or two, focusing on what it was like to be "in the room" and the human response to the incidents.

Speaker image - Kyle Lexmond

Kyle Lexmond

Production Engineer @Meta, Previously @AWS and @Twitter

Session Incidents

When Incidents Refuse to End

Wednesday Nov 19 / 11:45AM PST

As engineers, we’re used to managing failure, but long-running outages hit differently. They stretch teams, systems, and assumptions about how incidents “should” play out.

Speaker image - Vanessa Huerta Granda

Vanessa Huerta Granda

Resilience Engineering Manager @Enova, Co-Author of the Howie Guide on Post Incident Analysis, Board Member for the Resilience in Software Foundation

Session Staff Plus Engineering

The Ironies of A^2 I^2

Wednesday Nov 19 / 01:35PM PST

In this talk, we'll explore some of the "ironies" of automation—and now, artificial intelligence—in their interactions with software operators (i.e. you), especially during high consequence, high tempo situations (aka incidents).

Speaker image - J. Paul Reed

J. Paul Reed

Staff Incident Operations Manager @Chime

Session Incident Response

Week-Long Outage: Lifelong Lessons

Wednesday Nov 19 / 02:45PM PST

Routine database upgrades should be straightforward, especially with familiar, well-established technology. We were confident heading into our Elasticsearch upgrade, equipped with a solid plan and excited to see performance gains like we had seen from past upgrades.

Speaker image - Molly Struve

Molly Struve

Staff Site Reliability Engineer @Netflix

Session Incident Analysis

The Time it Wasn't DNS

Wednesday Nov 19 / 03:55PM PST

In January of 2023, the Microsoft Azure Wide Area Network experienced a global outage. If you were a Microsoft customer at the time, you were impacted by this outage.

Speaker image - Sean Klein

Sean Klein

Principal Technical Program Manager - Modern Incident Analysis @Microsoft Azure

Track Host

Lorin Hochstein

Staff Software Engineer @Airbnb, Writes @surfingcomplexity.blog, Previously @Netflix and Member of the Resilience in Software Foundation

Lorin Hochstein is Staff Software Engineer, Reliability at Airbnb. He was previously Senior Staff Software Engineer at Coupang, Senior Software Engineer at Netflix, Senior Software Engineer at SendGrid Labs, Lead Architect for Cloud Services at Nimbis Services, Computer Scientist at the University of Southern California's Information Sciences Institute, and Assistant Professor in the Department of Computer Science and Engineering at the University of Nebraska–Lincoln.

Lorin has a B.Eng. in Computer Engineering from McGill University, an M.S. in Electrical Engineering from Boston University, and a PhD in Computer Science from the University of Maryland.

Read more
Find Lorin Hochstein at: