Your organization likely has a process for incidents. You have alerting, runbooks, an on-call rotation, and some version of a post-incident review. And you probably still have the nagging sense that you keep learning the same lessons, or not learning them at all. This training is about why that happens, and what to do about it.
Most of what we know about incident response comes from after-the-fact accounts, sanitized for legal and reputational reasons, that tell you surface-level details about what happened but not what it actually felt like to be in the middle of it. This training is built around something that's almost never available: an incident you can examine from the inside, in full detail, without the filters.
This full-day class is built around access to an incident, from start to finish, examined with complete transparency. Using an incident drill — a controlled scenario that gives us the kind of visibility that legal and reputational constraints typically make impossible — participants will work with actual chat transcripts, video recordings of responders in action, and analyst notes to understand how a real incident unfolds from the inside. This is not a tabletop exercise with a tidy resolution. It is as close as you can get to being in the room without being the one on call.
Morning: Incident Response in Practice
The morning puts you inside an active incident. By both participating in an active drill and reading the analysis of two experts conducting a drill previously, you will experience the dynamics that make incidents genuinely hard: multi-party coordination under pressure, information saturation, the hidden work that never makes it into the post-incident report, and the moment-to-moment decisions practitioners make with incomplete information and competing priorities. The goal is not to find the right answer. It is to develop a more accurate mental model of what skilled incident response actually looks like, and what it demands from the people doing it.
Afternoon: From Response to Learning
The afternoon shifts from response to analysis. Participants will work through a structured post-incident review of the morning's scenario, applying frameworks from Resilience Engineering and Human Factors research to move beyond surface-level root cause analysis toward a genuine understanding of how systemic conditions shaped the outcome. This section addresses the gap between how most teams currently conduct post-incident reviews and what actually produces lasting improvement — including why the most common conclusions ("human error," "we need more automation," "tighter processes") often make systems more brittle rather than more resilient.
The day closes with a moderated panel of four practitioners, each from a different company, who have been through significant incidents and their aftermath. The discussion covers what effective incident response looks like under real pressure, the nature of expertise in high-stakes situations, how to extract meaningful insights from complex events rather than comfortable narratives, and what organizations that genuinely improve after incidents do differently from those that don't.
What You'll Leave With:
- A sharper, more honest picture of what makes incident response more successful and what makes it harder.
- Practical frameworks for post-incident analysis that produces real learning rather than documentation.
- A clearer understanding of the systemic and human factors that shape how incidents unfold, and what that means for how you build, operate, and staff your systems.
- The rare experience of having examined exceptional incident response in real detail, discussed openly, without the filters that normally apply.
Who Should Attend:
Senior engineers, SREs, incident commanders, and engineering managers who regularly deal with production incidents — either as responders or as the people responsible for ensuring their organizations learn from them. Also relevant for architects and technical leads who want to understand how the systems they design behave under stress, and what that should mean for the decisions they make upstream.
Speaker
Courtney Nash
Internet Incident Librarian & Research Analyst, Previously @Verica, @Holloway, @Fastly, @O’Reilly Media, @Microsoft, & @Amazon
Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s held a variety of editorial, program management, research, and management roles at Prowler, Verica, Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon. She lives in the mountains where she skis, rides bikes, and herds dogs and kids.