You are viewing content from a past/completed conference.

Rethinking Reliability: What You Can (and Can't) Learn From Incidents

Abstract

This talk presents research collected from the VOID—an open database of public incident reports. Containing over 2,000 reports for almost 700 organizations, the database allows for more structured review and research about software-related incident reporting. Key results from our research challenge standard industry practices for incident response and analysis, like tracking Mean Time To Resolve (MMTR) and using Root Cause Analysis (RCA) methodology. In particular, we demonstrate how unreliable MTTR can be, and how RCA can lead to environments where people are less likely to admit mistakes and speak up about things that could lead to future incidents. We propose alternate metrics (SLOs and cost of coordination data), practices (Near Miss analysis), and mindsets (humans are the solution, not the problem) to help organizations better learn from their incidents, and make their systems safer and more resilient.

Speaker

Courtney Nash

Co-founder @The VOID, Previously @Verica, @Holloway, @Fastly, @O’Reilly Media, @Microsoft, & @Amazon

Courtney Nash is the Co-founder The VOID. Her research focuses on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she has held a variety of editorial, program management, research, and management roles at Verica, Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon.

Speaker

Courtney Nash

Co-founder @The VOID, Previously @Verica, @Holloway, @Fastly, @O’Reilly Media, @Microsoft, & @Amazon

From the same track

Session SRE

Did the Chaos Test Pass?

Wednesday Oct 26 / 11:50AM PDT

People used to ask me all the time how to figure out if their chaos test has “passed,” and I’d always say “well, that’s a loaded question.” To confirm that a chaos test “passed,” we need to do verification of hypotheses - sometimes you’re trying to prove some system behavior occurred in response

Christina Yakomin

Senior Site Reliability Engineering Specialist @Vanguard_Group

Session SRE

The Endgame of SRE

Wednesday Oct 26 / 10:35AM PDT

The containers are deployed and the builds are green. Yaml flows through the system, linted, reviewed, tested, and shipped with ease and regularity. Our intrepid SRE finds themself at a crossroads. The infrastructure is great but teams still struggle to maintain error budgets.

Amy Tobey

Senior Principal Engineer and SRE practice Leader @Equinix

Session SRE

The Eternal Sunshine of the Toil-Less Prod

Wednesday Oct 26 / 04:10PM PDT

One of the most important decisions in building an SRE practice is what kind of work should be assigned to the SRE team, and in what percentages.

Sasha Rosenbaum

Director of the Cloud Services Black Belt Team @RedHat

Session

[Panel] SRE: Is it Working?

Wednesday Oct 26 / 01:40PM PDT

How does SRE mature from a craft with a wide range of skills and levels of expertise to a mature discipline?

Courtney Nash

Co-founder @The VOID, Previously @Verica, @Holloway, @Fastly, @O’Reilly Media, @Microsoft, & @Amazon

Amy Tobey

Senior Principal Engineer and SRE practice Leader @Equinix

Christina Yakomin

Senior Site Reliability Engineering Specialist @Vanguard_Group

Sasha Rosenbaum

Director of the Cloud Services Black Belt Team @RedHat

Rethinking Reliability: What You Can (and Can't) Learn From Incidents

Abstract

Speaker

Courtney Nash

Find Courtney Nash at:

Speaker

Courtney Nash

Date

Location

Track

Topics

Share

From the same track

Did the Chaos Test Pass?

The Endgame of SRE

The Eternal Sunshine of the Toil-Less Prod

[Panel] SRE: Is it Working?

Follow QCon

Contact

Menu

Conferences around the World