The Human Toll of Incidents & Ways To Mitigate It

Abstract

Have you ever wondered what it's like to respond to a significant incident? Walk through an hour by hour reconstruction of an incident response or two, focusing on what it was like to be "in the room" and the human response to the incidents. Learn about some actions that could help you while you respond to the next outage, as well as changes you can drive to make incident response more considerate of the humans involved.


Speaker

Kyle Lexmond

Kyle is an almost-SWE who learned about Site Reliability Engineering in passing conversation during university, changing the course of his career. Having worked at big names (Twitter, Amazon, Facebook) and small (CBSA, Kik), he enjoys working on building optimized and efficient systems that break less often after he touches them. He currently lives in Seattle with a partner and an adorable dog. (Yes, he has pictures.)

Read more

Date

Wednesday Nov 19 / 03:55PM PST ( 50 minutes )

Location

Ballroom BC

Share

From the same track

Session

When Incidents Refuse to End

Wednesday Nov 19 / 11:45AM PST

As engineers, we’re used to managing failure, but long-running outages hit differently. They stretch teams, systems, and assumptions about how incidents “should” play out.

Speaker image - Vanessa Huerta Granda

Vanessa Huerta Granda

Resiliency Manager @Enova, Co-Author of the Howie Guide on Post Incident Analysis

Session

The Ironies of AAII

Wednesday Nov 19 / 10:35AM PST

Details coming soon.

Speaker image - Paul Reed

Paul Reed

Staff Incident Operations Manager @Chime

Session

Rebuilding A System After a Security Breach

Wednesday Nov 19 / 01:35PM PST

Details coming soon.

Session

Week-Long Outage: Lifelong Lessons

Wednesday Nov 19 / 02:45PM PST

Routine database upgrades should be straightforward, especially with familiar, well-established technology. We were confident heading into our Elasticsearch upgrade, equipped with a solid plan and excited to see performance gains like we had seen from past upgrades.

Speaker image - Molly Struve

Molly Struve

Staff Site Reliability Engineer @Netflix