You are viewing content from a past/completed conference.

When Incidents Refuse to End

Abstract

As engineers, we’re used to managing failure, but long-running outages hit differently. They stretch teams, systems, and assumptions about how incidents “should” play out. In this talk, we’ll dive into real examples of incidents that dragged on far longer than anyone expected, and unpack what they revealed about our systems, processes, and mental models.

We’ll explore what these situations taught us about coordination under pressure, shifting system behavior, and the limitations of our current practices for detection and response. We will also look at how a mindset of curiosity helped us make sense of the mess — not just to resolve the immediate situation, but to improve how we adapt, learn, and build stronger systems and teams.

Speaker

Vanessa Huerta Granda

Resilience Engineering Manager @Enova, Co-Author of the Howie Guide on Post Incident Analysis, Board Member for the Resilience in Software Foundation

Vanessa is an Engineering Manager at Enova leading the Resilience Engineering team focusing on their Production Incident process, learning from incidents, and leading the on-call rotation of Incident Commanders. She previously worked as a Solutions Engineer at Jeli helping companies make the most of their incidents. In 2021 she co-authored Howie: The Post-Incident Guide, an in-depth explanation for how tech organizations can learn from incidents.

She has led the Chicago Women in Technology Conference and is an admin of the Learning From Incidents community. She is passionate about continuous improvement, getting teams to talk to each other, and Diversity and Inclusion in Tech.

Speaker

Vanessa Huerta Granda

Resilience Engineering Manager @Enova, Co-Author of the Howie Guide on Post Incident Analysis, Board Member for the Resilience in Software Foundation

From the same track

Session Staff Plus Engineering

The Ironies of A^2 I^2

Wednesday Nov 19 / 01:35PM PST

In this talk, we'll explore some of the "ironies" of automation—and now, artificial intelligence—in their interactions with software operators (i.e. you), especially during high consequence, high tempo situations (aka incidents).

J. Paul Reed

Staff Incident Operations Manager @Chime

Session Incident Analysis

The Time it Wasn't DNS

Wednesday Nov 19 / 03:55PM PST

In January of 2023, the Microsoft Azure Wide Area Network experienced a global outage. If you were a Microsoft customer at the time, you were impacted by this outage.

Sean Klein

Principal Technical Program Manager - Modern Incident Analysis @Microsoft Azure

Session Incident Response

Week-Long Outage: Lifelong Lessons

Wednesday Nov 19 / 02:45PM PST

Routine database upgrades should be straightforward, especially with familiar, well-established technology. We were confident heading into our Elasticsearch upgrade, equipped with a solid plan and excited to see performance gains like we had seen from past upgrades.

Molly Struve

Staff Site Reliability Engineer @Netflix

Session Incidents

The Human Toll of Incidents & Ways To Mitigate It

Wednesday Nov 19 / 10:35AM PST

Have you ever wondered what it's like to respond to a significant incident? Walk through an hour by hour reconstruction of an incident response or two, focusing on what it was like to be "in the room" and the human response to the incidents.

Kyle Lexmond

Production Engineer @Meta, Previously @AWS and @Twitter

When Incidents Refuse to End

Abstract

Speaker

Vanessa Huerta Granda

Find Vanessa Huerta Granda at:

Speaker

Vanessa Huerta Granda

Date

Location

Track

Topics

Share

From the same track

The Ironies of A^2 I^2

The Time it Wasn't DNS

Week-Long Outage: Lifelong Lessons

The Human Toll of Incidents & Ways To Mitigate It

Follow QCon

Contact

Menu

Conferences around the World