How Did It Make Sense at the Time? Understanding Incidents As They Occurred, Not as They Are Remembered

When we encounter undesirable outcomes, there is a natural instinct to look back, find something that went wrong, and fix it. But looking back in this way doesn’t actually help us as much as we think because we know what went wrong and what it took to fix it.  To the responders in the incident, that knowledge only came after the hard stuff - detecting, diagnosing, and repairing the problem.  In this talk we’ll use the question “how did it make sense at the time?” to flip perspectives both concretely and philosophically. 

Our systems are built and operated by humans making decisions; what we see when we look at “failure” is people taking actions that made sense at the time even if they seem ridiculous afterwards. For example, when overloaded, a web service experiences catastrophic collapse which, in hindsight, might have been avoided by a different rate limiter setting. But this setting was configured — perhaps months ago — by a developer who intended the service to be robust!

How did the configuration — which later enabled sadness —  make sense at the time? What might we change to help developers make better decisions in the future? In this talk, we’ll explore the basics of failure in complex systems, the theory and practice of how it made sense at the time, and actions you can take to weave this perspective into your software development and incident lifecycles to help teams be more resilient in the face of complexity.

Interview:

What's the focus of your work these days?

I’m the tech lead at Stripe on the API events team, focused on things like web hooks and notifications. I work collaboratively with other teams at Stripe, helping large projects row in the same direction. My focus really depends on the day, but it includes design, implementation, operations - the whole software development lifecycle; trying to deliver impact and success to Stripe users, my coworkers, and the company as a whole. I care about reliability a lot!

What would you say is the motivation behind your talk?

I’m really curious and interested in the question of safety and reliability in complex socio technical systems - like high growth tech companies. Specifically, it's interesting taking one concept – in this case, ‘How did it make sense at the time?’, questioning this cognitive perspective on what happens when things go wrong – and sharing it with people. This gives them an arrow for their quiver or a tool for their toolkit. I'm really happy to be speaking with the other folks on the track; I think the effective SRE track has a high overlap, and is very interesting for the attendees and the community.

How would you describe the persona and level of the target audience for your session?

This is really interesting and something I tweeted: “Are you the persona and target of this talk? Who are you?” I think the target audience is anyone who cares a lot about reliability and figuring out how things go right, how they go wrong and learning from that. That generally tends to be people who are a little bit more senior, rather than new college grads. It's people who have had some experience, who've seen some things happen. 

Things fail - even though we have really smart people and really good technology. What's going on? This session is for people who are at the beginning of their exploration of the domain, which I imagine is why they come to the ‘Just Culture’ track in the first place. Maybe not folks who are full safety science nerds; if you’ve read every paper by Dekker, Woods and Cooke, this talk may be more of a refresher.

 

What would you like this persona to walk away with after your presentation?

There are two main things. One is, some familiarity with the concepts. I’d love for the target persona to walk away with a better understanding of how it makes sense to ask  “How does it make sense at the time?” Why should I care? 

The second thing I’d like them to walk away is the knowledge of how to pull some threads to actually make change at their organizations. Both as a concept and as a tool. 

  • Where would you ask? 
  • What are the options for weaving this in, your post-incident activities?
  • How might you leverage this outside of the incident lifecycle? 

Speaker

Jacob Scott

Staff Software Engineer @stripe

Jacob is a technologist who is deeply curious about reliability in complex socio-technical (software) systems. He is currently a staff software engineer in the Platform & Ecosystem group at Stripe, focused on user facing event systems. Outside of work, he might be found at a nearby park with his one year old daughter or pursuing his avocation of collecting employees-only tech swag. Do you have a Facebook “illuminati” hoodie you are willing to part with? DM him on Twitter! 

 

Read more
Find Jacob Scott at:

Date

Tuesday Oct 25 / 04:10PM PDT ( 50 minutes )

Location

Ballroom BC

Topics

Engineering Culture Complex Systems

Share

From the same track

Session Engineering Culture

Generous, High Fidelity Communication Is the Key to a Safe, Effective Team

Tuesday Oct 25 / 10:35AM PDT

A team's ability to communicate effectively and disagree productively is directly related to its resilience towards incidents and interruptions.

Speaker image - Denise Yu

Denise Yu

Engineering Manager and Rubyist, Previously Engineering Manager @GitHub

Session Engineering Culture

Recipes for Blameless Accountability

Tuesday Oct 25 / 02:55PM PDT

Building a culture of continuous improvement requires that teams value psychological safety, blamelessness, and admitting error. This can sometimes feel in conflict with an organization's desire to see accountability and ownership of the work.

Speaker image - Michelle Brush

Michelle Brush

Engineering Director, SRE @Google, Previously Director of HealtheIntent Architecture @Cerner Corporation & Lead Engineer @Garmin, Author of "2 out of the 97 Things Every SRE Should Know"

Session Engineering Culture

Reckoning with the Harm We Do: In Search of Restorative Just Culture in Software and Web Operations

Tuesday Oct 25 / 05:25PM PDT

“Psychological Safety” and “Blameless” postmortems are not enough. We’ve heard that we need a “Just Culture” but does that matter if your people are “stressed, exhausted, depleted, spent, drained”?

Speaker image - Jessica DeVita

Jessica DeVita

Sr. Software Engineering Manager - SRE @Microsoft

Session

Panel: "Just" Engineering Culture

Tuesday Oct 25 / 11:50AM PDT

The hardest part of technology is rarely the tech itself. Systems are designed, used, and operated by people. People make mistakes, but they are also critical to keeping systems safe and reliable.

Speaker image - Denise Yu

Denise Yu

Engineering Manager and Rubyist, Previously Engineering Manager @GitHub

Speaker image - Jacob Scott

Jacob Scott

Staff Software Engineer @stripe

Speaker image - Jessica DeVita

Jessica DeVita

Sr. Software Engineering Manager - SRE @Microsoft

Speaker image - Vanessa Huerta Granda

Vanessa Huerta Granda

Resiliency Manager @Enova, Co-Author of the Howie Guide on Post Incident Analysis

Session

Unconference: Engineering Culture

Tuesday Oct 25 / 01:40PM PDT

What is an unconference? At QCon SF, we’ll have unconferences in most of our tracks.

Speaker image - Shane Hastie

Shane Hastie

Global Delivery Lead for SoftEd and Lead Editor for Culture & Methods at InfoQ.com