Routine database upgrades should be straightforward, especially with familiar, well-established technology. We were confident heading into our Elasticsearch upgrade, equipped with a solid plan and excited to see performance gains like we had seen from past upgrades. But what started as a straightforward upgrade became a week-long catastrophe that brought our platform to its knees. For six grueling days, we fought cluster instability while our Fortune 500 customers demanded answers we didn't have.

This talk shares the raw story of struggle and how disaster became the greatest teacher. The experience highlighted how creating psychological safety, leveraging community support, exceptional leadership, and team character can often matter more than technical solutions. You'll take away six hard-won lessons that will better prepare you for when your next "routine" upgrade goes sideways.

Interview:

What is your session about, and why is it important for senior software developers?

This session shares six lifelong lessons from a week-long Elasticsearch outage at my former company in 2017. While the story is entertaining—involving a critical system failure, desperate debugging, and an eventual bug discovery—the real value is in the hard-won lessons that can save folks from similar disasters. The technical lessons (having rollback plans, doing performance testing, and being wary of bias) apply to changes of any size, not just major ones. The human lessons (widening your circle, having strong leadership support, and building resilient teams) are what ultimately determine how well your team can survive a crisis. Technical leaders are uniquely positioned to implement these practices and model the culture shifts needed to handle incidents effectively.

Why is it critical for software leaders to focus on this topic right now, as we head into 2026?

As systems grow more complex and interconnected, incidents are inevitable—it's not a question of "if" but "when." We need leaders who embrace incidents as learning opportunities and create psychological safety for teams to be vulnerable, ask for help early, and grow from failures. The lessons in this talk—especially around leadership support during crises and building teams with strong character—are essential for creating resilient engineering cultures that can thrive amid inevitable disruptions.

What are the common challenges developers and architects face in this area?

The challenges tend to fall into two buckets: technical blind spots and cultural barriers. On the technical side, past success can create assumptions about future changes, and it's easy to overlook the full scope of what needs testing or planning. On the human side, there's often reluctance to ask for help early, particularly among experienced engineers who feel pressure to have all the answers. Teams also struggle with the gap between having plans on paper versus actually practicing them under realistic conditions. These challenges are universal across the industry, which is why sharing stories about them matters.

What's one thing you hope attendees will implement immediately after your talk?

Leaders commit to showing up supportively during incidents. They lean into being their team's cheerleader and defender, not their interrogator. When an incident happens, a leader should strive to remove external pressures and trust that their team will figure it out. My favorite saying is “People don't remember what you did, they remember how you made them feel.” The way leaders show up during a crisis shapes an engineering culture. Engineers will watch how they react, and early career engineers especially need to see that asking for help is a strength, not a weakness. Leaders’ composure and trust during incidents builds psychological safety that pays dividends long after the incident is resolved.

Week-Long Outage: Lifelong Lessons

Abstract

Interview:

What is your session about, and why is it important for senior software developers?

Why is it critical for software leaders to focus on this topic right now, as we head into 2026?

What are the common challenges developers and architects face in this area?

What's one thing you hope attendees will implement immediately after your talk?

Speaker

Molly Struve

Find Molly Struve at:

Speaker

Molly Struve

Date

Location

Track

Topics

Share

From the same track

When Incidents Refuse to End

The Ironies of A^2 I^2

The Time it Wasn't DNS

The Human Toll of Incidents & Ways To Mitigate It

Follow QCon

Contact

Menu

Conferences around the World