Inspired by the old saw, "Troubleshooting is like investigating a murder when you are also the murderer," this workshop will be the geekiest murder-mystery party you've ever attended. You will join BlubCo as a lead engineer, tasked with forensics on progressively larger systems experiencing failures ranging from intermittent issues to hard downtime.
Using systems thinking and incident response patterns, you will reverse engineer and troubleshoot misbehaving systems and resolve the incidents. Bring your laptop, pen, and paper; prepare to find the culprits. You will leave this workshop with new techniques for troubleshooting in production. You'll be better equipped to learn unfamiliar systems in a world where systems are getting ever less familiar.
Entropy and inherited systems have always been a challenging part of production environments:
- Rushing evades learning, and time pressure on the job forces engineers to take shortcuts. Shortcuts add up and create a cycle of technical debt, rollbacks, and organizational scar tissue
- Troubleshooting isn't formally taught; engineers pick it up on the job or from mentors. Engineers must know how to reconstruct lost mental models from fragmented sources.
- Systems outlive their creators. The half-life of small teams can be as low as 12 months, and production systems can run for decades. Inherited systems have nobody to fall back on.
- Debugging is harder than building the code in the first place. Organizations can build systems beyond their ability to operate safely, and that burden turns into incidents.
Post-LLM problems for engineers:
- Learning no longer happens by accident. LLMs can bypass the forcing functions of understanding running systems. Deep reading, testing, and building mental models are required to understand the source of misbehavior under pressure.
- Anyone can prompt from zero to a 50KLOC codebase. The ability to maintain, debug, and evolve that complexity remains a (augmented) human problem.
- Communication is changing. LLMs used poorly create florid junk and drown your team in memos, docs, and specs without the execution to back them up.
Across three escalating "Acts" in our play, incidents will start with single-service logic errors and escalate to cross-service infrastructure collapses. You will navigate through multiple layers of the stack and encounter failures from CDN to DNS to the database. By the end of the session, you will have moved beyond bug-squashing to a repeatable framework for incident response. How to build mental models, isolate variables, and maintain observability in an era of unprecedented system opacity.