Abstract
Resilient systems do not stay resilient by standing still. They survive by adapting. But adaptation comes with risk: under sustained pressure, organizations and systems can slowly drift toward failure while still appearing to operate normally. Using Rasmussen’s Dynamic Safety Model as a lens, this talk explores how AI is accelerating that drift at a pace many engineering organizations have not had to manage before.
Major technological upheavals are not new. The recent impact of AI reflects patterns we saw with cloud computing: early adoption created enormous leverage, architectural sprawl followed as systems rapidly evolved, bill shock forced a painful shift toward cost discipline, and platform maturity required new operational guardrails. AI is following a similar path, but faster. Token usage becomes capacity planning. Model selection becomes routing logic. Prompts, policies, and context become deployable artifacts. Agents become distributed actors with permissions and side effects. The architectural surface area is already extensive, and it continues to evolve.
But the more critical architectural shift is this: non-deterministic behavior is becoming part of the production control plane. The question is no longer simply, “Can AI do this task?” It is: “Can this system adapt across models, workloads, costs, latency, and uncertainty without silently drifting outside the limits of acceptable behavior?”
This is an inflection point for an industry that has spent decades building around deterministic interfaces, explicit contracts, repeatable tests, and observable distributed systems. AI does not replace those concerns; it complicates them. Production AI systems are still distributed systems, but now with probabilistic components, faster feedback loops, external model dependencies, and economic pressure embedded directly in the request path.
This session explores resilience engineering for production AI systems as AI and LLMs become embedded across engineering workflows and production architectures. We will look at the forces pushing systems toward the edge and the architecture patterns that help keep them inside a safe operating envelope: model gateways, cost-aware routing, QoS tiers, prompt and policy versioning, circuit breakers, fallback models, agent permissions, human-in-the-loop escalation, and observability for quality, cost, and behavior.
The goal is not to fear AI or romanticize the past. It is to recognize that when the boundaries move faster, resilience depends on our ability to sense, constrain, adapt, and correct before drift becomes failure with real consequences for people, systems, and businesses.
Speaker
Andrew Hatch
Engineering Leader and SRE Manager @Cisco ThousandEyes, With 25+ Years Building Software, Operations, SRE, and Platform Teams Across Australia, India, and the United States
Andrew Hatch is an engineering leader and SRE manager at Cisco ThousandEyes, with over 25 years in the technology industry across Australia, India, and the United States. He moved to the Bay Area in 2020 to join LinkedIn as an SRE Manager before taking up his current role at ThousandEyes. His work spans software engineering, consulting, operations, and building SRE and platform teams for large-scale systems. Andrew has previously spoken at SREcon on learning from complex systems and the realities of SRE management, and continues to explore how organisations can hire, lead, and learn more effectively in an AI-augmented world.