Resilient systems do not stay resilient by standing still. They survive by adapting. But adaptation comes with risk: under sustained pressure, organizations and systems can slowly drift toward failure while still appearing to operate normally. Using Rasmussen’s Dynamic Safety Model as a lens, this talk explores how AI is accelerating that drift at a pace many engineering organizations have not had to manage before.

Major technological upheavals are not new. The recent impact of AI reflects patterns we saw with cloud computing: early adoption created enormous leverage, architectural sprawl followed as systems rapidly evolved, bill shock forced a painful shift toward cost discipline, and platform maturity required new operational guardrails. AI is following a similar path, but faster. Token usage becomes capacity planning. Model selection becomes routing logic. Prompts, policies, and context become deployable artifacts. Agents become distributed actors with permissions and side effects. The architectural surface area is already extensive, and it continues to evolve.

But the more critical architectural shift is this: non-deterministic behavior is becoming part of the production control plane. The question is no longer simply, “Can AI do this task?” It is: “Can this system adapt across models, workloads, costs, latency, and uncertainty without silently drifting outside the limits of acceptable behavior?”

This is an inflection point for an industry that has spent decades building around deterministic interfaces, explicit contracts, repeatable tests, and observable distributed systems. AI does not replace those concerns; it complicates them. Production AI systems are still distributed systems, but now with probabilistic components, faster feedback loops, external model dependencies, and economic pressure embedded directly in the request path.

This session explores resilience engineering for production AI systems as AI and LLMs become embedded across engineering workflows and production architectures. We will look at the forces pushing systems toward the edge and the architecture patterns that help keep them inside a safe operating envelope: model gateways, cost-aware routing, QoS tiers, prompt and policy versioning, circuit breakers, fallback models, agent permissions, human-in-the-loop escalation, and observability for quality, cost, and behavior.

The goal is not to fear AI or romanticize the past. It is to recognize that when the boundaries move faster, resilience depends on our ability to sense, constrain, adapt, and correct before drift becomes failure with real consequences for people, systems, and businesses.

Adapt or Drift: Resilience Engineering When AI Moves the Operating Point

Abstract

Speaker

Andrew Hatch

Find Andrew Hatch at:

Speaker

Andrew Hatch

Date

Track

Share

Follow QCon

Contact

Menu

Conferences around the World