Progressive Failure Modes of Modern AI Serving Systems

Abstract

Inference platforms fail in layers. Most organizations focus on model quality while underestimating the systems engineering required to operate production AI workloads safely and reliably at scale.

Before GPU saturation even becomes a problem, teams often expose models directly to ungoverned traffic, lack concurrency controls, fail to measure system behavior, overload memory bandwidth, and eventually destroy latency guarantees and operational stability.

This talk walks through the progressive failure modes of modern AI serving systems and how to architect scalable inference infrastructure that remains observable, resilient, and performant under real production workloads. In this talk, I will walk the attendees through real code paths, production failure scenarios, debugging strategies, and architectural tradeoffs, showing both how these systems fail and how to systematically fix them.


Speaker

Abi Aryan

AI Infrastructure Engineer and Educator

Abi Aryan is an AI infrastructure engineer and educator specializing in scalable inference systems and production AI infrastructure. She spends her time helping enterprises design and optimize large-scale inferencing serving architectures, improve observability in production pipelines, and solve performance bottlenecks across distributed GPU systems.

Outside of her startup work, Abi teaches distributed systems in a university HPC program, mentors AI Engineering Team Leads through her Maven course, and is currently writing a book on GPU Engineering. Her doctoral research explores the future of adaptive AI infrastructure.

Read more