Most teams can now ship an AI prototype by calling a foundation-model API. The hard part is knowing whether that system works when real users, messy data, and business consequences arrive. In this talk, I’ll argue that production AI is won or lost in the harness around the model: traces, metrics, labels, test sets, judges, and the discipline to inspect failures directly. Drawing from “The Revenge of the Data Scientist,” I’ll show five common eval pitfalls — generic metrics, unverified judges, weak experimental design, bad labels, and over-automation — and explain how engineering teams can avoid them. The practical takeaway is simple: reliable AI is not a model-only problem. It is an engineering system, and the missing muscle is often data science.

Main Takeaways:

Production AI quality depends on a harness: tests, traces, metrics, labels, and experiments that tell you when the system is going off track.
Generic eval dashboards and off-the-shelf metrics rarely diagnose real application failures; teams need error analysis and domain-specific metrics.
LLM judges should be treated like classifiers: validated against human labels, tuned on development data, and reported with precision/recall rather than blind accuracy.
The fastest path to better AI systems is still to look at the data: read traces, involve domain experts, and design experiments around real production behavior.

From the same track

Session

Progressive Failure Modes of Modern AI Serving Systems

Inference platforms fail in layers. Most organizations focus on model quality while underestimating the systems engineering required to operate production AI workloads safely and reliably at scale.

Abi Aryan

AI Infrastructure Engineer and Educator

Session

Skills, Memory, or Fine-Tuning? The Engineering Loop Behind Self-Improving Agents

As agents become mainstream, everyone wants to improve theirs either by making fewer mistakes on existing tasks or by taking on harder ones. This usually happens once an agent is already deployed in production.

Abhinav Sinha

CEO @Lucidic AI, Previously @Stanford AI Lab, @Citadel and Susquehanna International Group, and @Apple

The Revenge of the Data Scientist: Why Reliable AI Needs Evals, Traces, and Metrics

Abstract

Speaker

Hamel Husain

Speaker

Hamel Husain

Date

Track

Share

From the same track

Progressive Failure Modes of Modern AI Serving Systems

Skills, Memory, or Fine-Tuning? The Engineering Loop Behind Self-Improving Agents

Follow QCon

Contact

Menu

Conferences around the World