The Revenge of the Data Scientist: Why Reliable AI Needs Evals, Traces, and Metrics

Abstract

Most teams can now ship an AI prototype by calling a foundation-model API. The hard part is knowing whether that system works when real users, messy data, and business consequences arrive. In this talk, I’ll argue that production AI is won or lost in the harness around the model: traces, metrics, labels, test sets, judges, and the discipline to inspect failures directly. Drawing from “The Revenge of the Data Scientist,” I’ll show five common eval pitfalls — generic metrics, unverified judges, weak experimental design, bad labels, and over-automation — and explain how engineering teams can avoid them. The practical takeaway is simple: reliable AI is not a model-only problem. It is an engineering system, and the missing muscle is often data science.

Main Takeaways:

  1. Production AI quality depends on a harness: tests, traces, metrics, labels, and experiments that tell you when the system is going off track.
  2. Generic eval dashboards and off-the-shelf metrics rarely diagnose real application failures; teams need error analysis and domain-specific metrics.
  3. LLM judges should be treated like classifiers: validated against human labels, tuned on development data, and reported with precision/recall rather than blind accuracy.
  4. The fastest path to better AI systems is still to look at the data: read traces, involve domain experts, and design experiments around real production behavior.

Speaker

Hamel Husain

Machine Learning Engineer, 20+ Years in Applied AI, Machine Learning, and Data Science

Hamel Husain is a machine learning engineer with over 20 years of experience in applied AI, machine learning, and data science. He has worked at Airbnb and GitHub, including early LLM research used by OpenAI for code understanding, and has led and contributed to popular open-source machine-learning tools. He currently focuses on bringing data science back to AI by helping teams debug, analyze, and measure production systems through evals, teaching, consulting, and writing.

Read more

From the same track

Session

Progressive Failure Modes of Modern AI Serving Systems

Inference platforms fail in layers. Most organizations focus on model quality while underestimating the systems engineering required to operate production AI workloads safely and reliably at scale.

Speaker image - Abi Aryan

Abi Aryan

AI Infrastructure Engineer and Educator