Abstract
Most teams can now ship an AI prototype by calling a foundation-model API. The hard part is knowing whether that system works when real users, messy data, and business consequences arrive. In this talk, I’ll argue that production AI is won or lost in the harness around the model: traces, metrics, labels, test sets, judges, and the discipline to inspect failures directly. Drawing from “The Revenge of the Data Scientist,” I’ll show five common eval pitfalls — generic metrics, unverified judges, weak experimental design, bad labels, and over-automation — and explain how engineering teams can avoid them. The practical takeaway is simple: reliable AI is not a model-only problem. It is an engineering system, and the missing muscle is often data science.
Main Takeaways:
- Production AI quality depends on a harness: tests, traces, metrics, labels, and experiments that tell you when the system is going off track.
- Generic eval dashboards and off-the-shelf metrics rarely diagnose real application failures; teams need error analysis and domain-specific metrics.
- LLM judges should be treated like classifiers: validated against human labels, tuned on development data, and reported with precision/recall rather than blind accuracy.
- The fastest path to better AI systems is still to look at the data: read traces, involve domain experts, and design experiments around real production behavior.
Speaker
Hamel Husain
Machine Learning Engineer, 20+ Years in Applied AI, Machine Learning, and Data Science
Hamel Husain is a machine learning engineer with over 20 years of experience in applied AI, machine learning, and data science. He has worked at Airbnb and GitHub, including early LLM research used by OpenAI for code understanding, and has led and contributed to popular open-source machine-learning tools. He currently focuses on bringing data science back to AI by helping teams debug, analyze, and measure production systems through evals, teaching, consulting, and writing.