No More Spray and Pray— Let's Talk About LLM Evals

The pace of development in AI in the past year or so has been dizzying, to say the least, with new models and techniques emerging weekly. Yet, amidst the hype, a sobering reality emerges: much of these advancements lack robust empirical evidence. Recent surveys reveal that testing and evaluation practices for generative AI features across organizations remain in their infancy. It’s time we start treating LLM systems like any other software, subjecting them to the same, if not greater level of rigorous testing.

In this talk, we explore the landscape of LLM system evaluation. Specifically, we will focus on challenges with evaluating LLM systems, determining metrics for evaluation, existing tools and techniques, and of course how to evaluate an LLM system. Finally, we will bring all these pieces together by walking through an end-to-end evaluation study of a real LLM system we’ve built.

Key Takeaways:
The main takeaways from this talk are intended to be a better understanding of the importance of evaluation when it comes to LLMs and for attendees to leave with a practical framework for LLM evaluation that they can apply to their projects.


Apoorva Joshi

AI Developer Advocate @MongoDB

Apoorva is a Data Scientist turned Developer Advocate, with over 6 years of experience applying Machine Learning to problems in Cybersecurity, including phishing detection, malware protection, and entity behavior analytics. As an AI Developer Advocate at MongoDB, she now helps developers be successful at building AI applications through written content and hands-on workshops.

Read more

From the same track


Recommender and Search Ranking Systems in Large Scale Real World Applications

Recommendation and search systems are two of the key applications of machine learning models in industry. Current state of the art approaches have evolved from tree based ensembles models to large deep learning models within the last few years.

Speaker image - Moumita Bhattacharya

Moumita Bhattacharya

Senior Research Scientist @Netflix


Verifiable and Navigable LLMs with Knowledge Graphs

Graphs, especially knowledge graphs, are powerful tools for structuring data into interconnected networks. The structured format of knowledge graphs enhances the performance of LLM-based systems by improving information retrieval and ensuring the use of reliable sources.

Speaker image - Leann Chen

Leann Chen

AI Developer Advocate @Diffbot


Why Most Machine Learning Projects Fail to Reach Production and How to Beat the Odds

Despite the hype around AI, many ML projects fail, with only 15% of businesses' ML projects succeeding, according to McKinsey. Particularly with the significant investments in large language models and generative AI, only a small portion of companies have managed to realize their true value.

Speaker image - Wenjie Zi

Wenjie Zi

Senior Machine Learning Engineer and Tech Lead @Grammarly