No More Spray and Pray— Let's Talk About LLM Evaluations

The pace of development in AI in the past year or so has been dizzying, to say the least, with new models and techniques emerging weekly. Yet, amidst the hype, a sobering reality emerges: much of these advancements lack robust empirical evidence. Recent surveys reveal that testing and evaluation practices for generative AI features across organizations remain in their infancy. It’s time we start treating LLM systems like any other software, subjecting them to the same, if not greater level of rigorous testing.

In this talk, we explore the landscape of LLM system evaluation. Specifically, we will focus on challenges with evaluating LLM systems, determining metrics for evaluation, existing tools and techniques, and of course how to evaluate an LLM system. Finally, we will bring all these pieces together by walking through an end-to-end evaluation study of a real LLM system we’ve built.

Key Takeaways:
The main takeaways from this talk are intended to be a better understanding of the importance of evaluation when it comes to LLMs and for attendees to leave with a practical framework for LLM evaluation that they can apply to their projects.
 

What is the focus of your work these days?

Each day looks different but you'll find me doing one of these: head down researching and writing code for tutorials covering different aspects of building AI applications, creating and delivering hands-on AI workshops for our customers, or advising customers on how to go about building AI applications for different use cases.

What’s the motivation for your talk?

Having worked as a data scientist for several years, thoroughly evaluating ML models before shipping them was a big part of the process, especially when building models for customer-facing applications. With LLM-based applications, there has been a tendency to ship fast to get ahead of the race, without spending the time and effort to test them, largely because there are no existing best practices and guidelines to do this and I wanted to change that. 

How would you describe the persona and level of the target audience?

This session is mainly targeted toward anyone building customer-facing AI applications. 

What do you want this persona to walk away with from your presentation?

The main takeaways from this talk are intended to be a better understanding of the importance of evaluation when it comes to LLMs and for attendees to leave with a practical framework for LLM evaluation that they can apply to their projects.

What do you think is the next big disruption in software?

It's here and it's AI (I'm totally not biased!). We have yet to see the full potential of AI, but I am excited to see how it changes how we work, write software, etc.   


Speaker

Apoorva Joshi

Senior AI Developer Advocate @MongoDB, 6 Years of Experience as a Data Scientist in Cybersecurity, Active Member of Girls Who Code, Women in Cybersecurity (WiCyS) and AnitaB.org

Apoorva is a Data Scientist turned Developer Advocate, with over 6 years of experience applying Machine Learning to problems in Cybersecurity, including phishing detection, malware protection, and entity behavior analytics. As an AI Developer Advocate at MongoDB, she now helps developers be successful at building AI applications through written content and hands-on workshops.

Read more
Find Apoorva Joshi at:

From the same track

Session AI/ML

Recommender and Search Ranking Systems in Large Scale Real World Applications

Monday Nov 18 / 01:35PM PST

Recommendation and search systems are two of the key applications of machine learning models in industry. Current state of the art approaches have evolved from tree based ensembles models to large deep learning models within the last few years.

Speaker image - Moumita Bhattacharya

Moumita Bhattacharya

Senior Research Scientist @Netflix, Previously @Etsy, Specialized in Machine Learning, Deep Learning, Big Data, Scala, Tensorflow, and Python

Session Knowledge Graphs

Verifiable and Navigable LLMs with Knowledge Graphs

Monday Nov 18 / 10:35AM PST

Graphs, especially knowledge graphs, are powerful tools for structuring data into interconnected networks. The structured format of knowledge graphs enhances the performance of LLM-based systems by improving information retrieval and ensuring the use of reliable sources.

Speaker image - Leann Chen

Leann Chen

AI Developer Advocate @Diffbot, Creator of AI and Knowledge Graph Content on YouTube, Passionate About Knowledge Graphs & Generative AI

Session AI/ML

Why Most Machine Learning Projects Fail to Reach Production and How to Beat the Odds

Monday Nov 18 / 02:45PM PST

Despite the hype around AI, many ML projects fail, with only 15% of businesses' ML projects succeeding, according to McKinsey. Particularly with the significant investments in large language models and generative AI, only a small portion of companies have managed to realize their true value.

Speaker image - Wenjie Zi

Wenjie Zi

Senior Machine Learning Engineer and Tech Lead @Grammarly, Specializing in Natural Language Processing, 10+ Years of Industrial Experience in Artificial Intelligence Applications

Session AI/ML

Reinforcement Learning for User Retention in Large-Scale Recommendation Systems

Monday Nov 18 / 05:05PM PST

This talk explores the application of reinforcement learning (RL) in large-scale recommendation systems to optimize user retention at scale - the true north star of effective recommendation engines.

Speaker image - Saurabh Gupta

Saurabh Gupta

Senior Engineering Leader @Meta, Veteran in the Video Recommendations Domain, Helping Scale Video Consumption

Speaker image - Gaurav Chakravorty

Gaurav Chakravorty

Uber TL @Meta, Previously Worked on Facebook Video Recommendations and Instagram Friending and Growth

Session

Unconference: AI and ML for Software Engineers

Monday Nov 18 / 03:55PM PST