No More Spray and Pray— Let's Talk About LLM Evaluations

The pace of development in AI in the past year or so has been dizzying, to say the least, with new models and techniques emerging weekly. Yet, amidst the hype, a sobering reality emerges: much of these advancements lack robust empirical evidence. Recent surveys reveal that testing and evaluation practices for generative AI features across organizations remain in their infancy. It’s time we start treating LLM systems like any other software, subjecting them to the same, if not greater level of rigorous testing.

In this talk, we explore the landscape of LLM system evaluation. Specifically, we will focus on challenges with evaluating LLM systems, determining metrics for evaluation, existing tools and techniques, and of course how to evaluate an LLM system. Finally, we will bring all these pieces together by walking through an end-to-end evaluation study of a real LLM system we’ve built.

Key Takeaways:
The main takeaways from this talk are intended to be a better understanding of the importance of evaluation when it comes to LLMs and for attendees to leave with a practical framework for LLM evaluation that they can apply to their projects.
 


Speaker

Apoorva Joshi

Senior AI Developer Advocate @MongoDB, 6 Years of Experience as a Data Scientist in Cybersecurity, Active Member of Girls Who Code, Women in Cybersecurity (WiCyS) and AnitaB.org

Apoorva is a Data Scientist turned Developer Advocate, with over 6 years of experience applying Machine Learning to problems in Cybersecurity, including phishing detection, malware protection, and entity behavior analytics. As an AI Developer Advocate at MongoDB, she now helps developers be successful at building AI applications through written content and hands-on workshops.

Read more
Find Apoorva Joshi at:

From the same track

Session

Recommender and Search Ranking Systems in Large Scale Real World Applications

Monday Nov 18 / 01:35PM PST

Recommendation and search systems are two of the key applications of machine learning models in industry. Current state of the art approaches have evolved from tree based ensembles models to large deep learning models within the last few years.

Speaker image - Moumita Bhattacharya

Moumita Bhattacharya

Senior Research Scientist @Netflix, Previously @Etsy, Specialized in Machine Learning, Deep Learning, Big Data, Scala, Tensorflow, and Python

Session

Verifiable and Navigable LLMs with Knowledge Graphs

Monday Nov 18 / 10:35AM PST

Graphs, especially knowledge graphs, are powerful tools for structuring data into interconnected networks. The structured format of knowledge graphs enhances the performance of LLM-based systems by improving information retrieval and ensuring the use of reliable sources.

Speaker image - Leann Chen

Leann Chen

AI Developer Advocate @Diffbot, Creator of AI and Knowledge Graph Content on YouTube, Passionate About Knowledge Graphs & Generative AI

Session

Why Most Machine Learning Projects Fail to Reach Production and How to Beat the Odds

Monday Nov 18 / 02:45PM PST

Despite the hype around AI, many ML projects fail, with only 15% of businesses' ML projects succeeding, according to McKinsey. Particularly with the significant investments in large language models and generative AI, only a small portion of companies have managed to realize their true value.

Speaker image - Wenjie Zi

Wenjie Zi

Senior Machine Learning Engineer and Tech Lead @Grammarly, Specializing in Natural Language Processing, 10+ Years of Industrial Experience in Artificial Intelligence Applications

Session

Reinforcement Learning for User Retention in Large-Scale Recommendation Systems

Monday Nov 18 / 05:05PM PST

This talk explores the application of reinforcement learning (RL) in large-scale recommendation systems to optimize user retention at scale - the true north star of effective recommendation engines.

Speaker image - Saurabh Gupta

Saurabh Gupta

Senior Engineering Leader @Meta, Veteran in the Video Recommendations Domain, Helping Scale Video Consumption

Speaker image - Gaurav Chakravorty

Gaurav Chakravorty

Uber TL @Meta, Previously Worked on Facebook Video Recommendations and Instagram Friending and Growth

Session

Unconference: AI and ML for Software Engineers

Monday Nov 18 / 03:55PM PST