The pace of development in AI in the past year or so has been dizzying, to say the least, with new models and techniques emerging weekly. Yet, amidst the hype, a sobering reality emerges: much of these advancements lack robust empirical evidence. Recent surveys reveal that testing and evaluation practices for generative AI features across organizations remain in their infancy. It’s time we start treating LLM systems like any other software, subjecting them to the same, if not greater level of rigorous testing.
In this talk, we explore the landscape of LLM system evaluation. Specifically, we will focus on challenges with evaluating LLM systems, determining metrics for evaluation, existing tools and techniques, and of course how to evaluate an LLM system. Finally, we will bring all these pieces together by walking through an end-to-end evaluation study of a real LLM system we’ve built.
Key Takeaways:
The main takeaways from this talk are intended to be a better understanding of the importance of evaluation when it comes to LLMs and for attendees to leave with a practical framework for LLM evaluation that they can apply to their projects.
What is the focus of your work these days?
Each day looks different but you'll find me doing one of these: head down researching and writing code for tutorials covering different aspects of building AI applications, creating and delivering hands-on AI workshops for our customers, or advising customers on how to go about building AI applications for different use cases.
What’s the motivation for your talk?
Having worked as a data scientist for several years, thoroughly evaluating ML models before shipping them was a big part of the process, especially when building models for customer-facing applications. With LLM-based applications, there has been a tendency to ship fast to get ahead of the race, without spending the time and effort to test them, largely because there are no existing best practices and guidelines to do this and I wanted to change that.
How would you describe the persona and level of the target audience?
This session is mainly targeted toward anyone building customer-facing AI applications.
What do you want this persona to walk away with from your presentation?
The main takeaways from this talk are intended to be a better understanding of the importance of evaluation when it comes to LLMs and for attendees to leave with a practical framework for LLM evaluation that they can apply to their projects.
What do you think is the next big disruption in software?
It's here and it's AI (I'm totally not biased!). We have yet to see the full potential of AI, but I am excited to see how it changes how we work, write software, etc.
Speaker
Apoorva Joshi
Senior AI Developer Advocate @MongoDB, 6 Years of Experience as a Data Scientist in Cybersecurity, Active Member of Girls Who Code, Women in Cybersecurity (WiCyS) and AnitaB.org
Apoorva is a Data Scientist turned Developer Advocate, with over 6 years of experience applying Machine Learning to problems in Cybersecurity, including phishing detection, malware protection, and entity behavior analytics. As an AI Developer Advocate at MongoDB, she now helps developers be successful at building AI applications through written content and hands-on workshops.