The pace of development in AI in the past year or so has been dizzying, to say the least, with new models and techniques emerging weekly. Yet, amidst the hype, a sobering reality emerges: much of these advancements lack robust empirical evidence. Recent surveys reveal that testing and evaluation practices for generative AI features across organizations remain in their infancy. It’s time we start treating LLM systems like any other software, subjecting them to the same, if not greater level of rigorous testing.
In this talk, we explore the landscape of LLM system evaluation. Specifically, we will focus on challenges with evaluating LLM systems, determining metrics for evaluation, existing tools and techniques, and of course how to evaluate an LLM system. Finally, we will bring all these pieces together by walking through an end-to-end evaluation study of a real LLM system we’ve built.
Key Takeaways:
The main takeaways from this talk are intended to be a better understanding of the importance of evaluation when it comes to LLMs and for attendees to leave with a practical framework for LLM evaluation that they can apply to their projects.
Speaker
Apoorva Joshi
AI Developer Advocate @MongoDB
Apoorva is a Data Scientist turned Developer Advocate, with over 6 years of experience applying Machine Learning to problems in Cybersecurity, including phishing detection, malware protection, and entity behavior analytics. As an AI Developer Advocate at MongoDB, she now helps developers be successful at building AI applications through written content and hands-on workshops.