Scaling Large Language Model Serving Infrastructure at Meta

Abstract

Running LLMs requires significant computational power, which scales with model size and context length. We will discuss strategies for fitting models to various hardware configurations and share techniques for optimizing inference latency and throughput at Meta.

As we transition from stand-alone LLMs to production grade systems that support LLM at a global scale, we delve into our approach to constructing systems that accommodate dynamic user requests and widespread product adoption. This includes implementing caching strategies and addressing infra latency, efficiency and reliability issues within real data centers of a heterogeneous hardware fleet.

Finally, we will present case studies that demonstrate our methods for achieving a balance between model quality, latency, throughput, reliability, and cost in a complex and demanding environment.

Speaker

Ye (Charlotte) Qi

Senior Staff Engineer @Meta

Ye (Charlotte) Qi is a production engineer on the AI inference team at Meta.

She is one of the inference technical leads behind Meta’s initial Meta.AI product launch and LLaMa3 development. With over six years of experience at Meta, she has run large-scale online inference systems for both RecSys and LLM models across various organizations.

Charlotte enjoys working at the multidisciplinary intersection of infrastructure, machine learning, product development and DevOps, advancing end-to-end development from research to production. Her background spans the entire software stack, including hardware productionization, inference runtime optimizations, distributed system reliability, experiment management, and service operations.

Prior to joining Meta, Charlotte earned her Master's degree from Carnegie Mellon University, specializing in large-scale machine learning systems and neural machine translation.

Speaker

Ye (Charlotte) Qi

Senior Staff Engineer @Meta

From the same track

Session AI/ML

Search: from Linear to Multiverse

Tuesday Nov 19 / 02:45PM PST

The future of search is undergoing a revolutionary transformation, shifting from traditional linear queries to a rich multiverse of possibilities powered by AI.

Faye Zhang

Staff Software Engineer @Pinterest, Tech Lead on GenAI Search Traffic Projects, Speaker, Expert in AI/ML with a Strong Background in Large Distributed System

Session LLMOps

Navigating LLM Deployment: Tips, Tricks, and Techniques

Tuesday Nov 19 / 01:35PM PST

Self-hosted Language Models are going to power the next generation of applications in critical industries like financial services, healthcare, and defense.

Meryem Arik

Co-Founder and CEO @Doubleword (Previously TitanML), Recognized as a Technology Leader in Forbes 30 Under 30, Recovering Physicist

Session Generative AI

GenAI for Productivity

Tuesday Nov 19 / 11:45AM PST

At Wealthsimple, we leverage Generative AI internally to improve operational efficiency and streamline monotonous tasks. Our GenAI stack is a blend of tools we developed in house and third party solutions.

Mandy Gu

Senior Software Development Manager @Wealthsimple

Session AI/ML

10 Reasons Your Multi-Agent Workflows Fail and What You Can Do About It

Tuesday Nov 19 / 03:55PM PST

Multi-agent systems – a setup where multiple agents (generative AI models with access to tools) collaborate to solve complex tasks – are an emerging paradigm for building applications.

Victor Dibia

Principal Research Software Engineer @Microsoft Research, Core Contributor to AutoGen, Author of "Multi-Agent Systems with AutoGen" book. Previously @Cloudera, @IBMResearch

Session Machine Learning

A Framework for Building Micro Metrics for LLM System Evaluation

Tuesday Nov 19 / 05:05PM PST

LLM accuracy is a challenging topic to address and is much more multi dimensional than a simple accuracy score. In this talk we’ll dive deeper into how to measure LLM related metrics, going through examples, case studies and techniques beyond just a single accuracy and score.

Denys Linkov

Head of ML @Voiceflow, LinkedIn Learning Instructor, ML Advisor and Instructor, Previously @LinkedIn

Scaling Large Language Model Serving Infrastructure at Meta

Abstract

Speaker

Ye (Charlotte) Qi

Find Ye (Charlotte) Qi at:

Speaker

Ye (Charlotte) Qi

Date

Location

Track

Share

From the same track

Search: from Linear to Multiverse

Navigating LLM Deployment: Tips, Tricks, and Techniques

GenAI for Productivity

10 Reasons Your Multi-Agent Workflows Fail and What You Can Do About It

A Framework for Building Micro Metrics for LLM System Evaluation

Follow QCon

Contact

Menu

Conferences around the World