One Platform to Serve Them All: Autoscaling Multi-Model LLM Serving

Abstract

AI teams are moving to self-hosted inference away from hosted LLMs as fine-tuning drives model performance. The catch is scale, hundreds of variants create long-tail traffic, cold starts, and duplicated stacks. This talk shows how one platform can autoscale multi-model serving with shared base weights, hot-swapped adapters, dynamic loading, and smart eviction. You will see how to hold P95 and P99 while serving hundreds of fine-tuned models at near single-model cost. We focus on Kubernetes, dynamo, vLLM and SGLang as core technologies, and the metrics that matter, such as: tokens per second, model load latency, and cost per 1k tokens.


Speaker

Meryem Arik

Co-Founder and CEO @Doubleword (Previously TitanML), Recognized as a Technology Leader in Forbes 30 Under 30, Recovering Physicist

Meryem is the Co-founder and CEO of Doubleword (previously TitanML), a self-hosted AI inference platform empowering enterprise teams to deploy domain-specific or custom models in their private environment. An alumna of Oxford University, Meryem studied Theoretical Physics and Philosophy. She frequently speaks at leading conferences, including TEDx and QCon, sharing insights on inference technology and enterprise AI. Meryem has been recognized as a Forbes 30 Under 30 honoree for her contributions to the AI field.

Read more
Find Meryem Arik at:

From the same track

Session

How Netflix Shapes our Fleet for Efficiency and Reliability

Netflix runs on a complex multi-layer cloud architecture made up of thousands of services, caches, and databases. As hardware options, workload patterns, cost dynamics and the Netflix products evolve, the cost-optimal hardware and configuration for running our services is constantly changing.

Speaker image - Joseph Lynch

Joseph Lynch

Principal Software Engineer @Netflix Building Highly-Reliable and High-Leverage Infrastructure Across Stateless and Stateful Services

Speaker image - Argha C

Argha C

Staff Software Engineer @Netflix Building Highly Available, High Throughput Systems

Session

Realtime and Batch Processing of GPU Workloads

SS&C Technologies runs 47 trillion dollars of assets on our global private cloud. We have the primitives for infrastructure as well as platforms as a service like Kubernetes, Kafka, NiFi, Databases, etc.

Speaker image - Joseph Stein

Joseph Stein

Principal Architect of Research & Development @SS&C Technologies, Previous Apache Kafka Committer and PMC Member

Session

From ms to µs: OSS Valkey Architecture Patterns for Modern AI

As AI applications demand faster and more intelligent data access, traditional caching strategies are hitting performance and reliability limits. 

Speaker image - Dumanshu Goyal

Dumanshu Goyal

Software Engineer @Airbnb - Leading Online Data Priorities, Previously @Google and @AWS