Many API backends can be scaled by deploying multiple instances, adding a load balancer in front of them, and pointing clients at the load balancer. Unfortunately, this simple plan doesn't work that well when API requests can take more than a few seconds, and if we have to deal with sudden bursts of requests. This is exactly what happens when deploying compute-heavy inference workloads on Kubernetes; specifically with "generative AI" like large language and stable diffusion models.

This is the kind of scenario where asynchronous architectures and message queues can save the day (or at least considerably improve it) by buffering requests and decoupling clients and servers (which become producers and consumers).

In this hands-on workshop, we will implement, deploy, and scale an application using a large language model on Kubernetes. We will leverage open source components such as:

RabbitMQ and PostgreSQL to store requests and responses
Benthos to implement API servers, producers, and consumers without writing code
Prometheus, Grafana, and KEDA for observability, dashboard, and autoscaling
Helm and Helmfile to automate deployment as much as possible

This workshop is for:

Data scientists who have been asked to deploy their models on Kubernetes
Ops folks who have been asked to support their fellow data scientists
Everyone in between!

Key Takeaways

1 An understanding of the challenges associated with the deployment and scaling of "Gen AI" and similar compute-heavy workloads.

2 Best practices and tools (like Benthos, KEDA...) to implement asynchronous data pipelines and autoscaling on Kubernetes.

3 An open source repository with all the samples, code, and configurations used during the workshop.

Speaker

Jérôme Petazzoni

Founder @Tiny Shell Script LLC, Cofounder @Enix France

Jérôme was part of the team that built, scaled, and operated the dotCloud PAAS, before that company became Docker. He worked seven years at the container startup, where he wore countless hats and ran containers in production before it was cool. He loves to share what he knows, which led him to give hundreds of talks and demos on containers, Docker, and Kubernetes. He trained thousands of people to deploy their apps in confidence on these platforms, and continues to do so as an independent consultant. He values diversity, and strives to be a good ally, or at least a decent social justice sidekick. He also collects musical instruments and can arguably play the theme of Zelda on a dozen of them.

Jérôme Petazzoni

Founder @Tiny Shell Script LLC, Cofounder @Enix France

Prerequisites

To make the most out of this workshop:

You don't need to be a Kubernetes expert, but you should at least know how to create and scale a "Deployment" and expose it with a "Service".
You don't need to know the ins and outs of Kubernetes manifests, but you should be comfortable editing YAML files.
You don't need to know RabbitMQ or PostgreSQL, but it will be useful to know (in generic terms) what queues, messages, tables and rows are.

If you intend to deploy on a specific cloud provider, we will also provide you with Terraform configurations to deploy your cluster on that provider (please get in touch with us before the workshop to make sure your provider is in the list of 10+ that we support). Otherwise, we will provide a small Kubernetes cluster for you for the duration of the workshop.

You will need a computer with an SSH client and a web browser (nothing too fancy!), and if you want to use your own cluster, we recommend that you also have kubectl, helm, helmfile, and your provider's CLI tools installed on your machine.

Asynchronous Architecture Patterns To Scale ML and Other High Latency Workloads on Kubernetes

Key Takeaways

Speaker

Jérôme Petazzoni

Speaker

Jérôme Petazzoni

Date

Location

Level

Share

Prerequisites

Follow QCon

Contact

Menu

Conferences around the World