Module 0: Setup

Review prerequisites, environment, format, setup, questions, etc.

Module 1: Defining Platform Engineering

In this module, we will first define some baseline foundations for what Platform Engineering is, how it came into being, and the Domains (DDD) of effective platforms. This will lay the baseline understanding of what we will be building for the rest of the workshop. Exercises will be conceptual. We will have teams diagram a high-level platform and select a few to present their concept to the group for discussion.

Module 2: How AI is reshaping Platforms

In this module, we will build off of our Baseline, and understand how AI is changing Platform Engineering. We will focus on where AI fits into the Platform, as well as some of the new concepts and architectures we must consider for hosting AI systems, as well as training them. We’re also going to play with and take a look at GPU Scheduling using our lab modules. Exercises for this module will be conceptual. We will have teams answer concepts in the miro/mural method, and then discuss the answers provided by some of the teams.

Module 3: The GPU changes everything

The biggest shift in Engineering Platforms today is that we need GPUs. GPUs add a great deal of complexity to how we think about Engineering Platforms. The ways they are used are different, the ways they communicate with each other are different, the observability considerations are different, scheduling and orchestration are different, and autoscaling is also different. How do we begin to build engineering platforms that allow for both traditional requirements (i.e. Web/Restful APIs) and these new complex computational requirements.

Exercises for this will be hands-on.

Teams will have access to a simple 1-node gpu on a kubernetes cluster (each team will have access to a namespace, in it they can access 1 gpu-node per team).
1. They will deploy a simple machine learning job to it, to begin to understand the differences between deploying a web api and a machine learning job.
They will then be required to look at the monitoring and observability as the job is executed. We’ll walk through how jobs can be monitored in a gpu-enabled ML platform.

Exercises will follow with discussion around how autoscaling is very different in the gpu enabled world.

Module 4: Hands on, build our platform foundation

In this module, we’ll build and implement the core concepts of an engineering platform that is heavily focused on developer freedom and self-service. The underlying components of an engineering platform that accomplish this are:

Compliance at the Point of Change
Observability Driven Development
Continuous rollouts and delivery

Module 5: Observability

In this module, we’ll understand and implement an engineering platform that enables an Observability Driven Platform, including for engineering platforms that need to be able to handle GPUs, which have some unique observability characteristics.

Observability driven development
O11y in a GPU world
Chaos Engineering
Chaos engineering with gpus in the mix

Module 6: Final module

In the final module, we will put together several touchpoints and interfaces for development teams to interact with and use our new engineering platform.

Self-service Teams api & Operator
Platform CLI
Starter kits

Platform Engineering in 2025, Factoring AI Into Every Component of the Platform

Speaker

Bryan Oliver

Find Bryan Oliver at:

Speaker

Bryan Oliver

Date

Share