Module 0: Setup
Review prerequisites, environment, format, setup, questions, etc.
Module 1: Defining Platform Engineering
In this module, we will first define some baseline foundations for what Platform Engineering is, how it came into being, and the Domains (DDD) of effective platforms. This will lay the baseline understanding of what we will be building for the rest of the workshop. Exercises will be conceptual. We will have teams diagram a high-level platform and select a few to present their concept to the group for discussion.
Module 2: How AI is reshaping Platforms
In this module, we will build off of our Baseline, and understand how AI is changing Platform Engineering. We will focus on where AI fits into the Platform, as well as some of the new concepts and architectures we must consider for hosting AI systems, as well as training them. We’re also going to play with and take a look at GPU Scheduling using our lab modules. Exercises for this module will be conceptual. We will have teams answer concepts in the miro/mural method, and then discuss the answers provided by some of the teams.
Module 3: The GPU changes everything
The biggest shift in Engineering Platforms today is that we need GPUs. GPUs add a great deal of complexity to how we think about Engineering Platforms. The ways they are used are different, the ways they communicate with each other are different, the observability considerations are different, scheduling and orchestration are different, and autoscaling is also different. How do we begin to build engineering platforms that allow for both traditional requirements (i.e. Web/Restful APIs) and these new complex computational requirements.
Exercises for this will be hands-on.
- Teams will have access to a simple 1-node gpu on a kubernetes cluster (each team will have access to a namespace, in it they can access 1 gpu-node per team).
- They will deploy a simple machine learning job to it, to begin to understand the differences between deploying a web api and a machine learning job.
- They will then be required to look at the monitoring and observability as the job is executed. We’ll walk through how jobs can be monitored in a gpu-enabled ML platform.
Exercises will follow with discussion around how autoscaling is very different in the gpu enabled world.
Module 4: Hands on, build our platform foundation
In this module, we’ll build and implement the core concepts of an engineering platform that is heavily focused on developer freedom and self-service. The underlying components of an engineering platform that accomplish this are:
- Compliance at the Point of Change
- Observability Driven Development
- Continuous rollouts and delivery
Module 5: Observability
In this module, we’ll understand and implement an engineering platform that enables an Observability Driven Platform, including for engineering platforms that need to be able to handle GPUs, which have some unique observability characteristics.
- Observability driven development
- O11y in a GPU world
- Chaos Engineering
- Chaos engineering with gpus in the mix
Module 6: Final module
In the final module, we will put together several touchpoints and interfaces for development teams to interact with and use our new engineering platform.
- Self-service Teams api & Operator
- Platform CLI
- Starter kits
Speaker

Bryan Oliver
Principal @Thoughtworks, Global Speaker, Co-Author of "Effective Platform Engineering" and "Designing Cloud Native Delivery Systems"
Bryan Oliver is the co-author of Effective Platform Engineering (Manning) and Designing Cloud Native Delivery Systems (O’Reilly). He has spent years in the Platform Engineering space through Thoughtworks. He is also an International Speaker, and a Thoughtworks Doppler Radar Co-Author and Committee member. He has been on the program committee for multiple Platform Engineering conferences and events, and contributes to open source via Kubernetes and Platform Working Groups. At Thoughtworks, he is now focused on building very large-scale Engineering Platforms for LLM training, using the latest GPUs such as GB200 and H100, in the thousands. This scale requires a platform engineering approach and architecture, but has many differences from our previous understanding of platforms (pre-gpu enabled).