Continuous Delivery for Foundational Platforms

Abstract

Platform teams frequently inherit systems that were never architected for their current scale, yet are so foundational that downtime can halt the business. Operating on these fragile foundations, teams face the daunting challenge of continuously shipping new features while scaling infrastructure significantly. Continuous delivery can feel risky in such critical scenarios—but avoiding it can stall progress, frustrate internal customers, and trap teams in endless rewrites that never materialize.

Drawing from his experiences leading foundational platform teams at AWS EC2 and Datadog, Ian Nowland will share practical strategies to safely implement continuous delivery, balancing reliability with innovation. Attendees will learn how to scale confidently, enhance developer productivity, and sustainably improve their platforms—even under immense pressure.

Interview:

What is your session about, and why is it important for senior software developers?

This session is about how platform teams can safely implement continuous delivery for foundational infrastructure. Systems like CI/CD, compute, networking, and service discovery are so critical you can’t afford to break them—yet they still need to evolve. These are often legacy systems that were never designed for today’s scale but now sit at the center of everything.

For senior developers—especially those who end up inheriting these systems—it’s a real trap: the pressure to innovate is high, but the blast radius is huge. I’ll share strategies we used at AWS and Datadog to keep delivering change safely, and why that’s essential to avoid stagnation, rewrites, and developer burnout.

Why is it critical for software leaders to focus on this topic right now, as we head into 2026?

We’re entering a phase where AI is accelerating everything. Teams are racing to ship new features, and developers are using AI to generate code faster than ever. The bottleneck is no longer ideation or execution—it’s the platform standing in the way of safely and quickly getting that code into production.

In the past—like during the cloud migration—platform teams could respond by building a new “V2” platform tailored to emerging use cases. But this time is different. With AI-accelerated development, nearly every team wants to move faster. Supporting just a subset of use cases for the first couple of years isn’t enough. Foundational platforms need to incrementally evolve to deliver capabilities for all users, even as they carry foundational load.

That’s why continuous delivery for these systems has become critical. It’s about enabling safe, sustainable iteration without requiring a full rewrite followed by a years-long migration by your users. The goal is to build the internal tooling and processes that allows foundational platform teams to ship, test, and recover quickly—while the business keeps moving.

What are the common challenges developers and architects face in this area?

The most common challenges I see are:

Staging environments don’t match production in either diversity or scale, which makes testing platform changes almost impossible.

Change becomes scary. Platform teams hesitate to ship—even small “quality of life” improvements—because one wrong move could bring everything down.

Grand rewrites stall out. The team starts building a “V2” but never cuts over, because the risk is too high.

Techniques like blue/green deploys, one-box testing, and traffic shadowing are well-established for stateless microservices—but often seem out of reach for foundational platforms. In this talk, I’ll cover how to bridge that gap, even when you’re working on critical, fragile systems.

What’s one thing you hope attendees will implement immediately after your talk?

Build a path to production that feels safe. That might mean introducing a shadowing mechanism. It might mean running a flaky staging use case behind a flag in production. Or it might just mean adding better observability during rollouts. But the goal is the same: get to a place where it’s safe to ship small changes continuously—even to your scariest systems.

What makes QCon stand out as a conference for senior software professionals?

As someone who’s run large platform teams and now started a company in the space, I appreciate conferences where you can talk openly about failure modes—not just success stories. QCon consistently gets those conversations right.

What was one interesting thing that you learned from a previous QCon?

In 2019, I caught Brian Cantrill’s talk, “No Moore Left to Give: Enterprise Computing after Moore’s Law.” He was one of the first to clearly articulate that the “free” gains we’ve relied on—faster chips, more efficient transistors, cheaper compute—were all slowing down. And while it wasn’t the sole focus of his talk, it was one of the first times I saw someone point to GPUs becoming essential for non-graphics (well, and non-blockchain) workloads, which feels prescient today.


Speaker

Ian Nowland

CEO @Junction Labs, Author of O'Reilly's Platform Engineering, Previously SVP Core Engineering at Datadog and Leader of AWS Nitro

Ian Nowland is the CEO and co-founder of Junction Labs, and co-author of O'Reilly’s Platform Engineering. With 25 years in software, Ian previously served as SVP of Core Engineering at Datadog during its hypergrowth phase, and spent eight formative years at AWS (2008–2016), where he led the creation and development of EMR and AWS Nitro, EC2’s virtualization platform.

Read more
Find Ian Nowland at:

From the same track

Session

Microservices Platforms: When Team Topologies Meets Microservices Patterns

Monday Nov 17 / 01:35PM PST

When many teams work on a large, complex application, the microservice architecture potentially enables them to work independently and deliver a continuous stream of changes.

Speaker image - Chris Richardson

Chris Richardson

Creator of microservices.io, Java Champion, & Core Microservices Thoughtleader

Session Resilience

Enhancing Reliability Using Service-Level Prioritized Load Shedding at Netflix

Monday Nov 17 / 05:05PM PST

How does Netflix maintain a seamless viewing experience for millions of users, especially during traffic spikes or when backend datastores are overloaded? Autoscaling can help during traffic spikes, but it costs money, takes a few minutes to kick in, and capacity may not always be available.

Speaker image - Anirudh Mendiratta

Anirudh Mendiratta

Staff Software Engineer, Playback Lifecycle @Netflix, Previously @Amazon Prime Video and @fuboTV

Speaker image - Benjamin Fedorka

Benjamin Fedorka

Staff Software Engineer, Productivity Engineering @Netflix

Session

Platform Engineering: Lessons from the Rise and Fall of eBay Velocity

Monday Nov 17 / 03:55PM PST

Once a stock market darling and a pioneering hyperscaler in the 1990s and early 2000s, eBay has been in steady decline since the 2010s. A household name with a flat business, eBay has been unable to make substantive strides in its market reach or its engineering outcomes in the last 15 years.

Speaker image - Randy Shoup

Randy Shoup

SVP Engineering @Thrive Market, Previously @eBay, @Google, @Stitch Fix

Session

Beyond Line Charts: Why Some Diversity in Telemetry Visualization Is Long Overdue

Monday Nov 17 / 11:45AM PST

For decades, visualization of service metrics overwhelmingly converges to line charts. The time-centric nature of real-time telemetry further cemented this phenomenon via storage layouts and domain-specific query languages.

Speaker image - Yao Yue

Yao Yue

Platform Engineer, Distributed System Aficionado, Cache Expert, and the Founder of IOP Systems

Session

Unconference: Modern Platform Engineering and Dev Enablement

Monday Nov 17 / 02:45PM PST