Presentation: Architecting for Failure in a Containerized World


10:35am - 11:25am



Key Takeaways

  • Understand how the use of containers can help architects rethink decomposing applications.
  • Consider a perspective of composing applications along lines more associated with failure and restoration modes.
  • Learn approaches to dealing with legacy applications and how containers can help you with the process of decomposing them.


The container revolution is upon us and with it comes a new toolbox for building systems that are robust in the face of failures. Created in just the past few years, this new set of tools demands that we rethink our approach to architecting for failure. When we do, we will reap the benefits of architectural models that make it much simpler to reason about and handle failures of all types.

In this talk, we'll explore this toolbox and how the tools can be used to best effect. We will consider not only Docker but failure and recovery models built into orchestration systems such as Mesos and Kubernetes that are often used with Docker. We will talk about how to recognize and recover from failures involving different parts of our application including data persistence and active user operations. We will look in detail at how failure and recovery work in a real application and understand the implications this has on system architecture.

This all comes together into an architectural framework that lets us model the various types of failures that we might see and be confident that our systems can recover from them.


QCon: What is your role today?

Tom: I am a consultant. I work mostly on productionizing big data workflows. A lot of what I do is around working with companies (Fortune 500 companies, global companies, startups) to deliver solutions at scale that use a lot of data.

QCon: Are you mostly talking about the traditional big data involving a Hadoop stack or are you talking about some of streaming use cases around Apache Flink and Spark?

Tom: For what it’s worth, no one that I talk to still uses the Hadoop stack. Everybody uses parts of it (in particular at the storage layer), so in that sense there is still a ton of Hadoop. Most of what I am seeing is Spark or ad hoc solutions that involve Scala and Akka for example. Being able to deploy data applications that are not necessarily leveraging some of the big named tools (or are often privately developed) is a powerful thing that people want to be able to do. Additionally, people want to be able to mix and match according to what suits their workload and their analytics stack.

QCon: Can you explain your talk title to me?

Tom: One of the things I find really interesting is about how we think about building robust systems today. For example, containerization gives us a way to separate parts of our systems such that we can mock failure, improve monitoring/recovery, and improve resiliency. These are easy to implement. These tools let us extend some of the ideas we have had about things like microservices, replication across the network, about failover and simplify them in some very important ways. It’s exciting ideas and when combine using containers with some of the orchestration ideas that are happening now, we get some really powerful ways to assemble systems. It makes the job of the architects and implementers a whole lot easier if they understand how to put systems together this way and leverage this ecosystem that is out there. So I want to talk about how to do that.

The talk is about how we think about the components of our overarching applications. Things like: computation, state, resiliency and duplication. How do we separate parts of it and put them into containers? Part of what this means is changing our mindset about the way we have done things in the past. For example, logging or things where we have built these sort of monolithic applications that have to have a bunch of supporting services in them. With containers, we can actually roll some of those services out of the application and compose them by putting containers together.

Since containers are so much lighter than VM’s and containers can happily co-exist within one VM, it lets us build a very simple component. 

I am going to walk through what some of those components look like, how we develop an architecture that separates the salient points of the components. For example, some components need to have persistent state, some components are pure compute. In an existing system, for performance reasons a lot of times we have to couple those together into the system. With containers, we can now separate some of those things. The container can be very simple. 

That means thinking about failure within a specific container then becomes very simple. What do we have to track? What do we need to have to restart? What sort of information do we need to be able to replay into that container, or in some cases, to put it to another place within our cluster? 

This lets us not only deal with failure but also with situations that are lot like failure, like when clusters have become overloaded or nodes have become overloaded hot spots. Being able to think about those things as failure, and failure and recovery, we get a very simple, powerful model for moving forward and building complex applications out of simple parts. But it requires decomposing the application in the right way and that’s mostly what I am going to talk about: how are we going to decompose our applications to do that.

QCon: So as you decompose an application, are there portions of an application that are just not eligible or not recommended for containers?

Tom: Not in general but, in certain specific cases, yes. An issue with legacy components is that the amount of work to disassemble something into containers is either more than you want to do (because it is a legacy and you don’t want to move it forward), or it’s just not worth it right now. But later, you’ll want to migrate it and will want to separate parts. One of the movements that’s going on now is this notion of gradually decomposing legacy applications into containers or into microservices. That’s a great direction and you can support that very robustly. 

You will have a more complicated failure case the more complicated any given container is, just like with any other application or system, but you can work through steps to decompose things as it makes sense. You can have new green field applications architected for whatever you decide is correct. One of the interesting things to consider from there is what is the relationship between containers and microservices in the face of failure?

That’s a good question. Containers are a great way to implement microservices if your team is going heavily that way. But you don’t have to go all the way to microservices to get the benefits of decomposition and the ability to compose services of various granularities together. You may have some services that are very large and very complex and other services that are small and fit more of the microservice model. That hybrid approach is a perfectly reasonable way to proceed.

I want to give architects and developers the tools they need to understand how a given service, or application that’s packaged in containers, is going to behave in the case of failure. This talk is about how they can build architectures that are going to be easier to handle in the case of failure.

QCon: What are some of the gotchas when you you start to decompose and containerize a legacy monolith?

Tom: I want to look at the gotchas specifically from the point of view of failure, recovery, and resiliency. In this case, when you think about microservices as an architect, you are typically thinking semantically. I want to keep some sets of operations grouped together in a single microservice and others in another microservice. But when you think about it from the point of view of failure and recovery, you want to decompose in a different way. 

You want to decompose in such a way that maintains consistency and in a perfect world, you want to be able to extract some of those microservices or containers that are completely stateless. They do computer operations. Maybe they do authentication or processing. The key is they don’t keep any local data or any state. They just answer questions. Because those are the things that are most easy to recover from. 

When you think about state that way, you can have microservices that are purely about the state management. Failure and recovery then becomes a very straight forward (though state management is never a simple process).

It’s a different way of thinking about decomposition from the purely semantic way. Similarly, one of the nice things about containers is cross cutting concerns. We can begin to pull out what programming language theorists might call aspect oriented things. You can begin to think about that within our architecture, within our container bed there are containers that are executing, monitoring, logging, other activities that cut across all of our services and can be used reused within containers. The containers may be aware but the things running in the container may actually be very ignorant to what’s going on around them. And that makes them much simpler, much easier to restart components.

QCon: How you you describe the persona of the target audience of this talk?

Tom: I am mostly talking to architects here.

There are two questions that I want to answer in this talk. The first is, in a perfect world, how would I architect my system such that it is robust to failure, and it’s easy to reason about its robustness to failure?

The second part is the fact that none of us live in a perfect world and need to think about not only that perfect green field opportunity. Given that reality, how do we now think about existing applications, and how they interact with this new world? Where can I get leverage from containerization? How much do I want to think about it? And how do I get some of these robustness promises without buying the whole farm on it?

What are your key takeaways for this talk?

The biggest thing that I want someone to walk away with is an understanding of how to think about decomposing a system and, in a very practical sense, of having sets of rules, taking them apart, and rebuilding them. Given that, how do I understand what the robustness of my system is, and how do I make those choices? 

We have three steps, so to speak. What are the sets of tools I have available to me in this new containerized world? How do I leverage that set of tools to make perfectly robust applications, and what are the guidelines that go with that? 

Speaker: Tom Faulhaber

Principle Data Analysis Leader @Infolace

Tom Faulhaber is principal of Infolace (, a San Francisco-based consultancy that helps clients from startups to global brands turn raw data into information and information into action. Throughout his career, Tom has developed systems for high-performance TCP/IP, large-scale scientific visualization, energy trading, and many more. In addition, Tom is a contributor to the Clojure language and is active in that community.

Find Tom Faulhaber at

Similar Talks

Senior Solution Architect @JFrog
Senior Software Engineer, Playback Features @Netflix
Data Scientist, Author of "Faceted Search"
Partner and Data Scientist @Datascope
Security Research Engineer @ShapeSecurity



Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers