Presentation: Freeing the Whale: How to Fail at Scale


4:10pm - 5:00pm



Key Takeaways

  • Ideas for separating concerns when developing applications
  • How instrumenting the communication processes correctly can prevent many problems
  • What you can do to make it easier to run a system in face of failure


Twitter was once known for its ever-present error page, the “Fail Whale.” Thousands of staff-years later, this iconic image has all but faded from memory. This transformation was only possible due to Twitter’s treatment of failure as something not just to be expected, but to be embraced.

In this talk, we discuss the technical insights that enabled Twitter to fail, safely and often. We will show how Finagle, the high-scale RPC library used at Twitter, Pinterest, SoundCloud, and other companies, provides a uniform model for handling failure at the communications layer. We’ll describe Finagle’s multi-layer mechanism for handling failure (and its pernicious cousin, latency), including latency-aware load balancing, failure accrual, deadline propagation, retry budgets, and negative acknowledgement. Finally, we’ll describe Finagle’s unified model for naming, inspired by the concepts of symbolic naming and dynamic linking in operating systems, which allows it to extend failure handling across service cluster and datacenter boundaries. We will end with a roadmap for improvements upon this model and mechanisms for applying it to non-Finagle applications.


QCon: What is your role today?

Oliver: I a co-founder of Buoyant, which started about 18ish months ago. I’m an ex-Twitter person, left Twitter on a Friday, and started Buoyant on a Monday to build open source projects and a proprietary platform on top of that. 

Right now, almost all of my time is spent working on the open source piece, talking about the open source piece, working on GRPC integrations and the various migrations we are doing. The open source project is called Linkerdy, it’s a proxy that packages Finagle. 

The goal is to bring Finagle’s operational value outside of Scala and Java, making lot’s of features available. We shouldn’t have to solve these in Node and Pasqual and Go and Python and every framework. For teams that operate them, they shouldn’t have to know how to operate all of these things differently. By doing this with the proxy layer, we can give a bunch of tooling to operators.

QCon: Is that specifically observability?

Oliver:  One of the big benefits is that we can monitor things in a uniform way, but also do things like service discovery. Service discovery is a problem that I struggled with when we moved Mesos at Twitter and seeing other people on the mailing list dealing with these things, I understand how foreign it is and how you really end up binding your code into it if you don’t do it carefully. We want to be able to plug in a bunch of the things which are a service mesh. You can host a player service into something and then the communication management is part of the infrastructure and not something you have to deal with in every application.

QCon: Is it something like a Daemon, that’s collecting data and sending it off someplace? Or is Finagle instrumented to be able to expose what is happening inside the JVM?

Oliver: This effectively is an HDP proxy. It’s not only HDP. We can go to the protocols, but you send your requests through this thing and it will deal with service discovery and routing and monitoring and staging and testing and debugging and all of the things that you end up having to solve over and over again. Finagle has all this great code where our task is to put this into a small  configuration that operators can understand without having to reason about Scala.

QCon: What will you be talking about at QCon SF?

Oliver: The goal for the talk is to really expose some of the lessons with Finagle and motivate why Finagle does what it does. Things like load balancing, for instance, I thought I understood before I worked on Finagle in depth. I had written some naïve load balancers on various projects and going into this and realizing when we actually operate this I thought about the things that are going to wake me up. And how do we improve load balancing so that I don’t have to wake up? How do I instrument re-tries and timeouts in a way that makes sense at scale? 

In my talk I will go through the lessons that we learned. Part of this is the orchestration layer where we are going to assume that hardware is going to fail. We are going to assume that my process isn’t stay off reliably. Once we accept failure and we view our infrastructure differently, it is not going to stay up all the time, it is going to be fragile, what are the implications of that?

I will go into more depth around how we explored load balancing which is thing that everyone who has tried to face will understand. There are some pretty surprising lessons getting through that, also for services here. I was on call for service discovery for years and we made a lot of bad assumptions early on that took a while to fix.

QCon: What are your key takeaways for this talk?

Oliver: Learn to understand separation of concerns: that we want our applications to be separate from a bunch of the things that we end up binding our applications. I will show many infrastructure problem. and shine light on the problems we can solve at the communication layer. In my experience, we try to solve everything with deploy tools or orchestration tools and there are a whole bunch of things we can solve if we instrument the communication processes correctly. I will give examples of the types of things we can solve here and the types of code you can build that make it easier to run a system in face of failure.

QCon: What do you feel is the most disruptive technology in IT right now?

Oliver: I think the orchestration layer has absolutely changed everything, way more than Docker or containers for that matter. Everyone in Java world has had containers forever. We have jars and shipped them around, that’s effectively a container. The scheduling layer, things like Kubernetes and Mesos and Swarm and Nomad; there is now a cottage industry of these things. That’s changing everything. 

Moving to the Cloud is a big force behind it but now that we have new high level API’s for doing operations, we are seeing an explosion of operability features. Watching every company who are Kubernetes in some way try to position themselves is really interesting. Everybody is trying to figure out what the new stack looks like in this flexible world where we don’t have hosts that we maintain.

Speaker: Oliver Gould

CTO @Buoyant

Oliver is the CTO of Buoyant, where he leads open source development efforts. Prior to joining Buoyant, he was a staff infrastructure engineer at Twitter, where he was the technical lead of Observability, Traffic, and Configuration & Coordination teams. He is the creator of linkerd, a core contributor to Finagle, and a huge fan of dogs.

Find Oliver Gould at

Similar Talks

Software Development Director @Viator
Tech Lead for Timelines Infrastructure Team / Sr Staff Software Engineer @Twitter
Core Systems Libraries Software Engineer @Twitter
Observability Software Engineer @Twitter
Tech Lead of Manhattan Team @Twitter
Staff Engineer, JVM Team @Twitter
Technical Manager Aurora / Mesos Team @Twitter



Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers