Conference: Nov 13-15, 2017
Workshops: Nov 16-17, 2017
Presentation: Freeing the Whale: How to Fail at Scale
Duration
Level:
- Intermediate
Persona:
- Architect
- Developer
- Developer, JVM
Key Takeaways
- Ideas for separating concerns when developing applications
- How instrumenting the communication processes correctly can prevent many problems
- What you can do to make it easier to run a system in face of failure
Abstract
Twitter was once known for its ever-present error page, the “Fail Whale.” Thousands of staff-years later, this iconic image has all but faded from memory. This transformation was only possible due to Twitter’s treatment of failure as something not just to be expected, but to be embraced.
In this talk, we discuss the technical insights that enabled Twitter to fail, safely and often. We will show how Finagle, the high-scale RPC library used at Twitter, Pinterest, SoundCloud, and other companies, provides a uniform model for handling failure at the communications layer. We’ll describe Finagle’s multi-layer mechanism for handling failure (and its pernicious cousin, latency), including latency-aware load balancing, failure accrual, deadline propagation, retry budgets, and negative acknowledgement. Finally, we’ll describe Finagle’s unified model for naming, inspired by the concepts of symbolic naming and dynamic linking in operating systems, which allows it to extend failure handling across service cluster and datacenter boundaries. We will end with a roadmap for improvements upon this model and mechanisms for applying it to non-Finagle applications.
Interview
Oliver: I a co-founder of Buoyant, which started about 18ish months ago. I’m an ex-Twitter person, left Twitter on a Friday, and started Buoyant on a Monday to build open source projects and a proprietary platform on top of that.
Right now, almost all of my time is spent working on the open source piece, talking about the open source piece, working on GRPC integrations and the various migrations we are doing. The open source project is called Linkerdy, it’s a proxy that packages Finagle.
The goal is to bring Finagle’s operational value outside of Scala and Java, making lot’s of features available. We shouldn’t have to solve these in Node and Pasqual and Go and Python and every framework. For teams that operate them, they shouldn’t have to know how to operate all of these things differently. By doing this with the proxy layer, we can give a bunch of tooling to operators.
Oliver: One of the big benefits is that we can monitor things in a uniform way, but also do things like service discovery. Service discovery is a problem that I struggled with when we moved Mesos at Twitter and seeing other people on the mailing list dealing with these things, I understand how foreign it is and how you really end up binding your code into it if you don’t do it carefully. We want to be able to plug in a bunch of the things which are a service mesh. You can host a player service into something and then the communication management is part of the infrastructure and not something you have to deal with in every application.
Oliver: This effectively is an HDP proxy. It’s not only HDP. We can go to the protocols, but you send your requests through this thing and it will deal with service discovery and routing and monitoring and staging and testing and debugging and all of the things that you end up having to solve over and over again. Finagle has all this great code where our task is to put this into a small configuration that operators can understand without having to reason about Scala.
Oliver: The goal for the talk is to really expose some of the lessons with Finagle and motivate why Finagle does what it does. Things like load balancing, for instance, I thought I understood before I worked on Finagle in depth. I had written some naïve load balancers on various projects and going into this and realizing when we actually operate this I thought about the things that are going to wake me up. And how do we improve load balancing so that I don’t have to wake up? How do I instrument re-tries and timeouts in a way that makes sense at scale?
In my talk I will go through the lessons that we learned. Part of this is the orchestration layer where we are going to assume that hardware is going to fail. We are going to assume that my process isn’t stay off reliably. Once we accept failure and we view our infrastructure differently, it is not going to stay up all the time, it is going to be fragile, what are the implications of that?
I will go into more depth around how we explored load balancing which is thing that everyone who has tried to face will understand. There are some pretty surprising lessons getting through that, also for services here. I was on call for service discovery for years and we made a lot of bad assumptions early on that took a while to fix.
Oliver: Learn to understand separation of concerns: that we want our applications to be separate from a bunch of the things that we end up binding our applications. I will show many infrastructure problem. and shine light on the problems we can solve at the communication layer. In my experience, we try to solve everything with deploy tools or orchestration tools and there are a whole bunch of things we can solve if we instrument the communication processes correctly. I will give examples of the types of things we can solve here and the types of code you can build that make it easier to run a system in face of failure.
Oliver: I think the orchestration layer has absolutely changed everything, way more than Docker or containers for that matter. Everyone in Java world has had containers forever. We have jars and shipped them around, that’s effectively a container. The scheduling layer, things like Kubernetes and Mesos and Swarm and Nomad; there is now a cottage industry of these things. That’s changing everything.
Moving to the Cloud is a big force behind it but now that we have new high level API’s for doing operations, we are seeing an explosion of operability features. Watching every company who are Kubernetes in some way try to position themselves is really interesting. Everybody is trying to figure out what the new stack looks like in this flexible world where we don’t have hosts that we maintain.
Similar Talks
.
Tracks
Monday Nov 7
-
Architectures You've Always Wondered About
You know the names. Now learn lessons from their architectures
-
Distributed Systems War Stories
“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Lamport.
-
Containers Everywhere
State of the art in Container deployment, management, scheduling
-
Art of Relevancy and Recommendations
Lessons on the adoption of practical, real-world machine learning practices. AI & Deep learning explored.
-
Next Generation Web Standards, Frameworks, and Techniques
JavaScript, HTML5, WASM, and more... innovations targetting the browser
-
Optimize You
Keeping life in balance is a challenge. Learn lifehacks, tips, & techniques for success.
Tuesday Nov 8
-
Next Generation Microservices
What will microservices look like in 3 years? What if we could start over?
-
Java: Are You Ready for This?
Real world lessons & prepping for JDK9. Reactive code in Java today, Performance/Optimization, Where Unsafe is heading, & JVM compile interface.
-
Big Data Meets the Cloud
Overviews and lessons learned from companies that have implemented their Big Data use-cases in the Cloud
-
Evolving DevOps
Lessons/stories on optimizing the deployment pipeline
-
Software Engineering Softskills
Great engineers do more than code. Learn their secrets and level up.
-
Modern CS in the Real World
Applied, practical, & real-world dive into industry adoption of modern CS ideas
Wednesday Nov 9
-
Architecting for Failure
Your system will fail. Take control before it takes you with it.
-
Stream Processing
Stream Processing, Near-Real Time Processing
-
Bare Metal Performance
Native languages, kernel bypass, tooling - make the most of your hardware
-
Culture as a Differentiator
The why and how for building successful engineering cultures
-
//TODO: Security <-- fix this
Building security from the start. Stories, lessons, and innovations advancing the field of software security.
-
UX Reimagined
Bots, virtual reality, voice, and new thought processes around design. The track explores the current art of the possible in UX and lessons from early adoption.