Track:

Duration

Duration:

4:10pm - 5:00pm

Level:

Intermediate

Persona:

Architect
Developer
Developer, JVM

Key Takeaways

Ideas for separating concerns when developing applications
How instrumenting the communication processes correctly can prevent many problems
What you can do to make it easier to run a system in face of failure

Abstract

Twitter was once known for its ever-present error page, the “Fail Whale.” Thousands of staff-years later, this iconic image has all but faded from memory. This transformation was only possible due to Twitter’s treatment of failure as something not just to be expected, but to be embraced.

In this talk, we discuss the technical insights that enabled Twitter to fail, safely and often. We will show how Finagle, the high-scale RPC library used at Twitter, Pinterest, SoundCloud, and other companies, provides a uniform model for handling failure at the communications layer. We’ll describe Finagle’s multi-layer mechanism for handling failure (and its pernicious cousin, latency), including latency-aware load balancing, failure accrual, deadline propagation, retry budgets, and negative acknowledgement. Finally, we’ll describe Finagle’s unified model for naming, inspired by the concepts of symbolic naming and dynamic linking in operating systems, which allows it to extend failure handling across service cluster and datacenter boundaries. We will end with a roadmap for improvements upon this model and mechanisms for applying it to non-Finagle applications.

Interview

Question:

QCon: What is your role today?

Answer:

Oliver: I a co-founder of Buoyant, which started about 18ish months ago. I’m an ex-Twitter person, left Twitter on a Friday, and started Buoyant on a Monday to build open source projects and a proprietary platform on top of that.

Right now, almost all of my time is spent working on the open source piece, talking about the open source piece, working on GRPC integrations and the various migrations we are doing. The open source project is called Linkerdy, it’s a proxy that packages Finagle.

The goal is to bring Finagle’s operational value outside of Scala and Java, making lot’s of features available. We shouldn’t have to solve these in Node and Pasqual and Go and Python and every framework. For teams that operate them, they shouldn’t have to know how to operate all of these things differently. By doing this with the proxy layer, we can give a bunch of tooling to operators.

Question:

QCon: Is that specifically observability?

Answer:

Oliver: One of the big benefits is that we can monitor things in a uniform way, but also do things like service discovery. Service discovery is a problem that I struggled with when we moved Mesos at Twitter and seeing other people on the mailing list dealing with these things, I understand how foreign it is and how you really end up binding your code into it if you don’t do it carefully. We want to be able to plug in a bunch of the things which are a service mesh. You can host a player service into something and then the communication management is part of the infrastructure and not something you have to deal with in every application.

Question:

QCon: Is it something like a Daemon, that’s collecting data and sending it off someplace? Or is Finagle instrumented to be able to expose what is happening inside the JVM?

Answer:

Oliver: This effectively is an HDP proxy. It’s not only HDP. We can go to the protocols, but you send your requests through this thing and it will deal with service discovery and routing and monitoring and staging and testing and debugging and all of the things that you end up having to solve over and over again. Finagle has all this great code where our task is to put this into a small configuration that operators can understand without having to reason about Scala.

Question:

QCon: What will you be talking about at QCon SF?

Answer:

Oliver: The goal for the talk is to really expose some of the lessons with Finagle and motivate why Finagle does what it does. Things like load balancing, for instance, I thought I understood before I worked on Finagle in depth. I had written some naïve load balancers on various projects and going into this and realizing when we actually operate this I thought about the things that are going to wake me up. And how do we improve load balancing so that I don’t have to wake up? How do I instrument re-tries and timeouts in a way that makes sense at scale?

In my talk I will go through the lessons that we learned. Part of this is the orchestration layer where we are going to assume that hardware is going to fail. We are going to assume that my process isn’t stay off reliably. Once we accept failure and we view our infrastructure differently, it is not going to stay up all the time, it is going to be fragile, what are the implications of that?

I will go into more depth around how we explored load balancing which is thing that everyone who has tried to face will understand. There are some pretty surprising lessons getting through that, also for services here. I was on call for service discovery for years and we made a lot of bad assumptions early on that took a while to fix.

Question:

QCon: What are your key takeaways for this talk?

Answer:

Oliver: Learn to understand separation of concerns: that we want our applications to be separate from a bunch of the things that we end up binding our applications. I will show many infrastructure problem. and shine light on the problems we can solve at the communication layer. In my experience, we try to solve everything with deploy tools or orchestration tools and there are a whole bunch of things we can solve if we instrument the communication processes correctly. I will give examples of the types of things we can solve here and the types of code you can build that make it easier to run a system in face of failure.

Question:

QCon: What do you feel is the most disruptive technology in IT right now?

Answer:

Oliver: I think the orchestration layer has absolutely changed everything, way more than Docker or containers for that matter. Everyone in Java world has had containers forever. We have jars and shipped them around, that’s effectively a container. The scheduling layer, things like Kubernetes and Mesos and Swarm and Nomad; there is now a cottage industry of these things. That’s changing everything.

Moving to the Cloud is a big force behind it but now that we have new high level API’s for doing operations, we are seeing an explosion of operability features. Watching every company who are Kubernetes in some way try to position themselves is really interesting. Everybody is trying to figure out what the new stack looks like in this flexible world where we don’t have hosts that we maintain.

Speaker: Oliver Gould

CTO @Buoyant

Oliver is the CTO of Buoyant, where he leads open source development efforts. Prior to joining Buoyant, he was a staff infrastructure engineer at Twitter, where he was the technical lead of Observability, Traffic, and Configuration & Coordination teams. He is the creator of linkerd, a core contributor to Finagle, and a huge fan of dogs.

Find Oliver Gould at

Speaker page

@olix0r

Similar Talks

The New Way to Debug Java in Production

Monitoring Engineer @OverOps

Nicholas Durkin

The New Way to Debug Java in Production

Software Development Director @Viator

Steve Rogers

How We Learned to Stop Worrying and Love Fan-In

Tech Lead for Timelines Infrastructure Team / Sr Staff Software Engineer @Twitter

Mike Cvet

Scaling Reliability: So You Want to Add a 9

Core Systems Libraries Software Engineer @Twitter

Moses Nakamura

Building Twitter’s Next-Gen Alerting System

Observability Software Engineer @Twitter

Megan Kanne

Managing Big Storage Clusters

Tech Lead of Manhattan Team @Twitter

Boaz Avital

JVMs across the Data Center

Staff Engineer, JVM Team @Twitter

John Coomes

JVMs across the Data Center

Technical Manager Aurora / Mesos Team @Twitter

Ian Downes

Hardware & Provisioning Engineering @Twitter

Staff Hardware Engineer @Twitter

Matt Singer

Tracks

Monday Nov 7

Architectures You've Always Wondered About

You know the names. Now learn lessons from their architectures
Distributed Systems War Stories

“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Lamport.
Containers Everywhere

State of the art in Container deployment, management, scheduling
Art of Relevancy and Recommendations

Lessons on the adoption of practical, real-world machine learning practices. AI & Deep learning explored.
Next Generation Web Standards, Frameworks, and Techniques

JavaScript, HTML5, WASM, and more... innovations targetting the browser
Optimize You

Keeping life in balance is a challenge. Learn lifehacks, tips, & techniques for success.

Tuesday Nov 8

Next Generation Microservices

What will microservices look like in 3 years? What if we could start over?
Java: Are You Ready for This?

Real world lessons & prepping for JDK9. Reactive code in Java today, Performance/Optimization, Where Unsafe is heading, & JVM compile interface.
Big Data Meets the Cloud

Overviews and lessons learned from companies that have implemented their Big Data use-cases in the Cloud
Evolving DevOps

Lessons/stories on optimizing the deployment pipeline
Software Engineering Softskills

Great engineers do more than code. Learn their secrets and level up.
Modern CS in the Real World

Applied, practical, & real-world dive into industry adoption of modern CS ideas

Wednesday Nov 9

Architecting for Failure

Your system will fail. Take control before it takes you with it.
Stream Processing

Stream Processing, Near-Real Time Processing
Bare Metal Performance

Native languages, kernel bypass, tooling - make the most of your hardware
Culture as a Differentiator

The why and how for building successful engineering cultures
//TODO: Security <-- fix this

Building security from the start. Stories, lessons, and innovations advancing the field of software security.
UX Reimagined

Bots, virtual reality, voice, and new thought processes around design. The track explores the current art of the possible in UX and lessons from early adoption.

SCHEDULE

Duration

Level:

Persona:

Key Takeaways

Abstract

Interview

Find Oliver Gould at

Similar Talks

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: Freeing the Whale: How to Fail at Scale

Duration

Level:

Persona:

More talks on:

Key Takeaways

Abstract

Interview

Find Oliver Gould at

Similar Talks

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World