You are viewing content from a past/completed QCon

Presentation: High Resolution Performance Telemetry at Scale

Track: Bare Knuckle Performance

Location: Pacific DEKJ

Duration: 10:35am - 11:25am

Day of week: Monday

Share this on:

This presentation is now available to view on InfoQ.com

Watch video with transcript

What You’ll Learn

  1. Hear how Twitter uses Rezolus to monitor their systems to find out if there are performance problems.
  2. Learn how to use Rezolus to pinpoint bottlenecks even in smaller systems.

Abstract

One of the most critical aspects of running large distributed systems is understanding and quantifying performance. Without telemetry it is challenging to diagnose performance issues, plan for capacity needs, and tune for maximum efficiency. Even when we have telemetry, the resolution is insufficient to capture anomalies and bursty behaviors that are typical in microservice architectures. 

In this talk, we explore the issues of resolution in performance monitoring, cover sources of performance telemetry including hardware performance and eBPF, and learn some tricks for getting high resolution telemetry without high costs.

Question: 

What is the work you’re doing today?

Answer: 

I work on a team that's focused on infrastructure optimization and performance. As part of that we quantify workloads, measure the performance of systems, and come up with tuning changes that can help either increase the performance just so we can get more or so we can actually start to reduce the amount of resources allocated to specific services to reduce costs. As part of that, I work with a lot of like benchmarking and measuring runtime performance. Luckily here at Twitter I get to do this in open source capacity and I manage two open source projects. One is called rpc-perf, which is a benchmarking tool for in-memory caches. And my newer project which is a systems performance telemetry agent called Rezolus. This talk winds up centered more on the telemetry component of this and how we use Rezolus at Twitter.

Question: 

You use the word telemetry. Why that word?

Answer: 

Telemetry just fits really nicely. Because with telemetry it's basically the remote transmission metrics, which are just numbers. So it's a way for us to grab metrics from our fleet-wide systems words. Basically we want to record the runtime performance characteristics of our systems so we can go back and look at how they have been performing, how that's changed over time. We use it for runtime performance diagnostics and for being able to judge tuning changes so we can actually know whether we're moving the gauge up or down.

Question: 

What's the goal of the talk?

Answer: 

We're going to talk about the challenges of measuring performance at scale especially in distributed microservice architecture. A web request is hundreds of milliseconds. Having a lot of traditional telemetry collection happens at much coarser time scales and a lot of that is due to the cost of aggregation and processing those time series. We found that the traditional resolution that we had was insufficient to capture the actual performance anomalies. And it was at the point where it was interfering with my ability to do tuning work. That got me thinking about ways that we could capture bursty behavior without necessarily spending whatever X to increase the resolution.

Question: 

How did you come up with a sampling rate that worked?

Answer: 

Essentially it comes down to summarization. Thinking about the questions that you're trying to answer with telemetry and whether you actually need that high resolution as your end-product or whether you just need that to get to your end-product. The talk is about how we did that. Essentially it's about using percentile metrics. Basically you can sample at a really high rate and then do some metrics processing on the fly and then export summary metrics instead of exporting a second layer, 10 times per second or 100 times per second time series. And as long as you have that sampling rate short you can still have a minutely time series as your end-product and have a hint at the subminute behavior.

Question: 

What do you want someone to leave your talk with?

Answer: 

I would like for people to leave the talk with a deeper appreciation of the complexities of measuring performance. There are behaviors that we're just not aware of due to blind spots in how telemetry is collected today. There have been a lot of efforts to enable people to diagnose that in a more hands on fashion. I think really one of the core ideas here is that you can do it fleet-wide without necessarily spending X million dollars more to store really high resolution samples. Inspire people that this is possible and maybe that they would like to check out Rezolus as an open source project and contribute to it.

Question: 

How do you answer someone who says we don't operate at Twitter scale? Is this going to have important takeaways for me at normal scale?

Answer: 

Even in smaller environments systems performance is very important. And it can even be more so at really small shops where you don't have the budget to just throw money at the problem, where you need to squeeze out the most performance. And you might not have a team who can develop a sophisticated observability system. I think they would be able to leverage something like Rezolus to help capture runtime performance issues without having to fund a performance team. It's like the tool dovetails nicely into the rest of the open source observability ecosystem with Prometheus and stuff like that. I think it could provide people of different skills the ability to go to runtime performance diagnostics which I know at least when I used to work at a small shop one of the common things was the CTO coming and saying, the website feels slow now, and then having to go figure out why. It would have been really nice to have the visibility into what was happening.

Speaker: Brian Martin

Software Developer @Twitter

Brian is a Staff SRE at Twitter. He works on infrastructure optimization and performance. His work with tuning high performance services led him to discovering a need for better performance telemetry. He is the author and maintainer of Rezolus, Twitter's high resolution systems performance telemetry agent.

Find Brian Martin at

2020 Tracks

  • Remotely Productive: Remote Teams & Software

    More and more companies are moving to remote work. How do you build, work on, and lead teams remotely?

  • Operating Microservices

    Building and operating distributed systems is hard, and microservices are no different. Learn strategies for not just building a service but operating them at scale.

  • Distributed Systems for Developers

    Computer science in practice. An applied track that fuses together the human side of computer science with the technical choices that are made along the way

  • The Future of APIs

    Web-based API continue to evolve. The track provides the what, how, and why of future APIs, including GraphQL, Backend for Frontend, gRPC, & ReST

  • Resurgence of Functional Programming

    What was once a paradigm shift in how we thought of programming languages is now main stream in nearly all modern languages. Hear how software shops are infusing concepts like pure functions and immutablity into their architectures and design choices.

  • Social Responsibility: Implications of Building Modern Software

    Software has an ever increasing impact on individuals and society. Understanding these implications helps build software that works for all users

  • Non-Technical Skills for Technical Folks

    To be an effective engineer, requires more than great coding skills. Learn the subtle arts of the tech lead, including empathy, communication, and organization.

  • Clientside: From WASM to Browser Applications

    Dive into some of the technologies that can be leveraged to ultimately deliver a more impactful interaction between the user and client.

  • Languages of Infra

    More than just Infrastructure as a Service, today we have libraries, languages, and platforms that help us define our infra. Languages of Infra explore languages and libraries being used today to build modern cloud native architectures.

  • Mechanical Sympathy: The Software/Hardware Divide

    Understanding the Hardware Makes You a Better Developer

  • Paths to Production: Deployment Pipelines as a Competitive Advantage

    Deployment pipelines allow us to push to production at ever increasing volume. Paths to production looks at how some of software's most well known shops continuous deliver code.

  • Java, The Platform

    Mobile, Micro, Modular: The platform continues to evolve and change. Discover how the platform continues to drive us forward.

  • Security for Engineers

    How to build secure, yet usable, systems from the engineer's perspective.

  • Modern Data Engineering

    The innovations necessary to build towards a fully automated decentralized data warehouse.

  • Machine Learning for the Software Engineer

    AI and machine learning are more approachable than ever. Discover how ML, deep learning, and other modern approaches are being used in practice by Software Engineers.

  • Inclusion & Diversity in Tech

    The road map to an inclusive and diverse tech organization. *Diversity & Inclusion defined as the inclusion of all individuals in an within tech, regardless of gender, religion, ethnicity, race, age, sexual orientation, and physical or mental fitness.

  • Architectures You've Always Wondered About

    How do they do it? In QCon's marquee Architectures track, we learn what it takes to operate at large scale from well-known names in our industry. You will take away hard-earned architectural lessons on scalability, reliability, throughput, and performance.

  • Architecting for Confidence: Building Resilient Systems

    Your system will fail. Build systems with the confidence to know when they do and you won’t.