QCon Munich 2020 has been canceled. See our current virtual and in-person events.

You are viewing content from a past/completed QCon -

Presentation: Monitoring and Tracing @Netflix Streaming Data Infrastructure

Track: Production Readiness: Building Resilient Systems

Location: Ballroom BC

Duration: 10:35am - 11:25am

Day of week: Wednesday

This presentation is now available to view on InfoQ.com

Watch video with transcript

What You’ll Learn

  1. Find out why observability is useful for a streaming data infrastructure.
  2. Hear how Netflix is monitoring and doing message tracing for their streaming data infrastructure which enables critical automations.
  3. Listen on the importance of building observability tools and how quickly they pay off the investment in developing them.


Netflix streaming data infrastructure transports trillions of events per day and supports hundreds of streaming processing jobs. The team behind it is small and there is no separate operations team. To efficiently manage and operate this huge infrastructure and reduce operational burden for everyone, we developed a set of tools that enables automated operations and mitigations. Our Kafka monitoring tools provide comprehensive signals and great insights into the health of our Kafka brokers and consumers, from which we derived ways to automate error handling that improves stability of brokers and stream processing jobs. For data streams that have high consistency requirements, instead of purely relying on aggregated counts that may be misleading, we trace individual events along their transporting path. Enabled by stream processing with minimal resources, tracing provides insight into end-to-end data loss, duplicates and latency at near real time and with high accuracy. These results helped us to further improve our service quality and validate design trade-offs.

The talk will give the design and implementation details of these dev/ops tools and highlight the critical roles they play in operating our data infrastructure. It will showcase how active and targeted tools development for operational use can quickly payoff with improved product quality and overall agility.


What is the work you're doing today?


My major responsibility at Netflix is to create a scalable and high quality streaming data infrastructure that handles trillions of messages per day. The first thing that comes to my mind is that we need to make sure the architecture is scalable for this high volume of streaming services. I have created a multi-cluster Kafka architecture that is fault tolerant and scalable as I have shown in the past QCon London conference. We spend a lot of time integrating this on our client side and create a higher level of abstraction that makes it easy for everybody to use. The second and equally important job that I'm doing daily is to create operational tools to increase the visibility and operability. We are a team of a little more than 10 people and we do everything from design, coding to deployment and operation. We need to make sure that we can operate this thing efficiently and spend most of our time not dealing with operations.


What are the goals of the talk?


There have been quite a few talks covering microservices from an operational perspective. But when I look at the past conferences there are very few talks about operating data infrastructure. In the past we have built years of experience of design and operating this data infrastructures. So I would like to fill the gap and share with the audience the tools and principles that have enabled us to efficiently operate this huge data infrastructure.


Will there be general lessons to learn if I am not a Kafka user?


Sure. Observability plays an important role in streaming data infrastructure. So regardless of the scale, observability plays a vital role there because it enables the critical automation and increases the transparency that earns trust from people that use this infrastructure. That's why we spend a lot of time to increase observability with targeted tools development. These tools help tremendously to improve efficiency and quality. Creating these tools is not trivial. In our case they are usually a multi quarter effort and requires smart thinking which I will cover extensively in my talk, but I find that once they are up and running the payoff is pretty quick. So I believe that investing in these tooling improves the overall agility of the team and the product for data infrastructure. That's the top point that I wish that everybody can take away.


 Are the tools available as open source?


Unfortunately no. We have a plan to open source them but obviously there are some high priority tasks that keep us very busy. But in general we want to open source them.


What do you want people to leave the talk with?


As I mentioned before, I want them to know that improving observability is important. People should spend effort and invest on this kind of tooling because it will quickly pay off. Also in my talk, there will be some details about Kafka and stream processing with Flink and how we utilize those tools to do message tracing and loss detection. That's something I want to share with the people, and hopefully they can take away some learning.

Speaker: Allen Wang

Architect & Engineer in Real Time Data Infrastructure Team @Netflix

Allen Wang is an architect and engineer in Real Time Data Infrastructure team at Netflix. He architected the multi-cluster Kafka infrastructure for Netflix in cloud environment and is heavily involved in developing the tools needed for operating the streaming data infrastructure. He is an open source contributor for Apache Kafka and NetflixOSS and a frequent speaker for Kafka.

Find Allen Wang at

Last Year's Tracks

  • Monday, 16 November

  • Distributed Systems for Developers

    Computer science in practice. An applied track that fuses together the human side of computer science with the technical choices that are made along the way

  • The Future of APIs

    Web-based API continue to evolve. The track provides the what, how, and why of future APIs, including GraphQL, Backend for Frontend, gRPC, & ReST

  • Resurgence of Functional Programming

    What was once a paradigm shift in how we thought of programming languages is now main stream in nearly all modern languages. Hear how software shops are infusing concepts like pure functions and immutablity into their architectures and design choices.

  • Social Responsibility: Implications of Building Modern Software

    Software has an ever increasing impact on individuals and society. Understanding these implications helps build software that works for all users

  • Non-Technical Skills for Technical Folks

    To be an effective engineer, requires more than great coding skills. Learn the subtle arts of the tech lead, including empathy, communication, and organization.

  • Clientside: From WASM to Browser Applications

    Dive into some of the technologies that can be leveraged to ultimately deliver a more impactful interaction between the user and client.

  • Tuesday, 17 November

  • Languages of Infra

    More than just Infrastructure as a Service, today we have libraries, languages, and platforms that help us define our infra. Languages of Infra explore languages and libraries being used today to build modern cloud native architectures.

  • Mechanical Sympathy: The Software/Hardware Divide

    Understanding the Hardware Makes You a Better Developer

  • Paths to Production: Deployment Pipelines as a Competitive Advantage

    Deployment pipelines allow us to push to production at ever increasing volume. Paths to production looks at how some of software's most well known shops continuous deliver code.

  • Java, The Platform

    Mobile, Micro, Modular: The platform continues to evolve and change. Discover how the platform continues to drive us forward.

  • Security for Engineers

    How to build secure, yet usable, systems from the engineer's perspective.

  • Modern Data Engineering

    The innovations necessary to build towards a fully automated decentralized data warehouse.

  • Wednesday, 18 November

  • Machine Learning for the Software Engineer

    AI and machine learning are more approachable than ever. Discover how ML, deep learning, and other modern approaches are being used in practice by Software Engineers.

  • Inclusion & Diversity in Tech

    The road map to an inclusive and diverse tech organization. *Diversity & Inclusion defined as the inclusion of all individuals in an within tech, regardless of gender, religion, ethnicity, race, age, sexual orientation, and physical or mental fitness.

  • Architectures You've Always Wondered About

    How do they do it? In QCon's marquee Architectures track, we learn what it takes to operate at large scale from well-known names in our industry. You will take away hard-earned architectural lessons on scalability, reliability, throughput, and performance.

  • Architecting for Confidence: Building Resilient Systems

    Your system will fail. Build systems with the confidence to know when they do and you won’t.

  • Remotely Productive: Remote Teams & Software

    More and more companies are moving to remote work. How do you build, work on, and lead teams remotely?

  • Operating Microservices

    Building and operating distributed systems is hard, and microservices are no different. Learn strategies for not just building a service but operating them at scale.