Speaker: Allen Wang

Architect & Engineer in Real Time Data Infrastructure Team @Netflix

Allen Wang is an architect and engineer in Real Time Data Infrastructure team at Netflix. He architected the multi-cluster Kafka infrastructure for Netflix in cloud environment and is heavily involved in developing the tools needed for operating the streaming data infrastructure. He is an open source contributor for Apache Kafka and NetflixOSS and a frequent speaker for Kafka.

SESSION + Live Q&A

Monitoring and Tracing @Netflix Streaming Data Infrastructure

Netflix streaming data infrastructure transports trillions of events per day and supports hundreds of streaming processing jobs. The team behind it is small and there is no separate operations team. To efficiently manage and operate this huge infrastructure and reduce operational burden for everyone, we developed a set of tools that enables automated operations and mitigations. Our Kafka monitoring tools provide comprehensive signals and great insights into the health of our Kafka brokers and consumers, from which we derived ways to automate error handling that improves stability of brokers and stream processing jobs. For data streams that have high consistency requirements, instead of purely relying on aggregated counts that may be misleading, we trace individual events along their transporting path. Enabled by stream processing with minimal resources, tracing provides insight into end-to-end data loss, duplicates and latency at near real time and with high accuracy. These results helped us to further improve our service quality and validate design trade-offs.

The talk will give the design and implementation details of these dev/ops tools and highlight the critical roles they play in operating our data infrastructure. It will showcase how active and targeted tools development for operational use can quickly payoff with improved product quality and overall agility.

Location

Ballroom BC

Track

Production Readiness: Building Resilient Systems

Topics

Interview AvailableMonitoringResiliency

Slides

Slides are not available

Share

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.