Efficient Incremental Processing with Netflix Maestro and Apache Iceberg

Incremental processing, an approach that processes only new or updated data in workflows, substantially reduces compute resource costs and execution time, leading to fewer potential failures and less need for manual intervention. However, enabling incremental processing on large-scale data pipelines and workflows presents significant challenges around scalability, ease of adoption, and user experience. In this talk, we will discuss how we are leveraging Apache Iceberg and Netflix Maestro to build an Incremental Processing Solution (IPS) that enables incremental processing of only new or changed data, reducing compute costs and processing times while ensuring data accuracy and freshness. By combining Iceberg's metadata capabilities for snapshots and data files with Maestro's workflow orchestration, we can efficiently handle late-arriving data and backfills in various different scenarios beyond append-only mode.

We will share our experiences and insights into how this IPS has empowered our data engineering teams to build more reliable, efficient, and scalable data pipelines, unlocking new data processing patterns. Through real-world use cases, we will demonstrate how IPS has significantly improved resource utilization, reduced execution times, and simplified pipeline management, all while maintaining data integrity. Additionally, we will discuss the emerging incremental processing patterns that we have discovered, such as using captured change data for row-level filtering and leveraging range parameters in business logic, as well as the techniques, best practices, and lessons learned from our journey towards incremental processing at Netflix.


Speaker

Jun He

Staff Software Engineer @Netflix, Managing and Automating Large-Scale Data/ML Workflows, Previously @Airbnb and @Hulu

Jun He is a Staff Software Engineer in the Big Data Orchestration team at Netflix, where he leads the effort to build Netflix's workflow orchestrator, a.k.a. Maestro, to manage and automate large-scale Data/ML workflows at Netflix. He also made contributions to multiple open source projects, such as Apache Iceberg. Prior to Netflix, He spent a few years building distributed systems and search infrastructure at Airbnb. He was the main contributor to design and build message bus and search pipeline at Airbnb. 
 

Read more
Find Jun He at:

Date

Tuesday Nov 19 / 03:55PM PST ( 50 minutes )

Location

Ballroom A

Topics

Data Pipelines Apache Iceberg Data Workflow Data Accuracy

Share

From the same track

Session Platform Engineering

Beyond Durability: Enhancing Database Resilience and Reducing the Entropy Using Write-Ahead Logging at Netflix

Tuesday Nov 19 / 10:35AM PST

In modern database systems, durability guarantees are crucial but often insufficient in scenarios involving extended system outages or data corruption.

Speaker image - Prudhviraj Karumanchi

Prudhviraj Karumanchi

Staff Software Engineer at Data Platform @Netflix, Building Large-Scale Distributed Storage Systems and Cloud Services, Previously @Oracle, @NetApp, and @EMC/Dell

Speaker image - Vidhya Arvind

Vidhya Arvind

Staff Software Engineer @Netflix Data Platform, Founding Member of Data Abstractions at Netflix, Previously @Box and @Verizon

Session Architecture

OpenSearch Cluster Topologies for Cost-Saving Autoscaling

Tuesday Nov 19 / 11:45AM PST

The indexing rates of many clusters follow some sort of fluctuating pattern - be it day/night, weekday/weekend, or any sort of duality when the cluster changes from being active to less active.  In these cases how does one scale the cluster?

Speaker image - Amitai Stern

Amitai Stern

Engineering Manager @Logz.io, Managing Observability Data Storage of Petabyte Scale, OpenSearch Leadership Committee Member and Contributor

Session

Stream and Batch Processing Convergence in Apache Flink

Tuesday Nov 19 / 02:45PM PST

The idea of executing streaming and batch jobs with one engine has been there for a while. People always say batch is a special case of streaming. Conceptually, it is.

Speaker image - Becket Qin

Becket Qin

Principal Staff Software Engineer @LinkedIn

Session

Stream All the Things — Patterns of Effective Data Stream Processing

Tuesday Nov 19 / 01:35PM PST

Data streaming is a really difficult problem. Despite 10+ years of attempting to simplify it, teams building real-time data pipelines can spend up to 80% of their time optimizing it or fixing downstream output by handling bad data at the lake.

Speaker image - Adi Polak

Adi Polak

Director, Advocacy and Developer Experience Engineering @Confluent

Session

Unconference: Shift-Left Data Architecture

Tuesday Nov 19 / 05:05PM PST