Incremental processing, an approach that processes only new or updated data in workflows, substantially reduces compute resource costs and execution time, leading to fewer potential failures and less need for manual intervention. However, enabling incremental processing on large-scale data pipelines and workflows presents significant challenges around scalability, ease of adoption, and user experience. In this talk, we will discuss how we are leveraging Apache Iceberg and Netflix Maestro to build an Incremental Processing Solution (IPS) that enables incremental processing of only new or changed data, reducing compute costs and processing times while ensuring data accuracy and freshness. By combining Iceberg's metadata capabilities for snapshots and data files with Maestro's workflow orchestration, we can efficiently handle late-arriving data and backfills in various different scenarios beyond append-only mode.
We will share our experiences and insights into how this IPS has empowered our data engineering teams to build more reliable, efficient, and scalable data pipelines, unlocking new data processing patterns. Through real-world use cases, we will demonstrate how IPS has significantly improved resource utilization, reduced execution times, and simplified pipeline management, all while maintaining data integrity. Additionally, we will discuss the emerging incremental processing patterns that we have discovered, such as using captured change data for row-level filtering and leveraging range parameters in business logic, as well as the techniques, best practices, and lessons learned from our journey towards incremental processing at Netflix.
Speaker
Jun He
Staff Software Engineer @Netflix, Managing and Automating Large-Scale Data/ML Workflows, Previously @Airbnb and @Hulu
Jun He is a Staff Software Engineer in the Big Data Orchestration team at Netflix, where he leads the effort to build Netflix's workflow orchestrator, a.k.a. Maestro, to manage and automate large-scale Data/ML workflows at Netflix. He also made contributions to multiple open source projects, such as Apache Iceberg. Prior to Netflix, He spent a few years building distributed systems and search infrastructure at Airbnb. He was the main contributor to design and build message bus and search pipeline at Airbnb.