Incremental processing, an approach that processes only new or updated data in workflows, substantially reduces compute resource costs and execution time, leading to fewer potential failures and less need for manual intervention. However, enabling incremental processing on large-scale data pipelines and workflows presents significant challenges around scalability, ease of adoption, and user experience. In this talk, we will discuss how we are leveraging Apache Iceberg and Netflix Maestro to build an Incremental Processing Solution (IPS) that enables incremental processing of only new or changed data, reducing compute costs and processing times while ensuring data accuracy and freshness. By combining Iceberg's metadata capabilities for snapshots and data files with Maestro's workflow orchestration, we can efficiently handle late-arriving data and backfills in various different scenarios beyond append-only mode.
We will share our experiences and insights into how this IPS has empowered our data engineering teams to build more reliable, efficient, and scalable data pipelines, unlocking new data processing patterns. Through real-world use cases, we will demonstrate how IPS has significantly improved resource utilization, reduced execution times, and simplified pipeline management, all while maintaining data integrity. Additionally, we will discuss the emerging incremental processing patterns that we have discovered, such as using captured change data for row-level filtering and leveraging range parameters in business logic, as well as the techniques, best practices, and lessons learned from our journey towards incremental processing at Netflix.
Interview:
What is the focus of your work?
I am the tech lead of the Big Data Orchestration team at Netflix. Our team builds multiple workflow and job orchestration services, such as Maestro. My work focuses on designing and building the Netflix workflow orchestrator, a robust and scalable platform that provides workflow as a service. It is widely used by thousands of Netflix internal users. With Netflix's scale, one of my primary responsibilities is to develop Maestro to support a wide variety of use cases while being able to scale up and out to automate hundreds of thousands of data and ML pipelines. Additionally, I work on integrating Maestro with other systems to offer new features for data practitioners and meet evolving business needs, such as efficient incremental processing support. More recently, I have been exploring how to expand workflow orchestration into the AI domain, for example, by supporting AI agentic workflows.
What’s the motivation for your talk?
Over the years, we have learned that incremental processing significantly improves resource utilization, reduces execution times, and simplifies pipeline management, all while maintaining data integrity. We have also identified several emerging incremental processing patterns, such as using captured change data for row-level filtering and leveraging range parameters in business logic. We want to share these insights, along with the techniques, best practices, and lessons learned from our journey towards incremental processing at Netflix.
Who is your talk for?
The audience is expected to have basic understandings of data processing, data pipelines, and open table formats (e.g., Iceberg).
What do you want someone to walk away with from your presentation?
Someone will walk away with an understanding of how the Incremental Processing System (IPS) works and how it can empower data practitioners to build more reliable, efficient, and scalable data pipelines using new data processing patterns.
What do you think is the next big disruption in software?
I believe AI agent orchestration will be the next big disruption in the orchestration area.
Speaker
Jun He
Staff Software Engineer @Netflix, Managing and Automating Large-Scale Data/ML Workflows, Previously @Airbnb and @Hulu
Jun He is a Staff Software Engineer in the Big Data Orchestration team at Netflix, where he leads the effort to build Netflix's workflow orchestrator, a.k.a. Maestro, to manage and automate large-scale Data/ML workflows at Netflix. He also made contributions to multiple open source projects, such as Apache Iceberg. Prior to Netflix, He spent a few years building distributed systems and search infrastructure at Airbnb. He was the main contributor to design and build message bus and search pipeline at Airbnb.