Abstract

Incremental Data Processing is an emerging style of data processing gathering attention recently that has the potential to deliver orders of magnitude speed and efficiency over traditional batch processing on data lakes and data warehouses. Back in 2016, Uber created the Apache Hudi project to move away from bulk data ingestion and ETL towards this incremental model. Apache Hudi has since been successfully deployed at very large data lakes to achieve similar goals.

In this talk, we aim to present an introduction to incremental data processing, contrasting it with the two prevalent processing models of today - batch and stream data processing. We will then discuss how Hudi is designed as a specialized storage layer with features like indexing, change streams, incremental queries, and log merges to power incremental data processing on top. We will share real-world experiences gained from operating Uber’s critical data warehousing pipelines on the new incremental model leveraging Apache Hudi, along with results from the OSS community.

The talk then delves deeper into some open, hard challenges around out-of-order data, event-time semantics, and late-arriving data, that are being solved in the Apache Hudi community. Specifically, we will discuss the missing pieces towards realizing a unified Kappa architecture using Apache Flink or Apache Spark SQL, where updates/deletes are cascading across the data lake while being able to seamlessly rebuild/backfill tables when processing logic changes. We will also discuss cost and efficiency tradeoffs to be made along the way, in balancing the need for faster incremental updates against great query performance for analytics and ML/data science.

Interview:

What's the focus of your work these days?

Sudha: I work as a distributed systems engineer @onehouse and am a PMC member of Apache Hudi. Hudi, already solves incremental data processing problems to make data lakes more real-time and efficient. With the strong support of stream processing frameworks such as Apache Spark, Apache Flink, and Kafka Connect and Hudi’s core design choices like record-level metadata and incremental querying capabilities users are able to consistently chain the ingested data into downstream data pipelines. Most recent features added in Hudi 0.13.0 around this include change data capture, which empowers cdc style queries with before and after images of changed data, and advanced indexing techniques like consistent hashing index. My job focuses on streamlining and improving the integration with the streaming engines to further improve incremental processing capabilities and to enable expanding CDC-style querying support to more querying engines for all Hudi tables.

Saketh: I work on the data transformation platform team @Uber, which is responsible for simplifying the process for data producers throughout the company to create and author ETL solutions for their business requirements. For the last year, we have been focused on introducing and leveraging Apache Hudi for derived datasets and ETLs to consume/process data in an incremental fashion as opposed to the traditional batch processing which is commonly used for data warehouse models. By doing so, we’ve seen significant improvements around compute/resource utilization, as well as data freshness in downstream datasets.

What's the motivation for your talk at QCon San Francisco 2023?

Incremental Data Processing is an emerging style of data processing gathering attention recently that has the potential to deliver orders of magnitude speed and efficiency over traditional batch processing on data lakes and data warehouses. The motivation for this talk is to present the work done around incremental data processing by leveraging Apache Hudi and how it compares to the traditional batch processing techniques for data warehousing pipelines.

How would you describe your main persona and target audience for this session?

This talk is aimed to be understood by intermediate to advanced Data Engineering Communities.

Is there anything specific that you'd like people to walk away with after watching your session?

We expect the audience to walk away with a deeper understanding of the incremental approach to data processing and how their orgs/companies could tap into speed and efficiency over traditional batch processing models by adapting incremental processing in their data lakes/warehouses.

Speaker

Saketh Chintapalli

Software Engineer @Uber, Bringing Incremental Data Processing to Data Warehouse Models

Saketh Chintapalli is a Software Engineer on the Data Transformation Platform team at Uber. His work primarily lies in data platform engineering, specifically in building reliable tooling and infrastructure for efficient data processing and pipelines, varying from batch to real-time workflows.

Speaker

Bhavani Sudha Saktheeswaran

Distributed Systems Engineer @Onehouse, Apache Hudi PMC, Ex-Moveworks, Ex-Uber, Ex-Linkedin

Bhavani Sudha Saktheeswaran is a distributed systems engineer at Onehouse and a PMC member of the Apache Hudi project. She is passionate about engaging with and driving the Hudi community. Previously she worked as a software engineer at Moveworks, where she led the data lake efforts using Hudi to power data analytics and other data platform services. At Uber, she was part of the Presto team and has been a key contributor to the early Presto integrations of Hudi. She has rich experience with real-time data systems and distributed data systems through her work at Uber’s marketplace team, Data teams, and Linkedin’s Voldemort team