Incremental Data Processing with Apache Hudi

Incremental Data Processing is an emerging style of data processing gathering attention recently that has the potential to deliver orders of magnitude speed and efficiency over traditional batch processing on data lakes and data warehouses. Back in 2016, Uber created the Apache Hudi project to move away from bulk data ingestion and ETL towards this incremental model. Apache Hudi has since been successfully deployed at very large data lakes to achieve similar goals.

In this talk, we aim to present an introduction to incremental data processing, contrasting it with the two prevalent processing models of today - batch and stream data processing. We will then discuss how Hudi is designed as a specialized storage layer with features like indexing, change streams, incremental queries, and log merges to power incremental data processing on top. We will share real-world experiences gained from operating Uber’s critical data warehousing pipelines on the new incremental model leveraging Apache Hudi, along with results from the OSS community.

The talk then delves deeper into some open, hard challenges around out-of-order data, event-time semantics, and late-arriving data, that are being solved in the Apache Hudi community. Specifically, we will discuss the missing pieces towards realizing a unified Kappa architecture using Apache Flink or Apache Spark SQL, where updates/deletes are cascading across the data lake while being able to seamlessly rebuild/backfill tables when processing logic changes. We will also discuss cost and efficiency tradeoffs to be made along the way, in balancing the need for faster incremental updates against great query performance for analytics and ML/data science.
 

What's the focus of your work these days?

Sudha: I work as a distributed systems engineer @onehouse and am a PMC member of Apache Hudi. Hudi, already solves incremental data processing problems to make data lakes more real-time and efficient. With the strong support of stream processing frameworks such as Apache Spark, Apache Flink, and Kafka Connect and Hudi’s core design choices like record-level metadata and incremental querying capabilities users are able to consistently chain the ingested data into downstream data pipelines. Most recent features added in Hudi 0.13.0 around this include change data capture, which empowers cdc style queries with before and after images of changed data, and advanced indexing techniques like consistent hashing index. My job focuses on streamlining and improving the integration with the streaming engines to further improve incremental processing capabilities and to enable expanding CDC-style querying support to more querying engines for all Hudi tables. 

Saketh: I work on the data transformation platform team @Uber, which is responsible for simplifying the process for data producers throughout the company to create and author ETL solutions for their business requirements. For the last year, we have been focused on introducing and leveraging Apache Hudi for derived datasets and ETLs to consume/process data in an incremental fashion as opposed to the traditional batch processing which is commonly used for data warehouse models. By doing so, we’ve seen significant improvements around compute/resource utilization, as well as data freshness in downstream datasets.

What's the motivation for your talk at QCon San Francisco 2023?

Incremental Data Processing is an emerging style of data processing gathering attention recently that has the potential to deliver orders of magnitude speed and efficiency over traditional batch processing on data lakes and data warehouses. The motivation for this talk is to present the work done around incremental data processing by leveraging Apache Hudi and how it compares to the traditional batch processing techniques for data warehousing pipelines.

How would you describe your main persona and target audience for this session?

This talk is aimed to be understood by intermediate to advanced Data Engineering Communities. 

Is there anything specific that you'd like people to walk away with after watching your session?

We expect the audience to walk away with a deeper understanding of the incremental approach to data processing and how their orgs/companies could tap into speed and efficiency over traditional batch processing models by adapting incremental processing in their data lakes/warehouses. 


Speaker

Saketh Chintapalli

Software Engineer @Uber, Bringing Incremental Data Processing to Data Warehouse Models

Saketh Chintapalli is a Software Engineer on the Data Transformation Platform team at Uber. His work primarily lies in data platform engineering, specifically in building reliable tooling and infrastructure for efficient data processing and pipelines, varying from batch to real-time workflows.

 

Read more
Find Saketh Chintapalli at:

Speaker

Bhavani Sudha Saktheeswaran

Distributed Systems Engineer @Onehouse, Apache Hudi PMC, Ex-Moveworks, Ex-Uber, Ex-Linkedin

Bhavani Sudha Saktheeswaran is a distributed systems engineer at Onehouse and a PMC member of the Apache Hudi project. She is passionate about engaging with and driving the Hudi community. Previously she worked as a software engineer at Moveworks, where she led the data lake efforts using Hudi to power data analytics and other data platform services. At Uber, she was part of the Presto team and has been a key contributor to the early Presto integrations of Hudi. She has rich experience with real-time data systems and distributed data systems through her work at Uber’s marketplace team, Data teams, and Linkedin’s Voldemort team

Read more
Find Bhavani Sudha Saktheeswaran at:

Date

Monday Oct 2 / 03:55PM PDT ( 50 minutes )

Location

Ballroom A

Topics

Data Lakes Platform Engineering Big Data Incremental Processing Apache Hudi Data Engineering Data Architectures Batch Processing Stream Processing

Share

From the same track

Session Stream Processing

Streaming Databases: Embracing the Convergence of Stream Processing and Databases

Monday Oct 2 / 01:35PM PDT

Streaming databases have gained significant attention in recent years. From its name, it is evident that a streaming database combines the power of stream processing and databases.

Speaker image - Yingjun Wu

Yingjun Wu

Founder and CEO @RisingWave Labs, Previously Engineer @AWS Redshift & Researcher @IBM Research Almaden

Session Graph Databases

LIquid: A Large-Scale Relational Graph Database

Monday Oct 2 / 10:35AM PDT

We describe LIquid(1 2), the graph database built to host LinkedIn.

Speaker image - Scott Meyer

Scott Meyer

Distinguished Software Engineer @LinkedIn, Creator of the Graph Database, LIquid, Metaweb/freebase Alum

Session Distributed Systems

Redesigning OLTP for a New Order of Magnitude

Monday Oct 2 / 02:45PM PDT

The world is becoming more transactional. From colocation and server rental to serverless and usage-based billing. From coal to clean energy and smart meters that arbitrage solar prices 1440 times a month instead of monthly. Not to mention FedNow or the tsunami of instant payments.

Speaker image - Joran Greef

Joran Greef

Founder and CEO @TigerBeetle

Session Architecture

Sleeping at Scale - Delivering 10k Timers per Second per Node with Rust, Tokio, Kafka, and Scylla

Monday Oct 2 / 05:05PM PDT

As a part of OneSignal’s no-code Journeys system, we knew that we would need a way to store billions of timers.

Speaker image - Lily Mara

Lily Mara

Engineering Manager @OneSignal, Author of "Refactoring to Rust"

Speaker image - Hunter Laine

Hunter Laine

Software Engineer @OneSignal

Session Data

PRQL: A Simple, Powerful, Pipelined SQL Replacement

Monday Oct 2 / 11:45AM PDT

Most databases use SQL as the interface to access relational data. Because of that, we associate SQL to be the language of relational algebra. But its affinity with the English language and unclear and inconsistent semantics leave a lot of space for improvements.

Speaker image - Aljaž Mur Eržen

Aljaž Mur Eržen

Compiler Developer @EdgeDB & PRQL Maintainer