At DoorDash, real time events are an important data source to gain insight into our business but building a system capable of handling billions of real time events is challenging. Events are generated from our services and user devices and need to be processed and transported to different destinations to help us make data driven decisions on the platform.
Historically, DoorDash has a few data pipelines that get data from our legacy monolithic web application and ingest the data into Snowflake. These pipelines incurred high data latency, significant cost and operational overhead. Two years ago, we started the journey of creating a real time event processing system named Iguazu to replace the legacy data pipelines and address the following event processing needs we anticipated as the data volume grows with the business:
- Data ingest from heterogeneous data sources including mobile and web
- Easily accessible by streaming data consumers with higher abstractions
- High data quality with end to end schema enforcement
- Scalable, fault tolerant and easy to operate for a small team
To meet those objectives, we decided to create a new event processing system with a strategy to leverage open source frameworks which can be customized and better integrated with DoorDash’s infrastructure. Stream processing platforms like Apache Kafka and Apache Flink have matured in the past few years and become easy to adopt. By leveraging those excellent building blocks, applying best engineering practices and a platform mindset, we built the system from ground up and scaled it to process hundreds of billions of events per day with 99.99% delivery rate and only minutes of latency to Snowflake and other OLAP data warehouses.
The talk will discuss in detail the design of the system including major components of event producing, event processing with Flink and streaming SQL, event format and schema validation, Snowflake integration, and self-serve platform. It will also showcase how we have adapted and customized open source frameworks to address our needs, and solved some of the challenges including schema update and service/resource orchestration in an infrastructure-as-code environment.
Can you tell me what the focus of your work is these days?
Currently I'm focused on creating and improving the real-time event processing in DoorDash. What we have been doing is creating an architecture and frameworks that make event processing easy, scalable and reliable for event publishers and consumers. We created a unified event publishing framework that makes event publishing easy for both internal microservices and mobile clients.
For consumers, we have created a stream processing framework built on top of Apache Flink that supports different abstractions for different kinds of consumers. That means you can either write Flink code with customized stream processing logic or use a decorative approach where you can create your stream processing application with Flink SQL or YAML configurations.
The platform also has built-in integrations for our data warehouse Snowflake, and all that data stores and provide end-to-end schema enforcement. So for the past two years, I have been pretty busy with creating these basic building blocks that enable such a system. Nowadays, we are starting to build more product solutions like sessionization, which leverage the system to facilitate deeper level user behavior analysis at scale. And then that's just an example of interesting and challenging areas that we are expanding into.
What is the motivation behind your session?
At the end of 2019, after a few years and contributing to Netflix's Keystone pipeline, I left Netflix and joined DoorDash. I was given the opportunity to build a real-time data platform from scratch, and I really enjoy the journey. After two years of work, I felt it was in good shape and it had a very decent scale. So given both my Keystone experience and what I have been doing at DoorDash, I feel there is a consistent pattern of building a successful real-time event processing system, and I think it would be a great idea to share that with the rest of the software community. So that's my major motivation.
How would you describe the persona and level of your target audience?
I think my talk will fit any software engineer who is interested in learning the fundamentals of even processing and building a scalable and fault tolerant system. Whether you work in a small company or large company, you have little experience or extensive experience in the subject, I will strive to balance my talk so everyone can learn something from it.
Is there anything specific that you would like these folks to walk away with after seeing your session?
As I mentioned before, there are some principles of building real-time data platforms that have passed the test of time. I want the audience to walk away with these patterns and maybe some concrete ideas of how to implement those. For example, how do you build appropriate abstractions for event publishing and event consuming? How do you decouple services to make systems scalable? And how do you isolate failures to make your system fault tolerance? And in addition, I will cover the open source frameworks that we have used extensively in the system. I also hope the audience will walk away with some ideas of how you can evaluate open source frameworks, research their sweet spot and how to extend them to your needs.
Software Engineer @DoorDash, previously Lead for real-time data infrastructure team @Netflix
Allen Wang is currently a tech lead at data platform at DoorDash. He is the architect for the Iguazu event processing system and a founding member of the real time streaming platform team.
Prior to joining DoorDash, he was a lead in the real time data infrastructure team at Netflix where he created the Kafka infrastructure for Netflix’s Keystone data pipeline and was a main contributor to shape the Kafka ecosystem at Netflix.
He is a contributor to Apache Kafka and NetflixOSS, and a two-time QCon speaker.