Abstract
Data streaming is a really difficult problem. Despite 10+ years of attempting to simplify it, teams building real-time data pipelines can spend up to 80% of their time optimizing it or fixing downstream output by handling bad data at the lake. All we want is a service that will be reliable, handle all kinds of data, connect with all kinds of systems, be easy to manage, and scale up and down as our systems change.
Oh, it should also have super low latency and result in good data. Is it too much to ask?
In this presentation, we’ll discuss the basic challenges of data streaming and introduce a few design and architecture patterns, such as DLQ, used to tackle these challenges.
We will then explore how to implement these patterns using Apache Flink and discuss the challenges that real-time AI applications bring to our infra. Difficult problems are difficult, and we offer no silver bullets. Still, we will share pragmatic solutions that have helped many organizations build fast, scalable, and manageable data streaming pipelines.
Speaker
Adi Polak
Director, Advocacy and Developer Experience Engineering @Confluent, Author of "Scaling Machine Learning with Spark" and "High Performance Spark 2nd Edition"
Adi is an experienced Software Engineer and people manager. She has worked with data and machine learning for operations and analytics for over a decade. As a data practitioner, she developed algorithms to solve real-world problems using machine learning techniques while leveraging expertise in distributed large-scale systems to build machine learning and data streaming pipelines. As a manager, Adi builds high-performance teams focused on trust, excellence, and ownership.
Adi has taught thousands of students how to scale machine learning systems and is the author of the successful book Scaling Machine Learning with Spark and High Performance Spark 2nd Edition.