You are viewing content from a past/completed QCon

Presentation: Taming Large State: Lessons From Building Stream Processing

Track: Modern Data Architectures

Location: Ballroom BC

Duration: 1:40pm - 2:30pm

Day of week: Tuesday

Slides: Download Slides

Share this on:

This presentation is now available to view on

Watch video with transcript

What You’ll Learn

  1. Find out what it takes to convert a batch job into a data stream processing app.
  2. Learn what are some of the fundamental constructs to be used in a streaming app.
  3. Hear what were the challenges Netflix had to deal with converting a batch into a streaming app.


Streaming engines like Apache Flink are redefining ETL and data processing. Data can be extracted, transformed, filtered, and written out in real time with an ease matching that of batch processing. However, the real challenge of matching the prowess of batch ETL remains in doing joins, maintaining state, and dynamically pausing or resting the data.

At Netflix, micro-services serve and record many different kinds of user interactions with the product. Some of these live services generate millions of events per second, all carrying meaningful but often partial information. Things start to get exciting when the company wants to combine the events coming from one high-traffic micro-service to another. Joining these raw events generates rich datasets that are used to train the machine learning models that serve Netflix recommendations.

Historically, Netflix has done this joining of large volume datasets in batch. Recently, the company asked, If the data is being generated in real time, why can’t it be processed downstream in real time? Why wait a full day to get information from an event that was generated a few minutes ago?

This talks describes how we solved a complex join of two high-volume event streams at Netflix using Flink. You’ll learn about 

  • Managing out of order events and processing late arriving data
  • Exploring keyed state for maintaining large state
  • Fault tolerance of a stateful application
  • Strategies for failure recovery 
  • Schema evolution in a stateful realtime application
  • Data validation batch vs streaming

Tell me a bit about the team that you're all on and the problems you're solving in Netflix.


We are part of the Netflix Data Science and Engineering, and specifically our team supports the creation and maintenance of data products that are used for the purpose of personalization. We process data at huge scale, making it available for the training of machine learning models. We've been experimenting with creating a lot of low-latency data sets to increase the frequency at which the models can be trained and to provide a faster feedback to the model.


Tell me a bit about the talk that you're going to do.


The core problem that we're trying to address is that for the past three, four years stream processing has exploded, it is typically been associated with low-latency event processing. It's not as commonly used for state for aggregation of data that batch is typically used for. This is the core data problem that we're trying to address, that it is possible to do stateful processing and produce low-latency data, which is usually done through batch. But to do it in streaming is what we're trying to solve.


The first sentence of your abstract mentions Apache Flink. Is this a Flink talk or is a data architecture talk?


I would say largely it's a data architecture talk, because a lot of constructs and the challenges we faced were a result of taking a really large batch dataset that required holding a lot of state and do streaming, and all the problems that come with that. That's fault tolerance, data recovery, data reprocessing. It is about how do you think of processing this in streaming. However we have solved it in Flink, so some of the things we've used are unique. The features that we've leveraged to solve these problems are features of Flink. But I'm sure other engines also have similar features.


Are you talking to a data engineer or are you talking to an architect?


This talk would appeal to an audience of a large variety. As an architect, this talk would give you a good insight into how to think about creating a system that can handle this level of scale. As a data engineer, this will give you enough points to consider what kind of jobs are a good candidate to work with streaming. Because not everything needs that kind of investment. I think this talk could appeal to a variety of audiences. One thing is that we will cover a lot of technical depth. So having that prior knowledge of what it takes to build these large scale distributed systems would be good to have.


What do you want someone to leave the talk with?


On a high level, what it entails to convert a large batch job into a low-latency stream processing job. The different constructs of creating a streaming application. And also we will talk about the challenges that were encountered on the way and how we went about solving them.


If I'm dealing with batch, ETL jobs at moderate scale, is that enough background or should I be pretty familiar with a streaming tool like Flink?


Some basic knowledge of how stream processing engines interact with queues like Kafka and what some I think basic terminology, what does it mean to do data processing or what does it mean to handle duplicate checklists? Knowing this terminology and constructs is enough. I don't think anyone needs to know the specific features or constructs of Flink or Spark, but just a broader knowledge of what stream data processing is.

Speaker: Sonali Sharma

Data Engineering and Analytics @Netflix

Sonali Sharma a data engineer on the data personalization team at Netflix, which, among other things, delivers recommendations made for each user. The team is responsible for the data that goes into training and scoring of the various machine learning models that power the Netflix home page. They have been working on moving some of the company’s core datasets from being processed in a once-a-day daily batch ETL to being processed in near real time using Apache Flink. A UC Berkeley graduate, Sonali has worked on a variety of problems involving big data. Previously, she worked on the mail monetization and data insights engineering team at Yahoo, where she focused on building great data-driven products to do large-scale unstructured data extractions, recommendation systems, and audience insights for targeting using technologies like Spark, the Hadoop ecosystem (Pig, Hive, MapReduce), Solr, Druid, and Elasticsearch.

Find Sonali Sharma at

Speaker: Shriya Arora

Senior Software Engineer @Netflix

Shriya works on the Personalization Data engineering team at Netflix. The team is responsible for all the data products that are used for training and scoring of the various machine learning models that power the Netflix homepage. On the team, she been focusing on moving some of the core datasets from being processed in a once-a-day daily batch ETL to being processed in near-real time for both technical and business wins. Before Netflix, she was at Walmart Labs, where she helped build and architect the new generation item-setup, moving from batch processing to real-time item processing.

Find Shriya Arora at

Last Year's Tracks

  • Monday, 16 November

  • Modern Data Engineering

    The innovations necessary to build towards a fully automated decentralized data warehouse.

  • Machine Learning for the Software Engineer

    AI and machine learning are more approachable than ever. Discover how ML, deep learning, and other modern approaches are being used in practice by Software Engineers.

  • Inclusion & Diversity in Tech

    The road map to an inclusive and diverse tech organization. *Diversity & Inclusion defined as the inclusion of all individuals in an within tech, regardless of gender, religion, ethnicity, race, age, sexual orientation, and physical or mental fitness.

  • Architectures You've Always Wondered About

    How do they do it? In QCon's marquee Architectures track, we learn what it takes to operate at large scale from well-known names in our industry. You will take away hard-earned architectural lessons on scalability, reliability, throughput, and performance.

  • Architecting for Confidence: Building Resilient Systems

    Your system will fail. Build systems with the confidence to know when they do and you won’t.

  • Remotely Productive: Remote Teams & Software

    More and more companies are moving to remote work. How do you build, work on, and lead teams remotely?

  • Tuesday, 17 November

  • Operating Microservices

    Building and operating distributed systems is hard, and microservices are no different. Learn strategies for not just building a service but operating them at scale.

  • Distributed Systems for Developers

    Computer science in practice. An applied track that fuses together the human side of computer science with the technical choices that are made along the way

  • The Future of APIs

    Web-based API continue to evolve. The track provides the what, how, and why of future APIs, including GraphQL, Backend for Frontend, gRPC, & ReST

  • Resurgence of Functional Programming

    What was once a paradigm shift in how we thought of programming languages is now main stream in nearly all modern languages. Hear how software shops are infusing concepts like pure functions and immutablity into their architectures and design choices.

  • Social Responsibility: Implications of Building Modern Software

    Software has an ever increasing impact on individuals and society. Understanding these implications helps build software that works for all users

  • Non-Technical Skills for Technical Folks

    To be an effective engineer, requires more than great coding skills. Learn the subtle arts of the tech lead, including empathy, communication, and organization.

  • Wednesday, 18 November

  • Clientside: From WASM to Browser Applications

    Dive into some of the technologies that can be leveraged to ultimately deliver a more impactful interaction between the user and client.

  • Languages of Infra

    More than just Infrastructure as a Service, today we have libraries, languages, and platforms that help us define our infra. Languages of Infra explore languages and libraries being used today to build modern cloud native architectures.

  • Mechanical Sympathy: The Software/Hardware Divide

    Understanding the Hardware Makes You a Better Developer

  • Paths to Production: Deployment Pipelines as a Competitive Advantage

    Deployment pipelines allow us to push to production at ever increasing volume. Paths to production looks at how some of software's most well known shops continuous deliver code.

  • Java, The Platform

    Mobile, Micro, Modular: The platform continues to evolve and change. Discover how the platform continues to drive us forward.

  • Security for Engineers

    How to build secure, yet usable, systems from the engineer's perspective.