Presentation: Fundamentals of Stream Processing with Apache Beam

Duration

Duration: 
10:35am - 11:25am

Level:

Persona:

Key Takeaways

  • Understand the distinction between event time versus processing time in stream processing.
  • See how the what/where/when/how model clarifies and simplifies the otherwise complex task of building robust stream processing pipelines.
  • Learn about the Apache Beam (incubating) project and how makes this model concrete and portable. 

Abstract

Apache Beam (unified Batch and strEAM processing!) is a new Apache incubator project. Originally based on years of experience developing Big Data infrastructure within Google (such as MapReduce, FlumeJava, and MillWheel), it has now been donated to the OSS community at large.

Come learn the fundamentals of out-of-order stream processing, and how Beam’s powerful tools for reasoning about time greatly simplify this complex task.

Beam provides a model that allows developers to focus on the four important questions that must be answered by any stream processing pipeline:

  • What results are being calculated?
  • Where in event time are they calculated?
  • When in processing time are they materialized?
  • How do refinements of result relate?

By cleanly separating these questions from runtime characteristics, Beam programs become portable across multiple runtime environments, both proprietary (e.g., Google Cloud Dataflow) and open-source (e.g., Apache Flink, Apache Spark, et al).

Interview

Question: 
QCon: What is your role today?
Answer: 

Frances: I am the tech lead of the Google Cloud Dataflow SDK Team. Cloud Dataflow is based on Google’s internal data processing technologies that evolved out of MapReduce.  We’ve also taken many of those ideas and helped found the Apache Beam (incubating) project. 

Tyler: I am in charge of our internal data processing pipeline systems for streaming and now also the folks working on internal batch processing in Seattle. But my focus has traditionally been streaming.

Question: 
QCon: Can you give me an example of some of the core concepts for internal data processing?
Answer: 

Frances: I think these concepts are easiest to understand using a good example. Let’s say we are working for a mobile gaming company that wants to analyze logs about its mobile games. We’ve just launched a game where you are doing some sort of frantic task on your mobile phone to earn points for a team --  crushing candy, killing zombies, popping bubbles, whatever. As you score points, these scores are sent back to our system for analysis. 

We just want to figure out what the top scoring team is. But since this game works on mobile phones, we have to deal with things like flakey networks and phones that are offline. If Tyler is playing this game on a flight in airplane mode, his scores aren’t sent to our system until his flight lands, coming in 8 hours behind the rest of his team. 

This is exactly why stream processing and real time computations get hard -- distributed systems are ...distributed. Things get out of order, and that complicates things. 

Tyler: The basic thing revolves around how you think about stream processing, how you think about dealing with unbounded data. We have a cute “what/where/when/how” idiom for thinking about it that breaks it up into what transformations you are doing, where in event time you are windowing, when in processing time you materialize your outputs (which includes reasoning about input completeness), and how your outputs accumulate over time. It goes into those various concepts, tries to explain why they are relevant, and where they came from. We also show in code how this all maps out.

Question: 
QCon: What type of recommendations are you going to talk about?
Answer: 

Frances: It comes down to having a model that gives you the right abstractions for reasoning about these concepts. The key is this distinction between event time and processing time when you are dealing with unbounded, real time systems. Are you trying to do things relative to when the event happened? Or when you are able to process the event?"

There are use cases for both. We give you a very simple framework that lets you specify which of those you need, then it actually becomes very easy to express these sort of complex use cases. I want to process data in hourly buckets depending on when it occurred, and I want updates when late data comes in at this rate, and I want to give speculative answers at that rate.

Question: 
QCon: Who should come to your talk?
Answer: 

Tyler: Our talk is relevant to anybody that is interested in stream processing. My general experience has been that stream processing at first blush is more complex than a lot of people realize. They often don’t quite know what they are getting into when they start doing stream processing, and that can make it difficult. There are a lot of subtleties that you need to understand. On the flipside, it’s really quite tractable once you understand the core concepts and how to think about them. But if you don’t learn the basics first, you’re gonna have a harder time. 

Core developers find it very interesting because they just want to build pipelines and they want to understand how to do it right. Architects find it interesting because they want to know which sorts of technologies they should be choosing. They want to know what kinds of questions they need to be asking when they are designing larger scale systems. 

Question: 
QCon: Do you go into what you can do using Beam versus what you might do with Spark and Flink and different things like that? How does Beam come into this picture?
Answer: 

Frances: Beam is exactly the set of abstractions that we are teaching you. Beam is at its core a programming model revolving around this  “what, where, when, how” framework. It’s an intuitive, easy way to reason about complex use cases.

The other key idea behind Beam is that once you invest in learning the set of abstractions and you write your programs against these SDK’s, you want to be able to run them in different places. Beam pipelines can run on Cloud Dataflow, Apache Spark, and Apache Flink. And Apache Gearpump and Apache Apex are coming too. 

Tyler: We will talk about the support from various runners at this point in time and how that is evolving. Also talk about the broader ecosystem of stream processors out there and what things look like.

Question: 
QCon: What do you feel is the most disruptive tech in IT right now?
Answer: 

Tyler: I think Cloud computing is probably the most disruptive thing. It’s a huge shift for companies to move from writing all of their own infrastructure and having to manage all that to letting experts manage the mundane, make their lives easier and be able to focus on the applications they are building. To me, it is just a further extension of things that we’ve had for years with compilers and other stuff like that, building one more layer of abstraction that lets you do more with less.

Frances: For me it’s stream processing. I come from a long background of batch processing, and what Tyler and others have taught me over the years is that stream processing just subsumes it all. People used to think that stream processing had to be lossy, had to not quite be correct, and that you needed your batch pipelines and your Lambda architecture to actually make stuff work for real. That’s not true. Batch processing just falls out as a subset of stream processing. Anything you think of as a nightly batch job, it’s not. It is a constant streaming job that is windowed daily. I think we can really get to a good place if we merge these two into a unified model and then let the system set all the knobs for you. It could be the best way to run one of these jobs could be on a continuous system, or maybe it’s to just spin up a cluster and kick it for an hour and then tear it down for 23 like the traditional batch cron job. But the important thing is to separate what users are reasoning about from the execution details. You don’t want to be configuring clusters or choosing between batch and streaming. Let the system make that choice and do it smartly, while you focus on your data and logic.

Speaker: Frances Perry

Engineer @ Google & Founder/Committer on Apache Beam

Frances Perry is a software engineer who likes to make big data processing easy, intuitive, and efficient. After many years working on Google’s internal data processing stack, she joined the Cloud Dataflow team to make this technology available to external cloud customers. She led the early work on Dataflow’s unified batch/streaming programming model and is now a committer on Apache Beam (incubating).

Find Frances Perry at

Speaker: Tyler Akidau

Engineer @ Google & Founder/Committer on Apache Beam

Tyler Akidau is a Staff Software Engineer at Google. The current tech lead for internal streaming data processing systems (e.g. MillWheel), he’s spent six years working on massive-scale streaming data processing systems. He passionately believes in streaming data processing as the more general model of large-scale computation.

Find Tyler Akidau at

.

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers