Conference: Nov 13-15, 2017
Workshops: Nov 16-17, 2017
Presentation: Fundamentals of Stream Processing with Apache Beam
Duration
Level:
- Intermediate
Persona:
- Architect
- Data Scientist
Key Takeaways
- Understand the distinction between event time versus processing time in stream processing.
- See how the what/where/when/how model clarifies and simplifies the otherwise complex task of building robust stream processing pipelines.
- Learn about the Apache Beam (incubating) project and how makes this model concrete and portable.
Abstract
Apache Beam (unified Batch and strEAM processing!) is a new Apache incubator project. Originally based on years of experience developing Big Data infrastructure within Google (such as MapReduce, FlumeJava, and MillWheel), it has now been donated to the OSS community at large.
Come learn the fundamentals of out-of-order stream processing, and how Beam’s powerful tools for reasoning about time greatly simplify this complex task.
Beam provides a model that allows developers to focus on the four important questions that must be answered by any stream processing pipeline:
- What results are being calculated?
- Where in event time are they calculated?
- When in processing time are they materialized?
- How do refinements of result relate?
By cleanly separating these questions from runtime characteristics, Beam programs become portable across multiple runtime environments, both proprietary (e.g., Google Cloud Dataflow) and open-source (e.g., Apache Flink, Apache Spark, et al).
Interview
Frances: I am the tech lead of the Google Cloud Dataflow SDK Team. Cloud Dataflow is based on Google’s internal data processing technologies that evolved out of MapReduce. We’ve also taken many of those ideas and helped found the Apache Beam (incubating) project.
Tyler: I am in charge of our internal data processing pipeline systems for streaming and now also the folks working on internal batch processing in Seattle. But my focus has traditionally been streaming.
Frances: I think these concepts are easiest to understand using a good example. Let’s say we are working for a mobile gaming company that wants to analyze logs about its mobile games. We’ve just launched a game where you are doing some sort of frantic task on your mobile phone to earn points for a team -- crushing candy, killing zombies, popping bubbles, whatever. As you score points, these scores are sent back to our system for analysis.
We just want to figure out what the top scoring team is. But since this game works on mobile phones, we have to deal with things like flakey networks and phones that are offline. If Tyler is playing this game on a flight in airplane mode, his scores aren’t sent to our system until his flight lands, coming in 8 hours behind the rest of his team.
This is exactly why stream processing and real time computations get hard -- distributed systems are ...distributed. Things get out of order, and that complicates things.
Tyler: The basic thing revolves around how you think about stream processing, how you think about dealing with unbounded data. We have a cute “what/where/when/how” idiom for thinking about it that breaks it up into what transformations you are doing, where in event time you are windowing, when in processing time you materialize your outputs (which includes reasoning about input completeness), and how your outputs accumulate over time. It goes into those various concepts, tries to explain why they are relevant, and where they came from. We also show in code how this all maps out.
Frances: It comes down to having a model that gives you the right abstractions for reasoning about these concepts. The key is this distinction between event time and processing time when you are dealing with unbounded, real time systems. Are you trying to do things relative to when the event happened? Or when you are able to process the event?"
There are use cases for both. We give you a very simple framework that lets you specify which of those you need, then it actually becomes very easy to express these sort of complex use cases. I want to process data in hourly buckets depending on when it occurred, and I want updates when late data comes in at this rate, and I want to give speculative answers at that rate.
Tyler: Our talk is relevant to anybody that is interested in stream processing. My general experience has been that stream processing at first blush is more complex than a lot of people realize. They often don’t quite know what they are getting into when they start doing stream processing, and that can make it difficult. There are a lot of subtleties that you need to understand. On the flipside, it’s really quite tractable once you understand the core concepts and how to think about them. But if you don’t learn the basics first, you’re gonna have a harder time.
Core developers find it very interesting because they just want to build pipelines and they want to understand how to do it right. Architects find it interesting because they want to know which sorts of technologies they should be choosing. They want to know what kinds of questions they need to be asking when they are designing larger scale systems.
Frances: Beam is exactly the set of abstractions that we are teaching you. Beam is at its core a programming model revolving around this “what, where, when, how” framework. It’s an intuitive, easy way to reason about complex use cases.
The other key idea behind Beam is that once you invest in learning the set of abstractions and you write your programs against these SDK’s, you want to be able to run them in different places. Beam pipelines can run on Cloud Dataflow, Apache Spark, and Apache Flink. And Apache Gearpump and Apache Apex are coming too.
Tyler: We will talk about the support from various runners at this point in time and how that is evolving. Also talk about the broader ecosystem of stream processors out there and what things look like.
Tyler: I think Cloud computing is probably the most disruptive thing. It’s a huge shift for companies to move from writing all of their own infrastructure and having to manage all that to letting experts manage the mundane, make their lives easier and be able to focus on the applications they are building. To me, it is just a further extension of things that we’ve had for years with compilers and other stuff like that, building one more layer of abstraction that lets you do more with less.
Frances: For me it’s stream processing. I come from a long background of batch processing, and what Tyler and others have taught me over the years is that stream processing just subsumes it all. People used to think that stream processing had to be lossy, had to not quite be correct, and that you needed your batch pipelines and your Lambda architecture to actually make stuff work for real. That’s not true. Batch processing just falls out as a subset of stream processing. Anything you think of as a nightly batch job, it’s not. It is a constant streaming job that is windowed daily. I think we can really get to a good place if we merge these two into a unified model and then let the system set all the knobs for you. It could be the best way to run one of these jobs could be on a continuous system, or maybe it’s to just spin up a cluster and kick it for an hour and then tear it down for 23 like the traditional batch cron job. But the important thing is to separate what users are reasoning about from the execution details. You don’t want to be configuring clusters or choosing between batch and streaming. Let the system make that choice and do it smartly, while you focus on your data and logic.
Similar Talks
.
Tracks
Monday Nov 7
-
Architectures You've Always Wondered About
You know the names. Now learn lessons from their architectures
-
Distributed Systems War Stories
“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Lamport.
-
Containers Everywhere
State of the art in Container deployment, management, scheduling
-
Art of Relevancy and Recommendations
Lessons on the adoption of practical, real-world machine learning practices. AI & Deep learning explored.
-
Next Generation Web Standards, Frameworks, and Techniques
JavaScript, HTML5, WASM, and more... innovations targetting the browser
-
Optimize You
Keeping life in balance is a challenge. Learn lifehacks, tips, & techniques for success.
Tuesday Nov 8
-
Next Generation Microservices
What will microservices look like in 3 years? What if we could start over?
-
Java: Are You Ready for This?
Real world lessons & prepping for JDK9. Reactive code in Java today, Performance/Optimization, Where Unsafe is heading, & JVM compile interface.
-
Big Data Meets the Cloud
Overviews and lessons learned from companies that have implemented their Big Data use-cases in the Cloud
-
Evolving DevOps
Lessons/stories on optimizing the deployment pipeline
-
Software Engineering Softskills
Great engineers do more than code. Learn their secrets and level up.
-
Modern CS in the Real World
Applied, practical, & real-world dive into industry adoption of modern CS ideas
Wednesday Nov 9
-
Architecting for Failure
Your system will fail. Take control before it takes you with it.
-
Stream Processing
Stream Processing, Near-Real Time Processing
-
Bare Metal Performance
Native languages, kernel bypass, tooling - make the most of your hardware
-
Culture as a Differentiator
The why and how for building successful engineering cultures
-
//TODO: Security <-- fix this
Building security from the start. Stories, lessons, and innovations advancing the field of software security.
-
UX Reimagined
Bots, virtual reality, voice, and new thought processes around design. The track explores the current art of the possible in UX and lessons from early adoption.