Presentation: Samza in LinkedIn: How LinkedIn Processes Billions of Events Everyday in Real-time
- Bayview A/B
We are enjoying something of a renaissance in data infrastructure. The old workhorses like MySQL and Oracle still exist but they are complemented by new specialized distributed data systems like Cassandra, Redis, Druid, and Hadoop. At the same time what we consider data has changed too--user activity, monitoring, logging and other event data are becoming first class citizens for data driven companies. Taking full advantage of all these systems and the relevant data creates a massive data integration problem. This problem is important to solve as these specialized systems are not very useful in the absence of a complete and reliable data flow.
One of the most powerful ways of solving this data integration problem is by restructuring your digital business logic around a centralized firehose of immutable events.
Once your data is captured in real-time and available as real-time subscriptions, you can start to compute new data sets in real-time, off these feeds. This style of stream processing is seen as something of a niche today but the model is extremely powerful and general. Much of what people compute offline in systems like Hadoop can also be done in real-time as data arrives using a stream-processing model. On top of these real-time data feeds, we can run continual processing and transformations to derive new data feeds (which are themselves logs) and publish these in the same way. We have open sourced our stream processing layer, Apache Samza[], which does this.
In this talk, I will share our experience of successfully building LinkedIn’s data pipeline infrastructure around Kafka and Samza. These lessons are hugely relevant to anyone building a data driven company.
Neha Narkhede Elsewhere
Similar Talks

Covering innovative topics
Monday, 3 November
Architectures You've Always Wondered about
The newest and biggest Internet architectures
Real World Functional
Putting functional programming concepts to work in the real world.
The Future of Mobile
The future of mobile and performance improvements
Continuous Delivery: From Heroics to Becoming Invisible
Continuous Delivery philosophies, cultures, hiccups, and best practices.
Unleashing the Power of Streaming Data
This track explores a variety of use-cases, platforms, and techniques for processing and analyzing stream data from the companies deploying them at scale!
Sponsored Solutions Track I
Tuesday, 4 November
Engineering for Product Success
Architectures that make products more successful
Reactive Service Architecture
Reactive, Responsive, Fault Tolerant and More.
Modern CS In the Real World
How modern CS tackles problems in the real world.
Applied Machine Learning and Data Science
Understand your big big data!
Deploying at Scale
Containerizing Applications, Discovering Services, and Deploying to the Grid.
Sponsored Solutions Track II
Wednesday, 5 November
Beyond Hadoop
Emerging Big Data Frameworks and Technology
Scalable Microservice Architectures
This track addresses the ways companies with hundreds of fine-grained web-services (e.g. Netflix, LinkedIn) manage complexity!
Java at the Cutting Edge
The latest and greatest in the Java ecosystem
Engineering culture
Successes and failures in creating an engineering culture.
Next gen HTML5 and JS
How Web Components, the Future of CSS, and more are changing the web.
Sponsored Solutions Track III