Track: Beyond Hadoop

Location:

Day of week:

For many years now, Hadoop, the open-source combination of Map-Reduce libraries and the Hadoop Distributed File System (HDFS), has been the start and end of any conversation involving processing of large volumes of data. That has changed. Last year, we revealed that the Hadoop community was reducing its focus on the Map-Reduce paradigm in favor of a more flexible distributed system management paradigm known as YARN. YARN can support several data processing frameworks running side by side as applications (e.g. Map-Reduce, Storm, Tez) as well as many other types of frameworks, including those that support infrastructure beyond the purview of big data (e.g. web servers). In addition, several new file formats have emerged and HDFS is starting to take a back seat to new file systems based on SSD and memory! In short, the Big Data world we now live in has expanded beyond the borders of Hadoop -- it now includes several Interactive-speed OLAP engines, multiple machine learning platforms, a couple of columnar file formats, and many alternatives for both streaming and graph processing. Increasingly, sentences that begin with Hadoop often end with Spark. How are companies leveraging these new Big Data technologies? Come to this track to learn more.

Track Host:
Jeff Magnusson
Director of Data Platform at Stitch Fix
As Director of Data Platform at Stitch Fix, Jeff Magnusson leads the team responsible for building a robust and scalable algorithms platform that blends art and science by leveraging machines together with expert-human resources to generate innovative recommendations and insights. Prior to joining Stitch Fix, Jeff managed the Data Platform Architecture team at Netflix, where he helped design and open source much of the Hadoop based data and analytics platform Netflix uses for batch computation in the AWS cloud. Jeff holds a PhD from the University of Florida, specializing in database system implementation. @jeffmagnusson
10:35am - 11:25am

by Matei Zaharia
CTO and founder of Databricks

While early big data systems, such as MapReduce, focused on batch processing, the demands on these systems have quickly grown. Users quickly needed to run (1) more interactive ad-hoc queries, (2) sophisticated multi-pass algorithms (e.g. machine learning), and (3) real-time stream processing. The result has been an explosion of specialized systems to tackle these new workloads. Unfortunately, this means more systems to learn, manage, and stitch together into pipelines. Spark is unique in...

11:50am - 12:40pm

by Richard Kasperowski
QCon Open Space Facilitator

Open Space

Join Jeff Magnusen, our speakers, and other attendees as we discuss more flexible distributed system management paradigms like YARN and new file systems based on SSD and memory! The Big Data world we now live in has expanded beyond the borders of Hadoop--it now includes several Interactive-speed OLAP engines, multiple machine learning platforms, a couple of columnar file formats, and many alternatives for both streaming and graph processing. Increasingly, sentences that begin with Hadoop...

1:40pm - 2:30pm

by Eugene Mandel
Jawbone Data Science

At Jawbone, the Data Science team correlated step and workout data for hundreds of thousands of UP wearers with publicly available external datasets in order to understand how various factors affect physical activity.

In this talk we will highlight the challenges of combining internal and external datasets: knowing how the data was generated and its limitations, understanding the domain logic and, most importantly, addressing data errors and outliers.

We will also compare two...

2:55pm - 3:45pm

by Lin Qiao
Engineering Manager at LinkedIn

Traditionally, a Big Data system is about the large sheer volume of datasets it handles and the large processing power behind it. Nowadays, It also means large data ingestion and integration with high velocity and high quality. While the first part of the big data problem has been the focus lately with innovations to tackle these challenges.

In reality, the latter part of the problem starts to cause big pain point a lot of times before developers get to solve the next problems. With...

4:10pm - 5:00pm

by Julien Le Dem
Tech Lead at Twitter, Pig Committer, Co-author of Apache Parquet

Hadoop makes it relatively easy to store petabytes of data. However, storing data is not enough; it is important for a format to be queried quickly and efficiently. For interoperability, row based encodings (CSV, Thrift, Avro) combined with a general purpose compression algorithm to reduce storage cost (GZip, LZO, Snappy) are very common but are not efficient to query.

As discussed extensively in the database literature, a columnar layout with statistics on optionally sorted data...

5:25pm - 6:15pm

by Gian Merlino
Engineer at Metamarkets

Hybrid batch/real-time architectures (sometimes called “lambda architectures”) are a powerful pattern for building robust, production-quality, up-to-the-minute data analytics systems.

We’ll discuss why you may want to go hybrid, the sorts of challenges that can arise when building production data systems, and effective techniques for making them easier to deploy and manage. We’ll take the data system at Metamarkets as an example, which uses Hadoop, Storm, Kafka, and Druid to ingest...

Tracks

Covering innovative topics

Monday, 3 November

Tuesday, 4 November

Wednesday, 5 November