Location:

Seacliff C/D

Day of week:

Wednesday

For many years now, Hadoop, the open-source combination of Map-Reduce libraries and the Hadoop Distributed File System (HDFS), has been the start and end of any conversation involving processing of large volumes of data. That has changed. Last year, we revealed that the Hadoop community was reducing its focus on the Map-Reduce paradigm in favor of a more flexible distributed system management paradigm known as YARN. YARN can support several data processing frameworks running side by side as applications (e.g. Map-Reduce, Storm, Tez) as well as many other types of frameworks, including those that support infrastructure beyond the purview of big data (e.g. web servers). In addition, several new file formats have emerged and HDFS is starting to take a back seat to new file systems based on SSD and memory! In short, the Big Data world we now live in has expanded beyond the borders of Hadoop -- it now includes several Interactive-speed OLAP engines, multiple machine learning platforms, a couple of columnar file formats, and many alternatives for both streaming and graph processing. Increasingly, sentences that begin with Hadoop often end with Spark. How are companies leveraging these new Big Data technologies? Come to this track to learn more.

Track Host:

Jeff Magnusson

Director of Data Platform at Stitch Fix

As Director of Data Platform at Stitch Fix, Jeff Magnusson leads the team responsible for building a robust and scalable algorithms platform that blends art and science by leveraging machines together with expert-human resources to generate innovative recommendations and insights. Prior to joining Stitch Fix, Jeff managed the Data Platform Architecture team at Netflix, where he helped design and open source much of the Hadoop based data and analytics platform Netflix uses for batch computation in the AWS cloud. Jeff holds a PhD from the University of Florida, specializing in database system implementation. @jeffmagnusson

10:35am - 11:25am

by Matei Zaharia
CTO and founder of Databricks

Unified Big Data Processing with Apache Spark

While early big data systems, such as MapReduce, focused on batch processing, the demands on these systems have quickly grown. Users quickly needed to run (1) more interactive ad-hoc queries, (2) sophisticated multi-pass algorithms (e.g. machine learning), and (3) real-time stream processing. The result has been an explosion of specialized systems to tackle these new workloads. Unfortunately, this means more systems to learn, manage, and stitch together into pipelines. Spark is unique in...

11:50am - 12:40pm

by Richard Kasperowski
QCon Open Space Facilitator

Open Space

Beyond Hadoop Open Space

Join Jeff Magnusen, our speakers, and other attendees as we discuss more flexible distributed system management paradigms like YARN and new file systems based on SSD and memory! The Big Data world we now live in has expanded beyond the borders of Hadoop--it now includes several Interactive-speed OLAP engines, multiple machine learning platforms, a couple of columnar file formats, and many alternatives for both streaming and graph processing. Increasingly, sentences that begin with Hadoop...

1:40pm - 2:30pm

by Eugene Mandel
Jawbone Data Science

Better Together Using Spark and Redshift to combine your data with public datasets

At Jawbone, the Data Science team correlated step and workout data for hundreds of thousands of UP wearers with publicly available external datasets in order to understand how various factors affect physical activity.

In this talk we will highlight the challenges of combining internal and external datasets: knowing how the data was generated and its limitations, understanding the domain logic and, most importantly, addressing data errors and outliers.

We will also compare two...

2:55pm - 3:45pm

by Lin Qiao
Engineering Manager at LinkedIn

Gobblin: A Framework for Solving Big Data Ingestion Problem

Traditionally, a Big Data system is about the large sheer volume of datasets it handles and the large processing power behind it. Nowadays, It also means large data ingestion and integration with high velocity and high quality. While the first part of the big data problem has been the focus lately with innovations to tackle these challenges.

In reality, the latter part of the problem starts to cause big pain point a lot of times before developers get to solve the next problems. With...

4:10pm - 5:00pm

by Julien Le Dem
Tech Lead at Twitter, Pig Committer, Co-author of Apache Parquet

Efficient Data Storage for Analytics with Parquet 2.0

Hadoop makes it relatively easy to store petabytes of data. However, storing data is not enough; it is important for a format to be queried quickly and efficiently. For interoperability, row based encodings (CSV, Thrift, Avro) combined with a general purpose compression algorithm to reduce storage cost (GZip, LZO, Snappy) are very common but are not efficient to query.

As discussed extensively in the database literature, a columnar layout with statistics on optionally sorted data...

5:25pm - 6:15pm

by Gian Merlino
Engineer at Metamarkets

Lambda Architectures in Practice

Hybrid batch/real-time architectures (sometimes called “lambda architectures”) are a powerful pattern for building robust, production-quality, up-to-the-minute data analytics systems.

We’ll discuss why you may want to go hybrid, the sorts of challenges that can arise when building production data systems, and effective techniques for making them easier to deploy and manage. We’ll take the data system at Metamarkets as an example, which uses Hadoop, Storm, Kafka, and Druid to ingest...

Tracks

Covering innovative topics

Monday, 3 November

Architectures You've Always Wondered about

The newest and biggest Internet architectures
Real World Functional

Putting functional programming concepts to work in the real world.
The Future of Mobile

The future of mobile and performance improvements
Continuous Delivery: From Heroics to Becoming Invisible

Continuous Delivery philosophies, cultures, hiccups, and best practices.
Unleashing the Power of Streaming Data

This track explores a variety of use-cases, platforms, and techniques for processing and analyzing stream data from the companies deploying them at scale!
Sponsored Solutions Track I

Tuesday, 4 November

Engineering for Product Success

Architectures that make products more successful
Reactive Service Architecture

Reactive, Responsive, Fault Tolerant and More.
Modern CS In the Real World

How modern CS tackles problems in the real world.
Applied Machine Learning and Data Science

Understand your big big data!
Deploying at Scale

Containerizing Applications, Discovering Services, and Deploying to the Grid.
Sponsored Solutions Track II

Wednesday, 5 November

Beyond Hadoop

Emerging Big Data Frameworks and Technology
Scalable Microservice Architectures

This track addresses the ways companies with hundreds of fine-grained web-services (e.g. Netflix, LinkedIn) manage complexity!
Java at the Cutting Edge

The latest and greatest in the Java ecosystem
Engineering culture

Successes and failures in creating an engineering culture.
Next gen HTML5 and JS

How Web Components, the Future of CSS, and more are changing the web.
Sponsored Solutions Track III

Tracks or Schedule

Location:

Day of week:

Tracks

Covering innovative topics

Monday, 3 November

Tuesday, 4 November

Wednesday, 5 November

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Track: Beyond Hadoop

Location:

Day of week:

Tracks

Covering innovative topics

Monday, 3 November

Tuesday, 4 November

Wednesday, 5 November

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World