Hadoop : Beyond Map-Reduce

Hadoop, the open-source combination of Map-Reduce libraries and the Hadoop Distributed File System (HDFS) has long been an essential tool in any enterprise or startup. Data scientists use Hadoop to execute statistical and analytical functions on large volumes of data. Data Infrastructure and Search engineers use Hadoop to generate ready-to-load indexes for custom search engines, databases, and NoSQL systems. Data Warehouse engineers and analysts use Hadoop as a key-integration point for all data flowing into MPP databases like Teradata and reporting solutions like Microstrategy. Whatever the use-case, a Hadoop installation often finds several users in any business. Beyond Map-Reduce and HDFS, Hadoop acts as an ecosystem for several useful tools and languages, including Apache Pig and Apache Hive. Till recent months, all of these higher-level tools were forced to leverage Hadoop's Map-Reduce framework to do their work -- hence, everything at the higher level boiled down to one or more Map-Reduce jobs. This was not only inefficient, it also limited what framework developers and application developers could do on Hadoop. With the advent of YARN last year, things have dramatically changed. In recent months and weeks, several new frameworks have been introduced to leverage the power of YARN, including Tez, Samza, and REEF. Come learn about these and other exciting changes from the framework developers themselves!

Day of week:

Wednesday

Location:

Grand Ballroom A

Host:

Sid Anand

Presentations

REEF: Retainable Evaluator Execution Framework

4:05pm - 4:55pm

By: Rusty Sears
Senior Scientist, Microsoft Research

With the introduction of the YARN resource manager, it is now possible for Hadoop clusters to mix and match applications written for multiple computational frameworks. YARN achieves this by providing containers with an extremely low-level API---essentially a working directory and a command line---and expecting computational frameworks such as MapReduce to handle fault tolerance, communication, and the other trappings of scalable computations.

REEF is an Apache 2.0 licensed framework that bridges this gap by providing retainable hardware resources with lifetimes that are decoupled from those of computational tasks. This allows us to support high-performance iterative graph processing and machine learning algorithms, as well as sessions, which allow users to temporarily reserve a set of machines, instantiate a computational framework, and then run extremely low-latency ad-hoc queries and jobs atop the reserved machines.

Unlike existing approaches, REEF also aims for composability of jobs across computational models, providing significant performance and usability gains, even with legacy code. Finally, REEF provides a common set of mechanisms required by most scalable computational frameworks, including configuration management, scalable data-movement primitives, fault handling primitives, and support for advanced scheduling policies, such as preemption, and elastic job sizing.

In addition to providing first class support for Java-based applications and the Hadoop ecosystem, REEF provides a set of interoperability primitives that allow it to leverage systems written in native code and C#. This talk will cover REEF's core features, and present examples of computational frameworks, including interactive sessions, iterative graph processing, bulk synchronous computations, Hive queries, and, of course, MapReduce.

Hadoop & Big Data Open Space

2:50pm - 3:40pm

By:

Big Data Platform as a Service at Netflix

1:35pm - 2:25pm

By: Jeff Magnusson
Manager, Data Platform Architecture at Netflix

Netflix is well known for being a heavily data driven company, leveraging billions of hours of subscriber viewing data to power its recommendation algorithms and refine the customer experience. Hadoop has been instrumental in unlocking the potential to process these vast quantities data. However, without tooling and services to facilitate data discoverability and usability, the potential of big data would be difficult to fully realize at Netflix.

This talk will deep dive into key services of Netflix’s “data platform as a service” architecture, including RESTful services that: provide comprehensive metadata management across data sources (Franklin); enable visualization and caching of results of Hadoop jobs (Sting); and visualize the execution plans produced by languages such as Pig and Hive (Lipstick). The presentation will show how these services can be employed in concert to solve various use cases at Netflix, and will include implementation details, demos, and our open source roadmap for these projects.

Presentations

Apache Giraph: Scalable Graph Processing on YARN

5:20pm - 6:10pm

By: Eli Reisman
Software Engineer at Etsy

Apache Giraph performs offline batch processing of very large graph datasets on top of a Hadoop cluster. Giraph replaces iterative MapReduce-style solutions with Bulk Synchronous Parallel graph processing using in-memory or disk-based data sets, loosely following the model of Google`s Pregel. Robust, efficient, and fast, Giraph is now used in production to process massive graphs for companies like Facebook. Giraph's recent port to a pure YARN platform offers increased performance, fine-grained resource control, and scalability that Giraph atop Hadoop MRv1 cannot, while paving the way for ports to other platforms like Apache Mesos. Come hear what's on the roadmap for Giraph as we explore the new possibilities YARN offers.

Samza: Real-time Stream Processing at LinkedIn

11:45am - 12:35pm

By: Chris Riccomini
Staff Software Engineer, LinkedIn

Apache Samza is a distributed stream processing framework. Samza provides a familiar and easy to use MapReduce style API that allows developers to process messages and events in realtime. The framework integrates with Apache Kafka for its messaging layer, and Apache Hadoop YARN to manage fault tolerance, processor isolation, resource management, and security. Samza also manages processor state, and will recover to a consistent snapshot when failures occur. This talk will cover Samza's feature set, how Samza integrates with YARN and Kafka, how it's used at LinkedIn, and what's next on the roadmap.

Apache Tez : Accelerating Hadoop Query Processing

10:30am - 11:20am

By: Bikas Saha , Arun Murthy
Apache Tez Commiter at the Apache Software Foundation-- Lead of the MapReduce project in Apache Hadoop

Apache Tez is a general purpose data processing framework written on top on YARN. Tez aims to provide high performance and efficiency out of the box across the spectrum of low latency queries and heavy-weight batch processing. Query plans produced by high-level languages like Hive and Pig can be elegantly translated via Tez's dataflow graph description API.

Adding new types of storage & data transfer technologies is facilitated via a flexible task construction model. A modular execution engine enables advanced optimization strategies to be plugged in at runtime for optimal execution. Early investments in Hive on Tez have shown remarkable improvements in performance. The talk will provide details about the design of Tez, use cases high-lighting the features and share some initial results obtained by Hive on Tez.

MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.