Apache Tez : Accelerating Hadoop Query Processing

Location:

Grand Ballroom A

Track:

Time:

Wednesday, 10:30am - 11:20am

Abstract:

Apache Tez is a general purpose data processing framework written on top on YARN. Tez aims to provide high performance and efficiency out of the box across the spectrum of low latency queries and heavy-weight batch processing. Query plans produced by high-level languages like Hive and Pig can be elegantly translated via Tez's dataflow graph description API.

Adding new types of storage & data transfer technologies is facilitated via a flexible task construction model. A modular execution engine enables advanced optimization strategies to be plugged in at runtime for optimal execution. Early investments in Hive on Tez have shown remarkable improvements in performance. The talk will provide details about the design of Tez, use cases high-lighting the features and share some initial results obtained by Hive on Tez.

MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.

Bikas Saha

Bikas has been working on Apache Hadoop for over a year and is a committer on the project. He has been a key contributor in making Hadoop run natively on Windows and has focused on YARN and the Hadoop compute stack. Prior to Hadoop, he has worked extensively on the Dryad distributed data processing framework that runs on some of the worlds largest clusters as part of Microsoft Bing infrastructure. @bikassaha

Arun Murthy

Arun is the lead of the MapReduce project in Apache Hadoop where he has been a full-time contributor to Apache Hadoop since its inception in 2006. He is a long-time committer and member of the Apache Hadoop PMC and jointly holds the current world sorting record using Apache Hadoop. Prior to co-founding Hortonworks, Arun was responsible for all MapReduce code and configuration deployed across the 42,000+ servers at Yahoo!. In essence, he was responsible for running Apache Hadoop?s MapReduce as a service for Yahoo!. Twitter: @acmurthy. He is directly responsible for every bit of code and configuration of Map-Reduce deployed at over 40,000 machines running Apache Hadoop.