Presentation: "Facebook's Petabyte Scale Data Warehouse using Hive and Hadoop"

Time: Friday 10:10 - 11:10

Location: Metropolitan Ballroom

Abstract: Hive is an open source, peta-byte scale date warehousing framework built on top of Hadoop that enables scalable analytics on large data sets using SQL and some language extensions. Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis and business intelligence applications used by analysts across the company, a number of Facebook products are also based on analytics. These products range from simple reporting applications like Insights for the Facebook Ad Network, to more advanced kind such as Facebook's Lexicon product. As a result a flexible infrastructure that caters to the needs of these diverse applications and users and that also scales up in a cost effective manner with the ever increasing amounts of data being generated on Facebook, is critical. Hive fills that need and brings the power of Hadoop to users who are familar with SQL. It is flexible enough to understand different data formats (including custom formats) and also allows users to embedded cutom map/reduce logic or functions within a SQL like query. It is powerful enough to support many different kinds of analytics applications. In this presentation we will be talking in more detail about Hive, the motivations behind it and how it is used at Facebook to analyze and manage 400TB of compressed data (2.5PB of uncompressed) in our Hadoop cluster.

Ashish Thusoo, Facebook

 Ashish  Thusoo

Ashish Thusoo has been with Facebook for the last couple of years and is managing the Facebook data infrastructure team in his most recent role. He started the Hive project at Facebook along with Joydeep and serves at the project lead for Hive at Apache. He is also part of the Hadoop PMC at Apache and has presented Hive at a number of conferences, forums and panels.

Ashish has deep expertise in data processing and parallel processing technologies, infrastructure and applications built on those infrastructures. In the past he has worked at Oracle in areas of Parallel Query Execution as well as XML Databases. At Oracle he built many core data warehousing and query processing features and was recognized as one of the leaders in the Parallel Execution team. These features are regularly used in most Oracle based data warehouses.

When not tinkering with new ideas and technologies, Ashish loves to spend time with his family and listening to music.

Namit Jain, Facebook

 Namit  Jain

Namit Jain has been with the data-infrastructure group at Facebook for more than a year. He is one of the early engineers for Hive, and is one of the committers. He has presented Hive at a number of conferences, like Hadoop Summit 2009, VLDB 2009 etc.

Before that, Namit was at Oracle for over 10 years in the database and application server groups. He has worked on streaming technologies, XML, replication, queuing and related products in and outside the database.