<<< Previous speaker next speaker >>>

Kumar Palaniappan, NetApp

 Kumar  Palaniappan

Kumar Palaniapan is a technologist with almost 19 years of experience in distributed-systems development and deployment, focusing on High Performance Computing, service-oriented architectures, and cloud computing. Kumar is an Enterprise Architect at NetApp.

Presentation: "NetApp Case Study"

Time: Friday 15:35 - 16:35

Location: Franciscan I & II

Abstract:

NetApp is a fast growing leader in storage technology. Its devices phone home, sending unstructured auto-support log and configuration data back to centralized data centers. This data is used to provide timely support, to better sell, and to plan product improvements. To allow this, the data needs to be collected, organized, and analyzed. Data volumes are growing 40% per year, and are currently 5 TB of compressed data per week. NetApp was previously storing flat files on disk volumes and keeping summary data in relational databases. Now NetApp is working with Think Big Analytics, incrementally deploying Hadoop, HBase, Flume, and Solr for managing auto-support data. Key requirements include:
* Query data in seconds within 5 minutes of event occurrence.
* Execute complex ad hoc queries to investigate issues and plan accordingly.
* Build models to predict support issues and capacity limits to take action before issues arise.
* Build models for cross-sale opportunities.

In this session we look at the design and lessons learned to:
* Collect 1000 messages of 20MB compressed per minute. This uses a fan-out configuration for Flume, reusing Perl parsers, writing large data sets into HDFS, updating HBase tables for current status, and creating cases for high priority issues. It also uses Java MapReduce jobs that process data downstream.
* Store 2 PB of incoming support events by 2015.
* Provide low latency access to support information and configuration changes in HBase at scale within 5 minutes of event arrival.
* Support complex ad hoc queries that join multiple data sets, using custom User Defined Functions (UDF's) to correlate JSON data. These queries benefit from partitioning and indexing in Hive and can query tens of Terabytes of data. * Operate efficiently at scale.
* Integrate with a data warehouse in Oracle.