Scott Fleming, Engineering Practice Manager, Think Big Analytics

Scott is the Engineering Practice Manager at Think Big Analytics. Scott has responsibility for practice development, client project management and technology best practices. Scott brings over 15 years experience as a hands-on enterprise architect and has worked with clients globally across many industries. Previously, Scott was a Principal Architect at Optaros, a Solutions Architect at Cell Exchange, an Engineering Manager at Excelon, and Principal Consultant at C-Bridge Internet Solutions. Scott has a B.S. in Engineering Management from the University of Vermont.

Presentation: "NetApp Case Study"

Track: Big Data and NoSQL

Time: Friday 15:35 - 16:35

Location: Franciscan I & II

Abstract:

NetApp is a fast growing leader in storage technology. Its devices phone home, sending unstructured auto-support log and configuration data back to centralized data centers. This data is used to provide timely support, to better sell, and to plan product improvements. To allow this, the data needs to be collected, organized, and analyzed. Data volumes are growing 40% per year, and are currently 5 TB of compressed data per week. NetApp was previously storing flat files on disk volumes and keeping summary data in relational databases. Now NetApp is working with Think Big Analytics, incrementally deploying Hadoop, HBase, Flume, and Solr for managing auto-support data. Key requirements include:
* Query data in seconds within 5 minutes of event occurrence.
* Execute complex ad hoc queries to investigate issues and plan accordingly.
* Build models to predict support issues and capacity limits to take action before issues arise.
* Build models for cross-sale opportunities.

In this session we look at the design and lessons learned to:
* Collect 1000 messages of 20MB compressed per minute. This uses a fan-out configuration for Flume, reusing Perl parsers, writing large data sets into HDFS, updating HBase tables for current status, and creating cases for high priority issues. It also uses Java MapReduce jobs that process data downstream.
* Store 2 PB of incoming support events by 2015.
* Provide low latency access to support information and configuration changes in HBase at scale within 5 minutes of event arrival.
* Support complex ad hoc queries that join multiple data sets, using custom User Defined Functions (UDF's) to correlate JSON data. These queries benefit from partitioning and indexing in Hive and can query tens of Terabytes of data. * Operate efficiently at scale.
* Integrate with a data warehouse in Oracle.