Presentation: Efficient Data Storage for Analytics with Parquet 2.0
Hadoop makes it relatively easy to store petabytes of data. However, storing data is not enough; it is important for a format to be queried quickly and efficiently. For interoperability, row based encodings (CSV, Thrift, Avro) combined with a general purpose compression algorithm to reduce storage cost (GZip, LZO, Snappy) are very common but are not efficient to query.
As discussed extensively in the database literature, a columnar layout with statistics on optionally sorted data provides vertical and horizontal partitioning thus keeping IO to a minimum. Understanding modern CPU architecture is critical to designing fast data specific encodings enabled by columnar layout (dictionary, bit-packing, prefix coding) that provide great compression for a fraction of the cost of general purpose algorithms. The 2.0 release of Parquet is bringing new features enabling faster query execution.
We’ll dissect and explain the design choices to achieve all three goals of interoperability, space and query efficiency.
Tracks
Covering innovative topics
Monday, 3 November
-   
          Architectures You've Always Wondered about    
  
The newest and biggest Internet architectures
 -   
          Real World Functional     
  
Putting functional programming concepts to work in the real world.
 -   
          The Future of Mobile    
  
The future of mobile and performance improvements
 -   
          Continuous Delivery: From Heroics to Becoming Invisible    
  
Continuous Delivery philosophies, cultures, hiccups, and best practices.
 -   
          Unleashing the Power of Streaming Data    
  
This track explores a variety of use-cases, platforms, and techniques for processing and analyzing stream data from the companies deploying them at scale!
 -   
          Sponsored Solutions Track I    
  
 
Tuesday, 4 November
-   
          Engineering for Product Success    
  
Architectures that make products more successful
 -   
          Reactive Service Architecture    
  
Reactive, Responsive, Fault Tolerant and More.
 -   
          Modern CS In the Real World    
  
How modern CS tackles problems in the real world.
 -   
          Applied Machine Learning and Data Science    
  
Understand your big big data!
 -   
          Deploying at Scale    
  
Containerizing Applications, Discovering Services, and Deploying to the Grid.
 -   
          Sponsored Solutions Track II    
  
 
Wednesday, 5 November
-   
          Beyond Hadoop     
  
Emerging Big Data Frameworks and Technology
 -   
          Scalable Microservice Architectures    
  
This track addresses the ways companies with hundreds of fine-grained web-services (e.g. Netflix, LinkedIn) manage complexity!
 -   
          Java at the Cutting Edge    
  
The latest and greatest in the Java ecosystem
 -   
          Engineering culture    
  
Successes and failures in creating an engineering culture.
 -   
          Next gen HTML5 and JS    
  
How Web Components, the Future of CSS, and more are changing the web.
 -   
          Sponsored Solutions Track III    
  
 



