Presentation: Efficient Data Storage for Analytics with Parquet 2.0

Hadoop makes it relatively easy to store petabytes of data. However, storing data is not enough; it is important for a format to be queried quickly and efficiently. For interoperability, row based encodings (CSV, Thrift, Avro) combined with a general purpose compression algorithm to reduce storage cost (GZip, LZO, Snappy) are very common but are not efficient to query.

As discussed extensively in the database literature, a columnar layout with statistics on optionally sorted data provides vertical and horizontal partitioning thus keeping IO to a minimum. Understanding modern CPU architecture is critical to designing fast data specific encodings enabled by columnar layout (dictionary, bit-packing, prefix coding) that provide great compression for a fraction of the cost of general purpose algorithms. The 2.0 release of Parquet is bringing new features enabling faster query execution.

We’ll dissect and explain the design choices to achieve all three goals of interoperability, space and query efficiency.

Tracks

Covering innovative topics

Monday, 3 November

Tuesday, 4 November

Wednesday, 5 November

Conference for Professional Software Developers