Presentation: Not Exactly! Fast Queries via Approximation Algorithms

Many exact queries require computation and storage that scale linearly or super-linearly in the data. There exist many classes of problems for which exact results are not necessary.

We describe the roles of various approximation algorithms that allow Druid, a distributed datastore, to increase query speeds and minimize data volume while maintaining rigorous error bounds on the results. 

Calculating the cardinality exactly of 1 billion unique integer-valued IDs requires 4GB of storage. Algorithms such as HyperLogLog enable estimates within 2% for cardinalities exceeding 1 billion using only 1.5KB of storage. Data stored in Druid typically undergoes a timestamp truncation to a desired level of granularity (e.g. 1 hour) and then a GroupBy/aggregation step that summarizes data. Certain “distributive” statistics (count, sum, min, max) can be summarized and computed exactly.

For non-distributive statistics such as cardinality and quantiles, we store lossy representations of the data. These sketches enable fast, approximate calculations of the desired statistics.

We describe our use of and modifications to HyperLogLog for cardinality estimation, streaming parallel decision trees for quantile estimation and histogram building, and approximate top-k algorithms.

Fangjin Yang Elsewhere

Nelson Ray Elsewhere

Tracks

Covering innovative topics

Monday, 3 November

Tuesday, 4 November

Wednesday, 5 November

Conference for Professional Software Developers