Presentation: Not Exactly! Fast Queries via Approximation Algorithms
Many exact queries require computation and storage that scale linearly or super-linearly in the data. There exist many classes of problems for which exact results are not necessary.
We describe the roles of various approximation algorithms that allow Druid, a distributed datastore, to increase query speeds and minimize data volume while maintaining rigorous error bounds on the results.
Calculating the cardinality exactly of 1 billion unique integer-valued IDs requires 4GB of storage. Algorithms such as HyperLogLog enable estimates within 2% for cardinalities exceeding 1 billion using only 1.5KB of storage. Data stored in Druid typically undergoes a timestamp truncation to a desired level of granularity (e.g. 1 hour) and then a GroupBy/aggregation step that summarizes data. Certain “distributive” statistics (count, sum, min, max) can be summarized and computed exactly.
For non-distributive statistics such as cardinality and quantiles, we store lossy representations of the data. These sketches enable fast, approximate calculations of the desired statistics.
We describe our use of and modifications to HyperLogLog for cardinality estimation, streaming parallel decision trees for quantile estimation and histogram building, and approximate top-k algorithms.
Fangjin Yang Elsewhere
Similar Talks
Nelson Ray Elsewhere
Similar Talks
Tracks
Covering innovative topics
Monday, 3 November
-
Architectures You've Always Wondered about
The newest and biggest Internet architectures
-
Real World Functional
Putting functional programming concepts to work in the real world.
-
The Future of Mobile
The future of mobile and performance improvements
-
Continuous Delivery: From Heroics to Becoming Invisible
Continuous Delivery philosophies, cultures, hiccups, and best practices.
-
Unleashing the Power of Streaming Data
This track explores a variety of use-cases, platforms, and techniques for processing and analyzing stream data from the companies deploying them at scale!
-
Sponsored Solutions Track I
Tuesday, 4 November
-
Engineering for Product Success
Architectures that make products more successful
-
Reactive Service Architecture
Reactive, Responsive, Fault Tolerant and More.
-
Modern CS In the Real World
How modern CS tackles problems in the real world.
-
Applied Machine Learning and Data Science
Understand your big big data!
-
Deploying at Scale
Containerizing Applications, Discovering Services, and Deploying to the Grid.
-
Sponsored Solutions Track II
Wednesday, 5 November
-
Beyond Hadoop
Emerging Big Data Frameworks and Technology
-
Scalable Microservice Architectures
This track addresses the ways companies with hundreds of fine-grained web-services (e.g. Netflix, LinkedIn) manage complexity!
-
Java at the Cutting Edge
The latest and greatest in the Java ecosystem
-
Engineering culture
Successes and failures in creating an engineering culture.
-
Next gen HTML5 and JS
How Web Components, the Future of CSS, and more are changing the web.
-
Sponsored Solutions Track III