Track:

Many exact queries require computation and storage that scale linearly or super-linearly in the data. There exist many classes of problems for which exact results are not necessary.

We describe the roles of various approximation algorithms that allow Druid, a distributed datastore, to increase query speeds and minimize data volume while maintaining rigorous error bounds on the results.

Calculating the cardinality exactly of 1 billion unique integer-valued IDs requires 4GB of storage. Algorithms such as HyperLogLog enable estimates within 2% for cardinalities exceeding 1 billion using only 1.5KB of storage. Data stored in Druid typically undergoes a timestamp truncation to a desired level of granularity (e.g. 1 hour) and then a GroupBy/aggregation step that summarizes data. Certain “distributive” statistics (count, sum, min, max) can be summarized and computed exactly.

For non-distributive statistics such as cardinality and quantiles, we store lossy representations of the data. These sketches enable fast, approximate calculations of the desired statistics.

We describe our use of and modifications to HyperLogLog for cardinality estimation, streaming parallel decision trees for quantile estimation and histogram building, and approximate top-k algorithms.

Fangjin Yang Elsewhere

Engineering Manager at LinkedIn

Lin Qiao

Unified Big Data Processing with Apache Spark

CTO and founder of Databricks

Matei Zaharia

Nelson Ray Elsewhere

Engineering Manager at LinkedIn

Lin Qiao

Unified Big Data Processing with Apache Spark

CTO and founder of Databricks

Matei Zaharia

Tracks

Covering innovative topics

Monday, 3 November

Architectures You've Always Wondered about

The newest and biggest Internet architectures
Real World Functional

Putting functional programming concepts to work in the real world.
The Future of Mobile

The future of mobile and performance improvements
Continuous Delivery: From Heroics to Becoming Invisible

Continuous Delivery philosophies, cultures, hiccups, and best practices.
Unleashing the Power of Streaming Data

This track explores a variety of use-cases, platforms, and techniques for processing and analyzing stream data from the companies deploying them at scale!
Sponsored Solutions Track I

Tuesday, 4 November

Engineering for Product Success

Architectures that make products more successful
Reactive Service Architecture

Reactive, Responsive, Fault Tolerant and More.
Modern CS In the Real World

How modern CS tackles problems in the real world.
Applied Machine Learning and Data Science

Understand your big big data!
Deploying at Scale

Containerizing Applications, Discovering Services, and Deploying to the Grid.
Sponsored Solutions Track II

Wednesday, 5 November

Beyond Hadoop

Emerging Big Data Frameworks and Technology
Scalable Microservice Architectures

This track addresses the ways companies with hundreds of fine-grained web-services (e.g. Netflix, LinkedIn) manage complexity!
Java at the Cutting Edge

The latest and greatest in the Java ecosystem
Engineering culture

Successes and failures in creating an engineering culture.
Next gen HTML5 and JS

How Web Components, the Future of CSS, and more are changing the web.
Sponsored Solutions Track III

Tracks or Schedule

Find Fangjin Yang at

Fangjin Yang Elsewhere

Similar Talks

Find Nelson Ray at

Nelson Ray Elsewhere

Similar Talks

Tracks

Covering innovative topics

Monday, 3 November

Tuesday, 4 November

Wednesday, 5 November

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: Not Exactly! Fast Queries via Approximation Algorithms

Find Fangjin Yang at

Fangjin Yang Elsewhere

Similar Talks

Find Nelson Ray at

Nelson Ray Elsewhere

Similar Talks

Tracks

Covering innovative topics

Monday, 3 November

Tuesday, 4 November

Wednesday, 5 November

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World