BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Motivation Support interactive SQL-like aggregate queries over massive sets of data

Feature Most queries focus on global message of the whole table. blinkdb> SELECT AVG(jobtime) FROM very_big_log AVG, COUNT, SUM, STDEV, PERCENTILE etc.

Where and group semantics focus on limited clauses. blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ FILTERS, GROUP BY clauses Feature

Hard Disks ½ - 1 Hour1 - 5 Minutes1 second ? Memory 100 TB on 1000 machines Query Execution on Samples

IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley0.13 10Berkeley0.49 11NYC0.19 12Berkeley0.10 Query Execution on Samples What is the average buffering ratio in the table? 0.2325

IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley0.13 10Berkeley0.49 11NYC0.19 12Berkeley0.10 Query Execution on Samples What is the average buffering ratio in the table? IDCityBuff RatioSampling Rate 2NYC0.131/4 6Berkeley0.251/4 8NYC0.191/4 Uniform Sample 0.19 0.2325

IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley0.13 10Berkeley0.49 11NYC0.19 12Berkeley0.10 Query Execution on Samples What is the average buffering ratio in the table? IDCityBuff RatioSampling Rate 2NYC0.131/4 6Berkeley0.251/4 8NYC0.191/4 Uniform Sample 0.19 +/- 0.05 0.2325

IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley0.13 10Berkeley0.49 11NYC0.19 12Berkeley0.10 Query Execution on Samples What is the average buffering ratio in the table? IDCityBuff RatioSampling Rate 2NYC0.131/2 3Berkeley0.251/2 5NYC0.191/2 6Berkeley0.091/2 8NYC0.181/2 12Berkeley0.491/2 Uniform Sample $0.22 +/- 0.02 0.2325 0.19 +/- 0.05

Speed/Accuracy Trade-off Error 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Execution Time (Sample Size)

Error 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Speed/Accuracy Trade-off Pre-Existing Noise Execution Time (Sample Size)

Sampling Vs. No Sampling 1 10 -1 10 -2 10 -3 10 -4 10 -5 Fraction of full data Query Response Time (Seconds) 103 1020 1813 108 10x as response time is dominated by I/O 10x as response time is dominated by I/O

Sampling Vs. No Sampling 1 10 -1 10 -2 10 -3 10 -4 10 -5 Fraction of full data Query Response Time (Seconds) 103 1020 1813 108 (0.02%) (0.07%)(1.1%)(3.4%) (11%) Error Bars

What is BlinkDB? A framework built on Shark and Spark that … -creates and maintains a variety of uniform and stratified samples from underlying data -returns fast, approximate answers with error bars by executing queries on samples of data -verifies the correctness of the error bars that it returns at runtime

BlinkDB Background System Overview Sample Creation BlinkDB Runtime Inplementation & Evaluation

Background One common assumption is that future queries will be similar to historical queries. The meaning of “similarity” can differ. This choice of model of past workloads is one of the key differences between BlinkDB and prior work

Workload Taxonomy

System overview BlinkDB extends the Apache Hive frame work by adding two major components to it: (1)an offline sampling module that creates and maintains samples over time (2) a run-time sample selection module that creates an Error-Latency Profile(ELP) for queries

Supported queries standard SQL aggregate queries involving COUNT, AVG, SUM and QUANTILE. Queries involving these operations can be annotated with either an error bound, or a time constraint. Nested or joines queries not supported yet, but not a hindrance

It would also be straight forward to extend BlinkDB to deal with foreign keyjoins between two sampled tables (or a self join on one sampled table) where both tables have a stratified sample on the set of columns used for joins.

Sample Creation Why Stratified samples are useful? Samples carry storage costs, so we can only build a limited number of them.

Stratified Samples when uniform sample is useful? when uniform sample is useful A uniform sample may not contain any members of the subset at all, leading to a missing row in the final output of the query.

Stratified Samples for a single query

This problem has been studied before. Briefly, since error decreases at a decreasing rate as sample size increases, the best choices imply assigns equal sample size to each groups. In addition, the assignment of sample sizes is deterministic.

[16] S. Lohr. Sampling: design and analysis. Thomson, 2009. K=

Optimizing a set of stratified samples for all queries sharing a QCS n will change through queries.

Columns selection optimization In practice, we set M=K=100000

can also be useful by partially covering q j

The size of this optimization problem increases exponentially with the number of columns in T, which looks worrying. However, it is possible to solve these problems in practice by applying some simple optimizations, like considering only column sets that actually occurred in the past queries, or eliminating column sets that are unrealistically large.

BlinkDB Runtime

Predict mainly based on: 1. For all standard SQL aggregates, the variance is proportional to ∼ 1/n, and thus the standard deviation (or the statistical error) is proportional to ∼ 1/√n 2. BlinkDB simply predicts n by assuming that latency scales linearly with input size, as is commonly observed with a majority of I/O bounded queries in parallel distributed execution environments.

Bias correction use stratified sample to simulate a normal sample by trace the sample rate of every group.

Inplementation Enables queries with response time and error bounds Creates or updates the set of random and multi-dimensional samples re-writes the query and iteratively assigns it an appropriately sized uniform or stratified sample Modify all pre-existing aggregation functions with statistical closed forms to return errors bars and confidence intervals in addition to there result.

Sample refresh inaccuracies in analysis based on multiple queries. Multiple queries on unchanged biased sample will not help to convergence. periodically( typically, daily) samples from the original data to avoid correlation among the answers to queries which use the same sample.

IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley0.13 10Berkeley0.49 11NYC0.19 12Berkeley0.10 Query Execution on Samples IDCityBuff RatioSampling Rate 2NYC0.131/4 6Berkeley0.251/4 8NYC0.191/4 Uniform Sample 0.19 0.2325

Time cost for sample uniform samples are generally created in a few hundred seconds. creating stratified samples on a set of columns takes anywhere between a 5− 30 minutes depending on the number of unique values to stratify on, which decides the number of reducers and the amount of data shuffled.

Evaluation workloads and sample storage cost QCS choices change through the storage budget

Response time improvement by sample

Error by different samples

Error Convergence

Time and error bound

Scaling Up Highly selective queries Those queries that only operate on a small fraction of input data consist of one or more highly selective WHERE clauses those queries that are intended to crunch huge amounts of data Average among x=2 Average among all the data

Conclusion BlinkDB, a parallel, sampling-based approximate query engine that provides support for ad-hoc queries with error and response time constraints two key ideas: (i) a multi-dimensional sampling strategy that builds and maintains a variety of samples. (ii) a run-time dynamic sample selection strategy that uses parts of a sample to estimate query selectivity and chooses the best samples for satisfying query constraints. Answer a “range” of queries within 2 seconds on 17 TB of data with 90-98% accuracy.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Similar presentations

Presentation on theme: "BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Similar presentations

Presentation on theme: "BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)"— Presentation transcript:

Similar presentations

About project

Feedback