Presentation is loading. Please wait.

Presentation is loading. Please wait.

Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Similar presentations


Presentation on theme: "Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael."— Presentation transcript:

1 Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica

2 Our Goal Support interactive SQL-like aggregate queries over massive sets of data

3 Our Goal Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log AVG, COUNT, SUM, STDEV, PERCENTILE etc.

4 Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ FILTERS, GROUP BY clauses Our Goal

5 Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id JOINS, Nested Queries etc. Our Goal

6 Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT my_function(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id ML Primitives, User Defined Functions ML Primitives, User Defined Functions Our Goal

7 Hard Disks ½ - 1 Hour1 - 5 Minutes1 second ? Memory 100 TB on 1000 machines Query Execution on Samples

8 IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley0.13 10Berkeley0.49 11NYC0.19 12Berkeley0.10 Query Execution on Samples What is the average buffering ratio in the table? 0.2325

9 IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley0.13 10Berkeley0.49 11NYC0.19 12Berkeley0.10 Query Execution on Samples What is the average buffering ratio in the table? IDCityBuff RatioSampling Rate 2NYC0.131/4 6Berkeley0.251/4 8NYC0.191/4 Uniform Sample 0.19 0.2325

10 IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley0.13 10Berkeley0.49 11NYC0.19 12Berkeley0.10 Query Execution on Samples What is the average buffering ratio in the table? IDCityBuff RatioSampling Rate 2NYC0.131/4 6Berkeley0.251/4 8NYC0.191/4 Uniform Sample 0.19 +/- 0.05 0.2325

11 IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley0.13 10Berkeley0.49 11NYC0.19 12Berkeley0.10 Query Execution on Samples What is the average buffering ratio in the table? IDCityBuff RatioSampling Rate 2NYC0.131/2 3Berkeley0.251/2 5NYC0.191/2 6Berkeley0.091/2 8NYC0.181/2 12Berkeley0.491/2 Uniform Sample $0.22 +/- 0.02 0.2325 0.19 +/- 0.05

12 Speed/Accuracy Trade-off Execution Time Error 30 mins Time to Execute on Entire Dataset Interactive Queries 5 sec

13 Execution Time Error 30 mins Time to Execute on Entire Dataset Interactive Queries 5 sec Speed/Accuracy Trade-off Pre-Existing Noise

14 Sampling Vs. No Sampling 1 10 -1 10 -2 10 -3 10 -4 10 -5 Fraction of full data Query Response Time (Seconds) 103 1020 1813 108 10x as response time is dominated by I/O 10x as response time is dominated by I/O

15 Sampling Vs. No Sampling 1 10 -1 10 -2 10 -3 10 -4 10 -5 Fraction of full data Query Response Time (Seconds) 103 1020 1813 108 (0.02%) (0.07%)(1.1%)(3.4%) (11%) Error Bars

16 Latency: 772.34 sec (17TB input) Latency: 1.78 sec (1.7GB input) Top 10 worse performers identical! 440x faster! Video Quality Diagnosis

17 What is BlinkDB? A data analysis (warehouse) system that … -creates and maintains a variety of random and stratified samples from underlying data -returns fast, approximate answers with error bars by executing queries on samples of data -is compatible with Apache Hive, AMP Lab’s Shark and Facebook’s Presto (storage, serdes, UDFs, types, metadata)

18 What is BlinkDB? A data analysis (warehouse) system that … -creates and maintains a variety of random and stratified samples from underlying data -returns fast, approximate answers with error bars by executing queries on samples of data -is compatible with Apache Hive, AMP Lab’s Shark and Facebook’s Presto (storage, serdes, UDFs, types, metadata)

19 IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley0.13 10Berkeley0.49 11NYC0.19 12Berkeley0.10 Samples IDCityBuff RatioSampling Rate 2NYC0.132/7 8NYC0.252/7 6Berkeley0.092/5 12Berkeley0.492/5 Stratified Sample IDCityBuff RatioSampling Rate 2NYC0.131/3 8NYC0.251/3 6Berkeley0.091/3 11NYC0.191/3 Uniform Sample

20 Uniform Samples SELECT A.* FROM A WHERE rand() < 0.01 ORDER BY rand() Create a 1% random sample A_sample from A blinkdb> create table A_sample as select * from A samplewith 0.01; 1. 2. Keep track of per row scaling information

21 Stratified Samples Create a 1% stratified sample A_sample from A biased on col_A blinkdb> create table A_sample as select * from A samplewith 0.01 stratify on (col_A); 1. 2. Keep track of per row scaling information SELECT A.* from A JOIN (SELECT K, logic(count(*)) AS ratio FROM A GROUP BY K) USING K WHERE rand() < ratio;

22 What is BlinkDB? A data analysis (warehouse) system that … -creates and maintains a variety of random and stratified samples from underlying data -returns fast, approximate answers with error bars by executing queries on samples of data -is compatible with Apache Hive, AMP Lab’s Shark and Facebook’s Presto (storage, serdes, UDFs, types, metadata)

23 blinkdb> select count(1) from A_sample where event = “foo”; 12810132 +/- 3423 (99% Confidence) Also supports: sum(),avg(),stdev(), var() Approximate Answers

24 What is BlinkDB? A data analysis (warehouse) system that … -creates and maintains a variety of random and stratified samples from underlying data -returns fast, approximate answers with error bars by executing queries on samples of data -is compatible with Apache Hive, AMP Lab’s Shark and Facebook’s Presto (storage, serdes, UDFs, types, metadata)

25 BlinkDB Architecture Hadoop Storage (e.g., HDFS, Hbase, Presto) Meta store Meta store Hadoop/Spark/Presto SQL Parser Query Optimizer Physical Plan SerDes, UDFs Execution Driver Command-line Shell Thrift/JDBC

26 BlinkDB alpha-0.2.0 1.Released and available at http://blinkdb.orghttp://blinkdb.org 2.Allows you to create random and stratified samples on native tables and materialized views 3.Adds approximate aggregate functions with statistical closed forms to HiveQL : approx_avg(), approx_sum(), approx_count() etc.

27 Feature Roadmap 1.Integrating BlinkDB with Facebook’s Presto and Shark as an experimental feature 2.Automatic Sample Management 3.More Hive Aggregates, UDAF Support 4.Runtime Correctness Tests

28 SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 1 SECONDS 234.23 ± 15.32 Automatic Sample Management Goal: The API should abstract the details of creating, deleting and managing samples from the user

29 SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 2 SECONDS 234.23 ± 15.32 Automatic Sample Management 239.46 ± 4.96 Goal: The API should abstract the details of creating, deleting and managing samples from the user

30 SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ ERROR 0.1 CONFIDENCE 95.0% Automatic Sample Management Goal: The API should abstract the details of creating, deleting and managing samples from the user

31 TABLE Sampling Module Original Data Offline-sampling: Creates an optimal set of samples on native tables and materialized views based on query history and workload characteristics Automatic Sample Management

32 TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Sample Placement: Samples striped over 100s or 1,000s of machines both on disks and in-memory. Automatic Sample Management

33 SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Automatic Sample Management

34 SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Online sample selection to pick best sample(s) based on query latency and accuracy requirements Automatic Sample Management

35 TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Hive/Shark/Presto SELECT foo (*) FROM TABLE WITHIN 2 New Query Plan HiveQL/SQL Query Sample Selection Error Bars & Confidence Intervals Result 182.23 ± 5.56 (95% confidence) Parallel query execution on multiple samples striped across multiple machines. Automatic Sample Management

36 More Aggregates/ UDAF Generalized Error Estimators -Statistical Bootstrap -Applicable to complex and nested queries, UDAFs, joins etc.

37 More Aggregates/ UDAF Generalized Error Estimators -Statistical Bootstrap -Applicable to complex and nested queries, UDFs, joins etc. Sample A A R A1A1 A2A2 A 100 … … B ±ε

38 1.Given a query, how do you know if it can be approximated at runtime? -Depends on the query, data distribution, and sample size 2.Need for runtime diagnosis tests -Check whether error improves as sample size increases -Need to be extremely fast! Runtime Diagnostics

39 1.BlinkDB alpha-0.2.0 released and available at http://blinkdb.org http://blinkdb.org 2.Takes just 5-10 minutes to run it locally or to spin an EC2 cluster 3.Hands-on Exercises in the afternoon! Getting Started

40 1.Approximate queries is an important means to achieve interactivity in processing large datasets without really affecting the quality of results 2.BlinkDB.. -approximate answers with error bars by executing queries on small samples of data -supports existing Hive/Shark/Presto queries 3.For more information, please check out our EuroSys 2013 (http://bit.ly/blinkdb-1) and KDD 2013 (http://bit.ly/blinkdb-2) papershttp://bit.ly/blinkdb-1http://bit.ly/blinkdb-2 Summary Thanks!


Download ppt "Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael."

Similar presentations


Ads by Google