Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Texas at Arlington Presented By Srikanth Vadada Fall 2010 - CSE 6339 23 rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.

Similar presentations


Presentation on theme: "University of Texas at Arlington Presented By Srikanth Vadada Fall 2010 - CSE 6339 23 rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing."— Presentation transcript:

1 University of Texas at Arlington Presented By Srikanth Vadada Fall 2010 - CSE 6339 23 rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing Brian Babcock, Surajit Chaudhuri, Gautam Das ACM SIGMOD 2003

2 University of Texas at Arlington 3 Key Terms for the Topic Dynamic Selection Biased Sample Approximate Query Processing (AQP) Goal : Dynamically construct an appropriate Biased Sample for Approximate Query Processing

3 University of Texas at Arlington Why Approximate Query Processing (AQP) ?  Rapid strides in Data Collection & Management Technologies  Resulting in very large Databases  Effective Data Analysis methods – Ongoing Research  Analysis Queries require aggregation or summarizations  Expensive Running Times  Requirements of Analysis Systems (Decision Support Systems)  Short Query Response Time  Exactness of Query Results less important AQP Techniques are the Solution

4 University of Texas at Arlington Why Sampling Techniques ?  Data Analysis Queries  OLAP – Materialized Views of Data Cubes  Building Indexes  Physical Data Design – Use Preprocessing Time & Space  Effective when Query Workload is known in advance  Expensive to build indexes for all possible queries AQP & Physical Database Design Methods Complementary

5 University of Texas at Arlington Approximate Query Processing (Related Work)  Online Aggregation Hellerstein J., Haas P., Wang H. Online Aggregation, CM SIGMOD 1997  Approximate answers are produced during early stages  Gradual refinement until data is processed Advantages: No pre-processing required Allows progressive refinement of answers at runtime Disadvantages : Require random disk access (slow). Requires query processor code change.

6 University of Texas at Arlington Approximate Query Processing (Related Work)  Join Synopses - Sampling based method Acharya S., Gibbons P. B., Poosala V., Ramaswamy S. Join Synopses for Approximate Query Answering, ACM SIGMOD 1999  Join Queries – Primary Key Joins  Pre Computation - Join of Fact Tables with Dimension Tables Disadvantages: Does not extend to queries that involve non-foreign key joins

7 University of Texas at Arlington Approximate Query Processing (Related Work)  Icicles - Weighted Sampling method Ganti V., Lee M. L., Ramakrishnan R. ICICLES: Self-tuning Samples for Approximate Query Answering, VLDB, 2000.  Frequency of Tuple Access by Queries in Workload Disadvantages: Addresses only low selectivity problem

8 University of Texas at Arlington Dynamic Sample Selection  “Appropriate Biased” Samples  Give accurate approximate results for most queries  Appropriate Biased Sample varies from Query to Query  Previous Sampling Methods vs Dynamic Sample Selection  Previous sampling - Single Sample with fixed bias  Individual Tailored Sample for each query  Creation of subsamples is done offline - Preprocessing  Assembly into an overall sample is done online - Runtime

9 University of Texas at Arlington Effectiveness of Biased Sampling - Example  Database consists of 100 Product Tuples Product = “Stereo” - 90 Tuples Product = “TV” – 10 Tuples  Sampling 10 tuples in 2 ways 10% of the tuples uniformly each with weight 10 0% of “Stereo” tuples and 100% “TV” tuples with weight 1  Query – Count of “TV” Tuples Which gives a correct answer always? 2 nd Sample - Always gives the exact answer 1 st Sample - Only if exactly 1 of the TV tuples is chosen

10 University of Texas at Arlington DATA SAMPLE DATA SAMPLE Dynamic SamplingStandard Sampling Static vs Dynamic Sampling Selection

11 University of Texas at Arlington Dynamic Sample Selection Architecture ( 1 / 3) Pre- Processing Phase Extra Disk space not taken advantage by Standard Sampling Methods DSS uses this effectively by creating a large sample containing a family of differently biased subsamples Step 1 - Examine Data Distribution for creating a set of biased samples – Results into Overlapping Strata Step 2 – Samples are created with potentially different sampling rates for each stratum. Generate Metadata – Characteristics of each sample Query Workload Query Workload Data Select Strata Select Strata Build Sample Build Sample Data Sample Data Meta- Data Meta- Data Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

12 University of Texas at Arlington Dynamic Sample Selection Architecture ( 2 / 3) Runtime Phase When Queries are issued at Runtime - DSS re-writes the queries to run against sample tables Appropriate Sample Tables to use are determined by comparing Query with the Meta data Algorithms for choosing which samples to build in pre processing and samples to use for Query Processing are not described Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003. Query Meta- Data Meta- Data Sample Data Sample Data Choose Samples Choose Samples Rewrite Query Rewrite Query

13 University of Texas at Arlington Dynamic Sample Selection Architecture ( 3 / 3)  Policies for Sample Selection Choice of Sample guided by incoming Query Syntax Examples Separate sample for each table and choose sample based on the FROM clause Separate sample for each pre-specified aggregate expression and choose sample based on the SELECT clause

14 University of Texas at Arlington Motivation - Small Group Sampling (1 / 2)  Sampling for answering aggregation queries with “group-bys” Ex : SELECT DEPTID, COUNT(*) FROM EMP_DEPT GROUP BY DEPTID  Uniform Sampling give weight to each GROUP proportional to number of tuples falling in the GROUP  Skewness in data distribution results in Group Oversampling  Heuristic Method for samples satisfying most group by queries  Uniform Sampling - Good providing estimates for larger groups

15 University of Texas at Arlington Motivation - Small Group Sampling ( 2 / 2)  Small Groups are the problem.  Since the Groups are small – Take Advantage  Scan all tuples contributing to small groups  Need to identify the small groups  Small Group Sampling method  Overall sampling using Uniform Sampling- Overall Sample  Rows from small groups – Small group tables  No downsampling for small group tables -100% rows taken Large Groups Small Groups

16 University of Texas at Arlington Small Group Sampling - Example  Example : Aggregation queries with “group-bys” select Age, Income, count(*) from Employee_Tbl group by Age, Income AgeIncome (In 1000s)Designation 3060Developer 3570Developer 2560Developer 2570Lead 3070Lead 2570Developer 2570Developer 3060Developer 3060Developer 4070Developer 35100Manager 2560Developer 2570Developer 35100Developer 3060Developer

17 University of Texas at Arlington Small Group Sampling – Illustration (1/3)  sample – perform uniform sampling on large groups.  Small group tables - one or more sample tables for smaller groups. Pre-Processing Phase: 1.Create a overall sample s_overall  2. “Age” Histogram ( Column Index: 0 )  r : Base Sampling rate, determines the size of Overall Sample (eg, 30%)  t : small group fraction, max size of each small group table (eg, 20%) Small group table s_age s_overall

18 University of Texas at Arlington Pre-Processing Phase: “Income” Histogram (Column Index: 1) s_income s_overall 011 Small Group Sampling – Illustration (2/3)

19 University of Texas at Arlington Query Issued - SELECT Age, Income, count(*) FROM Employee_tbl GROUP BY Age, Income SELECT Age, Income, count(*) FROM s_age GROUP BY Age, Income UNION ALL SELECT Age, Income, count(*) FROM s_income GROUP BY Age, Income WHERE Bitmask & 1 = 0 /* ie, 001. (eg, 010 & 001 = 000 ; 011 & 001 = 1)*/ UNION ALL SELECT Age, Income, count(*) * (100/30) FROM s_overall GROUP BY Age, Income WHERE Bitmask & 3 = 0 /* 3 = 2 0 + 2 1 ie, 011 (eg, 001 & 011 = 1; 011 & 011 = 1; 010 & 011 = 1)*/ s_income (Column Index: 1 - 010) s_age (Column Index: 0 - 001) s_overall Small Group Sampling – Illustration (3/3) Runtime Phase:

20 University of Texas at Arlington Accuracy Metrics (1 / 2) As many possible groups to be preserved in approximate answer Error in the aggregate value for each group should be small Q = Aggregation Query Let G = {g 1, g 2, g 3, … g n } be the set of n groups in the answer to Q x i = aggregate value for group g i. A= Approximate Answer to Q G’ = {g i1, g i2, g i3, … g im } be the set of m groups in A x’ i1 = aggregate value for group g ij.

21 University of Texas at Arlington Accuracy Metrics (2 / 2) Percentage of Groups from Q missed by A Average relative error on Q of A Average squared relative error on Q of A

22 University of Texas at Arlington Experimental Results ( 1 / 2) TPC – H Database, Count & Sum queries, Number of Columns in all Tables =245 RelErr, PctGroups increased for Uniform Sampling & Small Group Sampling Increase was more pronounced for Uniform Sampling Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

23 University of Texas at Arlington Experimental Results ( 2 / 2) Uniform Sampling outperforms Small group sampling at low skews Small group sampling does better at moderate to high skew Speedup decreases as the number of grouping columns increase Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

24 University of Texas at Arlington Conclusions Dynamic Sample Selection improves on previous AQP Methods Productively utilizes additional Disk Space Small Group Sampling targets aggregate queries with group bys Small Group sampling outperforms other techniques

25 University of Texas at Arlington References Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing http://crystal.uta.edu/~cse6339/Fall08DBIR.htm http://crystal.uta.edu/~cse6339/Fall09DBIR.htm

26 University of Texas at Arlington A Q & Questions ?


Download ppt "University of Texas at Arlington Presented By Srikanth Vadada Fall 2010 - CSE 6339 23 rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing."

Similar presentations


Ads by Google