University of Texas at Arlington Presented By Srikanth Vadada Fall 2010 - CSE 6339 23 rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
A Privacy Preserving Index for Range Queries
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
CS4432: Database Systems II
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Fast Algorithms For Hierarchical Range Histogram Constructions
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes.
Optimal Workload-Based Weighted Wavelet Synopsis
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz) Serge Abiteboul (INRIA and University of Paris 11) Tova Milo (University of Tel Aviv) Automatic Index.
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
Chapter 10 Sampling and Sampling Distributions
Evaluating Hypotheses
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.
Parametric Query Generation Student: Dilys Thomas Mentor: Nico Bruno Manager: Surajit Chaudhuri.
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Steps in Using the and R Chart
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
EN : Adv. Storage and TP Systems Cost-Based Query Optimization.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C.
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
Histograms for Selectivity Estimation
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Presented By Anirban Maiti Chandrashekar Vijayarenu
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
BlinkDB.
BlinkDB.
A paper on Join Synopses for Approximate Query Answering
Ripple Joins for Online Aggregation
Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Overcoming Limitations of Sampling for Aggregation Queries
ICICLES: Self-tuning Samples for Approximate Query Answering
Spatial Online Sampling and Aggregation
Random Sampling over Joins Revisited
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Steps in Using the and R Chart
Presented by: Mariam John CSE /14/2006
Presentation transcript:

University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing Brian Babcock, Surajit Chaudhuri, Gautam Das ACM SIGMOD 2003

University of Texas at Arlington 3 Key Terms for the Topic Dynamic Selection Biased Sample Approximate Query Processing (AQP) Goal : Dynamically construct an appropriate Biased Sample for Approximate Query Processing

University of Texas at Arlington Why Approximate Query Processing (AQP) ?  Rapid strides in Data Collection & Management Technologies  Resulting in very large Databases  Effective Data Analysis methods – Ongoing Research  Analysis Queries require aggregation or summarizations  Expensive Running Times  Requirements of Analysis Systems (Decision Support Systems)  Short Query Response Time  Exactness of Query Results less important AQP Techniques are the Solution

University of Texas at Arlington Why Sampling Techniques ?  Data Analysis Queries  OLAP – Materialized Views of Data Cubes  Building Indexes  Physical Data Design – Use Preprocessing Time & Space  Effective when Query Workload is known in advance  Expensive to build indexes for all possible queries AQP & Physical Database Design Methods Complementary

University of Texas at Arlington Approximate Query Processing (Related Work)  Online Aggregation Hellerstein J., Haas P., Wang H. Online Aggregation, CM SIGMOD 1997  Approximate answers are produced during early stages  Gradual refinement until data is processed Advantages: No pre-processing required Allows progressive refinement of answers at runtime Disadvantages : Require random disk access (slow). Requires query processor code change.

University of Texas at Arlington Approximate Query Processing (Related Work)  Join Synopses - Sampling based method Acharya S., Gibbons P. B., Poosala V., Ramaswamy S. Join Synopses for Approximate Query Answering, ACM SIGMOD 1999  Join Queries – Primary Key Joins  Pre Computation - Join of Fact Tables with Dimension Tables Disadvantages: Does not extend to queries that involve non-foreign key joins

University of Texas at Arlington Approximate Query Processing (Related Work)  Icicles - Weighted Sampling method Ganti V., Lee M. L., Ramakrishnan R. ICICLES: Self-tuning Samples for Approximate Query Answering, VLDB,  Frequency of Tuple Access by Queries in Workload Disadvantages: Addresses only low selectivity problem

University of Texas at Arlington Dynamic Sample Selection  “Appropriate Biased” Samples  Give accurate approximate results for most queries  Appropriate Biased Sample varies from Query to Query  Previous Sampling Methods vs Dynamic Sample Selection  Previous sampling - Single Sample with fixed bias  Individual Tailored Sample for each query  Creation of subsamples is done offline - Preprocessing  Assembly into an overall sample is done online - Runtime

University of Texas at Arlington Effectiveness of Biased Sampling - Example  Database consists of 100 Product Tuples Product = “Stereo” - 90 Tuples Product = “TV” – 10 Tuples  Sampling 10 tuples in 2 ways 10% of the tuples uniformly each with weight 10 0% of “Stereo” tuples and 100% “TV” tuples with weight 1  Query – Count of “TV” Tuples Which gives a correct answer always? 2 nd Sample - Always gives the exact answer 1 st Sample - Only if exactly 1 of the TV tuples is chosen

University of Texas at Arlington DATA SAMPLE DATA SAMPLE Dynamic SamplingStandard Sampling Static vs Dynamic Sampling Selection

University of Texas at Arlington Dynamic Sample Selection Architecture ( 1 / 3) Pre- Processing Phase Extra Disk space not taken advantage by Standard Sampling Methods DSS uses this effectively by creating a large sample containing a family of differently biased subsamples Step 1 - Examine Data Distribution for creating a set of biased samples – Results into Overlapping Strata Step 2 – Samples are created with potentially different sampling rates for each stratum. Generate Metadata – Characteristics of each sample Query Workload Query Workload Data Select Strata Select Strata Build Sample Build Sample Data Sample Data Meta- Data Meta- Data Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

University of Texas at Arlington Dynamic Sample Selection Architecture ( 2 / 3) Runtime Phase When Queries are issued at Runtime - DSS re-writes the queries to run against sample tables Appropriate Sample Tables to use are determined by comparing Query with the Meta data Algorithms for choosing which samples to build in pre processing and samples to use for Query Processing are not described Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD Query Meta- Data Meta- Data Sample Data Sample Data Choose Samples Choose Samples Rewrite Query Rewrite Query

University of Texas at Arlington Dynamic Sample Selection Architecture ( 3 / 3)  Policies for Sample Selection Choice of Sample guided by incoming Query Syntax Examples Separate sample for each table and choose sample based on the FROM clause Separate sample for each pre-specified aggregate expression and choose sample based on the SELECT clause

University of Texas at Arlington Motivation - Small Group Sampling (1 / 2)  Sampling for answering aggregation queries with “group-bys” Ex : SELECT DEPTID, COUNT(*) FROM EMP_DEPT GROUP BY DEPTID  Uniform Sampling give weight to each GROUP proportional to number of tuples falling in the GROUP  Skewness in data distribution results in Group Oversampling  Heuristic Method for samples satisfying most group by queries  Uniform Sampling - Good providing estimates for larger groups

University of Texas at Arlington Motivation - Small Group Sampling ( 2 / 2)  Small Groups are the problem.  Since the Groups are small – Take Advantage  Scan all tuples contributing to small groups  Need to identify the small groups  Small Group Sampling method  Overall sampling using Uniform Sampling- Overall Sample  Rows from small groups – Small group tables  No downsampling for small group tables -100% rows taken Large Groups Small Groups

University of Texas at Arlington Small Group Sampling - Example  Example : Aggregation queries with “group-bys” select Age, Income, count(*) from Employee_Tbl group by Age, Income AgeIncome (In 1000s)Designation 3060Developer 3570Developer 2560Developer 2570Lead 3070Lead 2570Developer 2570Developer 3060Developer 3060Developer 4070Developer 35100Manager 2560Developer 2570Developer 35100Developer 3060Developer

University of Texas at Arlington Small Group Sampling – Illustration (1/3)  sample – perform uniform sampling on large groups.  Small group tables - one or more sample tables for smaller groups. Pre-Processing Phase: 1.Create a overall sample s_overall  2. “Age” Histogram ( Column Index: 0 )  r : Base Sampling rate, determines the size of Overall Sample (eg, 30%)  t : small group fraction, max size of each small group table (eg, 20%) Small group table s_age s_overall

University of Texas at Arlington Pre-Processing Phase: “Income” Histogram (Column Index: 1) s_income s_overall 011 Small Group Sampling – Illustration (2/3)

University of Texas at Arlington Query Issued - SELECT Age, Income, count(*) FROM Employee_tbl GROUP BY Age, Income SELECT Age, Income, count(*) FROM s_age GROUP BY Age, Income UNION ALL SELECT Age, Income, count(*) FROM s_income GROUP BY Age, Income WHERE Bitmask & 1 = 0 /* ie, 001. (eg, 010 & 001 = 000 ; 011 & 001 = 1)*/ UNION ALL SELECT Age, Income, count(*) * (100/30) FROM s_overall GROUP BY Age, Income WHERE Bitmask & 3 = 0 /* 3 = ie, 011 (eg, 001 & 011 = 1; 011 & 011 = 1; 010 & 011 = 1)*/ s_income (Column Index: ) s_age (Column Index: ) s_overall Small Group Sampling – Illustration (3/3) Runtime Phase:

University of Texas at Arlington Accuracy Metrics (1 / 2) As many possible groups to be preserved in approximate answer Error in the aggregate value for each group should be small Q = Aggregation Query Let G = {g 1, g 2, g 3, … g n } be the set of n groups in the answer to Q x i = aggregate value for group g i. A= Approximate Answer to Q G’ = {g i1, g i2, g i3, … g im } be the set of m groups in A x’ i1 = aggregate value for group g ij.

University of Texas at Arlington Accuracy Metrics (2 / 2) Percentage of Groups from Q missed by A Average relative error on Q of A Average squared relative error on Q of A

University of Texas at Arlington Experimental Results ( 1 / 2) TPC – H Database, Count & Sum queries, Number of Columns in all Tables =245 RelErr, PctGroups increased for Uniform Sampling & Small Group Sampling Increase was more pronounced for Uniform Sampling Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

University of Texas at Arlington Experimental Results ( 2 / 2) Uniform Sampling outperforms Small group sampling at low skews Small group sampling does better at moderate to high skew Speedup decreases as the number of grouping columns increase Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

University of Texas at Arlington Conclusions Dynamic Sample Selection improves on previous AQP Methods Productively utilizes additional Disk Space Small Group Sampling targets aggregate queries with group bys Small Group sampling outperforms other techniques

University of Texas at Arlington References Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing

University of Texas at Arlington A Q & Questions ?