Presented By Anirban Maiti Chandrashekar Vijayarenu

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Fast Algorithms For Hierarchical Range Histogram Constructions
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Optimal Workload-Based Weighted Wavelet Synopsis
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz) Serge Abiteboul (INRIA and University of Paris 11) Tova Milo (University of Tel Aviv) Automatic Index.
Mutual Information Mathematical Biology Seminar
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
O VERCOMING L IMITATIONS OF S AMPLING FOR A GGREGATION Q UERIES Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
Lecture 4. Sampling is the process of selecting a small number of elements from a larger defined target group of elements such that the information gathered.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
GSLPI: a Cost-based Query Progress Indicator
Ayyat IT Group Murad Faridi Roll NO#2492 Muhammad Waqas Roll NO#2803 Salman Raza Roll NO#2473 Junaid Pervaiz Roll NO#2468 Instructor :- “ Madam Sana Saeed”
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
What is OLAP?.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
Dense-Region Based Compact Data Cube
Sampling From Populations
BlinkDB.
BlinkDB.
A paper on Join Synopses for Approximate Query Answering
Ripple Joins for Online Aggregation
Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Overcoming Limitations of Sampling for Aggregation Queries
Chapter 15 QUERY EXECUTION.
Spatial Online Sampling and Aggregation
Load Shedding Techniques for Data Stream Systems
Reading Report 6 Yin Chen 5 Mar 2004
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Data Transformations targeted at minimizing experimental variance
Presented by: Mariam John CSE /14/2006
Presentation transcript:

Presented By Anirban Maiti Chandrashekar Vijayarenu Dynamic Sample Selection for Approximate Query Processing Brain Babcock (Stanford Univ) Surajit Chaudhuri (Microsoft Research) Gautam Das (Microsoft Research) Presented By Anirban Maiti Chandrashekar Vijayarenu

Related Work Research on OLAP query processing > Approximate query processing (AQP) Star Schema for a Warehouse Implementing these schema involves building indexes on commonly-queried attributes (derived from known query workload). Pros: Speeds up specific queries (when the workload is known in advance) Cons: Expensive to build indexes to cover all possible queries. Fails to perform for ad-hoc analysis queries. We need AQP technology along with physical database design technology. Why Approximate Query Processing?

Related Studies – Online Aggregation Hellerstein J., Haas P., Wang H. Online Aggregation, CM SIGMOD 1997. Approximate answers are produced during early stages of query processing. Then gradually refined until all data has been processed. Advantages: No pre-processing required. Allows progressive refinement of answers at runtime. Disadvantages: Require random disk access (slow). Requires query processor code change. SO, Fails to have online approach to AQP. Pre-processing is required. Offline Sampling based approach were of interest for AQP systems. Why Sampling based approach for Approximate Query Processing? (Online sampling approach)

Icicles – Low Selectivity Problem Recognizing the low selectivity problem, designing a biased sample that is based on known workload information was attempted by Ganti, Lee & Ramakrishnan (2000) Tuples that have been accessed by many queries in the workload were assigned greater probabilities of being selected into the sample. Disadvantage : Focuses only on the low selectivity problem (Offline sampling approach)

Disadvantage: Focuses only on the data variance problem. Outlier Indexing Chaudhuri S., Das G., Datar M., Motwani R., Narasayya V. Overcoming Limitations of Sampling for Aggregation Queries, ICDE 2001 Offline sampling-based approximations for aggregate queries when the attribute being aggregated has a skewed distribution. Disadvantage: Focuses only on the data variance problem.

Join Synopsis – AQUA @ Bell Labs Developed for certain types of join queries (particularly primary-key joins) involved pre-computing the join of samples of fact tables with dimension tables so that at runtime queries only need to be executed against single sample tables Disadvantage: Does not extend to queries that involve non-foreign key joins.

Congressional Sampling – AQUA @ Bell Labs Group by queries with aggregation Approach: Stratifies the database by considering the set of queries involving all possible combinations of grouping columns, and produces a weighted sample that balances the approximation errors of these queries. Disadvantages: Naïve scheme. But does not minimize the error for any of the well-known error metrics. Pre-processing time α 2g , g = potential group-by column

STRAT – Stratified Sampling The authors here formulated the problem of pre-computing a sample as an optimization problem, whose goal is to minimize the error for the given workload. Introduced a generalized model of the workload (“lifted workload”) Sample Selection from lifted workload - stratified sampling Benefits of this systematic approach are demonstrated by theoretical results. Disadvantage: ??

Standard Sampling AQP - Architecture Might give good answers when an attribute has a skewed data distribution. Might be a good fit for commonly used queries. Can we use this pre-processed sample(uniform/non-uniform) for any kind of query?

Dynamic Sample Selection Attempts to strike a middle ground between pre-computed and online sampling Scanning or storing significantly larger amounts of data during pre-processing Access a small amount of stored data at runtime accept greater disk usage for summary structures than other sampling based AQA methods in order to increase accuracy in query responses while holding query response time constant

Dynamic Sample Selection - Architecture Pre-Processing Phase: Query Workload Sample Data Select Strata Build Sample Data Meta- Data

Dynamic Sample Selection - Architecture Pre-Processing Phase:

Dynamic Sample Selection - Architecture Runtime Phase: Query Sample Data Choose Samples Rewrite Query Meta- Data

Motivation – Small Group Sampling Example : Aggregation queries with “group-bys” select Age, Income, count(*) from T group by Age, Income Uniform sampling ? Congressional sampling? Outlier Indexing? Small group sampling technique makes use of Dynamic Sample selection Architecture Treat small and large groups differently Age Income Designation 30 60 Developer 35 70 25 Lead 40 100 Manager It is the small groups that are the problem case for uniform sampling. Small group sampling technique makes use of Dynamic Sample selection Architecture. A disadvantage of this approach was that the primary emphasis was on the data variance problem, and while the authors did propose a hybrid solution for both the data variance as well as the low selectivity problem, the proposed solution was heuristical and therefore, suboptimal.

Small Group Sampling Approach: Pre-Processing Phase: Overall sample – perform uniform sampling on large groups. Small group tables - one or more sample tables for smaller groups. Pre-Processing Phase: Create a overall sample s_overall 2. “Age” Histogram ( Column Index: 0) r : Base Sampling rate, determines the size of Overall Sample (eg, 30%) t : small group fraction, max size of each small group table (eg, 20%) s_overall Age Income Designation Bitmask 25 60 Developer 000 70 Lead 30 35 100 001 Age Income Designation Bitmask 25 60 Developer 000 70 Lead 30 35 100 s_age Age Income Designation Bitmask 35 100 Developer 001 Manager 40 70 Small group table

Pre-processing Phase “Income” Histogram(Column Index: 1) s_overall Age Income Designation Bitmask 25 60 Developer 000 70 Lead 30 35 100 011 s_income Age Income Designation Bitmask 35 100 Manager 010 Developer 011 011

Runtime Phase s_overall s_age (Column Index: 0 - 001) SELECT Age, Income, count(*) FROM Employee_tbl GROUP BY Age, Income SELECT Age, Income, count(*) FROM s_age GROUP BY Age, Income UNION ALL SELECT Age, Income, count(*) FROM s_income GROUP BY Age, Income WHERE Bitmask & 1 = 0 /* ie, 001 . (eg, 010 & 001 = 000 ; 011 & 001 = 1)*/ SELECT Age, Income, count(*) * (100/30) FROM s_overall GROUP BY Age, Income WHERE Bitmask & 3 = 0 /* 3 = 20 + 21 ie, 011 (eg, 001 & 011 = 1; 011 & 011 = 1; 010 & 011 = 1)*/ Age Income Designation Bitmask 35 100 Developer 011 Manager 40 70 001 Age Income Designation Bitmask 25 60 Developer 000 70 Lead 30 35 100 011 s_income (Column Index: 1 - 010) Age Income Designation Bitmask 35 100 Manager 011 Developer

Relative Error – TPC-H Benchmark Database

Groups Missing – TPC-H Benchmark Database

Accuracy Metrics Accuracy Criteria Q = Aggregation Query. Include as many possible groups to calculate approximate answer. Error in the aggregate value for each group should be small. Q = Aggregation Query. Let G = {g1, g2, g3, … gn} be the set of n groups in the answer to Q xi = aggregate value for group gi. A= Approximate Answer to Q G’ = {gi1, gi2, gi3, … gin} be the set of m groups in A x’i1 = aggregate value for group gij.

Accuracy Metrics Percentage of groups from Q missed in A: 2. Average Relative Error 3. Average Squared Relative Error:

Expected Average Squared Relative Errors Q = Count Query over over a database with N tuples. Let G = {g1 . . . gn} be the set of n groups in the answer; pi = fraction of the N tuples that belong to group gi. Let C denote the set of grouping columns in the query Q Let [vC,gi Є L(C)] denote the indicator function that equals 1 when the value for grouping column C in group gi is one of the common values L(C) and 0 otherwise. Au, an approximate answer for Q produced using uniform random sampling at sampling rate s/N Asg, an approximate answer for Q produced using small group sampling with an overall sample generated at sampling rate s’/N.

Expected Average Squared Relative Errors Let Eu = E[SqRelErr(Q,Au)] and Esg = E[SqRelErr(Q,Asg)] denote the expected values of the average squared relative error on Q of Au and of Asg,respectively.

Thank You.