A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.

Slides:



Advertisements
Similar presentations
Introduction Simple Random Sampling Stratified Random Sampling
Advertisements

Module B-4: Processing ICT survey data TRAINING COURSE ON THE PRODUCTION OF STATISTICS ON THE INFORMATION ECONOMY Module B-4 Processing ICT Survey data.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Order Statistics Sorted
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
Chapter 5 Stratified Random Sampling n Advantages of stratified random sampling n How to select stratified random sample n Estimating population mean and.
Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes.
Ch 4: Stratified Random Sampling (STS)
Topic 6: Introduction to Hypothesis Testing
Inferences About Means of Two Independent Samples Chapter 11 Homework: 1, 2, 3, 4, 6, 7.
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
STAT262: Lecture 5 (Ratio estimation)
Chapter 7 ~ Sample Variability
Radial Basis Function Networks
1 PREDICTION In the previous sequence, we saw how to predict the price of a good or asset given the composition of its characteristics. In this sequence,
Statistical Analysis. Purpose of Statistical Analysis Determines whether the results found in an experiment are meaningful. Answers the question: –Does.
Evaluating Performance for Data Mining Techniques
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
HAWKES LEARNING SYSTEMS math courseware specialists Copyright © 2010 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Chapter 14 Analysis.
Lecture 5 slides on Central Limit Theorem Stratified Sampling How to acquire random sample Prepared by Amrita Tamrakar.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
1 Ratio estimation under SRS Assume Absence of nonsampling error SRS of size n from a pop of size N Ratio estimation is alternative to under SRS, uses.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Copyright ©2011 Pearson Education 7-1 Chapter 7 Sampling and Sampling Distributions Statistics for Managers using Microsoft Excel 6 th Global Edition.
CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Chapter 7: Sample Variability Empirical Distribution of Sample Means.
Sampling Design and Analysis MTH 494 LECTURE-12 Ossam Chohan Assistant Professor CIIT Abbottabad.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
1 1 Slide STATISTICS FOR BUSINESS AND ECONOMICS Seventh Edition AndersonSweeneyWilliams Slides Prepared by John Loucks © 1999 ITP/South-Western College.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
ICCS 2009 IDB Workshop, 18 th February 2010, Madrid 1 Training Workshop on the ICCS 2009 database Weighting and Variance Estimation picture.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
Presented By Anirban Maiti Chandrashekar Vijayarenu
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Sampling and Sampling Distributions Basic Business Statistics 11 th Edition.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
Basic Business Statistics
Sampling Theory and Some Important Sampling Distributions.
ICCS 2009 IDB Seminar – Nov 24-26, 2010 – IEA DPC, Hamburg, Germany Training Workshop on the ICCS 2009 database Weights and Variance Estimation picture.
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
Significance Tests for Regression Analysis. A. Testing the Significance of Regression Models The first important significance test is for the regression.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Section 4.2 Random Sampling.
Chapter 7. Classification and Prediction
Point and interval estimations of parameters of the normally up-diffused sign. Concept of statistical evaluation.
A paper on Join Synopses for Approximate Query Answering
Rutgers Intelligent Transportation Systems (RITS) Laboratory
Introduction to Instrumentation Engineering
2. Stratified Random Sampling.
Reading Report 6 Yin Chen 5 Mar 2004
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Presented by: Mariam John CSE /14/2006
Presentation transcript:

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed Muchallil September 21 st, 2010 Presented by :Sayed Muchallil September 21 st, 2010

CONTENTS 1.INTRODUCTION 2.ARCHITECTURE FOR APPROXIMATE QUERY PROCESSING 3.FIXED WORKLOAD 4.STRATIFIED SAMPLING 5.SOLUTION 6.SUMMARY

Pre-computed samples  Can give approximate answer very efficiently.  Workload are used to make sure that errors are acceptable.

Previous Studies  Solution is difficult to evaluate theoretically.  Do not formally deal with uncertainty in the expected workload.  Ignoring the variance in the data distribution.

Sample Product IDRevenue  Only 50% of R records can be used as sample  Query : “SELECT SUM(Revenue) FROM R”  The answer for is 1030 Table R

Sample (cont.) Product ID Revenue  The answer for the query for table S 1 is 40.  The answer for the query for table S 2 is  How to get these answer? Sample Table S 1 Sample Table S 2

Sample (cont.)  large variance in the aggregate column can lead to large relative errors.  Relative error = |y - y’| / y  Relative error for S 1 = |1030 – 40| / 1030  Relative error for S 2 = |1030 – 2020| / 1030

What’s New ?  The goal is to pick sample that minimize error.  If actual workload is identical to the given workload (fixed), error will be smaller.  Can work for identical and similar query to the given workload.

Sampling Two ways for selecting samples – Randomized – Deterministic A Workload W is a set of pairs of queries and their weight. – W = {,,… } – Σ i w i = 1.

Architecture for Approximate Query Processing

Architecture (cont.)  Offline Component Selects sample or records from relation R  Online Component Rewrites an incoming query to use the sample. What is “rewrites” means? Reports answer with an estimate error

Architecture (cont.)  New method for automatically lifting a given workload.  It is unrealistic to assume that the incoming queries will be identical to the given workload.  The key : the ability to compute a probability distribution P w.

Error Metrics  Relative Error : |y - y’| / y  Squared Error : SE(Q) = (|y - y’| / y)²  Squared Error for GROUP BY query SE(Q) = (1/g) Σ i ((y i – y i ’)/ y i )²  a probability distribution of queries p w Mean squared error for the distribution: MSE(p w ) = Σ Q p w (Q)*SE(Q) Root mean squared error : RMSE(p w ) = √ MSE(p w )

Fixed Workload  Special case ?  A given workload are “identical” to the incoming queries.  Problem: FIXEDSAMP Input: R, W, k Output: A sample of k records (with appropriate additional columns) such that MSE(W) is minimized.

Fundamental Regions  Relation R contains 9 records  W consists of 2 queries Q1 = select records with C values between Q2 = select records with C values between  These queries divide Relation R into 4 fundamental regions.

Fundamental Regions (cont.)

partitioning the records in R into a minimum number of regions R 1, R 2, …, R r such that for any region R j, each query in W selects either all records in R j or none. Total number fundamental regions =? Min(2 |W|, n)

FIXEDSAMP Solution  Step 1. Identify Fundamental Regions in R  r <= k  r > k  Step 2 Pick Sample Records  Step 3 Assign values to additional columns

LIFTING WORKLOAD TO QUERY DISTRIBUTION  Query Q’ is not identical, Pw(Q’) is high if Q’ is similar to queries in the workload, and Low if not.  Q’ and Q are similar if selected records have significant overlap.

LIFTED WORKLOAD  P {Q} (R’) is the probability of occurrence of any query that selects exactly the set of records R’.  For any given record inside (resp. outside) R Q, the parameter δ (resp. γ) represents the probability that an incoming query will select this record

LIFTED WORKLOAD (Cont.)

δ → 1 and γ → 0: implies that incoming queries are identical to workload queries. δ → 1 and γ → ½: implies that incoming queries are supersets of workload queries. δ → ½ and γ → 0: implies that incoming queries are subsets of workload queries. δ → ½ and γ → ½: implies that incoming queries are unrestricted.

RATIONALE FOR STRATIFIED SAMPLING  A population is partitioned into multiple strata, and samples are selected uniformly from each stratum.

STRATIFIED SAMPLING  a stratified sampling scheme partitions R into r strata containing n1,., nr records (where Σnj = n), with k1, …, kr records uniformly sampled from each stratum (where Σkj = k).  Q1 = SELECT COUNT(*) FROM R WHERE ProductID IN(3,4);  POPQ is population of query Q  POPQ1 = {0,0,1,1} = non-zero variance  Divided into two strata {0,0} and {1,1} Product IDRevenue

SOLUTION FOR SINGLE-TABLE SELECTION QUERIES WITH AGGREGATION  Stratification  How many strata  How many records for each stratum  Allocation  Determines how to divide k  Sampling  Forms the final sample of k record

SOLUTION FOR COUNT AGGREGATE  Stratification (lemma 1)  r is not known, divide R into fundamental regions and treat them as strata.  Allocation (lemma 2)  MSE(p W ) = Σ i w i MSE(p{Q})  MSE(p W ) can be expressed as a weighted sum of the MSE of each query in the workload

SOLUTION FOR COUNT AGGREGATE (Cont.)  For any Q ε W, we express MSE(p {Q} ) as a function of the k j ’s Lemma 3 : ApproxMSE(p {Q} ) = Then,

SOLUTION FOR COUNT AGGREGATE (Cont.)  Since we have an (approximate) formula for MSE(p {Q} ), we can express MSE(p w ) as a function of the k j ’s variables. Corollary 1 : MSE(p w ) = Σ j (α j / k j ), where each α j is a function of n 1,…,n r, δ, and γ. α j captures the “importance” of a region; it is positively correlated with n j as well as the frequency of queries in the workload that access R j.  Now we can minimize MSE(p w ).

SOLUTION FOR COUNT AGGREGATE (Cont.) Lemma 4: Σ j (α j / k j ) is minimized subject to Σ j k j = k if k j = k * ( sqrt(α j ) / Σ i sqrt(α i ) )  This provides a closed-form and computationally inexpensive solution to the allocation problem since α j depends only on δ, γ and the number of tuples in each fundamental region

SOLUTION FOR SUM AGGREGATE  Stratification  Bucketing technique  Divide fundamental regions with large variance into a set of finer regions.  Treat each region as strata  Allocation  Y j is average (sum) of the aggregate column values of all records in region R j

SOLUTION FOR SUM AGGREGATE (Cont.)  Each value in the region can be approximated as y j  An approximate formula for MSE(P{Q}) for SUM query Q in W

Pragmatic Issues  Identifying Fundamental Regions  Handling Large Number of Fundamental Regions  Obtaining Integer Solution  Obtaining unbiased error

STRAT ALGORITHM

IMPLEMENTATION AND EXPERIMENTAL RESULT  This experiment compares the STRAT method to other methods.  USAMP – uniform random sampling  WSAMP – weighted sampling  OTLIDX – outlier indexing combined with weighted sampling  CONG – Congressional sampling

COUNT AGGREGATE

SUM AGGREGATE

COUNT AGGREGATE

THANK YOU