1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Introduction Simple Random Sampling Stratified Random Sampling

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.

A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Order Statistics Sorted

Fast Algorithms For Hierarchical Range Histogram Constructions

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.

Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.

Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.

February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.

Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes.

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Visual Recognition Tutorial

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Chapter 10 Simple Regression.

Chapter 4 Multiple Regression.

Chapter 11 Multiple Regression.

Visual Recognition Tutorial

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

STAT262: Lecture 5 (Ratio estimation)

A new sampling method: stratified sampling

Radial Basis Function Networks

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.

Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

CORRELATION & REGRESSION

by B. Zadrozny and C. Elkan

Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.

Geo479/579: Geostatistics Ch12. Ordinary Kriging (1)

Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

1 Statistical Distribution Fitting Dr. Jason Merrick.

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.

OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Simulation Using computers to simulate real- world observations.

ICCS 2009 IDB Workshop, 18 th February 2010, Madrid 1 Training Workshop on the ICCS 2009 database Weighting and Variance Estimation picture.

Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Presented By Anirban Maiti Chandrashekar Vijayarenu

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

ICCS 2009 IDB Seminar – Nov 24-26, 2010 – IEA DPC, Hamburg, Germany Training Workshop on the ICCS 2009 database Weights and Variance Estimation picture.

Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.

University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.

CHAPTER 4 ESTIMATES OF MEAN AND ERRORS. 4.1 METHOD OF LEAST SQUARES I n Chapter 2 we defined the mean  of the parent distribution and noted that the.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.

Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.

Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.

Data Transformation: Normalization

Chapter 7. Classification and Prediction

A paper on Join Synopses for Approximate Query Answering

Rutgers Intelligent Transportation Systems (RITS) Laboratory

Spatial Online Sampling and Aggregation

Reading Report 6 Yin Chen 5 Mar 2004

DATABASE HISTOGRAMS E0 261 Jayant Haritsa

Parametric Methods Berlin Chen, 2005 References:

A Framework for Testing Query Transformation Rules

Statistical Thinking and Applications

Presentation transcript:

1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data p Presented by Rebecca M. Atchley

2 Motivation  Decision Support applications - OLAP and data mining for analyzing large databases have become very popular  Most queries are aggregation queries on these large databases  Expensive and resource intensive  Approximate answers given accurately and efficiently benefit the scalability of the application and are usually “Good Enough”

3 Approaches to Approximation/Data Reduction This paper’s approach uses pre-computed samples of data instead of complete data  Sampling approaches: Weighted Sampling Congressional sampling On-the-fly sampling (error-prone when selections, GROUP BYs and joins are used) Workload (Deterministic solution for identical workloads) “similar” workloads, are considered as an optimization problem (minimizing the error)

4 Approaches to Approximation/Data Reduction (contd.)  Histograms  Wavelets

5 Drawbacks of previous work  Lack of rigorous problem formulations lead to solutions that are difficult to evaluate theoretically  Does not deal with uncertainty in incoming queries that are “similar” but not identical – Assumes fixed workload  Ignores the variance in data distribution of aggregated columns

6 Attacking “similar” workloads  Stratified sampling  Minimize error in estimation of aggregates

7 Architecture for AQP  Queries with selections, foreign-key joins and GROUP BY, containing aggregation functions such as COUNT, SUM, AVG.

8 Architecture for AQP

9 Inputs: a database and a workload W Offline component for selecting a sample Online component that  Rewrites an incoming query to use the sample to answer the query approximately.  Reports the answer with an estimate of the error in the answer. ScaleFactor – As in previous works, each record in the sample contains an additional column, ScaleFactor. “The value of the aggregate column of each record in the sample is first scaled up by multiplying with the ScaleFactor, and then aggregated.”

10 Architecture for AQP A workload W is specified as a set of pairs of queries and their corresponding weights: i.e., W = {, … } Weight w i indicates the importance of query Q i in the workload. Without loss of generality, assume the weights are normalized, i.e., Σiw i =1

11 Architecture for AQP If correct answer for query Q is y while approximate answer is y’ Relative error : E(Q) = |y - y’| / y Squared error : SE(Q) = (|y - y’| / y)² If correct answer for the ith group is y i while approximate answer is y i ’ Squared error in answering a GROUP BY query Q : SE(Q) = (1/g) Σ i ((y i – y i ’)/ y i )² Given a probability distribution of queries p w Mean squared error for the distribution : MSE(p w ) = Σ Q p w (Q)*SE(Q), (where p w (Q) is probability of query Q) Root mean squared error (L 2 ): Other error metrics L 1 metric : the mean error over all queries in workload L ∞ metric : the max error over all queries

12 Special Case: Fixed Workload  Fundamental Regions: For a given relation R and workload W, consider partitioning the records in R into a minimum number of regions R 1, R 2, …, R r such that for any region R j, each query in W selects either all records in R j or none.

13 FIXEDSAMP Solution – a Deterministic Algorithm called “FIXED” Step 1: Identify all fundamental regions → r After step 1, Case A (r ≤ k) and Case B (r >k) (k → sample size) Case A (r ≤ k): (Pick Sample records) Selects the samples by picking exactly one record from each important fundamental region Assigns appropriate values to additional columns in the sample records. Case B (r > k): (Pick Sample records) select k regions and then pick one record from each of the selected regions. The heuristic is to select top k. Assign Values to Additional Columns. This is an optimization problem, which is solved by partially differentiating and the resulting linear equations using Gauss-Seidel method.

14 Disadvantage  Per-query (probabilistic) error guarantee is not possible  If incoming query is not identical to a query in the given workload, FIXED can result in unpredictable errors

15 Rationale for Stratified Sampling  Stratified sampling is a well-known generalization of uniform sampling where a population is partitioned into multiple strata and samples are selected uniformly from each stratum with “important” strata contributing relatively more samples  An effective scheme for stratification should be such that the expected variance, over all queries in each stratum, is small, and allocate more samples to strata with larger expected variances. Minimize the MSE of the lifted workload

16 Lifting Workload to Query Distributions  Should be resilient to situations where “similar” but not identical queries  “Similar”-ity not based on syntax If records returned by the two queries have significant overlap, they’re similar  Each record will have a probability associated with it such that the incoming query will select this record

17 The Non-Special Case with a Lifted Workload  Problem : SAMP Focus on Single-Relation queries with aggregation containing SUM or COUNT, W consisting of 1 query Q on Relation R Input : R, P w (a probability distribution function specified by W), and k Output : A sample of k records (with the appropriate additional column(s)) such that the MSE(P w ) is minimized.  lifted workload For a given W, define a lifted workload p w, i.e., a probability distribution of incoming queries.  high if Q’ is similar to queries in W  low if dissimilar Instead of mapping queries to probabilities, P{Q} maps subsets of R to probabilities. Objective is to define the distribution P{Q}

18 The Non-Special Case with a Lifted Workload  lifted workload (Cont.) Two parameters δ (½ ≤ δ ≤1) and γ (0 ≤ γ ≤ ½) define the degree to which the workload “influences” the query distribution. For any given record inside (resp. outside) R Q, the parameter δ (resp. γ) represents the probability that an incoming query will select this record. P {Q} (R’) is the probability of occurrence of any query that selects exactly the set of records R’. n 1, n 2, n 3, and n 4 are the counts of records in the regions. n 2 or n 4 large (large overlap), P {Q} (R’) is high n 1 or n 3 large (small overlap), P {Q} (R’) is low

19 Example ProductIDRevenue Query Q1 : SELECT COUNT(*) FROM R WHERE PRODUCTID IN (3,4); Population POPQ1 = {0,0,1,1}

20 Solution STRAT for single-table selection queries with Aggregation 1.Stratification  How many strata required during partition?  How many records should each stratum have? 2.Allocation  Determine the number of samples required across each strata 3.Sampling

21 Allocation After stratification into: the r fundamental regions for COUNT query h finer subdivisions of the fundamental regions for SUM query (since each r may have large internal variance in the aggregate column, we cannot use the same stratification as in the COUNT case) How do we do distribute the k records for the sample across the r (for COUNT) or h*r (for SUM) strata?

22 The Allocation Step for COUNT Query We want to minimize the error over queries in p w. k 1, … k r are unknown variables such that Σk j = k. From Equation (2) on an earlier slide, MSE(p W ) can be expressed as a weighted sum of the MSE of each query in the workload: Lemma 2: MSE(p W ) = Σ i w i MSE(p {Q} )

23 For any Q ε W, we express MSE(p {Q} ) as a function of the k j ’s. Lemma 3 : For a COUNT query Q in W, let ApproxMSE(p {Q} ) = Then The Allocation Step for COUNT Query (Cont.)

24 The Allocation Step for COUNT Query (Cont.) Since we have an (approximate) formula for MSE(p {Q} ), we can express MSE(p w ) as a function of the k j ’s variables. Corollary 1 : MSE(p w ) = Σ j (α j / k j ), where each α j is a function of n 1,…,n r, δ, and γ. α j captures the “importance” of a region; it is positively correlated with n j as well as the frequency of queries in the workload that access R j. Now we can minimize MSE(p w ).

25 Lemma 4: Σ j (α j / k j ) is minimized subject to Σ j k j = k if k j = k * ( sqrt(α j ) / Σ i sqrt(α i ) ) This provides a closed-form and computationally inexpensive solution to the allocation problem since α j depends only on δ, γ and the number of tuples in each fundamental region. The Allocation Step for COUNT Query (Cont.)

26 Like COUNT, we express an optimization problem with h*r unknowns k 1,…, k h*r. Unlike COUNT, the specific values of the aggregate column in each region (as well as the variance of values in each region) influence MSE(p {Q} ). Let y j (Y j ) be the average (sum) of the aggregate column values of all records in region R j. Since the variance within each region is small, each value within the region can be approximated as simply y j. Thus to express MSE(p {Q} ) as a function of the k j ’s for a SUM query Q in W: Allocation Step for SUM Query

27 Allocation Step for SUM Query As with COUNT, MSE(p W ) for SUM is functionally of the form Σ j (α j / k j ), and α j depends on the same parameters n 1, …n h*r, δ, and γ (see Corollary 1). The same minimization procedure can be used as in Lemma 4.

29 Experimental Results FIXED – solution for FIXEDSAMP, fixed workload, identical queries STRAT – solution for SAMP, workloads with single-table selection queries with aggregation PREVIOUS WORK USAMP – uniform random sampling WSAMP – weighted sampling OTLIDX – outlier indexing combined with weighted sampling CONG – Congressional sampling

Experimental Results

38 Pragmatic Issues  Identifying Fundamental Regions  Handling Large Number of Fundamental Regions  Obtaining Integer Solutions  Obtaining an Unbiased Estimator

39 Extensions for more General Queries  GROUP BY Queries  JOIN Queries  Other Extensions Mix of COUNT and SUM queries

40 Conclusion  The solutions FIXED and STRAT handle the problems of data variance, heterogeneous mixes of queries, GROUP BY and foreign-key joins. Questions?

41 Thank you!