Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.

Similar presentations


Presentation on theme: "A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed."— Presentation transcript:

1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed Muchallil September 21 st, 2010 Presented by :Sayed Muchallil September 21 st, 2010

2 CONTENTS 1.INTRODUCTION 2.ARCHITECTURE FOR APPROXIMATE QUERY PROCESSING 3.FIXED WORKLOAD 4.STRATIFIED SAMPLING 5.SOLUTION 6.SUMMARY

3 Pre-computed samples  Can give approximate answer very efficiently.  Workload are used to make sure that errors are acceptable.

4 Previous Studies  Solution is difficult to evaluate theoretically.  Do not formally deal with uncertainty in the expected workload.  Ignoring the variance in the data distribution.

5 Sample Product IDRevenue 110 2 3 41000  Only 50% of R records can be used as sample  Query : “SELECT SUM(Revenue) FROM R”  The answer for is 1030 Table R

6 Sample (cont.) Product ID Revenue 110 41000  The answer for the query for table S 1 is 40.  The answer for the query for table S 2 is 2020.  How to get these answer? Sample Table S 1 Sample Table S 2

7 Sample (cont.)  large variance in the aggregate column can lead to large relative errors.  Relative error = |y - y’| / y  Relative error for S 1 = |1030 – 40| / 1030  Relative error for S 2 = |1030 – 2020| / 1030

8 What’s New ?  The goal is to pick sample that minimize error.  If actual workload is identical to the given workload (fixed), error will be smaller.  Can work for identical and similar query to the given workload.

9 Sampling Two ways for selecting samples – Randomized – Deterministic A Workload W is a set of pairs of queries and their weight. – W = {,,… } – Σ i w i = 1.

10 Architecture for Approximate Query Processing

11 Architecture (cont.)  Offline Component Selects sample or records from relation R  Online Component Rewrites an incoming query to use the sample. What is “rewrites” means? Reports answer with an estimate error

12 Architecture (cont.)  New method for automatically lifting a given workload.  It is unrealistic to assume that the incoming queries will be identical to the given workload.  The key : the ability to compute a probability distribution P w.

13 Error Metrics  Relative Error : |y - y’| / y  Squared Error : SE(Q) = (|y - y’| / y)²  Squared Error for GROUP BY query SE(Q) = (1/g) Σ i ((y i – y i ’)/ y i )²  a probability distribution of queries p w Mean squared error for the distribution: MSE(p w ) = Σ Q p w (Q)*SE(Q) Root mean squared error : RMSE(p w ) = √ MSE(p w )

14 Fixed Workload  Special case ?  A given workload are “identical” to the incoming queries.  Problem: FIXEDSAMP Input: R, W, k Output: A sample of k records (with appropriate additional columns) such that MSE(W) is minimized.

15 Fundamental Regions  Relation R contains 9 records  W consists of 2 queries Q1 = select records with C values between 10 -50 Q2 = select records with C values between 40 -70  These queries divide Relation R into 4 fundamental regions.

16 Fundamental Regions (cont.)

17 partitioning the records in R into a minimum number of regions R 1, R 2, …, R r such that for any region R j, each query in W selects either all records in R j or none. Total number fundamental regions =? Min(2 |W|, n)

18 FIXEDSAMP Solution  Step 1. Identify Fundamental Regions in R  r <= k  r > k  Step 2 Pick Sample Records  Step 3 Assign values to additional columns

19 LIFTING WORKLOAD TO QUERY DISTRIBUTION  Query Q’ is not identical, Pw(Q’) is high if Q’ is similar to queries in the workload, and Low if not.  Q’ and Q are similar if selected records have significant overlap.

20 LIFTED WORKLOAD  P {Q} (R’) is the probability of occurrence of any query that selects exactly the set of records R’.  For any given record inside (resp. outside) R Q, the parameter δ (resp. γ) represents the probability that an incoming query will select this record

21 LIFTED WORKLOAD (Cont.)

22 δ → 1 and γ → 0: implies that incoming queries are identical to workload queries. δ → 1 and γ → ½: implies that incoming queries are supersets of workload queries. δ → ½ and γ → 0: implies that incoming queries are subsets of workload queries. δ → ½ and γ → ½: implies that incoming queries are unrestricted.

23 RATIONALE FOR STRATIFIED SAMPLING  A population is partitioned into multiple strata, and samples are selected uniformly from each stratum.

24 STRATIFIED SAMPLING  a stratified sampling scheme partitions R into r strata containing n1,., nr records (where Σnj = n), with k1, …, kr records uniformly sampled from each stratum (where Σkj = k).  Q1 = SELECT COUNT(*) FROM R WHERE ProductID IN(3,4);  POPQ is population of query Q  POPQ1 = {0,0,1,1} = non-zero variance  Divided into two strata {0,0} and {1,1} Product IDRevenue 110 2 3 41000

25 SOLUTION FOR SINGLE-TABLE SELECTION QUERIES WITH AGGREGATION  Stratification  How many strata  How many records for each stratum  Allocation  Determines how to divide k  Sampling  Forms the final sample of k record

26 SOLUTION FOR COUNT AGGREGATE  Stratification (lemma 1)  r is not known, divide R into fundamental regions and treat them as strata.  Allocation (lemma 2)  MSE(p W ) = Σ i w i MSE(p{Q})  MSE(p W ) can be expressed as a weighted sum of the MSE of each query in the workload

27 SOLUTION FOR COUNT AGGREGATE (Cont.)  For any Q ε W, we express MSE(p {Q} ) as a function of the k j ’s Lemma 3 : ApproxMSE(p {Q} ) = Then,

28 SOLUTION FOR COUNT AGGREGATE (Cont.)  Since we have an (approximate) formula for MSE(p {Q} ), we can express MSE(p w ) as a function of the k j ’s variables. Corollary 1 : MSE(p w ) = Σ j (α j / k j ), where each α j is a function of n 1,…,n r, δ, and γ. α j captures the “importance” of a region; it is positively correlated with n j as well as the frequency of queries in the workload that access R j.  Now we can minimize MSE(p w ).

29 SOLUTION FOR COUNT AGGREGATE (Cont.) Lemma 4: Σ j (α j / k j ) is minimized subject to Σ j k j = k if k j = k * ( sqrt(α j ) / Σ i sqrt(α i ) )  This provides a closed-form and computationally inexpensive solution to the allocation problem since α j depends only on δ, γ and the number of tuples in each fundamental region

30 SOLUTION FOR SUM AGGREGATE  Stratification  Bucketing technique  Divide fundamental regions with large variance into a set of finer regions.  Treat each region as strata  Allocation  Y j is average (sum) of the aggregate column values of all records in region R j

31 SOLUTION FOR SUM AGGREGATE (Cont.)  Each value in the region can be approximated as y j  An approximate formula for MSE(P{Q}) for SUM query Q in W

32 Pragmatic Issues  Identifying Fundamental Regions  Handling Large Number of Fundamental Regions  Obtaining Integer Solution  Obtaining unbiased error

33 STRAT ALGORITHM

34 IMPLEMENTATION AND EXPERIMENTAL RESULT  This experiment compares the STRAT method to other methods.  USAMP – uniform random sampling  WSAMP – weighted sampling  OTLIDX – outlier indexing combined with weighted sampling  CONG – Congressional sampling

35 COUNT AGGREGATE

36 SUM AGGREGATE

37 COUNT AGGREGATE

38 THANK YOU


Download ppt "A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed."

Similar presentations


Ads by Google