Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.

Similar presentations


Presentation on theme: "A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth."— Presentation transcript:

1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth Sivaram Vallath

2 Why Approximate query Answering? Most applications are OLAP and data mining for analyzing large databases. Most applications are OLAP and data mining for analyzing large databases. Most of the queries are aggregation queries on these large databases. Most of the queries are aggregation queries on these large databases. Hence expensive and resource intensive. Hence expensive and resource intensive. Approximate answers given accurately and efficiently benefits the scalability of the application. Approximate answers given accurately and efficiently benefits the scalability of the application.

3 Approaches to Approximation/Data Reduction Use pre-computed samples of data instead of complete data Use pre-computed samples of data instead of complete data Sampling: Sampling: Weighted Sampling Weighted Sampling Congressional sampling Congressional sampling On the fly sampling (error prone when selections, GROUP BYs and joins are used) On the fly sampling (error prone when selections, GROUP BYs and joins are used) Workload (Deterministic solution for identical workloads) Workload (Deterministic solution for identical workloads)

4 Approaches to Approximation/Data Reduction (contd.) “similar” workloads, are considered as optimization problem (minimizing the error) “similar” workloads, are considered as optimization problem (minimizing the error) Histograms Histograms Wavelets Wavelets

5 Attacking “similar” workloads Stratified sampling Stratified sampling Minimize error in estimation of aggregates Minimize error in estimation of aggregates

6 Drawbacks of previous studies Lack of rigorous problem formulations leds to solutions that are difficult to evaluate theoretically. Lack of rigorous problem formulations leds to solutions that are difficult to evaluate theoretically. Does not deal with uncertainty in incoming queries that are “similar” but identical Does not deal with uncertainty in incoming queries that are “similar” but identical Ignore the variance in data distribution of aggregated columns Ignore the variance in data distribution of aggregated columns

7 Architecture for AQP Queries with selections, foreign- key joins and GROUP BY, containing aggregation functions such as COUNT, SUM, AVG. Queries with selections, foreign- key joins and GROUP BY, containing aggregation functions such as COUNT, SUM, AVG.

8 Architecture for AQP

9 Offline Component: building of sample Offline Component: building of sample Online Component: Online Component: 1. Rewrites an incoming query to use the sample to answer the query approximately 2. Reports the answer with error estimates

10 Pre-computing Samples for Fixed Workload Fundamental Regions: Fundamental Regions: For a given relation R and workload W, consider partitioning the records in R into a minimum number of regions R 1, R 2, …, R r such that for any region R j, each query in W selects either all records in R j or none.

11 Fixed Workload 1. Identify all fundamental regions → r After step 1, Case A (r ≤ k) and Case B (r >k) (k → sample size) 2. Case A (r ≤ k): (Pick Sample records) Selects the samples by picking exactly one record from each important fundamental region 3. Assigns appropriate values to additional columns in the sample records 2. Case B (r > k): (Pick Sample records) select k regions and then pick one record from each of the selected regions. The heuristic is to select top k. 3. Assign Values to Additional Columns. This is an optimization problem, which is solved by partially differentiating and the resulting linear equations using Gauss-Seidel method.

12 Disadvantage Per-query (probabilistic) error guarantee is not possible. Per-query (probabilistic) error guarantee is not possible. If incoming query is not identical to a query in the given workload, FIXED can result in unpredictable errors. If incoming query is not identical to a query in the given workload, FIXED can result in unpredictable errors.

13 Lifting Workload to Query Distributions Should be resilient to situations where “similar” but not identical queries Should be resilient to situations where “similar” but not identical queries “Similar”ity is not based on syntax. If the records returned by the two queries have significant overlap, it is similar. “Similar”ity is not based on syntax. If the records returned by the two queries have significant overlap, it is similar. Each record will have a probability associated with it such that the incoming query will select this record Each record will have a probability associated with it such that the incoming query will select this record

14 Rationale for Stratified Sampling An effective scheme for stratification should be such that the expected variance, over all queries in each stratum, is small, and allocate more samples to strata with larger expected variances. An effective scheme for stratification should be such that the expected variance, over all queries in each stratum, is small, and allocate more samples to strata with larger expected variances. Minimize the MSE of the lifted workload

15 Example ProductIDRevenue 110 210 310 41000

16 Solution for single-table selection queries with Aggregation 1. Stratification 1. How many strata required during partition? 2. How many records should each strata have? 2. Allocation 1. Determine the number of samples required across each strata 3. Sampling

17 Pragmatic Issues Identifying Fundamental Regions Identifying Fundamental Regions Handling Large Number of Fundamental Regions Handling Large Number of Fundamental Regions Obtaining Integer Solutions Obtaining Integer Solutions Obtaining an Unbiased Estimator Obtaining an Unbiased Estimator

18 Extensions for more General Queries GROUP BY Queries GROUP BY Queries JOIN Queries JOIN Queries Other Extensions Other Extensions Mix of COUNT and SUM queries Mix of COUNT and SUM queries

19 Conclusion The solutions FIXED and STRAT handles the problems of data variance, heterogeneous mixes of queries, GROUP BY and foreign-key joins. The solutions FIXED and STRAT handles the problems of data variance, heterogeneous mixes of queries, GROUP BY and foreign-key joins.


Download ppt "A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth."

Similar presentations


Ads by Google