Reading Report 6 Yin Chen 5 Mar 2004

Reading Report 6 Yin Chen 5 Mar 2004
Reference: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries, Surajit Chaudhuri, Gautam das, Vivek Narasayya, Microsoft Research, One Microsoft Way.

Problem Over View Decision support applications such as On Line Analytical Processing (OLAP) and data mining for analyzing LARGE databases have become popular. But expensive and resource intensive This work uses precomputed samples of the data instead of the complete data to answer the queries, to give approximate answers efficiently. 3 drawbacks of previous works Lack of formulation thus difficult to evaluate theoretically Do NOT deal with uncertainty Ignore the variance in the data distribution of the aggregated column(s).

Related work Some works based on randomized techniques, assume a fixed workload, and do NOT cope with uncertainty. Each record is tagged with a frequency. An expected number of k records are selected in the sample, where the probability of selecting a record t with frequency ft is k*(ft/Σufu)) Records that are accessed more frequently have a greater chance of being included inside the sample. BUT has poor quality. i.e. Consider a set of queries, let a few queries reference large partitions and most queries reference very small partitions. By the weighted sampling scheme most record will come from the large partitions. Thus, with high probability, there will be no records selected from many of the small partitions, causing large error. An IMPROVMENT: collected outliers of the data (i.e. the records that contribute to high variance) into a separate index, while the remaining data is sampled using a weighted sampling technique. Queries are answered by running them against both the outlier index and the weighted sample, and an estimated answer is composed out of both results. Some other works use on-the-fly sampling, but can be expensive.

Architecture Workload Architecture ScaleFactor
A workload W is specified as a set of pairs of queries and their corresponding weights: i.e., W = {<Q1,w1>, … <Qq,wq>} Weight wi indicates the importance of query Qi in the workload. Without loss of generality, assume the weights are normalized, i.e., Σiwi =1 Architecture Inputs: a database and a workload W 2 components An offline component for selecting a sample An online component that Rewrites an incoming query to use the sample to answer the query approximately. Reports the answer with an estimate of the error in the answer. ScaleFactor Each record in the sample contains an additional column, ScaleFactor. The value of the aggregated column of each record in the sample is first scaled up by multiplying with the ScaleFactor, and then aggregated.

Architecture (Cont.) Error metrics : used to determine the quality of an approximate answer to an aggregation query. Suppose the correct answer for a query Q is y while the approximate answer is y’ Relative error : E(Q) = |y - y’| / y Squared error : SE(Q) = (|y - y’| / y)² Suppose the correct answer for the ith group is yi while the approximate answer is yi’ Squared error in answering a GROUP BY query Q : SE(Q) = (1/g) Σi ((yi – yi’)/ yi)² Given a probability distribution of queries pw Mean squared error for the distribution : MSE(pw) =ΣQ pw (Q)*SE(Q), ( pw (Q) is the probability of query Q) Root mean squared error : RMSE(pw) = square root of MSE(pw) Other error metrics L1 metrics : the mean error over all queries in the workload L∞ metrics : the maximum error over all queries

The Special Case of a Fixed Workload
Overview Here provide a solution for the special case of a fixed workload, i.e., when the incoming queries are identical to the given workload. Use an effective deterministic scheme rater than the conventional randomization scheme. Problem Formulation Problem : FIXEDSAMP Input : R, W, k Output : A sample of k records (with appropriate additional columns) such that MSE(W) is minimized MSE(W) = MSE(Pw) (Mean squared error ), where a query Q has a probability of occurrence of 1 is Q ∈W and 0 otherwise. Fundamental Regions For a given relation R and workload W, consider partitioning the records in R into a minimum number of regions F = {R1, R2, …, Rr} such that for any region Rj, each query in W selects either all records in Rj or none. These regions are the fundamental regions of R induced by W. i.e. Consider a relation R (with aggregate column C) containing nine records (with C values 10, 20, …, 90). Let W consist of two queries, Q1 (which selects records with C values between 10 and 50) and Q2 (which selects records with C values between 40 and 70). These two queries induce a partition of R into four fundamental regions, R1, R1, R1, R4. In general the total number of fundamental regions r depends on R and W and is upper-bounded by min(2|W|, n), where n is the number of records in R.

The Special Case of a Fixed Workload (Cont.)
Solutions for FIXEDSAMP A deterministic algorithm called FIXED Step1 (Identify Fundamental Regions) : Let r be the number of fundamental regions. Case A : r ≤ k (the selected sample can answer queries WITHOUT any errors) Step 2A (Pick Sample Records) : Pick exactly one record from each fundamental region. Step 3A (Assign Values to Additional Columns) : The idea is that each sample record can be used to “summarize” all records from the corresponding fundamental region, WITHOUT incurring any error. For a workload consisting of ONLY COUNT queries, store the count of the number of records in that fundamental region in a SINGLE additional column in the sample records (called RegionCount). For a workload consisting of ONLY SUM queries, store the sum of the values in the aggregate column for records in that fundamental region in the SINGLE additional column in the sample (called AggSum). For a workload contains a MIX of COUNT and SUM queries, need BOTH the RegionCount and AggSum column, which can also answer AVG queries. Case B : r > k (select sample that try to minimize the errors in queries) Step 2B (Pick Sample Records) : Sort all r regions by their importance, select the TOP k, and then pick up one record from each of the selected regions. The importance of region Rj is defined as fj * nj², where fj is the sum of the weights of all queries in W that select the region, nj is the number of records in the region. fj measures the weights of queries that are affected by Rj while nj² measures the effect on the (squared) error by not including Rj. Step 3B (Assign Values to Additional Columns) : Note that the extra column values of a sample are NOT required to characterize the corresponding fundamental region; all we care is that they contain appropriate values so that the error for the workload is minimized. To assign values to the RegionCount and AggSum columns of the k selected sample records, express MSE(W) as a quadratic function of 2*k unknowns: {RC1,…,RCk} and {AS1,…, ASk}, and partially differentiating with each variable and setting each result to zero. This gives rise to 2*k simultaneous (sparse) linear equations, which can be solved by using an iterative technique.

The Non-Special Case with a Lifted Workload
Overview Here provide a solution for the non-special case which incoming query is “similar” but NOT identical to queries. The problem was focus on the SINGLE-TABLE selection queries with aggregation containing either the SUM or COUNT aggregate. A workload W consists of exactly ONE query Q on relation R. Problem Formulation Problem : SAMP Input : R, Pw (a probability distribution function specified by W), and k Output : A sample of k records, (with the appropriate additional column(s)) such that the MSE(Pw) is minimized. lifted workload For a given W, define a lifted workload pw, i.e., a probability distribution of incoming queries. Intuitively, for any query Q’ (not necessarily in W), pw(Q’) should be related to the amount of similarity (dissimilarity) of Q’ to the workload: high if Q’ is similar to queries in the workload, and low otherwise. We say that two queries Q’ and Q are similar if the records selected by Q’ and Q have significant overlap. The objective is to define the distribution P{Q} Since for the purposed of lifting, only concern the set of records selected by a query and NOT the query itself. Thus instead of mapping queries to probabilities, P{Q} maps subsets of R to probabilities.

lifted workload (Cont.) Assume two parameters δ (½ ≤ δ ≤1) and γ (0 ≤ γ ≤ ½). These parameters define the degree to which the workload “influences” the query distribution. For any given record inside (resp. outside) RQ, the parameter δ (resp. γ) represents the probability that an incoming query will select this record. For all R’⊆R, P{Q}(R’) is the probability of occurrence of any query that selects exactly the set of records R’. RQ is the records selected by Q. n1, n2, n3, and n4 are the counts of records in the regions. When n2 or n4 are large (i.e., the overlap is large), P{Q}(R’) is high (i.e. queries that select RQ are likely to occur). When n1 or n3 are large (i.e., the overlap is small), P{Q}(R’) is low (i.e. queries that select RQ are unlikely to occur). Setting the parameters δ and γ: δ → 1 and γ → 0 : implies that incoming queries are identical to workload queries δ → 1 and γ → ½ : implies that incoming queries are supersets to workload queries δ → ½ and γ → 0 : implies that incoming queries are subsets to workload queries δ → ½ and γ → ½ : implies that incoming queries are unrestricted

Stratified sampling Stratified sampling is a well-known generalization of uniform sampling where a population is partitioned into multiple strata and samples are selected uniformly from each stratum, with “important” strata contributing relatively more samples. Define population of a query Q (denoted by POPQ) on a relation R as a set of size |R| that contains the value of the aggregated column that is selected by Q, or 0 if the record is not selected. A stratified sampling scheme partitions R into r strata containing n1,…,nr records (where Σnj = n), with k1,…,kr records uniformly sampled from each stratum (where Σkj = k). The scheme also associates a ScaleFactor with each record in the sample. Queries are answered by execution them on the sample instead of R. For a COUNT query, the ScaleFactor entries of the selected records are summed, while for a SUM(y) query the expression y*ScaleFactor is summed. If also wish to return an error guarantee with each query, then instead of ScaleFactor, we have to keep track of each nj and kj individually for each stratum. Solution for STRAT : 3 steps : Step one : stratification step, determine : (a) How many strata r to partition relation R into, and (b) The records from R that belong to each stratum. At the end of step one, we have r strata R1,…,Rr containing n1,…,nr records such that Σnj = n. Step two : allocation step, determine how to divide k (the number of records available for the sample) into integers k1,…,kr across the r strata such that Σkj = k. Step three : sampling step, uniformly samples kj records from stratum Rj to form the final sample of k records.

Solution for STRAT (Cont.) Solution for COUNT Aggregate Stratification Step Lemma 1: Consider a relation R with n records and a workload W of COUNT queries. In the limit when n tends to infinity, the fundamental regions F = {R1, R2, …, Rr} represent an optimal stratification. Allocation Step (1) Express MSE(pw) as a weighted sum of the MSE of each query in the workload: Lemma 2 : MSE(pw) = Σi wi MSE(p{Q}) (2) For any Q∈W, express MSE(p{Q}) as a function of the kj’s Lemma 3 : For a COUNT query Q in W, let ApproxMSE(p{Q}) = Then Since have an (approximate) formula for MSE(p{Q}), we can exppress MSE(pw) as a function of the variables kj’ Corollary 1 : MSE(pw) = Σj(αj/ kj), where each αj is a function of n1,…,nr, δ, and γ αj captures the “importance” of a region; it is positively correlated with nj as well as the frequency of queries in the workload that access Rj (3) Minimize MSE(pw) Lemma 4: Σj(αj/ kj) is minimized subject to Σjkj = k if kj =k*(sqrt(αj) / Σisqrt(αi)) Lemma 4 provides a closed-form and computationally inexpensive solution to the allocation problem since αj depends only on δ, γ and the number of tuples in each fundamental region.

Solution for STRAT (Cont.) Solution for SUM Aggregate Stratification Step Since each stratum may have LARGE internal variance in the values of the aggregate column, CANNOT use the same stratification as in the COUNT case, i.e., strata = fundamental regions. Divide fundamental regions with large variance into a set of finer regions, each of which has significantly lower internal variance, and treat these finer regions as the strata. Within a new stratum the aggregate column values of records are CLOSE to one another. Borrow from statistics literature an approximation of the optimal Neymann Allocation technique for minimizing variance, use it to divide each fundamental region into h finer regions, thus generating a total of h*r, which become the strata. (h was set to 6). Allocation Step (1) Similar to COUNT, it is expressed as an optimization problem with h*r unknowns k1,…, kh*r. (2) Different from COUNT, the specific values of the aggregate column in each region influence MSE(p{Q}). Let yj (Yj) be the AVERAG (sum) of the aggregate column values of all records in region Rj. Since the variance within each region is SMALL (due to stratification), we can assume that each value within the region can be approximated as yj. Thus to express MSE(p{Q}) as a function of the kj’s for a SUM query Q in W: As with COUNT, MSE(pW) for SUM is functionally of the form Σj(αj/ kj), and αj depends on the same parameters n1, …nh*r , δ, and γ (see Corollary 1), (3)The same for the minimization step as in Lemma 4.

Reading Report 6 Yin Chen 5 Mar 2004

Similar presentations

Presentation on theme: "Reading Report 6 Yin Chen 5 Mar 2004"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reading Report 6 Yin Chen 5 Mar 2004

Similar presentations

Presentation on theme: "Reading Report 6 Yin Chen 5 Mar 2004"— Presentation transcript:

Similar presentations

About project

Feedback