Presentation is loading. Please wait.

Presentation is loading. Please wait.

February 14, 2006CS6392 - DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.

Similar presentations


Presentation on theme: "February 14, 2006CS6392 - DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath."— Presentation transcript:

1 February 14, 2006CS6392 - DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath Poosala Presented By: Muhammed Z. Miah

2 February 14, 2006CS6392 - DB Exploration2 Introduction Limitations of Uniform Sampling Presence of skewed data in aggregate values Effect of low selectivity in selection queries Presence of small groups in group-by queries Biased Sampling for Group-By Queries (Precomputed) Biased sampling – hybrid union of biased and uniform sampling

3 February 14, 2006CS6392 - DB Exploration3 Aqua System (Architecture)

4 February 14, 2006CS6392 - DB Exploration4 Problems with Group-By Queries Decision support queries routinely segment the data into groups. For example, a group-by query on the U.S. census database could be used to determine the per capita income per state. However,there can be a huge discrepancy in the sizes of different groups, e.g., the state of California has nearly 70 times the population of Wyoming. As a result, a uniform random sample of the relation will contain disproportionately fewer tuples from the smaller groups, which leads to poor accuracy for answers on those groups because accuracy is highly dependent on the number of sample tuples that belong to that group. Standard error is inversely proportional to √n for uniform sample. n is the uniform sample random size.

5 February 14, 2006CS6392 - DB Exploration5 Solution (Congressional Sampling) Congressional samples are hybrid union of uniform and biased samples. The strategy adopted is to divide the available sample space X equally among the g groups, and take a uniform random sample within each group. Consider US Congress which is hybrid of House and Senate. House has representative from each state in proportion to its population. Senate has equal number of representative from each state. Then apply House and Senate scenario for representing different groups. House sample: Uniform random sampling from each group. Senate sample: Sample an equal number of tuples from each group.

6 February 14, 2006CS6392 - DB Exploration6 Solution (Congressional Sampling) Define a strategy S1 as following : Divide the available sample space X equally among the g groups, and take a uniform random sample within each group Congressional approach : In this approach consider the entire set of possible group by queries over a relation R. Let be the set of non-empty groups under the grouping G. The grouping G partitions the relation R according to the cross-product of all the grouping attributes; this is the finest possible partitioning for group-bys on R. Any group h on any other grouping T G is the union of one or more groups g from. Constructing Congress, 1. Apply S1 on each T G. 2. Let be the set of non-empty groups under the grouping T, and let the number of such groups. 3. By S1, each of the non-empty groups in T should get a uniform random sample of X/m T tuples from the group.

7 February 14, 2006CS6392 - DB Exploration7 Solution (Congressional Sampling) Constructing Congress, 4. Thus for each subgroup g in of a group h in T, the expected space allocated to g is simply 5. Then, for each group g, take the maximum over all T of S g,T, as the sample size for g, and scale it down to limit the space used to X. The final formula is: Sample Size (g) = 6. For each group g in, select a uniform random sample of size Sample Size(g). Thus we have a stratified, biased sample in which each group at the finest partitioning is its own strata. Thus Congress essentially guarantees that both large and small groups in all groupings will have a reasonable number of samples. where n g and n h are the number of tuples in g and h respectively.

8 February 14, 2006CS6392 - DB Exploration8 Rewriting Query rewriting involves two key steps: a) scaling up the aggregate expressions and b) deriving error bounds on the estimate. For each tuple, let its scale factor ScaleFactor be the inverse sampling rate for its strata. All the sample tuples belonging to a group will have the same ScaleFactor. Thus key step in scaling is efficiently associate each tuple with its corresponding ScaleFactor. There are two approaches to doing this: a) store the ScaleFactor(SF) with each tuple in sample relation - Integrated b) use a separate table to store the ScaleFactors for the groups - Normalized, Key-normalized, Nested-integrated Each approach has its pros and cons.

9 February 14, 2006CS6392 - DB Exploration9 Computation and Maintenance One Pass Algorithm [AGP99b] S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. Technical report, Bell Laboratories, Murray Hill, New Jersey, November 1999

10 February 14, 2006CS6392 - DB Exploration10 Experiments Testbed On Aqua, with Oracle (v7) Accuracy of Sample Allocation Strategies Performance for Different Query Sets Queries w/ No Group-bys, Three group-bys, Two group-bys Effect of Sample Size Error drops as more space is allocated to store the samples Congress – drops error rapidly w/ increasing sample size and provide high accuracy even for arbitrary group-bys Performance of Rewriting Strategies

11 February 14, 2006CS6392 - DB Exploration11 Extensions Generalization to Multiple Criteria Generalization to Other Queries

12 February 14, 2006CS6392 - DB Exploration12 Related Work Online Aggregation Histograms Wavelets Biased Sampling (Stratified Sampling)

13 February 14, 2006CS6392 - DB Exploration13 Conclusions Congressional samples are effective for group-by queries with arbitrary group-bys (including none) New strategies were validated experimentally for both in their ability to produce accurate estimates to group-by queries and in their execution efficiency

14 February 14, 2006CS6392 - DB Exploration14 THANK YOU Happy Valentines


Download ppt "February 14, 2006CS6392 - DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath."

Similar presentations


Ads by Google