Presentation is loading. Please wait.

Presentation is loading. Please wait.

Histograms for Selectivity Estimation

Similar presentations


Presentation on theme: "Histograms for Selectivity Estimation"— Presentation transcript:

1 Histograms for Selectivity Estimation
Speaker: Ho Wai Shing

2 Contents Introduction: What is a histogram? How to use a histogram?
A taxonomy of single-dimensional histograms Some experimental results Some approaches for multi-dimensional histograms Conclusions Future Work

3 Introduction Many modules of a DB require selectivity estimation (estimating the query result size) e.g., query optimizer -- determine the nesting in indexed nested loop join user interface -- return a rough answer to the users

4 Introduction We need to store some statistics of the database to estimate the selectivity Histogram is one of the most common statistics to be stored in practice. Quite accurate, needs reasonably small space.

5 What is a Histogram? Histograms approximate the frequency distribution of an attribute (or a set of attributes) group attribute values into "buckets" approximate the actual frequencies by the statistical information stored in each bucket.

6 Histogram: Example Consider the following distribution:
This is an "equi-width" histogram: (1)  3

7 Histogram: the problem
a pair of dual problem [1]: Given a data distribution, a limit B on the length of H, and an error metric E(), find the histogram H that minimizes E(H). Given the data distribution, a limit  on the error, and an error metric E(), find the histogram H of smallest length for which E(H) is at most .

8 Two Goals for Histograms
for selections (exact or range queries) for joins focus on histograms for selections in this talk

9 Taxonomy of Histogram based on the paper by Poosala et al. [2] in SIGMOD'96 on single-dimensional histograms proposed a generalized histogram-generating algorithm different decisions in each step results in different histograms

10 Generalized Histogram Generating Algorithm
consider the data distribution as a two column table T(value, frequency) create a third attribute a3 (sort parameter) based on the first two attributes, sort the table according to a3. specify a subclass of histogram create a 4th attribute a4 (source parameter) partition T into B buckets s.t. it satisfies some constraints on a4.

11 Example: Equi-Width Histograms
a3 = value all histograms are possible a4 = value constraint: every bucket should contain the same number of data values

12 Example: End-Biased Equi-Depth Histograms
a3 = value all but one buckets must be singletons a4 = frequency constraint: all buckets should have the same total frequency counts

13 Taxonomy Dimensions: partition classes -- serial, end-biased
a3 -- value (V), frequency (F), area (A) a4 -- spread (S), frequency (F), cum. freq (C), area (A) constraints -- equi-sum, v-optimal, max-diff, compressed, spline-based

14 Constraints Equi-sum: each bucket should have the same sum of a4
V-Optimal: divide the buckets so that the variance of the overall frequency approximation is minimized Spline-based: the cumulative freq. satisfies a piece-wise linear approximation.

15 Constraints (cont.) Max-diff: bucket boundaries are at top-(B-1) adjacent a4 differences. Compressed (comp.): top-n entries with the highest a3 values are stored exactly, others are stored using equi-sum.

16 Taxonomy

17 Equi-Width Histograms
discussed in Kooi's thesis (1980) [3] denoted by "Equi-Sum(V, S)" in the taxonomy mergeable buckets must have contiguous values merge criteria is about the spread based on equi-sum

18 Equi-Depth Histograms
proposed by Piatetsky-Shapiro and Connell in SIGMOD'84 [4] denoted by "Equi-Sum(V, F)" in the taxonomy mergeable buckets must have contiguous values merge criteria is about the frequencies based on equi-sum

19 V-Optimal(F, F) Histograms
proposed by Ioannidis and Christodoulakis in 1993 [5] mergeable buckets must have contiguous frequencies merge criteria is to minimize sum-squared error on frequencies within a bucket

20 V-Optimal(V, F) Histograms
proposed Poosala et al. in 1996 [2] mergeable buckets must have contiguous values merge criteria is to minimize sum-squared error on frequencies

21 Max-Diff(V, F) Histograms
proposed Poosala et al. in 1996 [2] mergeable buckets must have contiguous values merge criteria is to minimize sum-squared error on frequencies

22 Compressed(V, F) Histograms
proposed Poosala et al. in 1996 [2] mergeable buckets must have contiguous values merge criteria is equi-depth except the more frequent n values.

23 Summary Data Distribution Max-Diff(V, F) equi-width V-optimal(F, F)
V-optimal(V, F) Compressed(V, F) equi-depth

24 Estimation Example (4) = 2 (actual value) equi-width: (4)  4
equi-depth: (4)  2.8 V-optimal(F,F): (4)  9(0)+4(1)+1(0) = 4 V-optimal(V,F): (4)  2.4(1)+9(0)+2.7(0) =2.4 Max-Diff(V,F): (4)  2.4 Compressed(V,F): (4)  2.2

25 Experimental Results 100000 tuples 200 attribute values
2000 samples for construction

26 Experimental Results cusp_max value distribution
random value & freq. relation frequencies fit to Zipf distribution

27 Other Experiment Parameters
Skew of frequency Skew of data value distribution Sample size in construction Accuracy vs. Storage Data Distributions (freq., values, correlations) Queries

28 Conclusions Histograms are useful in estimating the selectivity of a query Different techniques to use histogram for approximating the data exist v-optimal or MaxDiff histograms can have good accuracy for 1-D case

29 Future Work The methods presented can't solve the n-D histogram problem completely Try to apply SF-Tree to store and retrieve the buckets in multi-dimensional histogram efficiently.

30 References [1] H.V. Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Ken Sevcik, Torsten Suel, Optimal Histograms with Quality Guarantees, VLDB’98 [2] Viswanath Poosala, Yannis Ioannidis, Peter Haas, Eugene Shekita, Improved Histograms for Selectivity Estimation of Range Predicates, SIGMOD’96 [3] R. P. Kooi, The Optimization of Queries in Relational Databases, PhD Thesis, Case Western Reserver University, 1980

31 References [4] M. Muralikrishna and D. DeWitt, Equi-Depth Histograms for Estimating Selectivity Factors for Multi-Dimensional Queries, SIGMOD’88


Download ppt "Histograms for Selectivity Estimation"

Similar presentations


Ads by Google