Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.

Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms

Contents Supplements to the previous talk Introduction to histograms for multi- dimensional data Global optimization of histograms Experiment results Conclusion

Summary Histograms approximate the frequency distribution of an attribute (or a set of attributes) group attribute values into "buckets" approximate the actual frequencies by the statistical information stored in each bucket. A taxonomy was discussed

Summary of the 1-D Histogram Taxonomy

equi-width equi-depth V-optimal(F, F) V-optimal(V, F) Max-Diff(V, F) Compressed(V, F) Data Distribution

Estimation Procedures Find the histogram buckets that can contain the query range Estimate the counts by studying the overlapped portion of the query and the buckets continuous values assumption point value assumption uniform spread assumption

Uniform Frequency Assumptions continuous values assumption point value assumption uniform spread assumption 29 45 area = 16 29 45 freq = 16 29 45 freq = 4

Uniform Frequency Assumptions Experiments Results show that uniform spread assumption gives the best estimation uniform spread assumption is better against "non-uniform spread" data than continuous value assumption (strange!) e.g., if spread is large, many queries should return 0.

Experiments Datasets in form of pairs, distribution of dataValues uniform, zipf_incr, zipf_dec, etc. frequency: zipf, with different skew factors different value-frequency correlations positive, negative, random

Experiments The data value distribution:

Experiments Queries has a form of a < X  b 5 types a = -1, b in the range of values a = -1, b = one of the appeared values random a and b, s.t. selectivity in [0, 0.2] random a and b, s.t. selectivity in [0.8, 1] random a and b, s.t. selectivity in [0, 0.2]  [0.8, 1]

Experiments Histograms all histograms described in the taxonomy the histograms are of the same size (not the same number of buckets) built from 2000 samples (except trivial, equi-depth: precise, and P 2 ) built through scanning the data once

Experiments cusp_max value distribution random value & freq. correlation z = 1 sort param = V usually means better accuracy

Experiments error based on v-optimal(V,A) histogram increasing sample size can give better results

End of Part I

Part II: Global Optimization of Histograms (GOH)

Histograms for n-D data Histograms discussed previously are on single attribute (or 1-D data) Two main approaches for n-D data: Use n 1-D histograms use an n-D histogram

Histograms for n-D data n 1-D histograms need "attribute value independence assumption" can use all 1-D histogram techniques quite good accuracy already representative: GOH [Jagadish et al. SIGMOD'01]

Histograms for n-D data n-D histograms don't need the "avia" which is usually not true difficult to compute, store and maintain a "good" partition of the n-D space into buckets representatives: MHIST [Poosala & Ioannidis VLDB'97], H-Tree [Muralikrishna & DeWitt SIGMOD'88]

Global Optimization of Histograms (GOH) given a space budget of B buckets, find an optimal assignment of buckets among the dimensions to minimize the error i.e., give more buckets to attributes that are used frequently or have skew distributions

GOH -- example e.g., if A1:,,, A2:,, and we have 4 buckets, an A1:1, A2:3 assignment is better than a A1:2, A2:2 assignment.

Computing GOH Exhaustive Algorithm (GOHEA) Based on Dynamic Programming (GOHDP) Greedy Approach (GOHGA) Greedy Approach with Remedy (GOHGR)

GOHEA For every possible bucket assignment, calculate the Error metric and find the minimum clearly too inefficient

GOHDP Define: E(b, k) = the minimum error of using b buckets to store k histograms error(b, k) = the error of using b buckets to store the k th histogram observation: E(b, k) = min( E(b-x, k-1) + error(x, k) )

GOHDP calculate all error(b, k): O(N 2 BM) another algorithm based in DP can calculate all error(b, k) with a given k in O(N 2 B) [Jagadish et al. VLDB'98] fill the E(b, k): O(B 2 M) O(BM) entries to be filled, and O(B) computations for each entry

GOHDP note that if we know the allocation beforehand, we need O(N 2 B) to construct the histograms it's still inefficient if M is large

GOHGA The greedy approach is O(N 2 B) i.e., nearly no penalty than direct construction (use the same number of buckets on all attributes) define: marginal gain: m k (i,j) = error(i,k) – error(j,k), i.e., the reduction in error if we use j buckets instead of i buckets.

GOHGA 1. assign 1 buckets to each dimension 2. allocate buckets one by one to the dimension which has the greatest marginal gain by the new bucket 3. repeat until all buckets are assigned

GOHGA O(B) steps O(N 2 ) per marginal gain calculation: since we can get error(b, k) from b = 1 in O(N 2 b) overall O(N 2 B) but sometimes GOHGA does not return the optimum

GOHGR greedy: look ahead 1 step to see which allocation has the greatest marginal gain remedy: look ahead 2 steps to see if it can find a better allocation e.g., m 1 (3,4)=30, m 1 (4,5)=130, m 2 (3,4)=40, m 2 (4,5)=40

Experiments aim: to show that GOH really has a smaller error by allocating more buckets to skewer data to show that GOH is efficient to be computed

Experiments types abs/rel errors of different attributes abs/rel errors for different bucket budgets abs/rel errors for different distribution skews running time

Experiments dataset (synthesis data): 5 attributes 500000 tuples, 500 values per attr frequencies follow Zipf distribution, random association between freq. and value. the 5 attrs has z = 0, 0.01, 0.1, 1, 2 10000 queries: X  a

Experiments

dataset (TPC-D dataset with skew data) as more realistic dataset has similar results as the synthesis data dataset (2 attrs) to evaluate the gain of GOH due to the skew difference between attrs

Experiments TG3: z=0, 2 TG4: z=0.02,1.8 TG5: z=1.8,2

Experiments

Conclusions GOH has smaller errors because it assigns more buckets to skew or frequently used distributions Nearly no time penalty on building GOH using GOHGR

Future Work The methods presented can't solve the n-D histogram problem completely Try to apply SF-Tree to store and retrieve the buckets in multi- dimensional histogram efficiently.

References [Jagadish et al. VLDB'98] H.V. Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Ken Sevcik, Torsten Suel, Optimal Histograms with Quality Guarantees, VLDB’98 [Poosala et al. SIMGOD'96] Viswanath Poosala, Yannis Ioannidis, Peter Haas, Eugene Shekita, Improved Histograms for Selectivity Estimation of Range Predicates, SIGMOD’96 [Poosala & Ioannidis VLDB'97] Viswanath Poosala and Yannis Ioannidis, Selectivity Estimation Without the Attribute Value Independence Assumption, VLDB'97

References [Muralikrishna & DeWitt SIGMOD'88] M. Muralikrishna and D. DeWitt, Equi-Depth Histograms for Estimating Selectivity Factors for Multi-Dimensional Queries, SIGMOD’88

Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.

Similar presentations

Presentation on theme: "Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.

Similar presentations

Presentation on theme: "Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms."— Presentation transcript:

Similar presentations

About project

Feedback