Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

Similar presentations


Presentation on theme: "Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto."— Presentation transcript:

1 Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto 2 IBM Toronto Lab

2 November 3, 2005CIKM Distinct value combinations CountryCityHotel Name GermanyBremenHilton GermanyBremenBest Western GermanyFrankfurtInterCity CanadaTorontoFour Seasons CanadaTorontoIntercontinental 3 distinct value combinations COLSCARD (COlumn Set CARDinality) = 3 The problem: estimating COLSCARD for a given set of attributes

3 November 3, 2005CIKM Motivation  Cardinality estimation for query optimization, e.g., Estimating the size of Estimating the size of the aggregation  Approximate query answering, e.g., COUNT queries SELECT sales_date, sales_person, SUM(sales_quantity) AS unit_sold FROM sales GROUP BY sales_date, sales_person

4 November 3, 2005CIKM Roadmap  Related work  Estimation with known marginal distributions Upper/lower bounds An estimator  Estimation with histograms  Experimental results  Conclusions

5 November 3, 2005CIKM Related work  Previous work has focused on the case of single attribute. [H Ö T88],[H Ö T89],[HNSS ’ 95],[HS ’ 98],[CCMN ’ 00]  Sampling approach is used. Estimation through sampling is difficult [CCMN ’ 00]  No existing statistical information is exploited.

6 November 3, 2005CIKM Our solution  Considering multiple-attributes  Utilizing existing statistics on individual attributes Readily available in most database systems Does not require access to the data  Granularity of statistics Exact marginal frequency distributions Approximate distributions: histograms etc.

7 November 3, 2005CIKM Estimation with known marginals  Number of distinct values in attribute Ai,  frequency vector CountryCityHotel Name GermanyBremenHilton GermanyBremenBest Western GermanyFrankfurtInterCity CanadaTorontoFour Seasons CanadaTorontoIntercontinental

8 November 3, 2005CIKM The na ï ve estimator COLSCARD = Number of possible value combinations d i : the number of distinct values in attribute A i Sanity bound: COLSCARD cannot be greater than the table size The problem: Some value combinations with low occurrence probabilities may not appear in the table!

9 November 3, 2005CIKM Upper/Lower bounds  Trivial bounds Upper bound: (the na ï ve estimator) Lower bound:  Tighter bounds? In the case of two attributes, tighter bounds are available.

10 November 3, 2005CIKM Tighter bounds N = d e f a b c A2A2 A1A1 Naïve bounds: 3, 9Lower bound = = value freq valuefreq [2, 3] Upper bound = = 5

11 November 3, 2005CIKM Expected number of combinations  Assumptions 1.The data distributions of individual columns are independent 2.The occurrence of each combination in the table is independent   Each element of f represents the frequency of a specific value combination. An estimate of the probability of occurrence

12 November 3, 2005CIKM Estimator The probability of the i-th combination not appearing in a particular tuple is The probability of the i-th combination not appearing in the table (of size N) is The expected number of value combinations is

13 November 3, 2005CIKM Example revisited  Estimate the COLSCARD for attribute set (A 1, A 2, A 3 ), given New estimate: 5.94 Na ï ve estimate: 3*2*2 = 12

14 November 3, 2005CIKM Roadmap  Related work  Estimation with known marginal distributions Upper/lower bounds An estimator  Estimation with histograms  Experimental results  Conclusions

15 November 3, 2005CIKM Estimation with histograms  Histograms exist on individual attributes  Two classes of histograms Partition-based End-biased  Marginals can be (approximately) reconstructed from histograms  Optimal histograms in each class?

16 November 3, 2005CIKM Optimal histograms  Minimizing the error incurred by histograms ERR = |EST hist – EST exact |  Partition-based histograms A dynamic programming algorithm similar to that for V-optimal histogram construction [Jagadish et al. 98] can be used.

17 November 3, 2005CIKM Optimal end-biased histograms  An end-biased histogram with B buckets stores The exact frequencies of B-1 attribute values The average of the remaining values  Which B-1 values to store exactly?  Most widely used end-biased histograms store the frequencies of the most frequent values Not always optimal for COLSCARD estimation!!

18 November 3, 2005CIKM Example Attributes (A1, A2) Choose 1 frequency to store exactly Index of the frequency stored Error table N=10

19 November 3, 2005CIKM Optimal end-biased histograms  Exhaustive search takes time proportional to  We prove that the optimal choices can be one of the following Most frequent values Least frequent values A combination of most frequent and least frequent values  Only need to search both ends Cost is linear in B, independent of d j !

20 November 3, 2005CIKM Roadmap  Related work  Estimation with known marginal distributions Upper/lower bounds An estimator  Estimation with histograms  Experimental results  Conclusions

21 November 3, 2005CIKM Experiments – Data sets  Synthetic data Skew: Zipfian parameter z=0 (uniform) to 4 (highly skewed) Number of tuples: 10K to 1M  Real data Cover Type: 581,012 tuples, 10 attributes Census Income: 32,561 tuples, 14 attributes  Error measure: ratio error ERR = max{true/est-1, est/true-1}

22 November 3, 2005CIKM Effect of data skew N=100K di=1k

23 November 3, 2005CIKM Effect of number of tuples

24 November 3, 2005CIKM Results on real data 45 pairs91 pairs

25 November 3, 2005CIKM Accuracy of end-biased histograms Results on the “ capital-gain ” attribute of Census Income data set

26 November 3, 2005CIKM Conclusions  Utilizing existing knowledge maintained in database systems  Proposed upper/lower bounds as well as an estimator  Considered two cases exact marginal frequencies Histograms: optimal histograms  Experimental results show the effectiveness of the proposed method


Download ppt "Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto."

Similar presentations


Ads by Google