# PODS 2000, 5/17/00AT&T Labs1 Selectivity Estimation For Boolean Queries Zhiyuan Chen (Speaker) Flip Korn Nick Koudas S.Muthukrishnan.

## Presentation on theme: "PODS 2000, 5/17/00AT&T Labs1 Selectivity Estimation For Boolean Queries Zhiyuan Chen (Speaker) Flip Korn Nick Koudas S.Muthukrishnan."— Presentation transcript:

PODS 2000, 5/17/00AT&T Labs1 Selectivity Estimation For Boolean Queries Zhiyuan Chen (Speaker) Flip Korn Nick Koudas S.Muthukrishnan

PODS 2000, 5/17/00AT&T Labs2 Motivation(1) Boolean queries on substring predicates are ubiquitous. Information Retrieval. Bibliographic search. Web searching. 40 millions per day at AltaVista. RDB queries. Document 1. Peanut butter lover’s club... Document 2. … peanut stock... Document 3. …butter, the natural choice… … Document 100.... How many documents contain substring peanut but not butter?

PODS 2000, 5/17/00AT&T Labs3 Motivation(2) zUse estimates for. yQuery optimization. xBest Filtering order. yInteractive query refinement. xHard to write a query having 1- 20 answers. xRanking approach does not always work and expensive. xEstimate -> refine query ->... -> exact answer zComputing exact answers is expensive! – Either super-linear space or linear time.

PODS 2000, 5/17/00AT&T Labs4 Outline zProblem Definition zRelated work zOur approach zExperiments zConclusions

PODS 2000, 5/17/00AT&T Labs5 Problem Definition zSubstring predicate  (s) is true iff string s contains  as substring. zBoolean queries: ySubstring predicates concatenated with AND, OR, NOT. zFor a string set S, a Boolean query q: ySelectivity P(q) is the fraction of strings in S that satisfy q. Document 1. Peanut butter lover’s club... Document 2. … peanut stock... … Document 100.... How many documents contain substring peanut but not butter?

PODS 2000, 5/17/00AT&T Labs6 Related Work (1) zHistograms: Not suitable for strings – Selectivity of adjacent substrings often differs a lot! zEnd-biased histograms(IP95). –Structure of substring dependence is not used. –If  is pruned, for any  Count(  ) = Count(  ) = Default ????

PODS 2000, 5/17/00AT&T Labs7 Related Work (2) zExisting work for substring queries. yConjunction-only queries: KVI96, WVI97, JNS99, JKNS99 yPreprocess: a compact data structure Pruned Suffix Tree. yCorrelation between predicates explicitly stored. x Otherwise - independence assumption on substring predicates. yPruned case: xParse into subqueries & estimate each subquery. xProbabilistic formula to combine estimates. independence assumption. Maximal overlap (conditioning on overlaps).

PODS 2000, 5/17/00AT&T Labs8 Use Previous Approach? Independence Assumption: P(peanut  butter) = P(peanut) * P(butter) = 2/100 * 2/100 = 0.04% NO! - Exponential (2 2 m ) space to store correlation between substring predicates. (m as number of suffixes) 25 times smaller than true count! Not to store correlation? - No! Correlation is important! Document 1. Peanut butter lover’s club... Document 2. … peanut stock… Document 3. …butter, the natural choice… … Document 100....

PODS 2000, 5/17/00AT&T Labs9 Set-Oriented Approach - Store Correlation Implicitly BaseSet(peanut  butter) = BaseSet(peanut)  BaseSet(butter) = {1} Base sets can be huge in the worst case!!! O(number of strings) Peanutbutter {1,2}{1,3} Base set: the set of IDs of strings that contain the substring. Document 1. Peanut butter lover’s club... Document 2. … peanut stock... Document 3. …butter, the natural choice… … Document 100....

PODS 2000, 5/17/00AT&T Labs10 Set-Hashing Approach zA Monte Carlo technique(Cohen94,Broder98) AB Set inclusion-exclusion for unions and complements. Two Sets A, B Generate a fixed length signature for each set. Estimate |A  B| by manipulating signatures.

PODS 2000, 5/17/00AT&T Labs11 Signature Generation 3 5 2 4 1 1 2 3 5 4 4 2 1 5 3 Universe = {1,2,3,4,5} A={1,2,3},B={5,2,3} Generate signatures with length 3 Randomly permute universe 3 times Signature of A 1 Pick first element in A = {1,2,3} 3 2 2 3 2 Signature of B

PODS 2000, 5/17/00AT&T Labs12 Reconstruction |A  B| = r / (1+ r) * (|A| +|B|) = 2/3 / (1+2/3) * (3+3) = 2.4. 132 A’s signature S A 232 B’s signature S B Definition: r = # of pair-wise matches of S A and S B / |S A | =2 /3 Theorem:

PODS 2000, 5/17/00AT&T Labs13 Implementation Issues zApproximate permutations: yUse a set of independent hash functions. yPick the minimal hash images as signature components. Sig(A) = min{h(x)| x in A}. zSignature of unions: Sig(peanut  butter) yPair-wise min of Sig(p) and Sig(b).

PODS 2000, 5/17/00AT&T Labs14 Algorithm Outline - No Pruning zWith negations. y| (Peanut  butter)   sandwich | = | (p ^  s)  ( b   s) |(Convert to DNF) = |p   s| + | b   s| - |p  b   s| (Eliminate disjunction by set-inclusion-exclusion) =|p| - |p  s| + |b| - | b  s| - |p  b| + |p  b  s| (Eliminate negations) Comments: Works fine with short queries. Without negations. – Convert to CNF forms. (Peanut  butter)  sandwich – Estimate using Sig(Peanut  butter), Sig(sandwich).

PODS 2000, 5/17/00AT&T Labs15 Pruned Suffix Tree Case Combine them using probabilistic formula. Maximal overlap parsing and conditioning on overlap. E.g. P((abc  12)   23) = P(23) - P(abc  12  23) = P(23) - P(ab  12  23) * P(bc  12  23 | ab  12  23)  P(23) - P(ab  12  23) * P(bc  12  23 | b  12  23) = P(23) - P(ab  12  23) * P(bc  12  23) / P(b  12  23) Parse a query into subqueries only have predicates in suffix trees. Use signatures to estimate each subquery. abc parsed into ab and bc

PODS 2000, 5/17/00AT&T Labs16 Complexity zTheorem: yPreprocessing: building tree and signatures O(signature length * database size) time and space. yOnline estimate: O(2 O(L) ), L is the query length. zOnline time only related to query length. zL is small in real life. yBelow 1 millisecond in experiments.

PODS 2000, 5/17/00AT&T Labs17 Experiments - Setup zData set: real AT&T data - service description. y130 K strings, 2.5 MB. zQueries: yTemplates: xT1: (A  B)  (C  D) xT2: (A  B)  (C  D)  (E  F)  (G  H) yWith a certain probability of negation. yPositive & Negative queries. zCompare with independence assumption. zRun on an Intel PC 350 MHz, 128 MB RAM. y1 minute preprocessing, < 1 millisecond estimate time.

PODS 2000, 5/17/00AT&T Labs18 PST-Positive Queries Probability of negations = 5% Average absolute relative error. T1: (A  B)  (C  D) T2: (A  B)  (C  D)  (E  F)  (G  H)

PODS 2000, 5/17/00AT&T Labs19 PST-Negative Queries Probability of negations = 5% Average root-mean-square error (count) T1: (A  B)  (C  D)T2: (A  B)  (C  D)  (E  F)  (G  H)

PODS 2000, 5/17/00AT&T Labs20 Conclusions zContributions: yA novel problem. yA novel approach of implicitly storing correlation and generating correlation as needed by set-hashing. yFar superior than independence assumption. x1.0% space, < 1 ms. x4 times more accurate for positive queries, many orders for negative queries. zOngoing & Future work. yTwig estimation for XML documents. yRegular expressions, position constraints.

Download ppt "PODS 2000, 5/17/00AT&T Labs1 Selectivity Estimation For Boolean Queries Zhiyuan Chen (Speaker) Flip Korn Nick Koudas S.Muthukrishnan."

Similar presentations