Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for.

Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics SaarbrückenGermany ACM SigIR ‘05

Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 2 An Initial Example… Robust Track ’04, hard query no. 363 (Aquaint news corpus) “transportation tunnel disasters” transportation tunnel disasters transit highway train truck metro “rail car” car … tube underground “Mont Blanc” … catastrophe accident fire flood earthquake “land slide” … 0.9 0.8 0.7 0.6 0.5 0.1 1.0 0.9 0.7 0.6 0.5 0.9 0.8 0.7 1.0 d1d1 d2d2 Expansion terms from relevance feedback, thesaurus lookups, Google top-10 snippets, etc. Term similarities, e.g., Robertson&Sparck-Jones, concept similarities, or other correlation measures Increased robustness Count only the best match per document and expansion set Increased efficiency Top-k-style query evaluations Open scans on new terms only on demand No threshold tuning

ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 3Outline Computational model & background on top-k algorithms Incremental Merge over inverted lists Probabilistic candidate pruning Phrase matching Experiments & Conclusions

ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 4 Computational Model Vector space model with a Cartesian product space D 1 ×…×D m and a data set D  D 1 ×…×D m   m Precomputed local scores s(t i,d) ∈ D i for all d ∈ D e.g., TF*IDF variations, probabilistic models (Okapi BM25), etc. typically normalized to s(t i,d) ∈ [0,1] Monotonous score aggregation aggr: (D 1 ×…×D m )  (D 1 ×…×D m ) →  + e.g., sum, max, product (sum over log s ij ), cosine (L 2 norm) Partial-match queries (aka. “andish”) Non-conjunctive query evaluations Weak local matches can be compensated Access model Inverted index over large text corpus, Inverted lists sorted by decreasing local scores  Inexpensive sequential accesses to per-term lists: “getNextItem()”  More expensive random accesses: “getItemBy(docid)”

ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 5 d1d1 d1d1 d1d1 d1d1 No-Random-Access (NRA) Algorithm [Fagin et al., PODS ‘02] Inverted Index s(t 1,d 1 ) = 0.7 … s(t m,d 1 ) = 0.2 s(t 1,d 1 ) = 0.7 … s(t m,d 1 ) = 0.2 … Corpus: d 1,…,d n Query: q = (t 1,t 2,t 3 ) … … t1t1 d78 0.9 d1 0.7 d88 0.2 d10 0.2 d78 0.1 d99 0.2 d34 0.1 d23 0.8 d10 0.8 d1d1 d1d1 t2t2 d64 0.8 d23 0.6 d10 0.6 t3t3 d10 0.7 d78 0.5 d64 0.4 Naive Join-then-Sort in between O(mn) and O(mn 2 ) runtime RankDoc # Worst- score Best- score 1d780.92.4 2d640.82.4 3d100.72.4 RankDoc # Worst- score Best- score 1d781.42.0 2d231.41.9 3d640.82.1 4d100.72.1 RankDoc # Worst- score Best- score 1d102.1 2d781.42.0 3d231.41.8 4d641.22.0 Scan depth 1 Scan depth 1 Scan depth 2 Scan depth 2 Scan depth 3 Scan depth 3 k = 1 STOP! 1. NRA(q,L): 2. scan all lists L i (i = 1..m) in parallel // e.g., round-robin 3. = L i.getNextItem() 4. E(d) = E(d)  {i} 5. high i = s(t i,d) 6. worstscore(d) = ∑  E(d) s(t,d) 7. bestscore(d) = worstscore(d) + ∑  E(d) high 8. if worstscore(d) > min-k then 9. add d to top-k 10. min-k = min{worstscore(d’) | d’  top-k} 11. else if bestscore(d) > min-k then 12. candidates = candidates  {d} 13. if max {bestscore(d’) | d’  candidates}  min-k then return top-k 1. NRA(q,L): 2. scan all lists L i (i = 1..m) in parallel // e.g., round-robin 3. = L i.getNextItem() 4. E(d) = E(d)  {i} 5. high i = s(t i,d) 6. worstscore(d) = ∑  E(d) s(t,d) 7. bestscore(d) = worstscore(d) + ∑  E(d) high 8. if worstscore(d) > min-k then 9. add d to top-k 10. min-k = min{worstscore(d’) | d’  top-k} 11. else if bestscore(d) > min-k then 12. candidates = candidates  {d} 13. if max {bestscore(d’) | d’  candidates}  min-k then return top-k

ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 7 Dynamic & Self-tuning Query Expansions Incrementally merge inverted lists L i1 …L im’ in descending order of local scores Dynamically add lists into set of active expansions exp(t i ) according to the combined term similarities and local scores Best match score aggregation t1t1 d66 d93 d95... d101 t2t2 d95 d17 d11... d99 top-k (t 1,t 2,~t 3 ) t 3,1 d42 d11 d92... d21 t 3,2 d78 d10 d11... d1 t 3,m‘ d92 d42 d32... d87 d42d11d92d11 … virtual index list ~t 3 incr. merge Increased retrieval robustness & fewer topic drifts Increased efficiency through fewer active expansions No threshold tuning required !

ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 8 ~t Correlation measures, Large corpus statistics, … Correlation measures, Large corpus statistics, … Incremental Merge Operator sim(t, t 1 ) = 1.0 ~t = {t 1,t 2,t 3 } sim(t, t 2 ) = 0.9 sim(t, t 3 ) = 0.5 t1t1... d78 0.9 d1 0.4 d88 0.3 d23 0.8 d10 0.8 0.4 t3t3... d99 0.7 d34 0.6 d11 0.9 d78 0.9 d64 0.7 d78 0.9 d23 0.8 d10 0.8 d64 0.72 d23 0.72 d10 0.63 d11 0.45 d78 0.45 d1 0.4... d12 0.2 d78 0.1 d64 0.8 d23 0.8 d10 0.7 t2t2 0.9 0.72 0.18 0.35 0.45 Relevance feedback, Thesaurus lookups, … Relevance feedback, Thesaurus lookups, … Index list meta data (e.g., histograms) Index list meta data (e.g., histograms) Incremental Merge iteratively triggered by top-k operator “getNextItem()” d88 0.3 Expansion terms Expansion similarities Initial high-scores

ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 10 Probabilistic Candidate Pruning [Theobald, Schenkel, Weikum, VLDB ‘04] For each physical index list L i Treat each s(t i,d)  [0,1] as a random variable S i and consider Approximate local score distribution using an equi-width histogram with n buckets

ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 11 Probabilistic Candidate Pruning [Theobald, Schenkel, Weikum, VLDB ‘04] For each physical index list L i Treat each s(t i,d)  [0,1] as a random variable S i and consider Approximate local score distribution using an equi-width histogram with n buckets For a virtual index list ~L i = L i1 …L im’ Consider the max-distribution Alternatively, construct meta histogram for the active expansions

ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 12 Probabilistic Candidate Pruning [Theobald, Schenkel, Weikum, VLDB ‘04] For each physical index list L i Treat each s(t i,d)  [0,1] as a random variable S i and consider Approximate local score distribution using an equi-width histogram with n buckets For all d in the candidate queue Consider the convolution over score distributions for aggregated scores Drop d from candidates, if For a virtual index list ~L i = L i1 …L im’ Consider the max-distribution Alternatively, construct meta histogram for the active expansions

ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 14 Incremental Merge for Multidimensional Predicates Nested Top-k operator iteratively prefetches & joins candidate items for each subquery condition “getNextItem()” Provides [wortscore(d), bestscore(d)] guarantees to superordinate top-k operator Propagates candidates in descending order of bestscore(d) values for monotonicity Top-level top-k operator performs phrase tests only for the most promising items (random IO) (Expensive predicates & minimal probes [Chang & Hwang, SIGMOD ‘02] ) Single threshold condition for algorithm termination (candidate pruning at the top-level queue only) q = (undersea „fiber optic cable“) term-to- position index term-to- position index undersea … d14 0.9 d23 0.8 d32 0.7 d18 0.9 d1 0.8 fiber … d78 0.9 d1 0.7 d88 0.2 d23 0.8 d10 0.8 optics … d78 0.8 d5 0.4 d47 0.1 d17 0.6 d23 0.6 sim(„fiber optic cable“, „fiber optic cable“) = 1.0 sim(„fiber optic cable“, „fiber optics“) = 0.8 Incr.Merge Nested Top-k Nested Top-k fiber … d78 0.9 d1 0.7 d88 0.2 d23 0.8 d10 0.8 optic d34 0.9 d7 0.4 d23 0.3 d12 0.8 d78 0.6 cable d41 0.9 d2 0.3 d23 0.1 d10 0.7 d75 0.5 … … random access

ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 16 Experiments – Aquaint with Fixed Expansions Aquaint corpus of English news articles (528,155 docs) 50 “hard” queries from TREC 2004 Robust track WordNet expansions using a simple form of WSD Okapi-BM25 model for local scores, Dice coefficients as term similarities Fixed expansion technique (synonyms + first-order hyponyms) with m ≤ 118 Join&Sort42,305,637 NRA -baseline 41,439,815 09.4 432 KB0.2520.0921.000 Join&Sort11820,582,764 NRA +Phrases 11818,258,834210,531245.037,355 KB0.2860.1051.000 NRA +Phrases 1183,622,68649,78379.65,895 KB0.2380.0860.541 Title-only Static Expansions Dynamic Expansions Incr.Merge1187,908,66653,050159.117,393 KB0.3100.1181.000 Incr.Merge1185,908,01748,62279.413,424 KB0.2980.1100.786 # SA CPU sec P@10 MAP rPrec max m max KB # RA

ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 17 Experiments – Aquaint with Fixed Expansions cont’d Probabilistic Pruning Performance Incremental Merge vs. Top-k with Static Expansions Epsilon controls pruning aggressiveness

ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 18 Experiments – Aquaint with Large Expansions Query Expansion Performance Incremental Merge vs. Top-k with Static Expansions Theta controls expansion size Aggressive expansion technique (synonyms + hyponyms + hypernyms) with 36 ≤ m ≤ 876

ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 19 Conclusions & Current Work Increased efficiency Incremental Merge vs. Join-then-Sort & top-k using static expansions Very good precision/runtime ratio for probabilistic pruning Increased retrieval robustness Largely avoids topic drifts Modeling of fine grained semantic similarities (Incremental Merge & Nested Top-k operators) Scalability (see paper) Large expansions (< 876 terms per query) on Aquaint Experiments on Terabyte collection Efficient support for XML-IR (INEX Benchmark) Inverted lists for combined tag-term pairs e.g., sec=mining Efficiently supports child-or-descendant axis e.g., //article//sec=mining Vague content & structure queries (VCAS) e.g., //article//~sec=~mining Incremental Merge over Data-Guide-like XPath locators VLDB ’05, Trondheim

ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 20 Thank you!

Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for.

Similar presentations

Presentation on theme: "Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for.

Similar presentations

Presentation on theme: "Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for."— Presentation transcript:

Similar presentations

About project

Feedback