Presentation is loading. Please wait.

Presentation is loading. Please wait.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

Similar presentations


Presentation on theme: "« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A."— Presentation transcript:

1 « Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A. Ntoulas, J. Cho Paper presentation: Konstantinos Zacharis, Dept. of Comp. & Comm.Engineering, UTH

2 Paper Outline Introduction Architecture Optimal size of pruned index Pruning policies Experimental evaluation Conclusions

3 Introduction Observation: approximately 80% of the users examine at most the first 3 batches of the results. That is, 80% of the users typically view at most 30 to 60 results for every query that they issue to a search engine Contribution: is a new answer computation algorithm that guarantees that the top-matching pages (according to the search- engine’s ranking metric) are always placed at the top of search results, even though we are computing the first batch of answers from the pruned index most of the time

4 Two-tier Index Architecture

5 Architecture Definition 1 (Correctness indicator function): Given a query q, the p-index IP returns the answer A together with a correctness indicator function C. C is set to 1 if A is guaranteed to be identical (i.e. same results in the same order) to the result computed from the full index IF. If it is possible that A is different, C is set to 0 Question 1: How can we compute the correctness indicator function C? Question 2: How should we prune the index so as to realize maximum cost saving?

6 Pruned Index Size Observation: the cost of the two-tier architecture depends on two important parameters: the size of the p-index and the fraction of the queries that can be handled by the 1st tier index alone Theorem 1: The cost for handling the query load Q is minimal when the size of the p-index, s, satisfies d f(s) / ds = 1, where s is the fraction of p-index relative to full-index

7 Pruning Policies 1) Term (keyword)-locality: the search engine will be able to answer a significant fraction of user queries even if it can handle only a few popular keywords (and possibly those that constitute the majority of query load 2) Document-locality: as long as search engines can compute the first few top-k answers correctly, users often will not notice that the search engine actually has not computed the correct answer for the remaining results (unless the users explicitly request them). This is what actual commericial search engines do!

8 Pruning Policies: ranking function assumptions The exact ranking function that search engines employ is a closely guarded secret Query-dependent relevance: captures how relevant the query is to every document (cosine distance metric) Query-independent document quality: measures the overall “quality” of a document D independent of the particular query issued by the user (e.g. PageRank, Hits) Paper adopts as ranking function a linear combination of the two above factors (which should be monotonic)

9 Pruning Policies: term (horizontal) pruning Problem 2 (optimal keyword pruning): Given the query load Q and a goal index size s · |I F | for the pruned index, select the inverted lists I P = {I(t 1 ),..., I(t h )} such that |I P | ≤ s · | I F | and the fraction of queries that I P can answer (expressed by f(s)) is maximized. Theorem 2: The problem of calculating the optimal keyword pruning is NP-hard (proven reducible to knapsack or bin-packing problem) Therefore paper implements a greedy policy by keeping the items with the maximum benefit per unit cost

10 Pruning Policies: document (vertical) pruning Global and local - based document pruning algorithms. Neither guarantees the basic paper assumption (theorem 3)

11 Combined policy: extended keyword – specific document pruning For every inverted list, paper picks two theshold values. This policy, when combined with a correct selected monotonic ranking function guarantees paper assumption (theorem 4)

12 Experimental setup Dataset: 130M pages crawled from www (on March 2004). Seed is ODP homepage Total uncompressed size of web pages: ~2TB Full inverted index size: ~ 1.2TB Query set available: ~ 450M queries, only a fraction of them (~5%) processed 1 (average # of terms/query is 2) Selected ranking function: r(D, q) = pr norm (D) + tr norm (D, q) 1 issued to web site www.looksmart.com www.looksmart.com

13 Term vs document pruning performance 1) ~73% of the queries can be answered using 30% of the original index. Furthermore, using keyword pruning only, the optimal index size is s = 0.17 2) for all index sizes larger than 40%, authors guarantee the correct answer for about 70% of the queries. Optimal index size here (doc-pruning) is s=0.20 3) For p-index sizes 20% then keyword-pruning performs much better

14 Combination 1)First apply term-pruning and subsequently doc-pruning 2)For p-index sizes smaller than 50%, combination does relatively well

15 Conclusions authors provided a framework for new pruning techniques and answer computation algorithms that guarantee that the top matching pages are always placed at the top of search results in the correct order term-pruned index can guarantee 73% of the queries with a size of 30% of the full index document-pruned index can guarantee 68% of the queries with the same size combination of the two pruning algorithms guarantees 60% of the queries with an index size of 16%

16 Any questions? Thank you for your attention!


Download ppt "« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A."

Similar presentations


Ads by Google