Presentation is loading. Please wait.

Presentation is loading. Please wait.

Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,

Similar presentations


Presentation on theme: "Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,"— Presentation transcript:

1 Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University, Melbourne, Australia SIGIR 2007(Collection representation in distributed IR) 2009-03-13 Presented by JongHeum Yeon, IDS Lab., Seoul National University

2 Copyright  2008 by CEBT Abstract  Federated information retrieval (FIR) Send query to multiple collections Central broker merges the results and ranks them  Duplicated documents in collections Final results contains high number of duplicates potentially  Authors propose a method for estimating the rate of overlap among collections based on sampling  Using the estimated overlap statistics, they propose two collection selection methods that aim to maximize the number of unique relevant documents in the final results 2 Broker Collection User

3 Copyright  2008 by CEBT Federated Information Retrieval (FIR)  Query is sent simultaneously to several collections  Each collection evaluates the query and returns the results to the broker  Advantage No need to access the index of the collections Search over the latest version of documents without crawling and indexing  Broker selects collections that are most likely to return relevant documents Collection selection problem Collection representation problem Result merging problem 3

4 Copyright  2008 by CEBT Collection Selection Problem  FIR techniques assume that the degree of overlap among collections is either none or negligible  However, there are many collections that have a significant degree of overlap Bibliographic databases News resources  Selecting collections that are likely to return the same results by introducing duplicate documents into the final results Wastes costly resources Degrades search effectiveness  Authors propose … A method that estimates the degree of overlap among collections by sampling from each collection using random queries two collection selection techniques that use the estimated overlap statistics to maximize the number of unique relevant documents in the final results 4

5 Copyright  2008 by CEBT Related Work  Cooperative collection selection techniques Collections provide the broker with their index statistics and other useful information CORI, GlOSS, CVV  Uncooperative collection selection techniques Collections do not provide their index statistics to the broker The broker samples documents from each collection ReDDE uses sampled documents for … – Estimates the number of relevant documents in collections – Ranks collections according to the number of highly ranked sampled documents 5

6 Copyright  2008 by CEBT Overlap Estimation  Using the documents downloaded by query-based sampling for estimating the rate of overlap and does not require any additional information  Subset of sample documents  Size of m  The probability of any given document from m1 to be available in m2 6 C1C2 S2 S1 K  Expected number of documents

7 Copyright  2008 by CEBT Overlap Estimation (cont’d)  P(i) follows binomial distribution 7

8 Copyright  2008 by CEBT Overlap Estimation (cont’d)  Binomial theorem  Expected number of documents in m1 ∩ m2 The number of overlap documents is independent of the collection size 8

9 Copyright  2008 by CEBT The ‘RELAX’ Selection Method  Graph G = {(u,v) | vertex u, v are collections, edges indicates overlap documents between vertices}  Output : final merged document lists that minimized duplicates 9

10 Copyright  2008 by CEBT The ‘RELAX’ Selection Method (cont’d) 10

11 Copyright  2008 by CEBT Overlap Filtering for ReDDE  F-ReDDE 1.The overlaps among collections are estimated as described for the Relax selection 2.Collections are ranked using a resource selection algorithm such as ReDDE 3.Each collection is compared with the previously selected collections. It is removed from the list if it has a high overlap (greater than γ) with any of the previously selected collections. We empirically choose γ = 30% and leave methods for finding the optimum value as future work 11

12 Copyright  2008 by CEBT Testbeds  Authors create three new testbeds with overlapping collections based on the documents available in the TREC GOV dataset  Qprobed-280 360 most frequent queries in a search engine in the.gov A random number of documents (between 5000 and 20000) are downloaded as a collection Generate 280 collections with average size of 12194 documents  Qprobed-300 every twentieth collection is merged into a single large collection  Sliding-115 Using a sliding window of 30 000 documents Generate 112 collections 12

13 Copyright  2008 by CEBT Testbeds (cont’d)  Qprobed-280 74492 collection pairs < 10% overlap 79 pairs < 90% 1.1% of collection pairs > 50% overlap  Qprobed-300 1.9% of collection pairs > 50% overlap  Sliding-115 2.5% of collection pairs > 50% overlap 13

14 Copyright  2008 by CEBT Results  The initial estimated values for D(i, j) suggested that the degree of overlap among collections is usually overestimated Document retrieval models are biased towards returning some popular documents for many queries Samples produced by query-based sampling are not random 14

15 Copyright  2008 by CEBT Results (cont’d) 15

16 Copyright  2008 by CEBT Results (cont’d) 16

17 Copyright  2008 by CEBT Conclusion & Discussion  Pros Propose the efficient algorithm for handling duplicates  Cons Experiments show the improved performance In practical environment? 17


Download ppt "Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,"

Similar presentations


Ads by Google