Presentation is loading. Please wait.

Presentation is loading. Please wait.

Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University

Similar presentations


Presentation on theme: "Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University"— Presentation transcript:

1 Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University lsi@cs.cmu.edu callan@cs.cmu.edu

2 2 © 2003, Luo Si and Jamie Callan Abstract Task: Distributed Information Retrieval in uncooperative environments. Contributions: Sample-Resample method to estimate DB size. ReDDE (relevant document distribution estimation) resource selection algorithm directly estimates distribution of relevant documents among databases Modified ReDDE algorithm for better retrieval performance.

3 3 © 2003, Luo Si and Jamie Callan What is Distributed Information Retrieval (Federated Search)? Engine 1Engine 2Engine 3Engine 4Engine n... (2)Resource Selection (1)Resource Representation …… (4)Results Merging …….. Four steps: (1)Find out what each DB contains (2) Decide which DBs to search (3)Search selected DBs (4) Merge results returned by DBs

4 4 © 2003, Luo Si and Jamie Callan Previous Work: Resource Representation Resource Representation (Content Representation): Query Based Sampling (Need no cooperation) (Callan, et al., 1999) –Submitting randomly-generated queries and analyze returned docs –Does not need cooperation for individual DBs Resource Representation (Database Size Estimation): Capture-Recapture Model (Liu and Yu, 1999) #In_Samp1#Out_Samp1 #In_Samp2ab #Out_Samp2cd Total Num:

5 5 © 2003, Luo Si and Jamie Callan Previous Work: Resource Selection & Results Merging Resource Selection: gGlOSS (Gravano, et al., 1995) –Represent DBs and queries as vectors and calculate the similarities CORI (Callan, et al., 1995) –A Bayesian Inference Network model. Has been shown effective on different testbeds Results Merging: CORI results merging algorithm (Callan, et al., 1995) –Linear heuristic model with fixed parameters Semi-Supervised Learning algorithm (Si and Callan, 2002) –Linear model and parameters are learned from training data

6 6 © 2003, Luo Si and Jamie Callan Previous Work: Thoughts Thoughts: Original capture-recapture method has very large cost to get relatively accurate DB size estimates Most of the resource selection algorithms have not been studied in the environment with skewed DB size distribution They do not directly optimize the number of relevant docs contained in selected DBs. (The goal of resource selection) There is inconsistency between the goals of resource selection and retrieval (high recall and high precision)

7 7 © 2003, Luo Si and Jamie Callan Experimental Data Testbeds: Trec123_100col: 100 DBs. Organized by source and publication date. DB sizes and distribution of relevant documents rather uniform Trec123_AP_WSJ_60col (Relevant): 62 DBs. 60 from above, 2 by merging AP and WSJ DBs. DB sizes skewed and large DBs have much more relevant docs Trec123_FR_DOE_81col (Non-Relevant): 83 DBs. 81 from above, 2 by merging FR and DOE DBs. DB sizes skewed and large DBs have not many relevant docs Trec4_kmeans: 100 DBs. Organized by topic. DB sizes and distribution of relevant documents moderately skewed Trec123_10col: 10 DBs. Each DB is built by merging 10 DBs in Trec123_100col in a round-robin way. DB sizes are large.

8 8 © 2003, Luo Si and Jamie Callan A New Approach to DB Size Estimation: The Sample-Resample Algorithm The Idea: Assume: Search engine indicates num of docs match a one-term query Strategy: Estimate df of a term in sampled docs and get df from the DB in the whole collection; scale the num of sampled docs to get the DB size Centralized sample DB: built by all the sampled docs Centralized complete DB: imaginarily built by all the docs in all DBs df of term in sampled docs from j th DB Num of docs sampled from j th DB df of term in the whole j th DB DB Size Estimate

9 9 © 2003, Luo Si and Jamie Callan Experimental Results: DB Size Estimation Method Trec123-100Col (Avg AER) Trec123-10Col (Avg AER) Original Cap-Recap (Top1)0.7290.943 Cap-Recap (Top 20)0.3770.849 Sample-Resample0.2320.299 Methods were allowed the same num of transactions with a DB Capture-Recapture: about 385 queries (transactions). Sample-Resample: 80 queries and 300 docs for sampled docs (sample) +5 queries ( resample)=385 transactions Measure: Absolute error ratio Estimated DB Size Actual DB Size Original Cap-Recap (Top 1) only selects top 1 Doc to build the sample, more experiments are in the paper

10 10 © 2003, Luo Si and Jamie Callan A New Approach to Resource Selection: The ReDDE Algorithm The goal of resource selection: –Select the (few) DBs that have the most relevant documents Common strategy: –Pick DBs that are the “most similar” to the query »But similarity measures don’t always normalize well for DB size Desired strategy: –Rank DBs by the number of relevant documents they contain »It hasn’t been clear how to do this An approximation of the desired strategy: –Rank DBs by the percentage of relevant documents they contain »This can be estimated a little more easily… …but we need to make some assumptions

11 11 © 2003, Luo Si and Jamie Callan The ReDDE Algorithm: Estimating the Distribution of Relevant Documents Estimated DB size Number of docs sampled from j th DB Number of docs sampled from the DB that contains d j Estimated Number of docs in the DB that contains d j “Everything at the top is (equally) relevant” Normalize, to eliminate constant C q. CSDB (Rank)CCDB (Rank) a } b } c aabbbaabbb Scale by DB Size

12 12 © 2003, Luo Si and Jamie Callan Experimental Results: Resource Selection Measure: Percentage of num of rel docs included compared with relevance based ranking. Trec123-100col (100 DBs) Trec4-kmeans (100 DBs) Non-Relevant ( 2 Large, 81 small DBs) Relevant ( 2 Large,60 small DBs) Evaluated Ranking Best Ranking Large are Relevant Large are Non-Relevant

13 13 © 2003, Luo Si and Jamie Callan Modified ReDDE for retrieval performance Document Retrieval The ReDDE algorithm has a parameter (“ratio”): It tunes the algorithm for “high Precision” or “high Recall” –High Precision focuses attention at the top of the rankings –High Recall focuses attention on retrieving more relevant documents Usually high Precision is the goal in interactive environments –But, for some databases data is sparse, so high Precision settings yield (inaccurate) estimates of zero relevant documents in a DB. Solution: Modified ReDDE with two ratios –Use high Precision setting if possible: Rank all the DBs that have large values with a smaller ratio: DistRel_r1j>=backoff_Thres –Else use high Recall setting: Rank all the DBs by the values with larger ratio: DistRel_r2j

14 14 © 2003, Luo Si and Jamie Callan Experimental Results: Retrieval Performance Document Rank Trec123-100colTrec123-2ldb-60col CORIModified ReDDECORIModified ReDDE 50.37600.4120 (+9.6%)0.37200.4480 (+20.4%) 100.36600.3720 (+1.6%)0.36800.4200 (+14.1%) 150.34530.3640 (+5.4%)0.34400.3853 (+12.0%) 200.31400.3350 (+6.7%)0.32400.3740 (+15.4%) 300.28730.2930 (+2.0%)0.28530.3487 (+22.2%) 1000.17500.1920 (+9.7%)0.16920.2476 (+46.3%) Precision at different doc ranks using CORI and Modified ReDDE resource selection algorithms. Results were averaged over 50 queries. 3 DBs were selected

15 15 © 2003, Luo Si and Jamie Callan Conclusion and Future Work Conclusions: Sample-Resample algorithm gives relatively accurate DB size estimates with low communication cost Database size is an important factor for resource selection algorithm especially in the environment of skewed relevant documents distribution ReDDE has better or at least the same performance than CORI in different environments Modified ReDDE results in better retrieval performance Future work: To adjust the parameters of ReDDE algorithm automatically


Download ppt "Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University"

Similar presentations


Ads by Google