Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.

Similar presentations


Presentation on theme: "The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University."— Presentation transcript:

1 The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University lsi@cs.cmu.edu callan@cs.cmu.edu

2 2 © 2003, Luo Si and Jamie Callan Abstract Task: Evaluate the performance of different resource selection algorithms in the environments of different DB size distributions. Extend CORI resource selection algorithm Extend the KL divergence algorithm by using DB sizes as priors Experiments were done on four different testbeds with different characteristics to show ReDDE and extend KL divergence are more robust

3 3 © 2003, Luo Si and Jamie Callan Previous Work: Resource Representation Resource Representation (Content Representation): Query Based Sampling (Need no cooperation) (Callan, et al., 1999) –Submit randomly-generated queries and analyze returned docs –Does not need cooperation for individual DBs Resource Representation (Database Size Estimation): Sample-Resample (Luo and Callan, 2003) Assume: Search engine indicates num of docs match a one-term query Strategy: Estimate df of a query term in sampled docs and in the whole collection; scale the num of sampled docs to get the DB size

4 4 © 2003, Luo Si and Jamie Callan Previous Work: Resource Selection & Results Merging Resource Selection: gGlOSS (Gravano, et al., 1995) –Represent DBs and queries as vectors and calculate the similarities Kullback-Leibler (KL) divergence ( Xu and Croft, 1999) –Calculate the KL divergence between the word frequency distributions of the query and the DB. CORI (Callan, et al., 1995) –A Bayesian Inference Network model. Has been shown effective on different testbeds Results Merging: CORI results merging algorithm (Callan, et al., 1995) Semi-Supervised Learning algorithm (Si and Callan, 2002)

5 5 © 2003, Luo Si and Jamie Callan Resource Selection Algorithms that Normalize DB Size: The Old Version of CORI algorithm CORI algorithm is a Bayesian inference network and an adaptation of the Okapi formula to rank resources. Belief of DB i according to the query term r k is determined: Doc frequency Length of DB i (Sampled) Avg (Sampled) DB Length Num of DBs DB frequency Belief of DB i to the query is the sum of belief for all terms df_base df_factor

6 6 © 2003, Luo Si and Jamie Callan Resource Selection Algorithms that Normalize DB Size: The Extended Version of CORI algorithm Three issues are addressed to incorporate the DB size factor Df is scaled to estimate the actual df in the DB Estimated DB Size DB Sample Size DB length is scaled. df_base and df_factor are scaled. CORI_ext1 addresses first two points; CORI_ext2 addresses all three points

7 7 © 2003, Luo Si and Jamie Callan Resource Selection Algorithms that Normalize DB Size: The Old and Extended Versions of KL-divergence algorithm By language model framework, KL-divergence algorithm calculates the conditional probability of DB given the query. DB independent constant In original KL-divergence algorithm P(C i ) is uniform distribution In extended KL-divergence algorithm P(C i ) is set according to DB Size

8 8 © 2003, Luo Si and Jamie Callan Resource Selection Algorithms that Normalize DB Size: The ReDDE Algorithm The goal of resource selection: –Select the (few) DBs that have the most relevant documents Common strategy: –Pick DBs that are the “most similar” to the query »But similarity measures don’t always normalize well for DB size Optimal strategy: –Rank DBs by the number of relevant documents they contain »It hasn’t been clear how to do this An approximation of the optimal strategy: –Rank DBs by the percentage of relevant documents they contain »This can be estimated a little more easily… …but we need to make some assumptions

9 9 © 2003, Luo Si and Jamie Callan The ReDDE Algorithm: Estimating the Distribution of Relevant Documents Estimated DB size Number of docs sampled from j th DB Number of docs sampled from the DB that contains d j Estimated Number of docs in the DB that contains d j “Everything at the top is (equally) relevant” Normalize, to eliminate constant C q. CSDB (Rank)CCDB (Rank) a } b } c aabbbaabbb Scale by DB Size

10 10 © 2003, Luo Si and Jamie Callan Experimental Data Testbeds: Trec123_100col: 100 DBs. Organized by source and publication date. DB sizes and distribution of relevant documents rather uniform Trec123_AP_WSJ_60col (Relevant): 62 DBs. 60 from above, 2 by merging AP and WSJ DBs. DB sizes skewed and large DBs have much more relevant docs Trec123_FR_DOE_81col (Non-Relevant): 83 DBs. 81 from above, 2 by merging FR and DOE DBs. DB sizes skewed and large DBs have not many relevant docs Trec4_kmeans: 100 DBs. Organized by topic. DB sizes and distribution of relevant documents moderately skewed Trec123_10col: 10 DBs. Each DB is built by merging 10 DBs in Trec123_100col in a round-robin way. DB sizes are large.

11 11 © 2003, Luo Si and Jamie Callan Experimental Results: Resource Selection Measure: Percentage of num of rel docs included compared with relevance based ranking. Trec123-100col (100 DBs) Trec4-kmeans (100 DBs) Trec123_FR_DOE_81col ( 2 Large, 81 small DBs) Trec123_AP_WSJ_60col ( 2 Large,60 small DBs) Evaluated Ranking Best Ranking Large are Relevant Large are Non-Relevant

12 12 © 2003, Luo Si and Jamie Callan Conclusion and Future Work Conclusions: Database size plays an important role for resource selection algorithms especially in the environment of skewed relevant documents distribution Extended KL-divergence and ReDDE algorithms tend to be most robust in the algorithms investigated in the paper In some case, the performance of ReDDE decreases when more and more DBs are selected, may due to parameter setting Future work: To adjust the parameters of ReDDE algorithm automatically


Download ppt "The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University."

Similar presentations


Ads by Google