The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Evaluating the Robustness of Learning from Implicit Feedback Filip Radlinski Thorsten Joachims Presentation by Dinesh Bhirud
Advertisements

Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Information Retrieval Models: Probabilistic Models
Search Engines and Information Retrieval
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Carnegie Mellon Exact Maximum Likelihood Estimation for Word Mixtures Yi Zhang & Jamie Callan Carnegie Mellon University Wei Xu.
Exploration & Exploitation in Adaptive Filtering Based on Bayesian Active Learning Yi Zhang, Jamie Callan Carnegie Mellon Univ. Wei Xu NEC Lab America.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
IR Models: Review Vector Model and Probabilistic.
Distributed Information Retrieval Jamie Callan Carnegie Mellon University
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Multiple testing correction
Search Engines and Information Retrieval Chapter 1.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
«Full-text federated search of text-based digital libraries in peer-to-peer networks» Information Retrieval 2006, Springer Jie Liu, Jamie Callan Language.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Positional Relevance Model for Pseudo–Relevance Feedback Yuanhua Lv & ChengXiang Zhai Department of Computer Science, UIUC Presented by Bo Man 2014/11/18.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Relevance Feedback Hongning Wang
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Collaborative Filtering With Decoupled Models for Preferences and Ratings Rong Jin 1, Luo Si 1, ChengXiang Zhai 2 and Jamie Callan 1 Language Technology.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Using Statistical Decision Theory and Relevance Models for Query-Performance Prediction Anna Shtok and Oren Kurland and David Carmel SIGIR 2010 Hao-Chin.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Collection Fusion in Carrot2
An Empirical Study of Learning to Rank for Entity Search
Martin Rajman, Martin Vesely
Information Retrieval Models: Probabilistic Models
Relevance Feedback Hongning Wang
John Lafferty, Chengxiang Zhai School of Computer Science
Panos Ipeirotis Luis Gravano
Update on “Channel Models for 60 GHz WLAN Systems” Document
Panagiotis G. Ipeirotis Luis Gravano
INF 141: Information Retrieval
Language Models for TR Rong Jin
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University

2 © 2003, Luo Si and Jamie Callan Abstract Task: Evaluate the performance of different resource selection algorithms in the environments of different DB size distributions. Extend CORI resource selection algorithm Extend the KL divergence algorithm by using DB sizes as priors Experiments were done on four different testbeds with different characteristics to show ReDDE and extend KL divergence are more robust

3 © 2003, Luo Si and Jamie Callan Previous Work: Resource Representation Resource Representation (Content Representation): Query Based Sampling (Need no cooperation) (Callan, et al., 1999) –Submit randomly-generated queries and analyze returned docs –Does not need cooperation for individual DBs Resource Representation (Database Size Estimation): Sample-Resample (Luo and Callan, 2003) Assume: Search engine indicates num of docs match a one-term query Strategy: Estimate df of a query term in sampled docs and in the whole collection; scale the num of sampled docs to get the DB size

4 © 2003, Luo Si and Jamie Callan Previous Work: Resource Selection & Results Merging Resource Selection: gGlOSS (Gravano, et al., 1995) –Represent DBs and queries as vectors and calculate the similarities Kullback-Leibler (KL) divergence ( Xu and Croft, 1999) –Calculate the KL divergence between the word frequency distributions of the query and the DB. CORI (Callan, et al., 1995) –A Bayesian Inference Network model. Has been shown effective on different testbeds Results Merging: CORI results merging algorithm (Callan, et al., 1995) Semi-Supervised Learning algorithm (Si and Callan, 2002)

5 © 2003, Luo Si and Jamie Callan Resource Selection Algorithms that Normalize DB Size: The Old Version of CORI algorithm CORI algorithm is a Bayesian inference network and an adaptation of the Okapi formula to rank resources. Belief of DB i according to the query term r k is determined: Doc frequency Length of DB i (Sampled) Avg (Sampled) DB Length Num of DBs DB frequency Belief of DB i to the query is the sum of belief for all terms df_base df_factor

6 © 2003, Luo Si and Jamie Callan Resource Selection Algorithms that Normalize DB Size: The Extended Version of CORI algorithm Three issues are addressed to incorporate the DB size factor Df is scaled to estimate the actual df in the DB Estimated DB Size DB Sample Size DB length is scaled. df_base and df_factor are scaled. CORI_ext1 addresses first two points; CORI_ext2 addresses all three points

7 © 2003, Luo Si and Jamie Callan Resource Selection Algorithms that Normalize DB Size: The Old and Extended Versions of KL-divergence algorithm By language model framework, KL-divergence algorithm calculates the conditional probability of DB given the query. DB independent constant In original KL-divergence algorithm P(C i ) is uniform distribution In extended KL-divergence algorithm P(C i ) is set according to DB Size

8 © 2003, Luo Si and Jamie Callan Resource Selection Algorithms that Normalize DB Size: The ReDDE Algorithm The goal of resource selection: –Select the (few) DBs that have the most relevant documents Common strategy: –Pick DBs that are the “most similar” to the query »But similarity measures don’t always normalize well for DB size Optimal strategy: –Rank DBs by the number of relevant documents they contain »It hasn’t been clear how to do this An approximation of the optimal strategy: –Rank DBs by the percentage of relevant documents they contain »This can be estimated a little more easily… …but we need to make some assumptions

9 © 2003, Luo Si and Jamie Callan The ReDDE Algorithm: Estimating the Distribution of Relevant Documents Estimated DB size Number of docs sampled from j th DB Number of docs sampled from the DB that contains d j Estimated Number of docs in the DB that contains d j “Everything at the top is (equally) relevant” Normalize, to eliminate constant C q. CSDB (Rank)CCDB (Rank) a } b } c aabbbaabbb Scale by DB Size

10 © 2003, Luo Si and Jamie Callan Experimental Data Testbeds: Trec123_100col: 100 DBs. Organized by source and publication date. DB sizes and distribution of relevant documents rather uniform Trec123_AP_WSJ_60col (Relevant): 62 DBs. 60 from above, 2 by merging AP and WSJ DBs. DB sizes skewed and large DBs have much more relevant docs Trec123_FR_DOE_81col (Non-Relevant): 83 DBs. 81 from above, 2 by merging FR and DOE DBs. DB sizes skewed and large DBs have not many relevant docs Trec4_kmeans: 100 DBs. Organized by topic. DB sizes and distribution of relevant documents moderately skewed Trec123_10col: 10 DBs. Each DB is built by merging 10 DBs in Trec123_100col in a round-robin way. DB sizes are large.

11 © 2003, Luo Si and Jamie Callan Experimental Results: Resource Selection Measure: Percentage of num of rel docs included compared with relevance based ranking. Trec col (100 DBs) Trec4-kmeans (100 DBs) Trec123_FR_DOE_81col ( 2 Large, 81 small DBs) Trec123_AP_WSJ_60col ( 2 Large,60 small DBs) Evaluated Ranking Best Ranking Large are Relevant Large are Non-Relevant

12 © 2003, Luo Si and Jamie Callan Conclusion and Future Work Conclusions: Database size plays an important role for resource selection algorithms especially in the environment of skewed relevant documents distribution Extended KL-divergence and ReDDE algorithms tend to be most robust in the algorithms investigated in the paper In some case, the performance of ReDDE decreases when more and more DBs are selected, may due to parameter setting Future work: To adjust the parameters of ReDDE algorithm automatically