Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University

Slides:



Advertisements
Similar presentations
Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007.
Advertisements

Chapter 5: Introduction to Information Retrieval
Information Extraction Lecture 4 – Named Entity Recognition II CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Fast Algorithms For Hierarchical Range Histogram Constructions
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Evaluating Search Engine
Search Engines and Information Retrieval
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Query Reformulation: User Relevance Feedback. Introduction Difficulty of formulating user queries –Users have insufficient knowledge of the collection.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Exploration & Exploitation in Adaptive Filtering Based on Bayesian Active Learning Yi Zhang, Jamie Callan Carnegie Mellon Univ. Wei Xu NEC Lab America.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Distributed Information Retrieval Jamie Callan Carnegie Mellon University
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
LIS618 lecture 11 i/r performance evaluation Thomas Krichel
Active Learning for Class Imbalance Problem
Search Engines and Information Retrieval Chapter 1.
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
«Full-text federated search of text-based digital libraries in peer-to-peer networks» Information Retrieval 2006, Springer Jie Liu, Jamie Callan Language.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Content-Based Retrieval in Hierarchical Peer-to-Peer.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Learning to Estimate Query Difficulty Including Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine,
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Relevance Feedback Hongning Wang
Search Result Diversification in Resource Selection for Federated Search Date : 2014/06/17 Author : Dzung Hong, Luo Si Source : SIGIR’13 Advisor: Jia-ling.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Collaborative Filtering With Decoupled Models for Preferences and Ratings Rong Jin 1, Luo Si 1, ChengXiang Zhai 2 and Jamie Callan 1 Language Technology.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Prediction of Interconnect Net-Degree Distribution Based on Rent’s Rule Tao Wan and Malgorzata Chrzanowska- Jeske Department of Electrical and Computer.
Collection Fusion in Carrot2
Relevance Feedback Hongning Wang
Panos Ipeirotis Luis Gravano
Panagiotis G. Ipeirotis Luis Gravano
Presentation transcript:

Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University

2 © 2003, Luo Si and Jamie Callan Abstract Task: Distributed Information Retrieval in uncooperative environments. Contributions: Sample-Resample method to estimate DB size. ReDDE (relevant document distribution estimation) resource selection algorithm directly estimates distribution of relevant documents among databases Modified ReDDE algorithm for better retrieval performance.

3 © 2003, Luo Si and Jamie Callan What is Distributed Information Retrieval (Federated Search)? Engine 1Engine 2Engine 3Engine 4Engine n... (2)Resource Selection (1)Resource Representation …… (4)Results Merging …….. Four steps: (1)Find out what each DB contains (2) Decide which DBs to search (3)Search selected DBs (4) Merge results returned by DBs

4 © 2003, Luo Si and Jamie Callan Previous Work: Resource Representation Resource Representation (Content Representation): Query Based Sampling (Need no cooperation) (Callan, et al., 1999) –Submitting randomly-generated queries and analyze returned docs –Does not need cooperation for individual DBs Resource Representation (Database Size Estimation): Capture-Recapture Model (Liu and Yu, 1999) #In_Samp1#Out_Samp1 #In_Samp2ab #Out_Samp2cd Total Num:

5 © 2003, Luo Si and Jamie Callan Previous Work: Resource Selection & Results Merging Resource Selection: gGlOSS (Gravano, et al., 1995) –Represent DBs and queries as vectors and calculate the similarities CORI (Callan, et al., 1995) –A Bayesian Inference Network model. Has been shown effective on different testbeds Results Merging: CORI results merging algorithm (Callan, et al., 1995) –Linear heuristic model with fixed parameters Semi-Supervised Learning algorithm (Si and Callan, 2002) –Linear model and parameters are learned from training data

6 © 2003, Luo Si and Jamie Callan Previous Work: Thoughts Thoughts: Original capture-recapture method has very large cost to get relatively accurate DB size estimates Most of the resource selection algorithms have not been studied in the environment with skewed DB size distribution They do not directly optimize the number of relevant docs contained in selected DBs. (The goal of resource selection) There is inconsistency between the goals of resource selection and retrieval (high recall and high precision)

7 © 2003, Luo Si and Jamie Callan Experimental Data Testbeds: Trec123_100col: 100 DBs. Organized by source and publication date. DB sizes and distribution of relevant documents rather uniform Trec123_AP_WSJ_60col (Relevant): 62 DBs. 60 from above, 2 by merging AP and WSJ DBs. DB sizes skewed and large DBs have much more relevant docs Trec123_FR_DOE_81col (Non-Relevant): 83 DBs. 81 from above, 2 by merging FR and DOE DBs. DB sizes skewed and large DBs have not many relevant docs Trec4_kmeans: 100 DBs. Organized by topic. DB sizes and distribution of relevant documents moderately skewed Trec123_10col: 10 DBs. Each DB is built by merging 10 DBs in Trec123_100col in a round-robin way. DB sizes are large.

8 © 2003, Luo Si and Jamie Callan A New Approach to DB Size Estimation: The Sample-Resample Algorithm The Idea: Assume: Search engine indicates num of docs match a one-term query Strategy: Estimate df of a term in sampled docs and get df from the DB in the whole collection; scale the num of sampled docs to get the DB size Centralized sample DB: built by all the sampled docs Centralized complete DB: imaginarily built by all the docs in all DBs df of term in sampled docs from j th DB Num of docs sampled from j th DB df of term in the whole j th DB DB Size Estimate

9 © 2003, Luo Si and Jamie Callan Experimental Results: DB Size Estimation Method Trec Col (Avg AER) Trec123-10Col (Avg AER) Original Cap-Recap (Top1) Cap-Recap (Top 20) Sample-Resample Methods were allowed the same num of transactions with a DB Capture-Recapture: about 385 queries (transactions). Sample-Resample: 80 queries and 300 docs for sampled docs (sample) +5 queries ( resample)=385 transactions Measure: Absolute error ratio Estimated DB Size Actual DB Size Original Cap-Recap (Top 1) only selects top 1 Doc to build the sample, more experiments are in the paper

10 © 2003, Luo Si and Jamie Callan A New Approach to Resource Selection: The ReDDE Algorithm The goal of resource selection: –Select the (few) DBs that have the most relevant documents Common strategy: –Pick DBs that are the “most similar” to the query »But similarity measures don’t always normalize well for DB size Desired strategy: –Rank DBs by the number of relevant documents they contain »It hasn’t been clear how to do this An approximation of the desired strategy: –Rank DBs by the percentage of relevant documents they contain »This can be estimated a little more easily… …but we need to make some assumptions

11 © 2003, Luo Si and Jamie Callan The ReDDE Algorithm: Estimating the Distribution of Relevant Documents Estimated DB size Number of docs sampled from j th DB Number of docs sampled from the DB that contains d j Estimated Number of docs in the DB that contains d j “Everything at the top is (equally) relevant” Normalize, to eliminate constant C q. CSDB (Rank)CCDB (Rank) a } b } c aabbbaabbb Scale by DB Size

12 © 2003, Luo Si and Jamie Callan Experimental Results: Resource Selection Measure: Percentage of num of rel docs included compared with relevance based ranking. Trec col (100 DBs) Trec4-kmeans (100 DBs) Non-Relevant ( 2 Large, 81 small DBs) Relevant ( 2 Large,60 small DBs) Evaluated Ranking Best Ranking Large are Relevant Large are Non-Relevant

13 © 2003, Luo Si and Jamie Callan Modified ReDDE for retrieval performance Document Retrieval The ReDDE algorithm has a parameter (“ratio”): It tunes the algorithm for “high Precision” or “high Recall” –High Precision focuses attention at the top of the rankings –High Recall focuses attention on retrieving more relevant documents Usually high Precision is the goal in interactive environments –But, for some databases data is sparse, so high Precision settings yield (inaccurate) estimates of zero relevant documents in a DB. Solution: Modified ReDDE with two ratios –Use high Precision setting if possible: Rank all the DBs that have large values with a smaller ratio: DistRel_r1j>=backoff_Thres –Else use high Recall setting: Rank all the DBs by the values with larger ratio: DistRel_r2j

14 © 2003, Luo Si and Jamie Callan Experimental Results: Retrieval Performance Document Rank Trec colTrec123-2ldb-60col CORIModified ReDDECORIModified ReDDE (+9.6%) (+20.4%) (+1.6%) (+14.1%) (+5.4%) (+12.0%) (+6.7%) (+15.4%) (+2.0%) (+22.2%) (+9.7%) (+46.3%) Precision at different doc ranks using CORI and Modified ReDDE resource selection algorithms. Results were averaged over 50 queries. 3 DBs were selected

15 © 2003, Luo Si and Jamie Callan Conclusion and Future Work Conclusions: Sample-Resample algorithm gives relatively accurate DB size estimates with low communication cost Database size is an important factor for resource selection algorithm especially in the environment of skewed relevant documents distribution ReDDE has better or at least the same performance than CORI in different environments Modified ReDDE results in better retrieval performance Future work: To adjust the parameters of ReDDE algorithm automatically