Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections MS Thesis Defense Bhaumik Chokshi Committee Members:
Evaluating Search Engine
Information Retrieval in Practice
Presenters: Başak Çakar Şadiye Kaptanoğlu.  Typical output of an IR system – static predefined summary ◦ Title ◦ First few sentences  Not a clear view.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Data Fusion Eyüp Serdar AYAZ İlker Nadi BOZKURT Hayrettin GÜRKÖK.
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
Distributed Information Retrieval Jamie Callan Carnegie Mellon University
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Overview of Search Engines
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.
Information Retrieval in Practice
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
PageRank for Product Image Search Yushi Jing, Shumeet Baluja College of Computing, Georgia Institute of Technology Google, Inc. WWW 2008 Referred Track:
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Querying Structured Text in an XML Database By Xuemei Luo.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Diversifying Search Result WSDM 2009 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University Center for E-Business.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,
Survey Jaehui Park Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
Efficient Result-set Merging Across Thousands of Hosts Simulating an Internet-scale GIR application with the GOV2 Test Collection Christopher Fallen Arctic.
PERSONALIZED DIVERSIFICATION OF SEARCH RESULTS Date: 2013/04/15 Author: David Vallet, Pablo Castells Source: SIGIR’12 Advisor: Dr.Jia-ling, Koh Speaker:
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Search Result Diversification in Resource Selection for Federated Search Date : 2014/06/17 Author : Dzung Hong, Luo Si Source : SIGIR’13 Advisor: Jia-ling.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Modern Information Retrieval
Document Clustering and Collection Selection Diego Puppin Web Mining,
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Information Retrieval in Practice
Evaluation Anisio Lacerda.
Information Retrieval in Practice
Evaluation of IR Systems
Compact Query Term Selection Using Topically Related Text
Feature Selection for Ranking
Information Retrieval and Web Design
Presentation transcript:

Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University, Melbourne, Australia SIGIR 2007(Collection representation in distributed IR) Presented by JongHeum Yeon, IDS Lab., Seoul National University

Copyright  2008 by CEBT Abstract  Federated information retrieval (FIR) Send query to multiple collections Central broker merges the results and ranks them  Duplicated documents in collections Final results contains high number of duplicates potentially  Authors propose a method for estimating the rate of overlap among collections based on sampling  Using the estimated overlap statistics, they propose two collection selection methods that aim to maximize the number of unique relevant documents in the final results 2 Broker Collection User

Copyright  2008 by CEBT Federated Information Retrieval (FIR)  Query is sent simultaneously to several collections  Each collection evaluates the query and returns the results to the broker  Advantage No need to access the index of the collections Search over the latest version of documents without crawling and indexing  Broker selects collections that are most likely to return relevant documents Collection selection problem Collection representation problem Result merging problem 3

Copyright  2008 by CEBT Collection Selection Problem  FIR techniques assume that the degree of overlap among collections is either none or negligible  However, there are many collections that have a significant degree of overlap Bibliographic databases News resources  Selecting collections that are likely to return the same results by introducing duplicate documents into the final results Wastes costly resources Degrades search effectiveness  Authors propose … A method that estimates the degree of overlap among collections by sampling from each collection using random queries two collection selection techniques that use the estimated overlap statistics to maximize the number of unique relevant documents in the final results 4

Copyright  2008 by CEBT Related Work  Cooperative collection selection techniques Collections provide the broker with their index statistics and other useful information CORI, GlOSS, CVV  Uncooperative collection selection techniques Collections do not provide their index statistics to the broker The broker samples documents from each collection ReDDE uses sampled documents for … – Estimates the number of relevant documents in collections – Ranks collections according to the number of highly ranked sampled documents 5

Copyright  2008 by CEBT Overlap Estimation  Using the documents downloaded by query-based sampling for estimating the rate of overlap and does not require any additional information  Subset of sample documents  Size of m  The probability of any given document from m1 to be available in m2 6 C1C2 S2 S1 K  Expected number of documents

Copyright  2008 by CEBT Overlap Estimation (cont’d)  P(i) follows binomial distribution 7

Copyright  2008 by CEBT Overlap Estimation (cont’d)  Binomial theorem  Expected number of documents in m1 ∩ m2 The number of overlap documents is independent of the collection size 8

Copyright  2008 by CEBT The ‘RELAX’ Selection Method  Graph G = {(u,v) | vertex u, v are collections, edges indicates overlap documents between vertices}  Output : final merged document lists that minimized duplicates 9

Copyright  2008 by CEBT The ‘RELAX’ Selection Method (cont’d) 10

Copyright  2008 by CEBT Overlap Filtering for ReDDE  F-ReDDE 1.The overlaps among collections are estimated as described for the Relax selection 2.Collections are ranked using a resource selection algorithm such as ReDDE 3.Each collection is compared with the previously selected collections. It is removed from the list if it has a high overlap (greater than γ) with any of the previously selected collections. We empirically choose γ = 30% and leave methods for finding the optimum value as future work 11

Copyright  2008 by CEBT Testbeds  Authors create three new testbeds with overlapping collections based on the documents available in the TREC GOV dataset  Qprobed most frequent queries in a search engine in the.gov A random number of documents (between 5000 and 20000) are downloaded as a collection Generate 280 collections with average size of documents  Qprobed-300 every twentieth collection is merged into a single large collection  Sliding-115 Using a sliding window of documents Generate 112 collections 12

Copyright  2008 by CEBT Testbeds (cont’d)  Qprobed collection pairs < 10% overlap 79 pairs < 90% 1.1% of collection pairs > 50% overlap  Qprobed % of collection pairs > 50% overlap  Sliding % of collection pairs > 50% overlap 13

Copyright  2008 by CEBT Results  The initial estimated values for D(i, j) suggested that the degree of overlap among collections is usually overestimated Document retrieval models are biased towards returning some popular documents for many queries Samples produced by query-based sampling are not random 14

Copyright  2008 by CEBT Results (cont’d) 15

Copyright  2008 by CEBT Results (cont’d) 16

Copyright  2008 by CEBT Conclusion & Discussion  Pros Propose the efficient algorithm for handling duplicates  Cons Experiments show the improved performance In practical environment? 17