DISTRIBUTED INFORMATION RETRIEVAL 2003. 07. 23 Lee Won Hee.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Modern Information Retrieval
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis & Luis Gravano.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Evaluating the Performance of IR Sytems
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Distributed Information Retrieval Jamie Callan Carnegie Mellon University
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
«Full-text federated search of text-based digital libraries in peer-to-peer networks» Information Retrieval 2006, Springer Jie Liu, Jamie Callan Language.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 Query Operations Relevance Feedback & Query Expansion.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 23: Probabilistic Language Models April 13, 2004.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
A Language Modeling Approach to Information Retrieval 한 경 수  Introduction  Previous Work  Model Description  Empirical Results  Conclusions.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
Statistical Properties of Text
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
Indri at TREC 2004: UMass Terabyte Track Overview Don Metzler University of Massachusetts, Amherst.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
An Empirical Study of Learning to Rank for Entity Search
Compact Query Term Selection Using Topically Related Text
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Panagiotis G. Ipeirotis Luis Gravano
Automatic Global Analysis
Information Retrieval and Web Design
Probabilistic Ranking of Database Query Results
Presentation transcript:

DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee

2 Abstraction  A multi-database model of distributed information retrieval  Full-text information retrieval consists of discovering database contents  Ranking databases by their expected ability to satisfy the query  Searching a small number of databases  Merging results returned by different databases  This paper  Presents algorithms for each task

3 Introduction  Multi-database model of distributed information retrieval  Reflects the distributed location and control of information in a wide area computer network 1)Resource description -The contents of each text database must be described 2)Resource selection -Given an information need and a set of resource descriptions, a decision must be made about which database(s) to search 3)Results merging -Integrating the ranked lists returned by search by each data base into a single, coherent ranked list

4 Multi-database Testbeds  Marcus, 1983  Addressed resource description and selection in the EXPERT CONIT system  The creation of the TREC corpora  The text collections created by the U.S. National Institute for Standards and Technology (NIST) for its TREC conferences  Sufficiently large and varied  Could divide into smaller databases  The summary statistics for three distributed IR testbeds

5 Resource Description  Unigram language model  Gravano et al.,1994; Gravano and Gracia-Molina,1995; Callan et al., Represent each database by a description consisting of the words that occur in the database, and their frequencies of occurrence  Compact and can be obtained automatically by examining the documents in a database or the document indexes  Can be extended easily to include phrases, proper names, and other text features  Resource description based on terms and frequencies  A small fraction of the size of the original text database  Resource Description gives the way to technique called Query Based Sampling

6 Resource Selection (1/4)  Distributed Information Retrieval System  Resource Selection  Process of selecting databases relative to the query  Collections are treated analogously to documents in a databae  CORI database selection algorithm is used

7 Resource Selection (2/4)  The CORI Algorithm (Callan et al., 1995) - df : the number of documents in Ri containing rk - cw : the number of indexing terms in resource Ri - avg_cw : the average number of indexing terms in each resource - C : the number of resource - cf : the number of resources containing term rk - B : the minimum belief component (usually 0.4)

8 Resource Selection (3/4)  INQUERY query operator (Turtle, 1990; Turtle and Croft, 1991)  Can be used for ranking databases and documents - p j :p(r j |R i )

9 Resource Selection (4/4)  Effectiveness of a resource ranking algorithm  Compares a given database ranking at rank n to a desired database ranking at rank n - rgi : number of relevant documents in the i’’th-ranked database under the given ranking - rdi : number of relevant documents in the i’’th-ranked database under a desired ranking in which documents are ordered by the number of relevant documents they contain

10 Merging Document Ranking (1/2)  After a set of databases is searched  The ranked results from each databases must be merged into a single ranking  Difficult when individual databases are not cooperative -Each database are based on different corpus statistics, representations and/or retrieval algorithms  Resource merging technique  Cooperative approach -Use of global idf or same ranking algorithm -Recomputing document scores at the search client  Non-cooperative approach -Estimate normalized document scores : combination of the score of the database and the score of the document

11 Merging Document Ranking (2/2)  Estimates normalized document score  - N : number of resources searched - D’’ : the product of the unnormalized document score D - R i : the database score R i - Avg_R : the average database score

12 Acquiring Resource Descriptions (1/2)  Query-based sampling (Callan, et al., 1999; Callan & Connel, 2001)  Does not require cooperation of the databases  Process of querying database using random word queries  Initial query is selected from large dictionary of terms  Subsequent queries from documents sampled from database

13 Acquiring Resource Descriptions (2/2)  Query-based sampling algorithm 1.Select initial query term 2.Run a one-term query on the database 3.Retrieve the top N documents returned by the database 4.Update the resource description based on characteristics of retrieved document -Extract words & frequencies from top N documents returned by the database -Add the word and their frequencies to the learned resource description 5.If a stopping criterion as not yet been reached, -Select a new query term -Go to Step 2

14 Accuracy of Unigram Language Models (1/3)  Test corpora for query-based sampling experiments  Ctf ratio  How well the learned vocabulary matches the actual vocabulary - V’ : a learned vocabulary - V : a an actual vocabulary - ctf i :the number of times term I occurs in the database

15 Accuracy of Unigram Language Models (2/3)  Spearman Rank Correlation Coefficient  How well the learned term frequencies indicates the frequency of each term in database  The rank correlation coefficient -1 : two orderings are identical -0 : they are uncorrelated --1 : they are in reverse order  - d i : the rank difference of common term i - n : the number of terms - f k :the number of ties in the kth group if ties in the learned resource description - g m : the number of ties in the mth group of ties in the actual resource description

16 Accuracy of Unigram Language Models (3/3)  Experiment

17 Accuracy of Resource Rankings  Experiment

18 Accuracy of Document Rankings  Experiment

19 Summary and Conclusions  Techniques for acquiring descriptions of resources controlled by uncooperative parties  Using resource description to rank text databases by their likelihood of satisfying a query  Merging the document rankings returned by different text databases  The major remaining weakness  The algorithm for merging document rankings produces by different databases  Computational cost by parsing and reranking the documents  Many of the traditional IR tools, such as relevance feedback, have yet to be applied to multi-database environments