DISTRIBUTED INFORMATION RETRIEVAL 2003. 07. 23 Lee Won Hee.

DISTRIBUTED INFORMATION RETRIEVAL 2003. 07. 23 Lee Won Hee

2 Abstraction  A multi-database model of distributed information retrieval  Full-text information retrieval consists of discovering database contents  Ranking databases by their expected ability to satisfy the query  Searching a small number of databases  Merging results returned by different databases  This paper  Presents algorithms for each task

3 Introduction  Multi-database model of distributed information retrieval  Reflects the distributed location and control of information in a wide area computer network 1)Resource description -The contents of each text database must be described 2)Resource selection -Given an information need and a set of resource descriptions, a decision must be made about which database(s) to search 3)Results merging -Integrating the ranked lists returned by search by each data base into a single, coherent ranked list

4 Multi-database Testbeds  Marcus, 1983  Addressed resource description and selection in the EXPERT CONIT system  The creation of the TREC corpora  The text collections created by the U.S. National Institute for Standards and Technology (NIST) for its TREC conferences  Sufficiently large and varied  Could divide into smaller databases  The summary statistics for three distributed IR testbeds

5 Resource Description  Unigram language model  Gravano et al.,1994; Gravano and Gracia-Molina,1995; Callan et al., 1995 -Represent each database by a description consisting of the words that occur in the database, and their frequencies of occurrence  Compact and can be obtained automatically by examining the documents in a database or the document indexes  Can be extended easily to include phrases, proper names, and other text features  Resource description based on terms and frequencies  A small fraction of the size of the original text database  Resource Description gives the way to technique called Query Based Sampling

6 Resource Selection (1/4)  Distributed Information Retrieval System  Resource Selection  Process of selecting databases relative to the query  Collections are treated analogously to documents in a databae  CORI database selection algorithm is used

7 Resource Selection (2/4)  The CORI Algorithm (Callan et al., 1995) - df : the number of documents in Ri containing rk - cw : the number of indexing terms in resource Ri - avg_cw : the average number of indexing terms in each resource - C : the number of resource - cf : the number of resources containing term rk - B : the minimum belief component (usually 0.4)

8 Resource Selection (3/4)  INQUERY query operator (Turtle, 1990; Turtle and Croft, 1991)  Can be used for ranking databases and documents - p j :p(r j |R i )

9 Resource Selection (4/4)  Effectiveness of a resource ranking algorithm  Compares a given database ranking at rank n to a desired database ranking at rank n - rgi : number of relevant documents in the i’’th-ranked database under the given ranking - rdi : number of relevant documents in the i’’th-ranked database under a desired ranking in which documents are ordered by the number of relevant documents they contain

10 Merging Document Ranking (1/2)  After a set of databases is searched  The ranked results from each databases must be merged into a single ranking  Difficult when individual databases are not cooperative -Each database are based on different corpus statistics, representations and/or retrieval algorithms  Resource merging technique  Cooperative approach -Use of global idf or same ranking algorithm -Recomputing document scores at the search client  Non-cooperative approach -Estimate normalized document scores : combination of the score of the database and the score of the document

11 Merging Document Ranking (2/2)  Estimates normalized document score  - N : number of resources searched - D’’ : the product of the unnormalized document score D - R i : the database score R i - Avg_R : the average database score

12 Acquiring Resource Descriptions (1/2)  Query-based sampling (Callan, et al., 1999; Callan & Connel, 2001)  Does not require cooperation of the databases  Process of querying database using random word queries  Initial query is selected from large dictionary of terms  Subsequent queries from documents sampled from database

13 Acquiring Resource Descriptions (2/2)  Query-based sampling algorithm 1.Select initial query term 2.Run a one-term query on the database 3.Retrieve the top N documents returned by the database 4.Update the resource description based on characteristics of retrieved document -Extract words & frequencies from top N documents returned by the database -Add the word and their frequencies to the learned resource description 5.If a stopping criterion as not yet been reached, -Select a new query term -Go to Step 2

14 Accuracy of Unigram Language Models (1/3)  Test corpora for query-based sampling experiments  Ctf ratio  How well the learned vocabulary matches the actual vocabulary - V’ : a learned vocabulary - V : a an actual vocabulary - ctf i :the number of times term I occurs in the database

15 Accuracy of Unigram Language Models (2/3)  Spearman Rank Correlation Coefficient  How well the learned term frequencies indicates the frequency of each term in database  The rank correlation coefficient -1 : two orderings are identical -0 : they are uncorrelated --1 : they are in reverse order  - d i : the rank difference of common term i - n : the number of terms - f k :the number of ties in the kth group if ties in the learned resource description - g m : the number of ties in the mth group of ties in the actual resource description

16 Accuracy of Unigram Language Models (3/3)  Experiment

17 Accuracy of Resource Rankings  Experiment

18 Accuracy of Document Rankings  Experiment

19 Summary and Conclusions  Techniques for acquiring descriptions of resources controlled by uncooperative parties  Using resource description to rank text databases by their likelihood of satisfying a query  Merging the document rankings returned by different text databases  The major remaining weakness  The algorithm for merging document rankings produces by different databases  Computational cost by parsing and reranking the documents  Many of the traditional IR tools, such as relevance feedback, have yet to be applied to multi-database environments

DISTRIBUTED INFORMATION RETRIEVAL 2003. 07. 23 Lee Won Hee.

Similar presentations

Presentation on theme: "DISTRIBUTED INFORMATION RETRIEVAL 2003. 07. 23 Lee Won Hee."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DISTRIBUTED INFORMATION RETRIEVAL 2003. 07. 23 Lee Won Hee.

Similar presentations

Presentation on theme: "DISTRIBUTED INFORMATION RETRIEVAL 2003. 07. 23 Lee Won Hee."— Presentation transcript:

Similar presentations

About project

Feedback