Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distributed Information Retrieval Jamie Callan Carnegie Mellon University

Similar presentations


Presentation on theme: "Distributed Information Retrieval Jamie Callan Carnegie Mellon University"— Presentation transcript:

1 Distributed Information Retrieval Jamie Callan Carnegie Mellon University callan@cs.cmu.edu

2 © 2002 Jamie Callan 2 Multi-Database Solutions: Distributed Information Retrieval Engine 1Engine 2Engine 3Engine 4Engine n..... ? Information Need Common scenarios: Multiple partitions, single service Independent engines, single organization Independent engines, affiliated organizations Independent engines, unaffiliated organizations Defining dimensions: Cooperative vs. uncooperative engines Centralized vs. decentralized solutions

3 © 2002 Jamie Callan 3 Multi-Database Solutions Browsing model –Manual selection, no support for results-merging, etc Web-search (single database) model Distributed information retrieval –Automatic or interactive DB selection –Support for results-merging Peer-to-peer systems –DB self-selection, mostly based on filename matching –No support for results merging

4 © 2002 Jamie Callan 4 Distributed IR: The Issues Usually Addressed Site description: Contents, search engine, services, etc Resource ranking: ranking resources by how likely to contain desired content Resource selection: selecting the best subset from a ranked list Searching: Interoperability Result merging: Merging a set of document rankings –different underlying corpus statistics –different search engines

5 © 2002 Jamie Callan 5 Resource Selection Resource Descriptions –Characterization of a given database –Typical solution: word histograms –Example: Query based sampling to learn a unigram language model Resource Ranking and Selection –Based on comparing the resource descriptions on a per query basis. –Current techniques are ad hoc »E.g., treat collections like big documents Language models are one way of describing and selecting resources –By comparing query language models one might be able to produce good resource descriptions. –Has been done (Si et al) in the case when the search engine is the same across databases. Performance better than CORI.

6 © 2002 Jamie Callan 6 Merging Results General problem: Multiple ranked lists of documents –Meta-search: Single DB or several DBs with overlapping content –Distributed IR: Multiple DBs with (more or less) disjoint contents Solutions: –Rerank at client –Ad-hoc –Semi-supervised learning of normalizing functions

7 © 2002 Jamie Callan 7 Distributed IR: State of the Art Acquiring database descriptions –Good techniques for cooperative & uncooperative environments Automatic resource selection –Good techniques for large numbers of databases –Current techniques very close to “single database” accuracy ……in research environments –Theory says they can be better than “single DB” accuracy ……but we don’t know how to do it yet Merging results from multiple databases –Good techniques for cooperative and uncooperative environments

8 © 2002 Jamie Callan 8 Distributed IR: What Lies Ahead Language modeling –So far it’s as good as (but not better than) ad-hoc techniques Query expansion –So far it doesn’t help for resource selection (!) Multilingual / cross-lingual environments Database summarization –“ACLU Search” is a mystery if you don’t know about the ACLU Automatic categorization of databases into classification hierarchies –An automatic “Invisible Web” site Decentralization –Most of the current solutions are centralized –Peer-to-peer a possible solution


Download ppt "Distributed Information Retrieval Jamie Callan Carnegie Mellon University"

Similar presentations


Ads by Google