"name": "Dublin Core Metadata in HTML and XML Distributed Information Retrieval
Resource Selection Component Resource Selection two jobs: involves identifying a small set of databases from the distributed information retrieval system that contains documents relevant to a query. after databases are selected a ranked list is produced This Process based on using algorithms CORI KL Divergence Relevant Document Distribution Estimation (ReDDE) Which is the best? ReDDE is proven to be the best algorithm for resource selection. estimates the distribution of relevant documents across the databases for each user query and ranks databases according to this distribution of relevant documents.
Resource Merging Result Merging Selected resources are complied into a single result. removes any duplication of resources Problems different databases use different selection algorithms difficult to merge. solution use standard selection algorithms more problems current merging methods take place at client end - isolated from DIR current methods are not very good. round robin - selecting the first database that it hits, doesn’t take into account of its relevance raw merge - results based on document scores solution place merging component near the selection component Semi Supervised Learning model - resource merging method. aim: produce a ranked list which is similar to one of a centralised information retrieval system. achieved: running a centralised sample database in parallel with the distributed databases. centralised sample database - using query based sampling to build resource descriptions.
Ranked list document links Semi Supervised Learning Model Query entry Resource selection Merging results CENTRALISED SAMPLE DATABASE DISTRIBUTED DATABASES Resource Descriptions of documents held on all databases. Obtained by querying Query is sent to a centralised sample database Merged results ranked by relevance. Combine document ranking Merged list Ranked list of documents from central database. Individual ranked lists Database independent scores Database specific scores
Semi Supervised Learning Model How distributed information retrieval works in more detail A user enters a query The query is used to rank the collection of databases from which a set of databases are selected. The query is then broadcasted to all the selected databases from which it produces a ranked list of all matches with document id and scores. The document ids and scores are added to the merging algorithm. The query is also broadcasted to the parallel running centralized database and the ranked list of document id’s and scores are also inputted into the merging algorithm. The ranked list provided by the central database will influence the resources merged from the distributed databases. SSL The SSL algorithm specifically models result merging as a task of transforming sets of database-specific document scores into a single set of database-independent document scores by using the documents acquired by query-based sampling as training data. Uses a regression algorithm to do this.
ISI Web of Knowledge ISI products are registered trademarks and service marks used under license. An incredible wealth of content -- ISI-Derwent + Partners = depth and diversity Engineered to work as single resource. Uniquely Integrated like no other platform. What makes the Web of Knowledge so unique? CrossSearch: 9,000+ International Journals 100,000+ meetings, symposia, and reports 11.3 million Patented Inventions
Our research interests involve the development of plant species that will actually assist in the clean-up of polluted soils.
We can choose to explore our results using the CrossSearch results summary list as a base.
We can also filter results by specific database. This is especially helpful in identifying particular information, such as patent data, within the results list.
Other Examples Emerge Emerge is a software built for information retrieval of scientific data. makes use of the Dublin core and Z39.50 search protocol XML-based translation engine which can perform metadata mapping and query translation. Harvest collects information from : - internet, intranet using http, ftp - local files like data on hard disk, CDROM and file servers. makes them searchable using a web interface supports wide range of formats Summary Object Interchange Format (SOIF) - metadata mapping BrokerGathererProvider 2 Provider 1 Provider 3 Client Collects information available at provider Collects, stores and managers the information for clients to query
Other Examples User information to keep track of processing data SETI@Home SETI@Home is a screensaver program used to aid the search for extraterrestrial life uses client computers CPU power to process data packets.
Conclusion Distributed Computing Concepts help information retrieval systems Distributed IR depends on Centralised IR - tries to emulate it Current State of Distributed Search GRUB screensaver program which uses your bandwidth and CPU power produces the most up-to-date indexes. have not got wide level of support. P2P search well known for Napster and Kazaa more dynamic than Google- allows users to upload whatever they want, and make it search available Google is in a controlled environment. not considered in commercial field - they don’t see the benefits.