1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.

1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1

2 Course Administration CS 490 and CS 790 Independent Research Projects Next semester, the National Science Digital Library (NSDL) will have several projects that are suitable for independent research. Topics will include information discovery with mixed content and mixed metadata. If you are interested, send email to wya@cs.cornell.edu.

3 Basic Architecture 1: Single Homogeneous Collection Documents and indexes are held on a single computer system (may be several computers). The user interface and search methods are selected for the specific service. Examples: Medline (medical information) Cornell University library catalog Index Documents

4 Basic Architecture 2: Several Similar Collections -- One Computer System Several more or less similar collections are held on a single computer system. Each collection is indexed separately using the same software, procedures, algorithms, etc. (but tuned for each collection, e.g., stoplists). The user interface is the same (or very similar) for each service. Examples: OCLC's FirstSearch

5 Distributed Architecture 1: Standard Search Protocols Find x Strict adherence to standards allows any user interface to search any conforming search service.

6 Distributed Architecture 1: Standard Search Protocols Example: Z 39.50 Family of Standards for Searching Library Catalogs Content: Anglo American Cataloging Rules Structure of Content: MARC Encoding Rules: Base Encoding Rules (character sets, separators, etc.) Message Passing Protocol: Z 39.50 Query Format: Bib 1 (Boolean), Type 102 (full text) In addition, there are the underlying network standards, e.g. the Internet suite of protocols.

7 Z39.50 principles Servers store a set of databases with searchable indexes Interactions are based on a session The client opens a connection with the server(s), carries out a sequence of interactions and then closes the connection. During the course of the session, both the server and the client remember the state of their interaction.

8 State Z39.50 The server carries out the search and builds a results set Server saves the results set. Subsequent message from the client can reference the result set. Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database.

9 Standard Search Protocols Example: Z 39.50 Family of Standards for Searching Library Catalogs The Z 39.50 family of standards has proved successful in a tightly knit community, where: There is a strong tradition of standardization, with many professionally trained people. The categories of material change gradually, allowing a slow-moving standardization process. The standardization approach has failed where these two criteria are not met. Historic note: WAIS was based on an early version of Z39.50.

10 Distributed Architecture 2: Broadcast Search (a.k.a. Federated Search) Find x Interface Service An interface server broadcasts a query to each collection, combines the results and returns them to the user. Examples: Dienst (digital library protocol), Web metasearch services

11 Distributed Architecture 2: Broadcast Search Interface Service: Can be a separate server (e.g., CGI), or run on the user's computer (e.g., applet). Protocols: In the simple version, each collection must support the same standards and protocols (e.g., Z 39.50, http, etc.).

12 Distributed Architecture 2: Broadcast Search Problems with Broadcast Search Performance: If any collection does not respond, the Interface Server waits for a time out. Recall: If any collection does not respond, documents in that collection are not found. Ranking and duplicates: There are great difficulties in reconciling ranked lists from different collections. Broadcast searching is as bad as its weakest link! Conclusion: Broadcast search does not scale beyond about five or ten collections, even with strict standardization.

13 Standardization: Function Versus Cost of Acceptance Function Cost of acceptance Many adopters Few adopters

14 Example: Textual Mark-up Function Cost of acceptance SGML ASCII HTML XML

15 Distributed Architecture 3: Centralized Search Services Find x Batch indexing: Metadata about all items is accumulated in a central system. Real-time searching: The user (a) searches the central system, and (b) retrieves items from collections. Examples: Union catalogs, Web search services Search Service retrieve search

16 Distributed Architecture 3: Centralized Search Services Gathering by Web Crawling Entirely automatic, low cost. Highly efficient at gathering very large amounts of material. but... Can only gather openly accessible materials. Cannot gather material in databases unless explicit URLs are known. Cannot easily make use of metadata provided by collections. Examples: Web search services.

17 Distributed Architecture 3: Centralized Search Services Harvesting Each collection makes a copy of its metadata available from a sever associated with the collection. A search service harvests metadata from all collections on a regular cycle and builds a central search system. Advantages... Can index material from databases without explicit URLs. Allows authentication and selection of material. but... Requires that collections have metadata and support harvesting protocol (e.g., Open Archives Initiative Protocol for Metadata Harvesting).

18 Open Archives Initiative Protocol for Metadata Harvesting Low-barrier protocol for exposing structured information (metadata) from cooperating repositories Provides opportunity for building comprehensive service network http://www.openarchives.org/

19 Discovery Current Awareness Preservation Service Providers Data Providers Metadata harvesting OAI-PMH: A simple two party model for sharing structured information

20 OAI-PMH Key technical features Simple HTTP encoding Built on of established XML standards Multiple metadata formats, but Dublin Core required Repository partitioning (sets) Selective harvesting (sets and dates) Clean partition between core and implementation-specific extensions –Multiple item-level metadata –Collection level metadata

21 OAI Verbs Identify – repository characteristics ListMetadataFormats – DC required ListSets – repository partitioning ListRecords – (selectively) harvest metadata ListIdentifiers – (selectively) harvest metadata identifiers GetRecord – known item retrieval

1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.

Similar presentations

Presentation on theme: "1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.

Similar presentations

Presentation on theme: "1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1."— Presentation transcript:

Similar presentations

About project

Feedback