1 CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems.

1 CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems

2 Course Administration Next Week Midterm examination on Wednesday evening No office hours or lecture on Thursday.

3 Course Administration Midterm Examination See the Examinations page on the Web site On Wednesday October 11, 7:30-9:00 p.m., Phillips 203, instead of the discussion class. The topics to be are examined are all lectures and discussion class readings before the midterm break. See the Web site for a sample paper. Laptops may be used to store course materials and your notes, but for no other purposes. Hand calculators are allowed. No communication. No other electronic devices.

4 Follow up on Discussion 6 Crawl and Stop Crawler runs until K pages have been downloaded Ideal crawl would download the K most highly ranked pages, R. Actual crawl, downloads M pages from R, where M < K. the ratio M/K is a measure of the performance of the crawler. With random crawling, probability that a given page in R is K/T, where T is total number of pages. Since K pages are downloaded, the expected number of pages that are from R is K x K/T.

5 Follow up on Discussion 6 Optimal refresh frequency assuming page changes follow a Poisson process and their change frequencies are static (i.e., do not change over time). The graph does not monotonically increase over change frequency. The optimal is to refresh pages less often if the pages change too often.

6 Notation Index Docs Catalog Documents File of catalog Searchable file records index User Human Automatic interface action process Physical objects User interface service UI

7 Single Homogeneous Collection: Full Text Indexing Documents and indexes are held on a single computer system (may be several computers). Information retrieval uses a full text index, which may be tuned to the specific corpus. Examples: SMART, Lucene Build index Search Docs Index

8 Lucene Apache Lucene High-performance, full-featured text search engine library. Written entirely in Java. Suitable for nearly any application that requires full- text search, especially cross-platform. An open source project available for free download.

9 Single Homogeneous Collection: Use of Catalog Records Documents may be digital or physical objects, e.g., books. Documents are described by catalog records generated manually (or sometimes automatically). Information retrieval uses an index of catalog records Example: Library catalog Build index Search Create catalog Docs Catalog Index

10 Several Similar Collections: One Computer System Several more or less similar collections are held on a single computer system. Each collection is indexed separately using the same software, procedures, algorithms, etc. (but tuned for each collection, e.g., different stoplists). Example: PubMed Build indexes Search Docs Index Docs Index Docs Index

11 Distributed Architecture: Standard Search Protocols Strict adherence to standards allows any user interface to search any conforming search service. Index 1Index 2 A user interface is configured so that the user can search several different indexes, one at a time User interface 1 User interface 2

12 Standards for Searching Standard Query Languages Example: Common Query Language Protocols for Distributed Searching Example: SRW: Search/Retrieve Web Service http://www.loc.gov/standards/sru/srw/ Example: Z39.50 Clifford A. Lynch, "The Z39.50 Information Retrieval Standard, Part I: A Strategic View of Its Past, Present and Future", D-Lib Magazine, April 1997. http://www.dlib.org/dlib/april97/04lynch.html

13 Standard Search Protocols Example: Z 39.50 Family of Standards for Searching Library Catalogs The Z 39.50 family of standards has proved successful in a tightly knit community, where: There is a strong tradition of standardization, with many professionally trained people. The categories of material change gradually, allowing a slow-moving standardization process. The standardization approach has failed where these two criteria are not met. Historical note: WAIS was based on an early version of Z39.50.

14 Z39.50: Principles Servers store a set of databases with searchable indexes Interactions are based on a session The client opens a connection with the server(s), carries out a sequence of interactions and then closes the connection. During the course of the session, both the server and the client remember the state of their interaction.

15 Z39.50: State The server carries out the search and builds a results set Server saves the results set. Subsequent message from the client can reference the result set. Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database.

16 Z 39.50 Family of Standards for Searching Library Catalogs Content: Anglo American Cataloging Rules Structure of Content: MARC Encoding Rules: Base Encoding Rules (character sets, separators, etc.) Message Passing Protocol: Z 39.50 Query Format: Bib 1 (Boolean), Type 102 (full text) In addition, there are the underlying network standards, e.g. the Internet suite of protocols.

17 Distributed Architecture: Meta-search (Broadcast Search) A user interface service broadcasts a query to several indexes and merges the results. Can be used with full text or catalogs. Searches Index 1 User interface service UI Index 2Index n Search Example: Dienst

18 Distributed Architecture: Broadcast Search Interface Service: Can be a separate server (e.g., CGI), or run on the user's computer (e.g., applet). Protocols: In the simple version, each collection must support the same standards and protocols (e.g., Z 39.50, http).

19 Distributed Architecture: Broadcast Search Problems with Broadcast Search Performance: If any collection does not respond, the Interface Server waits for a time out. Recall: If any collection does not respond, documents in that collection are not found. Ranking and duplicates: There are great difficulties in reconciling ranked lists from different collections. Broadcast searching is as bad as its weakest link! Conclusion: Broadcast search does not scale beyond about five or ten collections, even with strict standardization.

20 Union Catalog Build index Search Create catalog records Docs Union Catalog Index to Union Catalog Catalog records from several libraries are merged into a single union catalog Information retrieval uses an index of the records in the union catalog Example: National Science Digital Library Docs

21 Use of Union Catalogs Batch indexing: Metadata about all items is accumulated in a central system. Real-time searching: The user (a) searches the central index, (b) retrieves catalog records, (c) retrieves documents from collections. Search Docs Union Catalog Index to Union Catalog Retrieve

22 Building Union Catalogs Harvesting Each collection makes a copy of its metadata (catalog records) available from a sever associated with the collection. A search service harvests metadata from all collections on a regular cycle and builds a central search system. Advantages... Can index material from databases without explicit URLs. Allows authentication and selection of material. but... Requires that collections have metadata and support harvesting protocol (e.g., Open Archives Initiative Protocol for Metadata Harvesting).

23 Open Archives Initiative Protocol for Metadata Harvesting See: http://www.openarchives.org/ Herbert Van de Sompel and Carl Lagoze, "The Santa Fe Convention of the Open Archives Initiative." D-Lib Magazine, 6(2), 2000 http://www.dlib.org/dlib/february00/vandesompel- oai/02vandesompel-oai.html

24 Web Searching: Architecture Build index Search Index to all Web pages Documents stored on many Web servers are indexed in a single central index. (This is similar to a union catalog.) The central index is implemented as a single system on a very large number of computers Examples: Google, Yahoo! Web servers with Web pages Crawl Web pages retrieved by crawler

25 Use of Web Search Service Batch indexing: Each Web page is brought to the central location and indexed. Real-time searching: The user (a) searches the central index, (b) retrieves documents (Web pages) from original location. Search Retrieve Index to all Web pages Docs on Web servers

26 Web Searching: Building the Index Documents are Web pages Each document is: identified by Web Crawling copied to a central location indexed and added to the central index After indexing the documents may be discarded, but a copy may be retained, for use by the user interface. Web searching is the topic of Lectures 15-18 and Discussion Classes 6, 7, and 8.

27 Web Crawling Advantages of Web crawling Entirely automatic, low cost. Highly efficient at gathering very large amounts of material. but... Can only gather openly accessible materials. Cannot gather material in databases unless explicit URLs are known. Cannot easily make use of metadata provided by collections.

28 Standardization: Function Versus Cost of Acceptance Function Cost of acceptance Many adopters Few adopters

29 Example: Textual Mark-up Function Cost of acceptance SGML ASCII HTML XML

1 CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems.

Similar presentations

Presentation on theme: "1 CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems.

Similar presentations

Presentation on theme: "1 CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems."— Presentation transcript:

Similar presentations

About project

Feedback