ISP 433/533 Week 8 IR in libraries
Goal Universal Access to Information Vannevar Bush 1945 article Memex A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.
History 1970s - commercial retrieval systems –Search remote databases to provide reference services 1980s – online public access catalog (OPAC), full text files –Provide online access to end users 1990s – digital library programs, WWW 2000s - ?
Bibliographic Databases Chemical Abstracts (CA), Engineering Index, MEDLINE, PsycINFO, etc. Manually selected, indexed, abstracted and entered into system Record format depends on field –Controlled vocabulary
Database Vendors DIALOG, LEXIS-NEXIS, OCLC, Wilson etc. Provide a common search interface Search on multiple bibliographic databases –Cross-databases search Mostly Boolean retrieval Cater to professional search intermediaries, e.g. reference librarians
OPACs Provide patrons access to library holdings Author, title, call number, subject heading, keywords Machine Readable Catalogue (MARC) Boolean search Web interface to legacy systems OPACs at Albany
Digital Libraries DL is a collection of information that is both digitized and organized - Lesk
How DL differ? Vs. Traditional bibliographic databases and OPACs –Extension and superset –Provide both metadata and data –New technology Vs. WWW –Organization –tightly controlled, and have a targeted customer set
Vs. Traditional Library Physical objects –You have it, I can’t have it –Travel to access –Expensive to maintain –Anything else? TL doesn’t collect “Grey Literature” –technical reports, government reports, unedited proceedings etc.
Converting to Digital Format Scanning –basically “photographing” a page Optical Character Recognition (OCR) –generally when scanning, additional s/w deduces semantic content from the photographed page (“guesses the words”) Keying –retyping it all back in... All too time-consuming and $$$! Best to avoid conversion altogether if possible
Better Way Publishing with a DL in mind Publishing in electronic form What format? ArchivalOriginalIntermediatePresentation PastTIFF--GIF, JPEG Present/Fut ure XML, RTFWord, TeX/LaTeX RTFPS, PDF, HTML
DL Architecture A Framework for Distributed Digital Object Services –Kahn/Wilensky Framework (KWF) digital objects (DOs) –a unit of exchange for the DL with a particular data structure and characteristics repository –the place where DOs live handles –a unique, persistent name for a DO
Kahn/Wilensky Framework
Digital Objects Typed data: –E.g type: computer-science-tech-report, bit- sequence… –with metadata: author, institution, series, etc. Composite DOs: –a DO with data of type digital-object –composite DOs can be used to collect similar works together composite DO than contains a DO for each work of Shakespeare...
Handles Handles can be thought of as a Uniform Resource Name (URN) implementation contains info about the handle system –persistence –location independence Handles are of the general form: GlobalAuthority.LocalAuthority/LocallyUniqueString or, for example: NASA.LaRC/tm Possible project – evaluate various URN implementations (e.g. Handle, Purl, DOI )
Repository Access Protocol (RAP) “Protocol” may be misleading, its really just the skeleton for a protocol RAP is designed to be simple –repositories themselves should be simple KWF defines 3 basic operation classes: –ACCESS_DO –DEPOSIT_DO –ACCESS_REF Return reference to the repository server, this is the catch-all operation for all meta-services... –More operations were defined in implementations
DL Points The underlying architecture should be separate from the content stored in the library Names and identifiers are the basic building block for the digital library Digital library objects are more than collections of bits Users want intellectual works, not digital objects
5S Model Streams Structures Spaces Scenarios Societies
Many, many research projects Multilingual Multimedia Structured documents Distributed collections/federated search User interface Institution: creation, access, and use
Commercial DL Journal Storage Project – –started as a University of Michigan project funded by the Andrew Mellon foundation, now a commercial organization Roughly 100 journals –mostly humanities, social science, math, economics Only WWW access –keeps a list of “allowed” IP names / addresses Provides only images for the pages OCR done, but the results are used for searching and not displaying to the user