Presentation is loading. Please wait.

Presentation is loading. Please wait.

Old Dominion University Department of Computer Science

Similar presentations


Presentation on theme: "Old Dominion University Department of Computer Science"— Presentation transcript:

1 Introduction to Digital Libraries Week 10: The Open Archives Initiative
Old Dominion University Department of Computer Science CS 695 Fall 2003 Michael L. Nelson 10/28/03 several slides borrowed from Van de Sompel, Liu, & Warner

2 The Rise and Fall of Distributed Searching
wholesale distributed searching, popular at the time, is attractive in theory but troublesome in practice Davis & Lagoze, JASIS 51(3), pp Powell & French, Proc 5th ACM DL, pp distributed searching of N nodes still viable, but only for small values of N NCSTRL: N > 100; bad NTRS/NIX: N<=20; ok (but could be better)

3 The Rise and Fall of Distributed Searching
Other problems of distributed searching (from STARTS) source-metadata problem how do you know which nodes to search? query-language problem syntax varies and drifts over time between the various nodes rank-merging problem how do you meaningfully merge multiple result sets? Temptations: centralize all functions “everything will be done at X” standardize on a single product “everyone will use system Y”

4 Universal Preprint Service
Demonstrated at Santa Fe NM, October 21-22, 1999 D-Lib Magazine, 6(2) 2000 (2 articles) UPS was soon renamed the Open Archives Initiative (OAI) Based on NCSTRL+ software, it is a cross-archive DL that that provides services on a collection of content harvested from multiple archives NCSTRL+ is a modified version of Dienst support for “clustering” support for “buckets”

5 UPS Participants totals ca. July 1999

6 project metadata formats the arXiv CogPrints NACA NCSTRL NDLTD RePEc
internal Refer RFC1807 MARC ReDIF

7 Getting metadata out of archives
project metadata extraction Getting metadata out of archives not all archives support metadata extraction some archives have undocumented metadata extraction procedures not all archives support rich criteria for extraction single dump concept only Intellectual property and use rights not always clear

8 Metadata has problems with:
project metadata quality Metadata has problems with: record duplication crucial missing fields internal errors ambiguous references to people and places, publications

9 re-creation of archives
project re-creation of archives creation of archives for ReDIF-ed metadata using intelligent digital objects : “buckets” RePEc arXiv NCSTRL

10 project creation of end-user service NCSTRL+ digital library service indexing buckets in archives by requesting their metadata enhanced user-interface NCSTRL+ search results point at buckets buckets auto-display buckets provide link to full-text in native archive

11 Data and Service Providers
Data Providers publishing into an archive providing methods for metadata “harvesting” provide non-technical context for sharing information also Service Providers harvest metadata from providers implement user interface to data Even if provided by the same DL, these are distinct functions

12 Data and Service Providers
Native harvesting interface Input interface Native end-user interface Provider Input interface Provider Native end-user interface No machine based way to extract metadata… Machine and user interfaces for extracting metadata….

13 Data and Service Providers
Native end-user interface Input and harvesting interfaces optional Implementor Native harvesting interface Native harvesting interface Input interface Provider Input interface Provider Native end-user interface Native end-user interface optional (e.g., RePEc)

14 Self-Describing Archives
Much of the learning about the constituent UPS archives occurred out of band… Given an unknown archive, we should be able to algorithmically determine the archive’s metadata... Native harvesting interface Where possible, the harvesting interface should provide the same criteria as the end-user interface Input interface Provider Native end-user interface

15 Data and Service Providers
Recommended criteria for metadata extraction: subject classification accession date publication date Criteria for archive description metadata formats employed contact information for archive publication type scheme identifier scheme subject classification scheme

16 Result… OAI The OAI was the result of the demonstration and discussion during the Santa Fe meeting Lots of churn regarding what the OAI was OAI harvesting protocol originally a subset of the Dienst (NCSTRL) protocol and originally called the “Santa Fe Convention” originally defined an OAI-specific metadata format

17 OAI Protocol for Metadata Harvesting
OAI metadata format dropped in favor of unqualified Dublin Core other formats possible, but DC is required as lowest common denominator No longer dependent on Dienst defined independently (though still easily map-able)

18 OAI as a “Dumb Archive” SODA DL model originally used a separate protocol & implementation for the “dumb archive” development ceased in favor of the OAI metadata harvesting protocol OAI divides the world into “service providers” (DLs) and “data providers” (archives) OAI does not require smart objects, but does create a “dumb archive” layer note that OAI does not define an archive implementation, but rather just a standard way of exposing an archive’s contents

19 Santa Fe convention OAI-PMH v.1.0/1.1 OAI-PMH v.2.0 nature experimental stable verbs Dienst OAI-PMH requests HTTP GET/POST responses XML transport HTTP metadata OAMS unqualified Dublin Core about eprints document like objects resources model metadata harvesting

20 Overview of OAI Verbs Verb Function Identify description of archive
ListMetadataFormats metadata formats supported by archive ListSets sets defined by archive ListIdentifiers OAI unique ids contained in archive ListRecords listing of N records GetRecord listing of a single record archival metadata harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)

21 Identify 1.1 2.0 Arguments Errors Arguments Errors none none
badArgument

22 ListMetadataFormats 1.1 2.0 Arguments Errors Arguments Errors
identifier (OPTIONAL) Errors id does not exist Arguments identifier (OPTIONAL) Errors badArgument noMetadataFormats idDoesNotExist

23 ListSets 1.1 2.0 Arguments Errors Arguments Errors
resumptionToken (EXCLUSIVE) Errors no set hierarchy Arguments resumptionToken (EXCLUSIVE) Errors badArgument badResumptionToken noSetHierarchy

24 ListIdentifiers 1.1 2.0 Arguments Errors from (OPTIONAL)
until (OPTIONAL) set (OPTIONAL) resumptionToken (EXCLUSIVE) Errors no records match Arguments from (OPTIONAL) until (OPTIONAL) set (OPTIONAL) resumptionToken (EXCLUSIVE) metadataPrefix (REQUIRED) Errors badArgument cannotDisseminateFormat badResumptionToken noSetHierarchy noRecordsMatch

25 ListRecords 1.1 2.0 Arguments Errors Arguments Errors from (OPTIONAL)
until (OPTIONAL) set (OPTIONAL) resumptionToken (EXCLUSIVE) metadataPrefix (REQUIRED) Errors no records match metadata format cannot be disseminated Arguments from (OPTIONAL) until (OPTIONAL) set (OPTIONAL) resumptionToken (EXCLUSIVE) metadataPrefix (REQUIRED) Errors noRecordsMatch cannotDisseminateFormat badResumptionToken noSetHierarchy badArgument

26 GetRecord 1.1 2.0 Arguments Errors Arguments Errors
identifier (REQUIRED) metadataPrefix (REQUIRED) Errors id does not exist metadata format cannot be disseminated Arguments identifier (REQUIRED) metadataPrefix (REQUIRED) Errors badArgument cannotDisseminateFormat idDoesNotExist

27 Argument Summary   metadataPrefix from until set resumptionToken
identifier Identify ListMetadata Formats optional ListSets exclusive ListIdentifiers ListRecords GetRecord

28 Error Summary Identify BA ListMetadata Formats NMF IDDNE ListSets BRT NSH ListIdentifiers CDF NRM ListRecords GetRecord Generate badVerb on any input not matching the 6 defined verbs this is an inversion of the table in section 3.6 of the OAI-PMH specification

29 Flow Control ListSets, ListIdentifiers, ListRecords are all allowed to return partial responses, via a combination of: resumptionToken – an opaque, archive-defined data string that when passed back to the archive allows the response to begin where it left off each archive defines their own resumptionToken syntax; it may have visible semantics or not 503 http status code – “retry after” up to the harvester to understand this code and respect it, and up to the archive to enforce it

30 resumptionToken scenario: harvesting 2770 records in 3 separate
1000 record “chunks” RDBMS ListRecords harvester Records , resumptionToken=AXad31 ListRecords, resumptionToken=AXad31 Records , resumptionToken=pQ22-x ListRecords, resumptionToken=pQ22-x Records

31 302 Load Balancing Interactive users on main DL machine should not be impacted by metadata harvesting don’t take deliveries through the front door not part of the protocol; defined outside the protocl if load > 0.05 redirect request OAI Server harvester HTTP Status Code 302 naca.larc.nasa.gov/oai/ <?xml version=“1.0” encoding=“UTF-8”?> <ListIdentifiers> </ListIdentifiers> OAI Server buckets.dsi.internet2.edu/naca/oai/

32 OAI Demos Data providers
not really meant for end-user interaction, but Suleman’s “Repository Explorer” is an excellent tool

33 what’s new in OAI-PMH v.2.0

34 general changes

35 protocol vs periphery fixed protocol document
clear distinction between protocol and periphery fixed protocol document extensible implementation guidelines: e.g. sample metadata formats, description containers, about containers allows for OAI guidelines and community guidelines

36 clear separation of OAI-PMH and HTTP OAI-PMH error handling
OAI-PMH vs HTTP clear separation of OAI-PMH and HTTP OAI-PMH error handling all OK at HTTP level? => 200 OK something wrong at OAI-PMH level? => OAI-PMH error (e.g. badVerb) http codes 302, 503, etc. still available to implementers, but no longer represent OAI-PMH events

37 OAI-PMH Data Model set-membership is item-level property resource
all available metadata about David item item = identifier Dublin Core metadata MARC SPECTRUM records record = identifier + metadata format + datestamp

38 other general changes better definitions of harvester, repository, item, unique identifier, record, set, selective harvesting oai_dc schema builds on DCMI XML Schema for unqualified Dublin Core usage of must, must not etc. as in RFC2119 wording on response compression

39 all protocol responses can be validated with a single XML Schema
other general changes all protocol responses can be validated with a single XML Schema easier for data providers no redundancy in type definitions SOAP-ready clean for error handling

40 response no errors note no http encoding of the OAI-PMH request
<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate> T08:55:46Z</responseDate> <request verb=“GetRecord”… …> <GetRecord> <record> <header> <identifier>oai:arXiv:cs/ </identifier> <datestamp> </datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> </GetRecord> </OAI-PMH> note no http encoding of the OAI-PMH request

41 response with error with errors, only the correct
<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate> T08:55:46Z</responseDate> <request> <error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error> </OAI-PMH> with errors, only the correct attributes are echoed in <request>

42 corrections

43 dates/times all dates/times are UTC, encoded in ISO8601, Z-notation T20:30:00Z

44 resumptionToken idempotency of resumptionToken: return same incomplete list when rT is reissued while no changes occur in the repo: strict while changes occur in the repo: all items with unchanged datestamp new, optional attributes for the resumptionToken: expirationDate completeListSize cursor

45 noRecordsMatch 1.x - if no records match, an empty list was returned

46 noRecordsMatch 2.0 - if no records match, the error condition noRecordsMatch is returned -- not an empty list

47 new functionality

48 harvesting granularity
mandatory support of YYYY-MM-DD optional support of YYYY-MM-DDThh:mm:ssZ other granularities considered, but ultimately rejected granularity of from and until must be the same

49 Identify more expressive
<repositoryName>Library of Congress 1</repositoryName> <baseURL> <protocolVersion>2.0</protocolVersion> <deletedRecord>transient</deletedRecord> <earliestDatestamp> T00:00:00Z</earliestDatestamp> <granularity>YYYY-MM-DDThh:mm:ssZ</granularity> <compression>deflate</compression>

50 header contains set membership of item
<record> <header> <identifier>oai:arXiv:cs/ </identifier> <datestamp> </datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> eliminates the need for the “double harvest” 1.x required to get all records and all set information

51 ListIdentifiers returns headers
<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate> T08:55:46Z</responseDate> <request verb=“…” …> <ListIdentifiers> <header> <identifier>oai:arXiv:hep-th/ </identifier> <datestamp> </datestamp> <setSpec>physic:hep</setSpec> </header> <identifier>oai:arXiv:hep-th/ </identifier> <datestamp> </datestamp> <setSpec>physic:exp</setSpec> ……

52 ListIdentifiers mandates metadataPrefix as argument
verb=ListIdentifiers &metadataPrefix=olac &from= &until= &set=Perseus:collection:PersInfo

53 ListIdentifiers the changes to ListIdentifiers are subtle, and reflect a change in the OAI-PMH data model Could have been named “ListHeaders” or reduced to an option for ListRecords “ListIdentifiers” kept for lexigraphical consistency

54 metadataPrefix character set for metadataPrefix and setSpec extended to URL-safe characters A-Z a-z 0-9 _ ! ‘ $ ( ) *

55 in the periphery

56 provenance introduction of provenance container to facilitate tracing of harvesting history <about> <provenance> <originDescription> <baseURL> <identifier>oai:r1:plog/ </identifier> <datestamp> T13:00:02Z</datestamp> <metadataPrefix>oai_dc</metadataPrefix> <harvestDate> T12:01:30Z</harvestDate> … … … </originDescription> </provenance> </about>

57 friends introduction of friends container to facilitate discovery of repositories <description> <friends> <baseURL> <baseURL> <baseURL> <baseURL> </friends> </description>

58 branding introduction of branding container for DPs to suggest rendering & association hints <branding xmlns=" xmlns:xsi=" xsi:schemaLocation=" <collectionIcon> <url> <link> <title>MySite(tm)</title> <width>88</width> <height>31</height> </collectionIcon> <metadataRendering metadataNamespace=" mimeType="text/xsl"> metadataNamespace=" mimeType="text/css"> </branding>

59 revision of oai-identifier
<description> <oai-identifier xmlns=" xmlns:xsi=" xsi:schemaLocation=" <scheme>oai</scheme> <repositoryIdentifier>oai-stuff.foo.org</repositoryIdentifier> <delimiter>:</delimiter> <sampleIdentifier>oai:oai-stuff.foo.org:5324</sampleIdentifier> </oai-identifier> </description> domain based repository names

60 OAI-PMH musings

61 OAI Observation: Front-End Only
No input/registry mechanism OAI harvesting protocol is always a front-end for something else filesystem, Dienst, RDBMS, LDAP, etc. convenient for pre-existing DLs, but does not address “new” DLs e.g., “we want to do OAI” Bounds the scope of OAI responsibilities and domain of OAI are still be discussed tension between functionality and simplicity

62 OAI Observation: No T&C
No terms & conditions provisions in protocol assumes all metadata has uniform access rights how to restrict metadata to certain hosts? introducing T&C would increase the scope of application, but at the expense of simplicity how expensive do we want to make a “just-a-front-end protocol” ? maybe T&C is a good application for sets?

63 OAI Observation: No T&C
Possible to use multiple OAI servers in a DMZ-like configuration… OAI requests from trusted hosts OAI requests from arbitrary hosts Public OAI Server Private OAI Server Source database could even use a separate copy of the database…

64 OAI Observation: No T&C
Possible to use OAI harvesting protocol in closed, restricted systems OAI 1 OAI 2 OAI 4 OAI 3 all OAI requests originate from these 4 DLs

65 OAI Observation: Monolithic
An OAI server has no protocol-defined concept of “other” OAI servers backups, mirrors, etc. have to be resolved outside of the scope of OAI scope vs. complexity again fully connected graph of DLs harvesting from each other is unnecessary cf. web crawlers vs. “gathers” in U of Colorado’s Harvest System 3rd party harvesting interfaces raise more T&C and data coherency issues

66 OAI Observation: Data Coherency
In the interest of OAI implementer simplicity, several issues are left for the service provider to interpret what is an update vs. addition? in the NACA OAI interface, they are reported as the same and its up to the harvesting system to figure it out deletions? it is currently optional for OAI systems to mark records as deleted or not… still left to the harvester to interpret

67 OAI Observation: Harvest Model
Frequency of harvests all-at-once harvests? initial harvest resolving data coherency frequent incremental harvests? far more efficient for both service and data providers Webcrawling vs. digital library models webcrawlers: little to no a priori information about target DLs: frequent harvesting of a small number of known targets Realization: we know very little about how harvesting behavior… are we optimizing for all-at-once, when incremental will be more common?

68 Interesting Services DP9 Celestial Static (mini-) repositories
gateway to expose repository contents in HTML suitable for web crawlers Celestial OAI “cache”, also 1.1 -> 2.0 converter Static (mini-) repositories XML files, based on OLAC work OpenURL metadata format registries record = metadata format

69 DP9 Architecture see Liu et al., JCDL 2002; http://dlib.cs.odu.edu/dp9
Slide from Liu

70 DP9 Formatting Format of URLs HTML Meta tags
&prefix=oai_dc HTML Meta tags Some crawlers (such as Inktomi) use the HTML meta tags to index a Web pages; DP9 also maps Dublin Core metadata to corresponding HTML meta tags. For pages that are designed exclusively for robots navigation, a noindex robots meta tag is used X-FORWARDED-FOR header to distinguish between different users coming in via a proxy Slide from Liu

71 Celestial Developed by Brody @ Southampton
designed to complement DP9 see Liu, Brody, et al., D-Lib Magazine 8(11) Where DP9 is a non-caching proxy, Celestial caches the metadata records can off-load work from individual archives, higher availability can harvest 1.1, 2.0; exports in 2.0

72 “Static” Repositories
Premise: a repository does not wish to have an executing program on its site, so it has a “static” XML file with some of the OAI-PMH responses in place accessed through a proxy could be a low functionality node, or the XML file could be produced by a process and moved outside a firewall Based on OLAC work by Bird & Simons

73 OpenURL Metadata Registry
Registry of metadata formats for OpenURL

74 Additional Readings presentations publications
publications


Download ppt "Old Dominion University Department of Computer Science"

Similar presentations


Ads by Google