Presentation is loading. Please wait.

Presentation is loading. Please wait.

OAI Overview Michael L. Nelson Old Dominion University Norfolk Virginia, USA Bioinformatics Seminar ODU CS 791/891.

Similar presentations


Presentation on theme: "OAI Overview Michael L. Nelson Old Dominion University Norfolk Virginia, USA Bioinformatics Seminar ODU CS 791/891."— Presentation transcript:

1 OAI Overview Michael L. Nelson Old Dominion University Norfolk Virginia, USA mln@cs.odu.edu http://www.cs.odu.edu/~mln/ Bioinformatics Seminar ODU CS 791/891 Feb 3 2003

2 The Rise and Fall of Distributed Searching wholesale distributed searching, popular at the time, is attractive in theory but troublesome in practice –Davis & Lagoze, JASIS 51(3), pp. 273-80 –Powell & French, Proc 5 th ACM DL, pp. 264-265 distributed searching of N nodes still viable, but only for small values of N NCSTRL: N > 100; bad NTRS/NIX: N<=20; ok (but could be better)

3 The Rise and Fall of Distributed Searching Other problems of distributed searching (from STARTS) –source-metadata problem how do you know which nodes to search? –query-language problem syntax varies and drifts over time between the various nodes –rank-merging problem how do you meaningfully merge multiple result sets? Temptations: –centralize all functions “everything will be done at X” –standardize on a single product “everyone will use system Y”

4 Universal Preprint Service A cross-archive DL that that provides services on a collection of metadata harvested from multiple archives –based on NCSTRL+; a modified version of Dienst support for “clustering” support for “buckets” Demonstrated at Santa Fe NM, October 21-22, 1999 –http://ups.cs.odu.edu/ –D-Lib Magazine, 6(2) 2000 (2 articles) http://www.dlib.org/dlib/february00/02contents.html –UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/

5 Data Providers –publishing into an archive –providing methods for metadata “harvesting” provide non-technical context for sharing information also Service Providers –harvest metadata from providers –implement user interface to data Self-describing archives –Much of the learning about the constituent UPS archives occurred out of band… Data and Service Providers Even if these are done by the same DL, these are distinct roles

6 Metadata Harvesting Move away from distributed searching Extract metadata from various sources Build services on local copies of metadata –data remains at remote repositories user... search for “cfd applications” local copy of metadata harvested offline metadata harvested offline metadata harvested offline metadata harvested offline each node independently maintained all searching, browsing, etc. performed on the metadata here individual nodes can still support direct user interaction

7 Result… OAI http://www.openarchives.org/ The OAI was the result of the demonstration and discussion during the Santa Fe meeting Initial focus was on federating collections of scholarly e-print materials… …however, interest grew and the scope and application of OAI expanded to become a generic bulk metadata transport protocol Note: –OAI is only about metadata -- not full text! –OAI is neutral with respect to the nature of the metadata or the resources the metadata describes read: commercial publishers have an interest in OAI too...

8 abouteprints document like objects resourcesmetadata OAMS unqualified Dublin Core unqualified Dublin Core transport HTTP responsesXML requests HTTP GET/POST verbs Dienst OAI-PMH natureexperimental stable model metadata harvesting metadata harvesting metadata harvesting Santa Fe convention OAI-PMH v.1.0/1.1 OAI-PMH v.2.0

9 Dublin Core Dublin Core Metadata Initiative –http://www.dublincore.org/ –from 1994-1995, recognizing the need for simple, interoperable metadata for resource discovery –good overview of metadata & DC: http://www.dlib.org/dlib/january01/lagoze/01lagoze.html –15 elements (qualifiers possible)

10 Overview of OAI Verbs VerbFunction Identifydescription of archive ListMetadataFormatsmetadata formats supported by archive ListSetssets defined by archive ListIdentifiersOAI unique ids contained in archive ListRecordslisting of N records GetRecordlisting of a single record archival metadata harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)

11 Argument Summary metadataPrefixfromuntilsetresumptionTokenidentifier Identify  ListMetadata Formats  optional ListSets  exclusive  ListIdentifiers  optional exclusive  ListRecords  optional exclusive  GetRecord   

12 Error Summary Identify BA ListMetadata Formats BANMFIDDNE ListSets BABRTNSH ListIdentifiers BABRTCDFNRMNSH ListRecords BABRTCDFNRMNSH GetRecord BACDFIDDNE Generate badVerb on any input not matching the 6 defined verbs this is an inversion of the table in section 3.6 of the OAI-PMH specification

13 Flow Control ListSets, ListIdentifiers, ListRecords are all allowed to return partial responses, via a combination of: –resumptionToken – an opaque, archive-defined data string that when passed back to the archive allows the response to begin where it left off each archive defines their own resumptionToken syntax; it may have visible semantics or not –503 http status code – “retry after” up to the harvester to understand this code and respect it, and up to the archive to enforce it

14 resumptionToken harvester RDBMS ListRecords Records 1-100, resumptionToken=AXad31 ListRecords, resumptionToken=AXad31 Records 101-200, resumptionToken=pQ22-x ListRecords, resumptionToken=pQ22-x Records 201-277 scenario: harvesting 277 records in 3 separate 100 record “chunks”

15 OAI Links & Demos Data providers –not really meant for end-user interaction, but Suleman’s “Repository Explorer” is an excellent tool http://purl.org/net/oai_explorer ~100 registered data providers –http://oaisrv.nsdl.cornell.edu/Register/BrowseSites.pl –many being used for internal purposes; not registered Service providers –Arc, the first known SP harvesting from OAI data providers http://arc.cs.odu.edu/ ~20 registered service providers –http://www.openarchives.org/service_provider/oai_sp.htm –several more known to be in testing or creation

16 Field of Dreams It should be easy to be a data provider, even if it makes more work for the service provider. –if enough data providers exist, the service providers will come (DPs >> SPs) Open-source / freely available tools –“drop-in” data providers: industrial strength: http://www.eprints.org/ personal size: http://kepler.cs.odu.edu/ –tools to make your existing DL a data provider: http://www.openarchives.org/tools/tools.htm also: OAI-implementers mailing list / mail archive! –service providers: only bits and pieces currently publicly available...

17 OAI Observation: Front-End Only No input/registry mechanism –OAI harvesting protocol is always a front-end for something else filesystem, Dienst, RDBMS, LDAP, etc. –convenient for pre-existing DLs, but does not address “new” DLs e.g., “we want to do OAI” Bounds the scope of OAI –responsibilities and domain of OAI are still be discussed –tension between functionality and simplicity

18 OAI Observation: No T&C No terms & conditions provisions in protocol –assumes all metadata has uniform access rights how to restrict metadata to certain hosts? –introducing T&C would increase the scope of application, but at the expense of simplicity how expensive do we want to make a “just-a-front-end protocol” ? maybe T&C is a good application for sets?

19 OAI Observation: No T&C Possible to use multiple OAI servers in a DMZ-like configuration… Public OAI Server Private OAI Server Source database OAI requests from trusted hosts OAI requests from arbitrary hosts could even use a separate copy of the database…

20 OAI Observation: No T&C Possible to use OAI harvesting protocol in closed, restricted systems OAI 1OAI 2 OAI 3OAI 4 all OAI requests originate from these 4 DLs

21 OAI Observation: Monolithic An OAI server has no protocol-defined concept of “other” OAI servers –backups, mirrors, etc. have to be resolved outside of the scope of OAI scope vs. complexity again –fully connected graph of DLs harvesting from each other is unnecessary cf. web crawlers vs. “gathers” in U of Colorado’s Harvest System –3 rd party harvesting interfaces raise more T&C and data coherency issues

22 302 Load Balancing Interactive users on main DL machine should not be impacted by metadata harvesting –don’t take deliveries through the front door –not part of the protocol; defined outside the protocol OAI Server naca.larc.nasa.gov/oai/ if load > 0.05 redirect request OAI Server buckets.dsi.internet2.edu/naca/oai/ harvester http://blah/oai/?verb=ListIdentifiers HTTP Status Code 302 http://blah/oai/?verb=ListIdentifiers … …

23 OAI Observation: Data Coherency In the interest of OAI implementer simplicity, several issues are left for the service provider to interpret –what is an update vs. addition? in the NACA OAI interface, they are reported as the same and its up to the harvesting system to figure it out –deletions? it is currently optional for OAI systems to mark records as deleted or not… –still left to the harvester to interpret

24 OAI Observation: Harvest Model Frequency of harvests –all-at-once harvests? initial harvest resolving data coherency –frequent incremental harvests? far more efficient for both service and data providers Webcrawling vs. digital library models –webcrawlers: little to no a priori information about target –DLs: frequent harvesting of a small number of known targets Realization: we know very little about how harvesting behavior… –are we optimizing for all-at-once, when incremental will be more common?

25 Other Uses For the OAI-PMH Assumptions: –Traditional DLs / SPs will continue on their present path of increasing sophistication citation indexing, search results viz, personalization, recommendations, subject-based filtering, etc. –growth rates remain the same (5x DPs as SPs) Premise: OAI-PMH is applicable to any scenario that needs to update / synchronize distributed state –Future opportunities are possible by creatively interpreting the OAI-PMH data model

26 resource all available metadata about David item Dublin Core metadata MARC metadata SPECTRUM metadata records item = identifier record = identifier + metadata format + datestamp set-membership is item-level property OAI-PMH Data Model

27 Typical Values repository –collection of publications resource –scholarly publication item –all metadata (DC + MARC) record –a single metadata format datestamp –last update / addition of a record metadata format –bibliographic metadata format set –originating institution or subject categories

28 Repositories… Stretching the idea of a repository a bit: –contextually sensitive repositories “personalization for harvesters” communication between strangers, or communication between friends? –OAI-PMH for individual complex objects? OAI-PMH without MySQL?! –Fedora, Multi-valent documents, buckets –tar, jar, zip, etc. files

29 Resource What if resource were: –computer system status uptime, who, w, df, ps, etc. –or generalized “system” status e.g., sports league standings –people personnel databases authority files for authors

30 Item What if item were: –software union of versions + formats –all forms of metadata administrative + structural citations, annotations, reviews, etc. –data e.g., newsfeeds and other XML expressible content –metadataPrefixes or sets could be defined to be different versions

31 Record What if record were: –specific software instantiations / updates –access / retrieval logs for DLs (or computer systems) –push / pull model inversion put a harvester on the client behind a firewall, the client contacts a DP and receives “instructions” on how to submit the desired document (e.g., send email to a specified address)

32 Datestamp semantics of datestamp are strongly influenced by the choice of resource / item / record / metadataPrefix, but it could be used to: –signify change of set membership (e.g., workflow: item moves from “submitted” to “approved”) –change datestamp to reflect access to the DP e.g., in conjunction with metadataPrefixes of “accessed” or “mirrored”

33 metadataPrefix what if metadataPrefix were: –instructions for extracting / archiving / scraping the resource verb=ListRecords&metadataPrefix=extract_TIFFs –code fragments to run locally (harvested from a trusted source!) –XSLT for other metadataPrefixes branding container is at the repository-level, this could be record- or item-level

34 Set sets are already used for tunneling OAI-PMH extensions (see Suleman & Fox, D-Lib 7(12)) other uses: –in aggregators, automatically create 1 set per baseURL –have “hidden” sets (or metadataPrefix) that have administrative or community-specific values (or triggers) set=accessed>1000&from=2001-01-01 set=harvestMeWithTheseARGS&until=2002-05- 05&metadataPrefix=oai_marc

35 Interesting Services DP9 –gateway to expose repository contents in HTML suitable for web crawlers Celestial –OAI “cache”, also 1.1 -> 2.0 converter Static (mini-) repositories –XML files, based on OLAC work OpenURL metadata format registries –record = metadata format


Download ppt "OAI Overview Michael L. Nelson Old Dominion University Norfolk Virginia, USA Bioinformatics Seminar ODU CS 791/891."

Similar presentations


Ads by Google