Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Alon Kadury.

Similar presentations


Presentation on theme: "1 Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Alon Kadury."— Presentation transcript:

1 1 Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Alon Kadury

2 2 Content Reminders History OAI overview Technical introduction Conclusions Demonstrations Resources

3 3 Definition- A Digital Library is a: 1. Collection of digital objects 2. Collection of knowledge structures 3. Collection of library services 4. Domain/Focus/Topic 5. Quality Control 6. Preservation/Persistence

4 4 Types of DLs Single Digital Library (SDL) – also Stand-alone, Self-contained Federated Digital Library (FDL) – also confederated, distributed Harvested Digital Library (HDL)

5 5 Single Digital Library (SDL) A regular DL Self-contained material: – purchased – scanned/digitized Usually localized

6 6 Federated Digital Library (FDL) Contains many autonomous libraries Usually heterogeneous repositories Connected via network Forms a virtual distributed library Transparent user interface The major problem is interoperability.  The major problem is interoperability.

7 7 Harvested Digital Library (HDL) Does not contain data, just metadata Objects harvested into summaries Regular DL characteristics: – fine granularity – rich library services – high quality control – annotated

8 8 History As the Web evolved, the number of Web sites and search engines increased. A similar process happened with e-prints and digital libraries.e-prints The changes in the amount of DLs led to the development of the OAI-PMH protocol as we’re about to see.

9 9 History - Problems The development of e-prints and digital libraries let to several problems like: Many user interfaces - Each DL offered Web interface for deposit of articles and for end-user searches. The result: Difficult for end users to work across archives without having to learn multiple different interfaces.

10 10 History - Problems Different queries’ syntax - The result: Difficult for the user to keep track of the searching syntax of each SDL and difficult to create an FDL that could query many SDLs. Many metadata formats - SDL metadata could be kept in any format the SDL wanted. The result: Hard times for the FDLs which had to know the formats of each SDL they are harvesting.

11 11 History – Possible solutions The problems led researchers to recognise the need for single search interface to all archives - Universal Pre-print Service (UPS).Pre-print Two possible approaches to building the UPS where considered:

12 12 History – Solution 1 Cross-searching multiple archive: In this approach a client sends requests to several servers and then combines the data. The client and server work with a known and agreed protocol (for example Z39.50). However, studies showed this approach is not the preferred approach for distributed searching of large values of nodes mainly due to problems like knowing which collections to search and performance issues.

13 13 History – Solution 2 Harvesting metadata into a ‘Central Server’: This approach harvests the metadata and stores it in a central server, on which searches are made. The idea was demonstrated in a convention held at Santa Fe NM, October 21-22, 1999. UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/http://www.openarchives.org/ More reading: http://www.dlib.org/dlib/february00/02contents.htmlhttp://www.dlib.org/dlib/february00/02contents.html

14 14 OAI overview- definitions Lets start with a few definitions: Interoperability Open Archive Initiative (OAI) Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH)

15 15 OAI overview- definitions What is Interoperability? Interoperability refers to the ability of two or more systems to interact with one another and exchange data according to a prescribed method in order to achieve predictable results.

16 16 OAI overview- definitions In order to exchange data we need to agree on things like: – requests format – results format – transport protocols (HTTP vs FTP vs….) – Metadata formats (DC vs MARC vs…) – Usage rights (who can do what with the records) We need someone to organize it and “set the rules”.

17 17 OAI overview- definitions Who will organize it? Open Archive Initiative - “ The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. ” (http://www.openarchives.org/organization/index.html)http://www.openarchives.org/organization/index.html

18 18 OAI overview- definitions What will the interoperability standards be called ? Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH)

19 19 OAI overview- Key players When talking about OAI-PMH we see three main players: 1. Data Providers 2. Service Providers 3. The protocol (OAI-PMH)

20 20 OAI overview- Data Provider Data Provider: – Handles deposit/publishing of resources in archive. – Expose metadata about resources in archive (using the OAI-PMH protocol\interface). – Data Providers may support any metadata format, but must support the metadata format Dublin Core (DC). – Offer free access to the archives (at least the metadata). – A network accessible server, able to process OAI-PMH requests correctly is often called a Repository.

21 21 OAI overview - Service Provider Service Provider: – Harvest metadata from data providers and use it to offer single user-interface across all harvested metadata. – May enrich metadata. – Offer (value-added) services on the basis of the metadata. – Client application issuing OAI-PMH requests is often referred to as a Harvester.

22 22 OAI overview - Providers Service ProviderData Provider Has user interface Might have user interface End user interface Metadata onlyItems & metadata Contains ?mustOAI interface Harvests metadata from data providers Its own resources Offers data from

23 23 OAI overview - Providers Input interface Data Provider Input interface Native harvesting interface Data Provider Native end-user interface Native harvesting interface Service Provider Native end-user interface Native end-user interface optional (e.g., RePEc)

24 24 OAI overview - Providers Data providers Service providers Harvesting based on OAI-PMH

25 25 OAI overview - Model Web Layer 1 SDL Layer 2 OAI-PMH Layer 4 Layer 3 Service Provider - FDL\HDL Web interfaces

26 26 Technical introduction Since the days of the Santa Fe convention the protocol had several versions. Version 2.0 is the latest and is considered stable. The technical introduction refers to this version.

27 27 Tech ’ - protocol versions model metadata harvesting metadata harvesting metadata harvesting abouteprints document like objects resources metadata OAMS unqualified Dublin Core unqualified Dublin Core transport HTTP responsesXML requests HTTP GET/POST verbs Dienst OAI-PMH natureexperimental stable Santa Fe convention OAI-PMH v.1.0/1.1 OAI-PMH v.2.0

28 28 Tech ’ - request & response The requests of the protocol are HTTP based. The response contents of the protocol are XML based. Question: why? Answer: – Simple protocol based on existing standards which allows rapid development & effortless implementation. – Systems can be deployed in variety of configurations. – Low barrier interoperability specification. – Internet/Firewall friendly.

29 29 Tech ’ - request & response There are six request types which are called verbs. The request type and additional information are passed as parameters using HTTP POST or GET methods. Requests (based on HTTP) Metadata (encoded in XML) Harvester Metadata Service Provider Repository Metadata (Documents) Data Provider „Service”

30 30 Lets see a demonstration about how we can create a FDL and then we will look at the backstage of it. Demo

31 31 Tech ’ – more definition Service Provider Data Provider e-prints Data Provider Images Data Provider OPAC Data Provider Museum Data Provider Archive Requests: Identify ListMetadataformats ListSets ListIdentifiers ListRecords GetRecord Responses: General information Metadata formats Set structure Record identifier Metadata Data Provider Harvester Repository

32 32 Tech ’– Request Types Six different request types 1. Identify 2. ListMetadataFormats 3. ListSets 4. ListIdentifiers 5. ListRecords 6. GetRecord Harvester does not have to use all types. Repository must implement all request types fully (all required and optional arguments for each of the requests).

33 33 Tech ’ - Request Type: Identify function retrieve description and general information about an archive. example archive.org/oai-script?verb=Identify parameters none errors / exceptions badArgument e.g. archive.org/oai-script?verb=Identify& set=biology

34 34 Tech ’ - Request Type: Identify Response format ElementExample# repositoryNameMy Archive1 baseURLhttp://archive.org/oai1 protocolVersion2.01 earliestDatestamp1999-01-011 deleteRecordsno, transient, persistent1 granularityYYYY-MM-DD, YYYY-MM-DDThh:mm:ssZ1 adminEmailoai-admin@archive.org+ compressiondeflate, compress, …* descriptionoai-identifier, eprints, friends, …*

35 35 Tech ’ - Request Type: Identify Response in XML format http://cs1.ist.psu.edu/cgi-bin/oai.cgi?verb=Identify http://cs1.ist.psu.edu/cgi-bin/oai.cgi?verb=Identify

36 36 Tech ’ - Request Type: ListMetadataFormats function retrieve available metadata formats from archive. Remember that each archive must implement at least DC. example archive.org/oai-script?verb=ListMetadataFormats parameters identifier (optional) errors / exceptions badArgument idDoesNotExist e.g. archive.org/oai- script?verb=ListMetadataFormats& identifier=really-wrong-identifier noMetadataFormats

37 37 Tech ’ - Request Type: ListMetadataFormats Response in XML format http://cs1.ist.psu.edu/cgi-bin/oai.cgi?verb=ListMetadataFormats http://cs1.ist.psu.edu/cgi-bin/oai.cgi?verb=ListMetadataFormats

38 38 Tech ’ - Request Type: ListSets Q: What are Sets? A: Sets are logical partitioning of repositories. Q: Why use sets? A: Sets function was aimed to enable selective harvesting. Data providers don’t have to define sets. Sets are not strictly hierarchical.

39 39 Tech ’ - Request Type: ListSets function retrieve set structure of a repository example archive.org/oai-script?verb=ListSets parameters resumptionToken (exclusive) errors / exceptions badArgument badResumptionToken e.g. archive.org/oai-script?verb=ListSets& resumptionToken=any-wrong-token noSetHierarchy

40 40 Tech ’ - Request Type: ListSets Response in XML format http://theses.lub.lu.se/oai-service/xerxes/?verb=ListSets http://theses.lub.lu.se/oai-service/xerxes/?verb=ListSets

41 41 Tech ’ - Request Type: ListIdentifiers function abbreviated form of ListRecords, retrieving only headers example archive.org/oai-script?verb=ListIdentifiers& metadataPrefix=oai_dc&from=2002-12-01 parameters from (optional) until (optional) metadataPrefix (required) set (optional) resumptionToken (exclusive) errors / exceptions badArgument, e.g. …&from=2002-12-01-13:45:00 badResumptionToken cannotDisseminateFormat noRecordsMatch noSetHierarchy

42 42 Tech ’ - Request Type: ListIdentifiers Response in XML format http://theses.lub.lu.se/oai- service/xerxes/?verb=ListIdentifiers&metadataPrefix=oai_dc http://theses.lub.lu.se/oai- service/xerxes/?verb=ListIdentifiers&metadataPrefix=oai_dc

43 43 Tech ’ - Request Type: ListRecords function harvest records from a repository example archive.org/oai-script?verb=ListRecords& metadataPrefix=oai_dc&set=biology parameters from (optional) until (optional) metadataPrefix (required) set (optional) resumptionToken (exclusive) errors / exceptions badArgument badResumptionToken cannotDisseminateFormat noRecordsMatch noSetHierarchy

44 44 Tech ’ - Request Type: GetRecord function retrieve individual metadata record from a repository example archive.org/oai-script?verb=GetRecord& identifier=oai:HUBerlin.de:3000218& metadataPrefix=oai_dc parameters identifier (required) metadataPrefix (required) errors / exceptions badArgument cannotDisseminateFormat idDoesNotExist

45 45 Tech ’ - Records, items & DC or setting the record straight all available metadata about David item Dublin Core metadata MARC metadata SPECTRUM metadata records item = identifier resource

46 46 Tech ’ - Records, items & DC A record consists of: 1. Header (mandatory) identifier (1) datestamp (1) setSpec elements (*) status attribute for deleted item (?) 2. Metadata (mandatory) XML encoded metadata with root tag, namespace repositories must support Dublin Core 3. About (optional) rights statements provenance statements

47 47 Tech ’ - Records, items & DC OAI-PMH supports dissemination of multiple metadata formats from a repository. Properties of metadata formats: id string to specify the format ( metadataPrefix ) metadata schema URL (XML schema to test validity) XML namespace URI (global identifier for metadata format) Repositories must be able to disseminate unqualified DC. Arbitrary metadata formats can be defined and transported via the OAI-PMH. Returned metadata must comply with XML namespace specification.

48 48 Tech ’ - Records, items & DC As mentioned before the minimum standard is unqualified Dublin Core ( http://dublincore.org/). http://dublincore.org/ Dublin Core Metadata Element Set contains 15 elements. All elements are optional. All elements may be repeated. The Dublin Core Metadata Element Set: TitleContributorSource CreatorDateLanguage SubjectTypeRelation DescriptionFormatCoverage PublisherIdentifierRights

49 49 Tech ’ - Records, items & DC Response in XML format http://cs1.ist.psu.edu/cgi- bin/oai.cgi?verb=GetRecord&identifier=oai:CiteSeerPSU:1&metadataPrefix=oai_dc http://cs1.ist.psu.edu/cgi- bin/oai.cgi?verb=GetRecord&identifier=oai:CiteSeerPSU:1&metadataPrefix=oai_dc

50 50 Tech ’ - Flow control Some of the request commands can generate a very long response (for example think about requesting a CiteSeer or Library of Congress to list ALL their records using the GetRecords verb). In order not to generate long responses that will over load the server, a flow control mechanism was added to the protocol. It is only within the server responsibility to split long responses into shorter ones; the client has no control over length of the responses.

51 51 Tech ’ - Flow control The flow control mechanism is referred to as “resumption token”, and in it, the server splits the long response into shorter ones and assigns at the end of each response a token that the client will pass on the next request the get the next part.

52 52 Tech ’ - Flow control Harvester Service Provider Repository Data Provider “want to have all your records” archive.org/oai?verb=ListRecords& metadataPrefix=oai_dc “have 267, but give you only 100” 100 records + resumptionToken “anyID1” “want more of this” archive.org/oai?resumptionToken=anyID1 “have 267, give you another 100” 100 records + resumptionToken “anyID2” “want more of this” archive.org/oai?resumptionToken=anyID2 “have 267, give you my last 67” 67 records + resumptionToken “”

53 53 Conclusions and future use We saw that the increasing number of digital libraries caused the different DL types some problems: – FDLs and HDLs had to overcome different obstacles in order to federate or harvest data from SDLs due to different metadata formats and different queries formats for example. – The user had to overcome the learning of different user interfaces each SDL offered.

54 54 Conclusions and future use When looking at the OAI-PMH it seemed that putting the protocol in use will eliminate those problems. Service providers can lower the number of different user interfaces the user needs to handle and federating or harvesting would be much easier using a common standard. However…

55 55 Conclusions and future use When putting the protocol in use in digital libraries environment, the lack of strict rules may cause new problems or make the old ones reappear in another way. Lets take Citeseer for example. It contains 723140 records and its metadata size is around 1GB. If one would want to harvest citeseer efficiently for records dealing with a specific topic how could it be done?

56 56 Conclusions and future use Since the searching for data within the metadata is done at the harvester size, it could not ask citeseer to give it only records dealing with "network computationת" for example. Remember the sets? Could they be used to harvest only part of the information instead of handling a Giga of data? The answer is no since citeseer contains only one set.

57 57 Conclusions and future use The DC also might be a too low barrier which causes more and more SDLs to support not only DC but to create their own metadata formats (citeseer for example has two formats it supports). Nevertheless, OAI-PMH is becoming more and more a standard in digital libraries and is making a large contribution for the DLs and from the looks of it, it’s here to stay.

58 58 What's next Riddle – – Improving harvesting and creation of HDLs. – Composition of HDLs.

59 59 What's next Web Layer 1 SDL Layer 2 OAI-PMH Layer 4 CHDL Layer 3 HDL Layer 5 Web interfaces

60 60 Demonstration Independent queries. Repositories explorer: http://re.cs.uct.ac.za/ http://re.cs.uct.ac.za/ OAISter (FDL): http://oaister.umdl.umich.edu/o/oaister/ http://oaister.umdl.umich.edu/o/oaister/ Scirus (FDL): http://www.scirus.com/srsapp/ http://www.scirus.com/srsapp/ Riddle demo: http://riddle.dynalias.com:20055/riddle.html http://riddle.dynalias.com:20055/riddle.html

61 61 Resources OAI – official site http://www.openarchives.org/ http://www.openarchives.org/ protocol specification http://www.openarchives.org/OAI/openarchivesprotocol.html http://www.openarchives.org/OAI/openarchivesprotocol.html general mailing list http://www.openarchives.org/mailman/listinfo/OAI-general/ http://www.openarchives.org/mailman/listinfo/OAI-general/ implementers mailing list http://www.openarchives.org/mailman/listinfo/OAI- implementers/ http://www.openarchives.org/mailman/listinfo/OAI- implementers/ Presentation which this presentation was based on: http://www.oaforum.org/otherfiles/lisb_tutorial.ppt http://www.oaforum.org/otherfiles/lisb_tutorial.ppt Z39.50: http://www.loc.gov/z3950/agency/ http://www.loc.gov/z3950/agency/

62 62 Questions

63 63 The end


Download ppt "1 Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Alon Kadury."

Similar presentations


Ads by Google