Presentation is loading. Please wait.

Presentation is loading. Please wait.

The NSDL, OAI and Your Metadata Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University.

Similar presentations


Presentation on theme: "The NSDL, OAI and Your Metadata Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University."— Presentation transcript:

1 The NSDL, OAI and Your Metadata Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University

2 OAI in the NSDL Infrastructure Your collection’s metadata Your collection’s OAI server NSDL MR OAI server NSDL Search Service http://nsdl.org NSDL Archive Service other OAI Services Your collection’s metadata,scrubbed & normalized NSDL Metadata Repository (MR)

3 The Metadata Repository Designed to be scaleable Designed to be scaleable Based on automated harvest/expose model, with OAI at each end Based on automated harvest/expose model, with OAI at each end A notion of “normalized” metadata with Qualified Dublin Core as its base A notion of “normalized” metadata with Qualified Dublin Core as its base

4 Why do we normalize metadata? Improve services (e.g. search results, or UI display) Improve services (e.g. search results, or UI display) Improve metadata quality, when possible Improve metadata quality, when possible Enhance predictability of data for reharvesting services Enhance predictability of data for reharvesting services

5 How do we normalize metadata? Perform “safe” transforms to “smarten up” metadata Perform “safe” transforms to “smarten up” metadata XSL stylesheets -- from your XML metadata to our normalized XML metadata XSL stylesheets -- from your XML metadata to our normalized XML metadata Principles: Principles: Do no harm (Don’t lose information) Do no harm (Don’t lose information) Add information, when possible Add information, when possible Indicate schemes for valid values Indicate schemes for valid values Remove meaningless text Remove meaningless text “…”, “not available”, “-” “…”, “not available”, “-” Empty elements Empty elements Correct wrong information Correct wrong information “text/pdf”  “application/pdf” “text/pdf”  “application/pdf” Remove characters that impede functionality or display Remove characters that impede functionality or display Encoding fixes (e.g. “&”, double XML encodings, bad UTF-8 …) Encoding fixes (e.g. “&”, double XML encodings, bad UTF-8 …) Scrub URLs Scrub URLs

6 Automated MR Ingest process Your collection info and harvesting info is registered Your collection info and harvesting info is registered OAI validation – can we run our harvester on your OAI server? (see handout) OAI validation – can we run our harvester on your OAI server? (see handout) OAI harvest of your metadata (nsdl_dc if available; oai_dc if not...) OAI harvest of your metadata (nsdl_dc if available; oai_dc if not...) XML schema validation of all of your metadata XML schema validation of all of your metadata UTF-8 encoding validation, and make bad UTF-8 chars into harmless ones. UTF-8 encoding validation, and make bad UTF-8 chars into harmless ones. Normalized nsdl_dc created. Normalized nsdl_dc created. Your metadata, “raw” and normalized, is loaded into the MR tables and made available to the NSDL’s MR OAI server. Your metadata, “raw” and normalized, is loaded into the MR tables and made available to the NSDL’s MR OAI server.

7 Automated MR ingest process NSDL Collection Registration “raw” or “native” metadata Validation Notify collection of problems; May need to halt processing Metadata Repository Your collection’s OAI server NSDL MR OAI server OAI Harvest Normalize Validation normalized metadata

8 OAI-PMH: Key points OAI-PMH requests are embedded in HTTP OAI-PMH requests are embedded in HTTP it’s a web service, not a flat file it’s a web service, not a flat file XML, not HTML XML, not HTML multiple metadata formats are allowed multiple metadata formats are allowed OAI ≠ simple DC only! OAI ≠ simple DC only! Each metadata format MUST have a valid XML schema Each metadata format MUST have a valid XML schema

9 Metadata Formats and Schemas XML namespace XML Schema location OAI metadataPrefix Simple Dublin Core, OAI flavor http://www.openar chives.org/OAI/2.0 /oai_dc/ http://www.openarc hives.org/OAI/2.0/o ai_dc.xsd oai_dc Qualified Dublin Core, latest NSDL flavor http://ns.nsdl.org/ nsdl_dc_v1.02/ http://ns.nsdl.org/s chemas/nsdl_dc/ns dl_dc_v1.02.xsd (As you like; We use “nsdl_dc”) Your format (An appropriate URI) (URL for an XML schema) (As you like)

10 MR ingest requires: compliant OAI 2.0 server Correctly implements OAI-PMH; queries to all verbs respond correctly. Correctly implements OAI-PMH; queries to all verbs respond correctly. Every OAI response must be (deeply) XML schema valid Every OAI response must be (deeply) XML schema valid Encodes properly in proper places Encodes properly in proper places XML encoding XML encoding URL encoding URL encoding UTF-8 encoding UTF-8 encoding

11 OAI 2.0 – Identify baseURL baseURL email address email address protocol version protocol version description for OAI identifier syntax, especially if adhering to oai-identifier syntax described in Implementation Guidelines description for OAI identifier syntax, especially if adhering to oai-identifier syntax described in Implementation Guidelines

12 OAI 2.0 – ListMetadataFormats correct XML namespace for each format correct XML namespace for each format a valid XML schema for each format a valid XML schema for each format targetNamespace MUST match XML namespace above targetNamespace MUST match XML namespace above super easy out: use oai_dc super easy out: use oai_dc easy out: use nsdl_dc easy out: use nsdl_dc

13 OAI 2.0 – ListSets super easy out: if all your metadata is NSDL relevant, don’t use sets for our sake. super easy out: if all your metadata is NSDL relevant, don’t use sets for our sake. if you want the NSDL to harvest only SOME of your OAI server’s metadata, then use sets. if you want the NSDL to harvest only SOME of your OAI server’s metadata, then use sets. We will harvest only the sets you specify … but our default is to harvest all of them. We will harvest only the sets you specify … but our default is to harvest all of them. super easy setSpec strings: use only alpha-num characters super easy setSpec strings: use only alpha-num characters

14 OAI 2.0 – ListRecords Every metadata record served must (deeply) validate to its indicated XML schema Every metadata record served must (deeply) validate to its indicated XML schema If used, resumptionTokens must be implemented properly If used, resumptionTokens must be implemented properly RT is an exclusive argument RT is an exclusive argument Last response has an empty RT Last response has an empty RT Selective Harvesting works properly Selective Harvesting works properly “from” and “until” arguments do limit the results appropriately “from” and “until” arguments do limit the results appropriately “set” arguments do limit the results appropriately, if implemented “set” arguments do limit the results appropriately, if implemented

15 Common Points of Confusion - 1 about the metadata vs. about the resource identifiers: OAI vs. DC identifiers: OAI vs. DC record/header/identifier vs. record/metadata/../dc:identifier record/header/identifier vs. record/metadata/../dc:identifier dates: OAI vs. DC dates: OAI vs. DC record/header/datestamp vs. record/metadata/../dc:date record/header/datestamp vs. record/metadata/../dc:date OAI about containers are about the metadata OAI about containers are about the metadata rights: OAI about vs. DC rights: OAI about vs. DC record/about/../(dc:rights?) vs. record/metadata/../dc:rights record/about/../(dc:rights?) vs. record/metadata/../dc:rights

16 OAI identifiers Must uniquely identify individual metadata records at your site for OAI harvest and OAI reharvest Must uniquely identify individual metadata records at your site for OAI harvest and OAI reharvest Must stay the same for your metadata records Must stay the same for your metadata records metadata is updated; OAI identifier unchanged metadata is updated; OAI identifier unchanged

17 Common Points of Confusion - 2 Dates Dates format confusion format confusion OAI dates must be encoded as ISO8601 and must be in UTC (≈ GMT) OAI dates must be encoded as ISO8601 and must be in UTC (≈ GMT) OAI-PMH allows YYYY-MM-DD and YYYY-MM-DDThh:mm:ssZ. OAI-PMH allows YYYY-MM-DD and YYYY-MM-DDThh:mm:ssZ. DC date encoding – “Recommended best practice for encoding the date value is defined in a profile of ISO 8601 [W3CDTF] and follows the YYYY-MM-DD format.” DC date encoding – “Recommended best practice for encoding the date value is defined in a profile of ISO 8601 [W3CDTF] and follows the YYYY-MM-DD format.” (All OAI-PMH responses) (All OAI-PMH responses) Time when OAI server responds to a request Time when OAI server responds to a request OAI-PMH sez: ‘must be the time and date of the response in UTC. This is encoded using the "Complete date plus hours, minutes, and seconds" variant of ISO8601. This format is YYYY- MM-DDThh:mm:ssZ.’ OAI-PMH sez: ‘must be the time and date of the response in UTC. This is encoded using the "Complete date plus hours, minutes, and seconds" variant of ISO8601. This format is YYYY- MM-DDThh:mm:ssZ.’ISO8601 (OAI-PMH / ) (OAI-PMH / ) “from” and “until” arguments in OAI requests “from” and “until” arguments in OAI requests

18 When a Collection Deletes Records When a Collection Deletes Records if not indicated in OAI server if not indicated in OAI server incremental harvest for MR never shows update; MR copy never deleted! incremental harvest for MR never shows update; MR copy never deleted! if indicated in OAI server transiently if indicated in OAI server transiently reharvested soon enough – reharvested soon enough – not reharvested soon enough – incremental harvest for MR never shows update; MR copy never deleted! not reharvested soon enough – incremental harvest for MR never shows update; MR copy never deleted! if OAI server indicated and persistent if OAI server indicated and persistent MR finds delete on incremental harvest – MR finds delete on incremental harvest –

19 Deleted Records – Our Solution “Full reharvest” “Full reharvest” 1. Mark all the site’s records in MR “deleted” 2. Harvest all metadata records for the collection 3. As we ingest each newly retrieved record into the MR, if we over-write an old record, “un- delete” it. Expensive Expensive network bandwidth network bandwidth processing time processing time Okay for small collections (under ~15,000) Okay for small collections (under ~15,000) Okay for metadata that changes infrequently Okay for metadata that changes infrequently

20 In an ideal world, we’d like nsdl_dc nsdl_dc Information about nsdl_dc, example records and its XML schemas is in the NSDL Metadata Primer. Information about nsdl_dc, example records and its XML schemas is in the NSDL Metadata Primer. Persistent deleted records Persistent deleted records OAI identifier syntax, per OAI Implementation Guidelines OAI identifier syntax, per OAI Implementation Guidelines


Download ppt "The NSDL, OAI and Your Metadata Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University."

Similar presentations


Ads by Google