Introduction to Digital Libraries Week 10: Metadata Harvesting

Introduction to Digital Libraries Week 10: Metadata Harvesting
Old Dominion University Department of Computer Science CS 751/851 Spring 2010 Michael L. Nelson 03/15/10 several slides borrowed from Van de Sompel, Liu, Warner & Harrison

The Rise and Fall of Distributed Searching
wholesale distributed searching, popular at the time, is attractive in theory but troublesome in practice Davis & Lagoze, JASIS 51(3), pp Powell & French, Proc 5th ACM DL, pp distributed searching of N nodes still viable, but only for small values of N (<= 10) NCSTRL: N > 100; bad

The Rise and Fall of Distributed Searching
Other problems of distributed searching (from STARTS) source-metadata problem how do you know which nodes to search? query-language problem syntax varies and drifts over time between the various nodes rank-merging problem how do you meaningfully merge multiple result sets? Temptations: centralize all functions “everything will be done at X” standardize on a single product “everyone will use system Y”

Universal Preprint Service
Demonstrated at Santa Fe NM, October 21-22, 1999 D-Lib Magazine, 6(2) 2000 (2 articles) UPS was soon renamed the Open Archives Initiative (OAI) Based on NCSTRL+ software, it is a cross-archive DL that that provides services on a collection of content harvested from multiple archives NCSTRL+ is a modified version of Dienst support for “clustering” support for “buckets”

UPS Participants totals ca. July 1999

UPS: Metadata Formats repository the arXiv CogPrints NACA NCSTRL NDLTD
RePEc format internal Refer RFC1807 MARC ReDIF

UPS: Metadata Extraction
Getting metadata out of archives not all archives support metadata extraction some archives have undocumented metadata extraction procedures not all archives support rich criteria for extraction single dump concept only Intellectual property and use rights not always clear

Data and Service Providers
Data Providers publishing into an archive providing methods for metadata “harvesting” provide non-technical context for sharing information also Service Providers harvest metadata from providers implement user interface to data Even if provided by the same DL, these are distinct functions

Result… OAI The OAI was the result of the demonstration and discussion during the Santa Fe meeting Lots of churn regarding what the OAI was OAI-PMH originally a subset of the Dienst (NCSTRL) protocol and originally called the “Santa Fe Convention” originally defined an OAI-specific metadata format

Santa Fe convention OAI-PMH v.1.0/1.1 OAI-PMH v.2.0 nature experimental stable verbs Dienst OAI-PMH requests HTTP GET/POST responses XML transport HTTP metadata OAMS unqualified Dublin Core about eprints document like objects resources model metadata harvesting

Open Archives Initiative
The protocol is openly documented, and metadata is “exposed” to at least some peer group (note: rights management can still apply!) Archive defined as a “collection of stuff” -- not the archivist’s definition of “archive”. “Repository” used in most OAI documents. Needed a TLA…

OAI-PMH Actors data providers / repositories:
“A repository is a network accessible server that can process the 6 OAI-PMH requests in the manner described in [the OAI-PMH document]. A repository is managed by a data provider to expose metadata to harvesters.” service providers / harvesters: “A harvester is a client application that issues OAI-PMH requests. A harvester is operated by a service provider as a means of collecting metadata from repositories.”

Data Providers / Service Providers
(repositories) service providers (harvesters)

OAI-PMH Data Model resource all available metadata item = identifier
set-membership is item-level property all available metadata about David item item = identifier Dublin Core metadata MARC SPECTRUM records record = identifier + metadata format + datestamp

OAI-PMH characteristics: Typical Repository
OAI-PMH Entity value description Resource URL PDF, PS, XML, HTML or other file Item identifier OAI Identifier DNS-based name of metadata about resource set membership LCSH Library of Congress Subject Heading Record metadataPrefix oai_dc bibliographic metadata in Dublin Core datestamp modification date of DC record oai_marc bibliographic metadata in MARC modification date of MARC record

Overview of OAI-PMH Verbs
Function Identify description of repository ListMetadataFormats metadata formats supported by repository ListSets sets defined by repository ListIdentifiers OAI unique ids contained in repository ListRecords listing of N records GetRecord listing of a single record metadata about the repository harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)

Argument Summary   metadataPrefix from until set resumptionToken
identifier Identify  ListMetadata Formats optional ListSets exclusive ListIdentifiers  ListRecords GetRecord

Error Summary Identify BA ListMetadata Formats NMF IDDNE ListSets BRT NSH ListIdentifiers CDF NRM ListRecords GetRecord Generate badVerb on any input not matching the 6 defined verbs this is an inversion of the table in section 3.6 of the OAI-PMH specification

response no errors note no http encoding of the OAI-PMH request
<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate> T08:55:46Z</responseDate> <request verb=“GetRecord”… …> <GetRecord> <record> <header> <identifier>oai:arXiv:cs/ </identifier> <datestamp> </datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> </GetRecord> </OAI-PMH> note no http encoding of the OAI-PMH request

response with error with errors, only the correct
<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate> T08:55:46Z</responseDate> <request> <error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error> </OAI-PMH> with errors, only the correct attributes are echoed in <request>

harvesting granularity
mandatory support of YYYY-MM-DD optional support of YYYY-MM-DDThh:mm:ssZ other granularities considered, but ultimately rejected granularity of from and until must be the same

header contains set membership of item
<record> <header> <identifier>oai:arXiv:cs/ </identifier> <datestamp> </datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> eliminates the need for the “double harvest” 1.x required to get all records and all set information

ListIdentifiers returns headers
<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate> T08:55:46Z</responseDate> <request verb=“…” …> <ListIdentifiers> <header> <identifier>oai:arXiv:hep-th/ </identifier> <datestamp> </datestamp> <setSpec>physic:hep</setSpec> </header> <identifier>oai:arXiv:hep-th/ </identifier> <datestamp> </datestamp> <setSpec>physic:exp</setSpec> ……

Flow Control ListSets, ListIdentifiers, ListRecords are all allowed to return partial responses, via a combination of: resumptionToken – an opaque, archive-defined data string that when passed back to the archive allows the response to begin where it left off each archive defines their own resumptionToken syntax; it may have visible semantics or not 503 http status code – “retry after” up to the harvester to understand this code and respect it, and up to the archive to enforce it

resumptionToken scenario: harvesting 2770 records in 3 separate
1000 record “chunks” RDBMS ListRecords harvester Records , resumptionToken=AXad31 ListRecords, resumptionToken=AXad31 Records , resumptionToken=pQ22-x ListRecords, resumptionToken=pQ22-x Records

State in resumptionTokens
HTTP is stateless resumptionTokens allow state information to be passed back to the repository to create a complete list from sequence of incomplete lists EITHER – all state in resumptionToken OR – cache result set in repository

resumptionToken attributes (1)
expirationDate – likely to be useful when cache clean-up schedule is known Do not specify expirationDate if all state in resumptionToken badResumptionToken error to be used if resumptionToken expired May also be used if request cannot be completed for some other reason e.g.: if repository changes cause the incomplete list to have no records issue badRT’s judiciously; it can invalidate a lot of effort by a lot of harvesters

resumptionToken attributes (2)
completeListSize and cursor optionally provide information about size of complete list and number of records so far disseminated not (currently) widely used use consistently if used designed for status monitoring caveat harvester: completeListSize may be approximate and may be revised

resumptionToken The only defined use of resumptionToken is as follows:
a repository must include a resumptionToken element as part of each response that includes an incomplete list; in order to retrieve the next portion of the complete list, the next request must use the value of that resumptionToken element as the value of the resumptionToken argument of the request; the response containing the incomplete list that completes the list must include an empty resumptionToken element;

Idempotency of “List” Requests (1)
Purpose is to allow harvesters to recover from lost responses or crashes without starting a large harvest from scratch Recover by re-issuing request using resumptionToken from previous request IMPLICATION: harvester must accept both the most recent resumptionToken issued and the previous one

Idempotency of “List” Requests (2)
response to a re-issued request must contain all unchanged records any changed records will get new datestamps after time of initial request changes will be picked up by subsequent harvest if not included [no experience yet with incomplete responses to ListSets or ListMetadataFormats requests]

OAI-PMH 2.0 Registration DP:SP ~= 5:1 unregistered because:
testing / development not for public harvesting public, but “low-profile” never got around to it… ??? ??? unregistered repositories 700+ repositories registered DP:SP ~= 5:1

Registration is Nice… …But Not Required
OAI-PMH is (becoming) the “http” for digital libraries there is no central registry of http servers remember the NCSA “What’s New” page? (ca. 1994) There will never be “registration support” in OAI-PMH registries are a type of service provider, built on top of OAI-PMH registration will be an integral part of community building Some examples UIUC Celestial Cornell

<friends> A light-weight, data-provider driven way to communicate the existence of “others”, e.g. … <description> <friends …namespace stuff… > <baseURL> <baseURL> <baseURL> <baseURL> </friends> </description>

Aggregators aggregators allow for: scalability for OAI-PMH
load balancing community building discovery service providers (harvesters) data providers (repositories) aggregator

Aggregators Frequently interchangeable terms:
aggregators: likely to be community / institutionally focused caches: stores a copy, less likely to be community-oriented proxies: less likely to store a copy, may gateway between OAI-PMH and other protocols Dienst / OAI Gateway; Harrison, Nelson, Zubair, JCDL 03 To learn more about aggregators, caches & proxies:

<provenance> & datestamps
Reminder: datestamps are local to the repository, a re-exporting service must use new local datestamps Such services should use the <provenance> container to preserve the original datestamps and other information

Identifiers are Local Identifiers are local to the repository
Unless you absolutely did not change the metadata and the identifier corresponds to a recognized URI scheme, use a new identifier upon re-exporting use the <provenance> container to preserve the harvesting history

Derived from the same item?
3 different ways to determine if records share provenance from the same item: both records have the same identifier and the baseURL in the request elements of the OAI-PMH reponses which include the record are the same; both records have the same identifier and that identifier belongs to some recognized URI scheme; the provenance containers of both records have the same entries for both the identifier and baseURL;

<provenance> example (1)
Consider a request from crosswalker.oa.org: &identifier=oai:odd.oa.org:z1x2y3&metadataPrefix=odd_fmt and the following response from odd.oa.org: <responseDate> T08:55:46.1</responseDate> <request verb="GetRecord" metadataPrefix="odd_fmt" identifier="oai:odd.oa.org:z1x2y3"> <GetRecord ...namespace stuff… <record> <header> <identifier>oai:odd.oa.org:z1x2y3</identifier> <datestamp> T06:05:04Z</datestamp> </header> <metadata> …metadata record in odd_fmt… </metadata> </record> </GetRecord>

Imagine that crosswalker.oa.org cross-walks harvested metadata from odd_fmt into oai_marc and then re-exposes the metadata with new identifiers. A request from getmarc.oa.org: &identifier=oai:cw.oa.org:z1x2y3 &metadataPrefix=oai_marc might then yield the following response from crosswalker.oa.org:

<record> <header> <identifier>oai:cw.oa.org:z1x2y3</identifier> <datestamp> T01:15:43Z</datestamp> </header> <metadata> ...metadata record in oai_marc... </metadata> <about> <provenance …namespace stuff… > <originDescription harvestDate=" T08:55:46Z“ altered="true"> <baseURL> <identifier>oai:odd.oa.org:z1x2y3</identifier> <datestamp> T06:05:04Z</datestamp> <metadataNamespace> </originDescription> </provenance> </about> </record>

This oai_marc record is then re-exposed by getmarc.oa.org with the same identifier oai:cw.oa.og:z1x2y3 (because the record has not been altered). The associated <provenance> container might be:

<record> <header> <identifier>oai:cw.oa.org:z1x2y3</identifier> <datestamp> T01:46:11Z</datestamp> </header> <metadata> ...metadata record in oai_marc... </metadata> <about> <provenance …namespace stuff…> <originDescription harvestDate=“ T01:23:45” altered=“false”> <baseURL> <datestamp> T01:15:43Z</datestamp> <metadataNamespace> <originDescription harvestDate=" T08:55:46Z” altered="true"> <baseURL> <identifier>oai:odd.oa.org:z1x2y3</identifier> <datestamp> T06:05:04Z</datestamp> <metadataNamespace> </originDescription> </provenance> </about> </record>

Listen to the Repository
Check Identify’s <granularity> element if you wish to use finer than YYYY-MM-DD If you harvest with sets, remember that “:” indicates hierarchy harvesting “a” will recursively harvest “a:b”, “a:b:c”, and “a:d” Check for and handle non-200 HTTP status codes, 503, 302 and 4xx in particular Empty resumptionToken => end of complete list Ask for compressed responses if the repository supports them

Harvesting Everything
Issue an Identify request to find protocol version, finest datestamp granularity supported, if compression is supported… Issue a ListMetadataFormats request to obtain a list of all metadataPrefixes supported. Harvest using a ListRecords request for each metadataPrefix supported. Knowledge of the datestamp granularity allows for less overlap in incremental harvesting if granularities finer than a day are supported. Set structure can be inferred from the setSpec elements in the header blocks of each record returned (consistency checks are possible). Items may be reconstructed from the constituent records. Provenance and other information in <about> blocks may be re-assembled at the item level if it is the same for all metadata formats harvested. However, this information may be supplied differently for different metadata formats and may thus need to be store separately for each metadata format.

Observation: Front-End Only
No input/registry mechanism OAI-PMH is always a front-end for something else filesystem, Dienst, RDBMS, LDAP, etc. convenient for pre-existing DLs, but does not address “new” DLs e.g., “we want to do OAI-PMH” Bounds the scope of OAI-PMH responsibilities and domain of OAI-PMH are still be discussed tension between functionality and simplicity

1 Repository, 2 baseURLs Possible to use multiple OAI-PMH interfaces in a DMZ-like configuration… OAI-PMH requests from trusted hosts OAI-PMH requests from arbitrary hosts Public repo interface Private repo interface Source database could even use a separate copy of the database…

Closed Harvesting Possible to use OAI-PMH in closed, restricted systems OAI 1 OAI 2 OAI 4 OAI 3 all OAI requests originate from these 4 DLs

Observation: Data Coherency
In the interest of implementer simplicity, several issues are left for the service provider to interpret what is an update vs. addition? in the NACA interface, they are reported as the same and its up to the harvesting system to figure it out deletions? it is currently optional for systems to mark records as deleted or not… still left to the harvester to interpret

Interesting Services DP9 Celestial Static (mini-) repositories
gateway to expose repository contents in HTML suitable for web crawlers Celestial OAI “cache”, also 1.1 -> 2.0 converter Static (mini-) repositories XML files, based on OLAC work

DP9 Architecture see Liu et al., JCDL 2002; http://dlib.cs.odu.edu/dp9
Slide from Liu

DP9 Formatting Format of URLs HTML Meta tags
&prefix=oai_dc HTML Meta tags Some crawlers (such as Inktomi) use the HTML meta tags to index a Web pages; DP9 also maps Dublin Core metadata to corresponding HTML meta tags. For pages that are designed exclusively for robots navigation, a noindex robots meta tag is used X-FORWARDED-FOR header to distinguish between different users coming in via a proxy Slide from Liu

Celestial Developed by Brody @ Southampton
designed to complement DP9 see Liu, Brody, et al., D-Lib Magazine 8(11) Where DP9 is a non-caching proxy, Celestial caches the metadata records can off-load work from individual archives, higher availability can harvest 1.1, 2.0; exports in 2.0

“Static” Repositories
Premise: a repository does not wish to have an executing program on its site, so it has a “static” XML file with some of the OAI-PMH responses in place accessed through a proxy could be a low functionality node, or the XML file could be produced by a process and moved outside a firewall Based on OLAC work by Bird & Simons

OAI Demos Data providers
not really meant for end-user interaction, but Suleman’s “Repository Explorer” is an excellent tool

OAI-PMH & The Deep Web? How many of the digital resources described by OAI-PMH records are indexed by search engines? We did a study of this in 2006…

Indexed by SEs?

Web, Deep Web & OAI-PMH A. Resources that have been indexed by search engines using crawling B. Resources that have been indexed by search engines using OAIster, sitemaps and other techniques C. Resources that are not accessible on the surface Web D. Resources that are accessible on the surface Web and have not yet been found by crawling

Repository Corpus From four registries, we discovered 776 repositories
Harvested 9.9M records from 475 of the 776 repositories 406 of the 475 returned records with at least on DC.Identifier field From these 406, we harvested 5.6M DC.Identifiers, 4M of which were http or https URIs, and 3.2M of those were unique

Repository Corpus

Sampling We sampled 1000 records from: For #1:
all unique DC Identifiers 5 bins based on size (table 1) 10 "representative" repositories For #1: Y: 65%, G: 44%, M: 7% Y  G  M = 79% Y  G = 30%

By Size

Individual Repositories

Web, Deep Web & OAI-PMH A + B = 2.6M resource identifiers
C + D = 700K resource identifiers (hard to solve for A, since SEs don't reliably report backlinks (see McCown & Nelson, JCDL 2007), hard to solve for C or D without your own crawling) A. Resources that have been indexed by search engines using crawling B. Resources that have been indexed by search engines using OAIster, sitemaps and other techniques C. Resources that are not accessible on the surface Web D. Resources that are accessible on the surface Web and have not yet been found by crawling

Introduction to Digital Libraries Week 10: Metadata Harvesting

Similar presentations

Presentation on theme: "Introduction to Digital Libraries Week 10: Metadata Harvesting"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Digital Libraries Week 10: Metadata Harvesting

Similar presentations

Presentation on theme: "Introduction to Digital Libraries Week 10: Metadata Harvesting"— Presentation transcript:

Similar presentations

About project

Feedback