A New Model for Web Resource Harvesting Michael L. Nelson Old Dominion University joint work with: Her Herbert Van de Sompel Xiaoming Liu Carl Lagoze Simeon.

Slides:



Advertisements
Similar presentations
The Open Archives Initiative DRIADE Workshop, Durham NC, May 16-17, 2007 Michael L. Nelson The Open Archives Initiative Michael L. Nelson Computer Science,
Advertisements

Y.T. a brief history of the OAI 0 Kaynak: Herbert van de Sompel.
Depositing e-material to The National Library of Sweden.
Object Re-Use and Exchange Mellon Retreat, Nassau Inn, Princeton, NJ, March Herbert Van de Sompel, Carl Lagoze The OAI Object Re-Use & Exchange.
OAI-PMH Dawn Petherick, University Web Services Team Manager, Information Services, University of Birmingham MIDESS Dissemination.
National Science Digital Library (NSDL) Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University.
The Open Archives Initiative Simeon Warner (Cornell University) Symposium on “Scholarly Publishing and Archiving on the Web”, University.
What’s New from the OAI Herbert Van de Sompel Michael Nelson Simeon Warner Carl Lagoze CERN workshop on Innovations.
OAI Standards for Sheet Music Meeting March 28-29, 2002 Basic OAI Principals How They Apply to Sheet Music Presenter: Curtis Fornadley, Senior Programmer/Analyst.
Institutional Repositories Tools for scholarship Mary Westell University of Calgary AMTEC Conference May 26, 2005.
An Update from the OAI Herbert Van de Sompel Carl Lagoze Michael Nelson Simeon Warner CNI Task Force Meeting December 7 th 2004, Portland, OR.
OAI-PMH at Yale Report on the DLF OAI Training Session November 10, 2005 Charlottesville, VA.
The Open Archives Initiative Simeon Warner Cornell University, Ithaca, NY, USA CREPUQ 2002, Montréal, Canada 14:00, 24 October 2002.
Digital Library Architecture and Technology
Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, Digital Library Research Laboratory Virginia Tech.
Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.
“Old Style” Libraries, Digital Libraries: Convergences, Divergences, And the Troubles in Between.
Thinking Differently About Web Page Preservation Michael L. Nelson, Frank McCown, Joan A. Smith Old Dominion University Norfolk VA
Implementing an Integrated Digital Asset Management System: FEDORA and OAIS in Context Paul Bevan DAMS Implementation Manager
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAIResource Software Her This work supported in part by the.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
Dec 9-11, 2003ICADL Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland A New Model for Web Resource Harvesting Her This work supported.
07/11/2002Thomas Baron - JACoW Workshop1 CERN Library Requirements T. Baron CERN ETT-DH-CDS.
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAI-PMH for Resource Harvesting Herbert Van de Sompel Digital.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
OAI-PMH: Open Archives Initiative Protocol for Metadata Harvesting T.B. Rajashekar National Centre for Science Information (NCSI) Indian Institute of Science,
Van de Sompel, Herbert Los Alamos National Laboratory – Research Library OAI-PMH for Resource Harvesting.
Metadata harvesting in regional digital libraries in PIONIER Network Cezary Mazurek, Maciej Stroiński, Marcin Werla, Jan Węglarz.
Digital Library Interoperability Architecture CS 502 – Carl Lagoze – Cornell University.
Introduction to metadata
Lifecycle Metadata for Digital Objects November 1, 2004 Descriptive Metadata: “Modeling the World”
Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA
Kurt Maly Department of Computer Science Old Dominion University Norfolk, Virginia 23529, USA Digital Libraries, OAI and Free Software.
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Phil Barker, March © Heriot-Watt University. You may reproduce all or any part.
Open Archive Initiative – Protocol for metadata Harvesting (OAI-PMH) Surinder Kumar Technical Director NIC, New Delhi
Caltech CODA CODA: Collection of Digital Archives Caltech Scholarly Communication.
Slavic Digital Text Workshop 2006 The Open Archives Initiative Protocol for Metadata Harvesting: an Opportunity for Sharing Content in a Distributed Environment.
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley
Integrating Access to Digital Content Sarah Shreeves University of Illinois at Urbana-Champaign Visual Resources Association 23 rd Annual Conference Miami.
NSDL October 12-15, 2003Eisenhower National Clearinghouse Slide 1 NSDL and the Open Archives Initiative NSDL – OAI – and the Eisenhower National Clearinghouse.
OAI Object Reuse & Exchange: Atom Serialization Nordbib Workshop, September , Stockholm, Sweden OAI-ORE: Atom Serialization The ORE Editors are:
The OAI: technical overview OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University -- Computer Science.
Oct 12-14, 2003NSDL Challenges in Building Federation Services over Harvested Metadata Kurt Maly, Michael Nelson, Mohammad Zubair Digital Library.
The Open Archives Initiative Marshall Breeding Director for Innovative Technologies and Research Vanderbilt University
Open Archives Initiative Protocol for Metadata Harvesting.
DSpace - Digital Library Software
DSpace System Architecture 11 July 2002 DSpace System Architecture.
Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,
The library is open Digital Assets Management & Institutional Repository Russian-IUG November 2015 Tomsk, Russia Nabil Saadallah Manager Business.
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland The American Physical Society Project: Standards-based Mirroring.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
The NSDL, OAI and Your Metadata Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University.
Mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango
What’s New from the OAI Herbert Van de Sompel Michael Nelson Simeon Warner Carl Lagoze CERN workshop on Innovations.
LWW January 27, 2004, Los Alamos, NM LANL Ingestion and Repository architecture Research Library, Los Alamos National Laboratory RESEARCH LIBRARY LANL’s.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
The Multi-Faceted Use of the OAI-PMH in the LANL Repository Written By: Henry, Xiaoming,Patrick Henry, Xiaoming,Patrick and Herbert. Presented By: Shashi.
Efficient, Automatic Web Resource Harvesting Michael L. Nelson, Joan A. Smith and Ignacio Garcia del Campo Old Dominion University Computer Science Dept.
Introduction to Digital Libraries
Getting a Leg Up on OAI for the NSDL
Georges Arnaout Chaitanya Krishna
An Architecture for Complex Objects and their Relationships
OAI and Metadata Harvesting
NSDL Data Repository (NDR)
A New Model for Web Resource Harvesting
An Update from the OAI <
Open Archive Initiative
Presentation transcript:

A New Model for Web Resource Harvesting Michael L. Nelson Old Dominion University joint work with: Her Herbert Van de Sompel Xiaoming Liu Carl Lagoze Simeon Warner Terry Harrison This work supported in part by the Andrew Mellon Foundation & Library of Congress

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 My Research Interests Digital Objects and Repositories o Interaction between them: roles, responsibilities, architecture, scalability o Bumper sticker: Free the Object from the Tyranny of the Repository Digital Preservation o Shared infrastructure models, automation, large-scale best effort strategies o Bumper sticker: We Need Fewer Heroes User / System Co-Evolution o Discerning intent from access large-scale patterns o Bumper sticker: Freedom From Choice

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Outline (0) The Problem (1) OAI-PMH Mechanics (2) OAI-PMH for Resource Harvesting (3) mod_oai (4) Future Research

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 WWW and DL: Separated at Birth 1994 DL WWW Today more on DL/WWW, from the NSF Post-DL Workshop: The Good: XML, BitTorrent, Web Services The Bad: RSS The Ugly: Semantic Web The Good: OAIS, DOI, OAI-PMH The Bad: Dublin Core The Ugly: SRU/W The problem is not that the WWW doesn’t work; it clearly does. The problem is that our expectations have been lowered. WWW DL

A New Model for Web Resource Harvesting Texas A&M University, April 25, doc1; last mod doc2; last mod doc100; last mod … what documents have been modified since ? robot image from: Web Robots what is this file? what are its relationships to other files? how often does it change?

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 A More Efficient Way what documents have been modified since ? with mod_oai doc1; last mod doc2; last mod doc100; last mod … …

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Outline (0) The Problem (1) OAI-PMH Mechanics (2) OAI-PMH for Resource Harvesting (3) mod_oai (4) Future Research

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 A Very Brief OAI-PMH History Universal Preprint Service o A cross-archive DL that that provides services on a collection of metadata harvested from multiple archives -not distributed searching o Demonstrated at Santa Fe NM, October 21-22, D-Lib Magazine, 6(2) 2000 (2 articles) – o UPS was soon renamed the Open Archives Initiative (OAI) The OAI has authored, among other things, the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) o in use around the world, 600+ known instances -registration not required; many unknown instances – – o used by Google ca. late

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 “A repository is a network accessible server that can process the 6 OAI-PMH requests … A repository is managed by a data provider to expose metadata to harvesters.” “A harvester is a client application that issues OAI-PMH requests. A harvester is operated by a service provider as a means of collecting metadata from repositories.” Data Providers / Repositories Service Providers / Harvesters

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Aggregators data providers (repositories) service providers (harvesters) aggregator aggregators allow for: scalability for OAI-PMH load balancing community building discovery

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 OAI-PMH data model resource item Dublin Core metadata MARCXML metadata records entry point to all records pertaining to the resource metadata pertaining to the resource OAI-PMHidentifier metadataPrefix datestamp OAI-PMH identifierOAI-PMH sets

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Overview of OAI-PMH Verbs VerbFunction Identifydescription of repository ListMetadataFormatsmetadata formats supported by repository ListSetssets defined by repository ListIdentifiersOAI unique ids contained in repository ListRecordslisting of N records GetRecordlisting of a single record metadata about the repository harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Outline (0) The Problem (1) OAI-PMH Mechanics (2) OAI-PMH for Resource Harvesting (3) mod_oai (4) Future Research

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Resource Harvesting: Use cases Discovery: use content itself in the creation of services o search engines that make full-text searchable o citation indexing systems that extract references from the full-text content o browsing interfaces that include thumbnail versions of high-quality images from cultural heritage collections Preservation: o periodically transfer digital content from a data repository to one or more trusted digital repositories o trusted digital repositories need a mechanism to automatically synchronize with the originating data repository Ideas first presented in Van de Sompel, Nelson, Lagoze & Warner,

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches Typical scenario: 1.An OAI-PMH harvester harvests Dublin Core records from the OAI-PMH repository. 2.The harvester analyzes each Dublin Core record, extracting dc.identifier information in order to determine the network location of the described resource. 3.A separate process, out-of-band from the OAI-PMH, collects the described resource from its network location.

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Issue 1  Locating the resource based on information provided in dc.identifier  dc.identifier used to convey a variety of identifier: (simultaneously) URL DOI, bibliographic citation, … Not expressive enough to distinguish between identifier, locator.  Several dereferencing attempts required  URI provided in dc.identifier is commonly that of a bibliographic “splash page”  How to know it is a bibliographic “splash page”, not the resource?  If it is a bibliographic “splash page”, where is the resource?

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Issue 2  Using the OAI-PMH datestamp of the Dublin Core record to trigger incremental harvesting:  Datestamp of DC record does not necessarily change when resource changes no DC datestamp changeDC datestamp change no resource updateOKunnecessary resource download resource updatemissed resource update OK

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Conventions  Cannot really address issue 2 (datestamps) with metadata conventions  Issue 1 (identifier & locator of the resource) is currently addressed with a range of conventions  First dc.identifier is locator of the resource  what if the resource is not digital?  Use of dc.format and/or dc.relation to convey locator

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Conventions A Simple Parallel-Plate Resonator Technique for Microwave. Characterization of Thin Resistive Films Vorobiev, A. ING-INF/01 Elettronica A parallel-plate resonator method is proposed for non-destructive characterisation of resistive films used in microwave integrated circuits. A slot made in one... Microwave engineering Europe 2002 Documento relativo ad una Conferenza o altro Evento PeerReviewed pdf locator of resource splash page

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Conventions … … locator of resource splash page

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Conventions … … locator of resource splash page

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Other attempts  dc.identifier leads to splash page & splash page contains special purpose XHTML link to resource(s)  What if there is no splash page?  How does a harvester recognize this situation?  OA-X: protocol extension  OK in local context  Strategic problem to generalize  How to consolidate with OAI-PMH data model  Qualified Dublin Core  Could bring expressiveness to distinguish between locator & identifier  But what about the datestamp issue?

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Proposed OAI-PMH based approach  Use metadata formats that were specifically created for representation of digital objects:  Complex Object Formats as OAI-PMH metadata formats  MPEG-21 DIDL, METS,..

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 OAI-PMH data model resource item Dublin Core metadata METS records OAI-PMH identifier = entry point to all records pertaining to the resource MPEG-21 DIDL metadata pertaining to the resource simplehighly expressive more expressive highly expressive MARCXML metadata

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Complex Object Formats : characteristics Representation of a digital object by means of a wrapper XML document. Represented resource can be: o simple digital object (consisting of a single datastream) o compound digital object (consisting of multiple datastreams) Unambiguous approach to convey identifiers of the digital object and its constituent datastreams. Include datastream: o By-Value: embedding of base64-encoded datastream o By-Reference: embedding network location of the datastream o not mutually exclusive; equivalent Include a variety of secondary information o By-Value o By-Reference o Descriptive metadata, rights information, technical metadata, …

A New Model for Web Resource Harvesting Texas A&M University, April 25, A Simple Parallel-Plate Resonator Technique for Microwave. Characterization of Thin Resistive Films Vorobiev, A. application/pdf …

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Complex Object Formats & OAI-PMH Resource represented via XML wrapper => OAI-PMH Uniform solution for simple & compound objects Unambiguous expression of locator of datastream Disambiguation between locators & identifiers OAI-PMH datestamp changes whenever the resource (datastreams & secondary information) changes OAI-PMH semantics apply: “about” containers, set membership

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 OAI-PMH based approach using Complex Object Format Typical scenario: 1.An OAI-PMH harvester checks for support of a locally understood complex object format using the ListMetadataFormats verb 2.The harvester harvests the complex object metadata. Semantics of the OAI-PMH datestamp guarantee that new and modified resources are detected. 3.A parser at the end of the harvesting application analyzes each harvested complex object record: -The parser extracts the bitstreams that were delivered By-Value. -The parser extracts the unambiguous references to the network location of bitstreams delivered By-Reference. 4.A separate process, out-of-band from the OAI-PMH, collects the bitstreams delivered By-Reference from the extracted network locations.

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Complex Object Formats & OAI-PMH : OAIS archive export/ingest

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Complex Object Formats & OAI-PMH : issues Which Complex Object Format(s) How to Profile Complex Object Format(s) for OAI-PMH Harvesting Large records Making resources re-harvestable Because the resource is represented as, can rights pertaining to the resource be expressed according to the “rights for metadata” OAI-rights guideline? Tools: o Software library to write compliant complex objects o Integration of this library with repository systems (Fedora, DSpace, eprints.org, ….)

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Outline (0) The Problem (1) OAI-PMH Mechanics (2) OAI-PMH for Resource Harvesting (3) mod_oai (4) Future Research

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Goal: integrate OAI-PMH functionality into the web server itself… mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server o written in C o respects values in.htaccess, httpd.conf compile mod_oai on baseURL is now o Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets) - verb=ListIdentifiers & metdataPrefix=oai_dc & from= & set=mime:video:mpeg mod_oai approach

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 OAI-PMH data model in mod_oai resource item Dublin Core metadata records OAI-PMH identifier = entry point to all records pertaining to the resource MPEG-21 DIDL metadata pertaining to the resource HTTP header metadata OAI-PMH sets MIME type

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 OAI-PMH Entityvaluedescription ResourceURLPDF, PS, XML, HTML or other file Item identifierOAI IdentifierDNS-based name of metadata about resource set membershipLCSHLibrary of Congress Subject Heading Record metadataPrefixoai_dcbibliographic metadata in Dublin Core datestamp modification date of DC record Record metadataPrefixoai_marcbibliographic metadata in MARC datestamp modification date of MARC record OAI-PMH concepts : typical repository

OAI-PMH Entityvaluedescription ResourceURLHTML, GIF, PDF or other web file Item identifierURLsame URL as the resource set membershipMIME typeMIME type of the resource Record metadataPrefixhttp_headerthe http headers that would have been returned via HTTP GET/HEAD datestamp modification date of resource Record metadataPrefixoai_dca subset of http_header in DC datestamp modification date of resource Record metadataPrefixoai_didlMPEG-21 DIDL: base64 encoded resource + http_header metadata datestamp modification date of resource OAI-PMH concepts : mod_oai

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 harvester issues a ListIdentifiers, finds URLs of updated resources does HTTP GETs updates only can get URLs of resources with specified MIME types Resource Discovery: ListIdentifiers

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Preservation: ListRecords harvester issues a ListRecords, Gets updates as MPEG- 21 DIDL documents (HTTP headers, resource By Value or By Reference) can get resources with specified MIME types

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 performance of mod_oai and wget on for more detail: “mod_oai: An Apache Module for Metadata Harvesting “

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Outline (0) The Problem (1) OAI-PMH Mechanics (2) OAI-PMH for Resource Harvesting (3) mod_oai (4) Future Research

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Issues and Future Work For a given server, there are a set of URLs, U, and a set of files F o Apache maps U  F o mod_oai maps F  U Neither function is 1-1 nor onto o We can easily check if a single u maps to F, but given F we cannot (easily) generate U Short-term issues: o dynamic files -exporting unprocessed server-side files would be a security hole o IndexIgnore -httpd will “hide” valid URLs o File permissions -httpd will advertise files it cannot read Long-term issues o Alias, Location -files can be covered up by the httpd o UserDir -interactions between the httpd and the filesystem

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 IndexIgnore & File Permissions

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Alias: Covering Up Files httpd.conf: Alias /A /usr/local/web/htdocs/B Alias /B /usr/local/web/htdocs/A the files “A” and “B” will be different from the URLs

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 UserDir: “Just in Time” mounting of directories whiskey.cs.odu.edu:/ftp/WWW/conf% ls /home liu_x/ mln/ whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso /home/tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf % ls /home liu_x/ mln/ tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf %

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Looking Further Down the Road for mod_oai “Reverse” the method of URL discovery o cannot look to the files; o listen to incoming requests and build a list of valid URLs -could be seeded with files at start -also the method for handling server processed files / URLs Plug-ins for descriptive metadata o DC tags in HTML o MS Office formats, PDF o Tags from JPEG, TIFF, MP3, etc. Additional metadata in the DIDL o technical metadata from JHOVE o estimated change rate -cf. Cho & Garcia-Molina, ACM TOIT 28(4) http log access as separate metadata formats -cf. Van de Sompel, Young & Hickey, D-Lib 9(7/8)

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Expanding OAI-PMH / Complex Object Access OAI-PMH / CO access for: o blogs o message boards o native file systems -e.g. Mac OS X “Spotlight” More aggressive use of OAI-PMH / CO for preservation o recently funded NSF DIGARCH program o use for preservation: -Usenet - -Multicasting

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 OAI-PMH + Complex Objects: A New Model for Web Resource Harvesting Better web harvesting can be achieved through: o OAI-PMH: structured access to updates o Complex object formats: modeled representation of digital objects Use cases: o Preservation (ListRecords) o Web crawling (ListIdentifiers) mod_oai: reference implementation o Better performance than wget o static files only; dynamic files in the future o not a replacement for DSpace, Fedora, eprints.org, etc. More info: o o

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Datestamps and Etags Procedure o 16 harvests over 1 month of 465,374.dk domains o 5,543,470 possible downloads o 5,182,034 successful downloads o 599,143 changes Datestamp and Etag Example L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Errors in Datestamps and Etags Indicating Change EtagsDatestamps missed change0.087%0.30% redundant crawl32%10.7% L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL % of pages without Etags 0.07% of pages without Datestamps

http_header

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Complex Object Formats & OAI-PMH : existing implementations LANL Repository o Local storage of Terrabytes of scholarly assets o Assets stored as MPEG-21 DIDL documents o DIDL documents made accessible to downstream applications via the OAI-PMH Mirroring of American Physical Society collection at LANL o Maps APS document model to MPEG-21 DIDL Transfer Profile o Exposes MPEG-21 DIDL documents through OAI-PMH infrastructure o Inlcudes digests/signatures DSpace & Fedora plug-ins o Maps DSpace/Fedora document model to MPEG-21 DIDL Transfer Profile o Exposes MPEG-21 DIDL documents through OAI-PMH infrastructure mod_oai

A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 conceptmod_oai implementation OAI-PMH IdentifierURL of resource setMIME type of resource datestampchange time of resource deleted records“no” deleted records mod_oai : OAI-PMH concepts