Presentation is loading. Please wait.

Presentation is loading. Please wait.

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland A New Model for Web Resource Harvesting Her This work supported.

Similar presentations


Presentation on theme: "OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland A New Model for Web Resource Harvesting Her This work supported."— Presentation transcript:

1 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland A New Model for Web Resource Harvesting Her This work supported in part by the Andrew Mellon Foundation & Library of Congress Michael Nelson Computer Science Department Old Dominion University Herbert Van de Sompel Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory

2 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Outline (0) The Problem (1) mod_oai (2) Future Research

3 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland WWW and DL: Separated at Birth 1994 DL WWW Today The Good: XML, BitTorrent, Web Services The Bad: RSS The Ugly: Semantic Web The Good: OAIS, DOI, OAI-PMH The Bad: Dublin Core The Ugly: SRU/W The problem is not that the WWW doesn’t work; it clearly does. The problem is that our expectations have been lowered. WWW DL

4 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland www.getty.edu doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc100; last mod 2003-09-11 … what documents have been modified since 2003-11-15 ? robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG Web Robots what is this file? what are its relationships to other files? how often does it change?

5 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland A More Efficient Way what documents have been modified since 2003-11-15 ? www.getty.edu with mod_oai doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc100; last mod 2003-09-11 … …

6 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Outline (0) The Problem (1) mod_oai (2) Future Research

7 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Goal: integrate OAI-PMH functionality into the web server itself… mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server o written in C o respects values in.htaccess, httpd.conf compile mod_oai on http://www.foo.edu/http://www.foo.edu/ baseURL is now http://www.foo.edu/modoaihttp://www.foo.edu/modoai o Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets) -http://www.foo.edu/modoai? verb=ListIdentifiers & metdataPrefix=oai_dc & from=2004-09-15 & set=mime:video:mpeg mod_oai approach

8 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAI-PMH data model in mod_oai resource item Dublin Core metadata records OAI-PMH identifier = entry point to all records pertaining to the resource MPEG-21 DIDL metadata pertaining to the resource HTTP header metadata http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf OAI-PMH sets MIME type

9 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAI-PMH Entityvaluedescription ResourceURLPDF, PS, XML, HTML or other file Item identifierOAI IdentifierDNS-based name of metadata about resource set membershipLCSHLibrary of Congress Subject Heading Record metadataPrefixoai_dcbibliographic metadata in Dublin Core datestamp2004-10-18modification date of DC record Record metadataPrefixoai_marcbibliographic metadata in MARC datestamp2004-07-31modification date of MARC record OAI-PMH concepts : typical repository

10 OAI-PMH Entityvaluedescription ResourceURLHTML, GIF, PDF or other web file Item identifierURLsame URL as the resource set membershipMIME typeMIME type of the resource Record metadataPrefixhttp_headerthe http headers that would have been returned via HTTP GET/HEAD datestamp2004-07-31modification date of resource Record metadataPrefixoai_dca subset of http_header in DC datestamp2004-07-31modification date of resource Record metadataPrefixoai_didlMPEG-21 DIDL: base64 encoded resource + http_header metadata datestamp2004-07-31modification date of resource OAI-PMH concepts : mod_oai

11 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland harvester issues a ListIdentifiers, finds URLs of updated resources does HTTP GETs updates only can get URLs of resources with specified MIME types Resource Discovery: ListIdentifiers

12 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Preservation: ListRecords harvester issues a ListRecords, Gets updates as MPEG- 21 DIDL documents (HTTP headers, resource By Value or By Reference) can get resources with specified MIME types

13 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland performance of mod_oai and wget on www.cs.odu.eduwww.cs.odu.edu

14 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Readings Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Terry L. Harrison, Nathan McFarland. mod_oai: An Apache Module for Metadata Harvesting. http://arxiv.org/abs/cs.DL/0503069

15 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Outline (0) The Problem (1) mod_oai (2) Future Research

16 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Issues and Future Work For a given server, there are a set of URLs, U, and a set of files F o Apache maps U  F o mod_oai maps F  U Neither function is 1-1 nor onto o We can easily check if a single u maps to F, but given F we cannot (easily) generate U Short-term issues: o dynamic files -exporting unprocessed server-side files would be a security hole o IndexIgnore -httpd will “hide” valid URLs o File permissions -httpd will advertise files it cannot read Long-term issues o Alias, Location -files can be covered up by the httpd o UserDir -interactions between the httpd and the filesystem

17 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland IndexIgnore & File Permissions

18 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Alias: Covering Up Files httpd.conf: Alias /A /usr/local/web/htdocs/B Alias /B /usr/local/web/htdocs/A the files “A” and “B” will be different from the URLs http://server/A http://server/B

19 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland UserDir: “Just in Time” mounting of directories whiskey.cs.odu.edu:/ftp/WWW/conf% ls /home liu_x/ mln/ whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso /home/tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf % ls /home liu_x/ mln/ tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf %

20 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Looking Further Down the Road for mod_oai “Reverse” the method of URL discovery o cannot look to the files; o listen to incoming requests and build a list of valid URLs -could be seeded with files at start -also the method for handling server processed files / URLs Plug-ins for descriptive metadata o DC tags in HTML o MS Office formats, PDF o Tags from JPEG, TIFF, MP3, etc. Additional metadata in the DIDL o technical metadata from JHOVE o estimated change rate -cf. Cho & Garcia-Molina, ACM TOIT 28(4) http log access as separate metadata formats -cf. Van de Sompel, Young & Hickey, D-Lib 9(7/8)

21 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Expanding OAI-PMH / Complex Object Access OAI-PMH / CO access for: o blogs o message boards o native file systems -e.g. Mac OS X “Spotlight” More aggressive use of OAI-PMH / CO for preservation o recently funded NSF DIGARCH program o use for preservation: -Usenet -Email -Multicasting

22 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAI-PMH + Complex Objects: A New Model for Web Resource Harvesting Better web harvesting can be achieved through: o OAI-PMH: structured access to updates o Complex object formats: modeled representation of digital objects Use cases: o Preservation (ListRecords) o Web crawling (ListIdentifiers) mod_oai: reference implementation o Better performance than wget o static files only; dynamic files in the future o not a replacement for DSpace, Fedora, eprints.org, etc. More info: o http://www.modoai.org/ http://www.modoai.org/ o http://whiskey.cs.odu.edu/ http://whiskey.cs.odu.edu/

23 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Datestamps and Etags Procedure o 16 harvests over 1 month of 465,374.dk domains o 5,543,470 possible downloads o 5,182,034 successful downloads o 599,143 changes Datestamp and Etag Example L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004 http://www.netarchive.dk/website/publications/Etags-2004.pdf

24 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Errors in Datestamps and Etags Indicating Change EtagsDatestamps missed change0.087%0.30% redundant crawl32%10.7% L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004 http://www.netarchive.dk/website/publications/Etags-2004.pdf 40.1 % of pages without Etags 0.07% of pages without Datestamps

25 http_header


Download ppt "OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland A New Model for Web Resource Harvesting Her This work supported."

Similar presentations


Ads by Google