Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango

Similar presentations


Presentation on theme: "Mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango"— Presentation transcript:

1 mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango {mln,aelango}@cs.odu.edu {herbertv,liu_x}@lanl.gov DLF 2004 Fall Forum Baltimore MD October 25-27, 2004 mod_oai is sponsored by the Andrew Mellon Foundation

2 Outline mod_oai –crawling vs. harvesting –complex objects & OAI-PMH –how mod_oai works –scenarios –demos More information –http://www.modoai.org/http://www.modoai.org/ –http://www.openarchives.org/http://www.openarchives.org/

3 www.getty.edu doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc3; last mod 2003-11-29 doc4; last mod 2002-10-03 doc100; last mod 2003-09-113 … what documents have been modified since 2003-11-15? Inefficient Web Crawlers robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG

4 www.getty.edu with OAI-PMH doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc3; last mod 2003-11-29 doc4; last mod 2002-10-03 doc100; last mod 2003-09-113 … what documents have been modified since 2003-11-15? A More Efficient Way…

5 mod_oai Goal: integrate OAI-PMH functionality into the web server itself… mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server –written in C –respects values in.htaccess, httpd.conf Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets) www.foo.edu/modoai?ListIdentifiers&metdataPrefix= oai_dc&from=2004-09-15&set=video:mpeg

6 OAI-PMH data model resource item Dublin Core metadata MARCXML metadata MPEG-21 DIDL records OAI-PMH identifier = entry point to all records pertaining to the resource METS metadata pertaining to the resource modeled representation of the resource simple model more expressive model complex model complex model

7 OAI-PMH and complex models OAI-PMH record == modeled representation of the resource Can be selectively harvested via OAI-PMH ~ datestamp, set Resource can be: –simple object (1 file) –compound object (multiple files) OAI-PMH records can contain: –Typical metadata –Actual resource(s) By-Value – base64 encoded By-Reference – http address of resource both –Identifiers of metadata and resource(s), unambiguously mapped to the identified data –A variety of secondary information

8 Complex Objects & OAI-PMH LANL Repository –OAI-PMH as a Repository Access Protocol to access metadata and content represented as DIDLs APS/LANL/LoC Mirroring –OAI-PMH transfer of APS content represented in application neutral format (DIDLs) LANL DSpace Plug-in –Exposes MPEG-21 DIDL documents through built- in DSpace OAI-PMH infrastructure

9 How mod_oai works Install on an Apache 2.0 server –compile & edit httpd.conf http://www.foo.edu/ now has an OAI-PMH baseURL of: http://www.foo.edu/modoai

10 OAI-PMH characteristics: Typical Repository OAI-PMH Entityvaluedescription ResourceURLPDF, PS, XML, HTML or other file Item identifierOAI Identifier DNS-based name of metadata about resource set membershipLCSHLibrary of Congress Subject Heading Record metadataPrefixoai_dcbibliographic metadata in Dublin Core datestamp2004-10-18modification date of DC record Record metadataPrefixoai_marcbibliographic metadata in MARC datestamp2004-07-31modification date of MARC record

11 resource DC, HTTP, DIDL Modeled Representations item Dublin Core metadata HTTP headers DIDL: base64 or urls + HTTP headers records OAI Identifier == URL of Resource OAI-PMH Data Model in mod_oai http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf Set membership == MIME type

12 OAI-PMH characteristics: mod_oai OAI-PMH Entityvaluedescription ResourceURLHTML, GIF, PDF or other web file Item identifierURLsame URL as the resource set membershipMIME typeMIME type of the resource Record metadataPrefixhttp_headerthe http headers that would have been returned via HTTP GET/HEAD datestamp2004-07-31modification date of resource Record metadataPrefixoai_dca subset of http_header in DC datestamp2004-07-31modification date of resource Record metadataPrefixoai_didlMPEG-21 DIDL: base64 encoded resource + http_header metadata datestamp2004-07-31modification date of resource

13 OAI-PMH Concepts conceptmod_oai interpretation OAI IdentifierURL of resource setMIME type of resource datestampchange time of resource deleted records“no” deleted records

14 http_header

15 Use Cases Regular Web Crawling –use ListIdentifiers to discover URLs –add new URLs to the list of URLs to be crawled Harvesting Resources w/ OAI-PMH –use ListRecords to extract the entire resource as an MPEG-21 DIDL AIP

16 Regular Crawling: ListIdentifiers harvester issues a ListIdentifiers, finds the updates, and does HTTP GETs on just the updates

17 Resource Harvesting: ListRecords harvester issues a ListRecords, and gets the updates in DIDLs (http headers + by- value or by-ref resources)

18 Demo Repository Explorer –http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoaihttp://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai –http://oai.dlib.vt.edu/cgi- bin/Explorer/oai2.0/testoai?archive=http://whiskey.cs.odu.edu/modoaihttp://oai.dlib.vt.edu/cgi- bin/Explorer/oai2.0/testoai?archive=http://whiskey.cs.odu.edu/modoai Direct URLs –http://whiskey.cs.odu.edu/modoai?verb=Identifyhttp://whiskey.cs.odu.edu/modoai?verb=Identify –http://whiskey.cs.odu.edu/modoai?verb=ListMetadataFormatshttp://whiskey.cs.odu.edu/modoai?verb=ListMetadataFormats –http://whiskey.cs.odu.edu/modoai?verb=ListIdentifiers&metad ataPrefix=oai_dchttp://whiskey.cs.odu.edu/modoai?verb=ListIdentifiers&metad ataPrefix=oai_dc –http://whiskey.cs.odu.edu/modoai?verb=ListRecords&metadata Prefix=http_headerhttp://whiskey.cs.odu.edu/modoai?verb=ListRecords&metadata Prefix=http_header –http://whiskey.cs.odu.edu/modoai?verb=ListRecords&metadata Prefix=oai_didlhttp://whiskey.cs.odu.edu/modoai?verb=ListRecords&metadata Prefix=oai_didl

19 Datestamps and Etags Procedure –16 harvests over 1 month of 465,374.dk domains –5,543,470 possible downloads –5,182,034 successful downloads –599,143 changes Datestamp and Etag Example L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004 http://www.netarchive.dk/website/publications/Etags-2004.pdf

20 Errors in Datestamps and Etags Indicating Change EtagsDatestamps missed change0.087%0.30% redundant crawl32%10.7% L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004 http://www.netarchive.dk/website/publications/Etags-2004.pdf 40.1 % of pages without Etags 0.07% of pages without Datestamps

21 mod_oai… is: –a simple way to more efficiently harvest web pages –a possible impact on robots.txt –fully OAI-PMH compliant works with existing harvesters is not: –yet suitable for dynamic files –a replacement for DSpace Fedora eprints.org other digital libraries / repositories / cms


Download ppt "Mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango"

Similar presentations


Ads by Google