Presentation is loading. Please wait.

Presentation is loading. Please wait.

A New Model for Web Resource Harvesting Michael L. Nelson Old Dominion University joint work with: Her Herbert Van de Sompel Xiaoming Liu Carl Lagoze Simeon.

Similar presentations


Presentation on theme: "A New Model for Web Resource Harvesting Michael L. Nelson Old Dominion University joint work with: Her Herbert Van de Sompel Xiaoming Liu Carl Lagoze Simeon."— Presentation transcript:

1 A New Model for Web Resource Harvesting Michael L. Nelson Old Dominion University joint work with: Her Herbert Van de Sompel Xiaoming Liu Carl Lagoze Simeon Warner Terry Harrison This work supported in part by the Andrew Mellon Foundation & Library of Congress

2 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 My Research Interests Digital Objects and Repositories o Interaction between them: roles, responsibilities, architecture, scalability o Bumper sticker: Free the Object from the Tyranny of the Repository Digital Preservation o Shared infrastructure models, automation, large-scale best effort strategies o Bumper sticker: We Need Fewer Heroes User / System Co-Evolution o Discerning intent from access large-scale patterns o Bumper sticker: Freedom From Choice

3 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Outline (0) The Problem (1) OAI-PMH Mechanics (2) OAI-PMH for Resource Harvesting (3) mod_oai (4) Future Research

4 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 WWW and DL: Separated at Birth 1994 DL WWW Today more on DL/WWW, from the NSF Post-DL Workshop: http://www.sis.pitt.edu/~dlwkshop/paper_sompel.html http://www.sis.pitt.edu/~dlwkshop/paper_lagoze.html The Good: XML, BitTorrent, Web Services The Bad: RSS The Ugly: Semantic Web The Good: OAIS, DOI, OAI-PMH The Bad: Dublin Core The Ugly: SRU/W The problem is not that the WWW doesn’t work; it clearly does. The problem is that our expectations have been lowered. WWW DL

5 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 www.getty.edu doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc100; last mod 2003-09-11 … what documents have been modified since 2003-11-15 ? robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG Web Robots what is this file? what are its relationships to other files? how often does it change?

6 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 A More Efficient Way what documents have been modified since 2003-11-15 ? www.getty.edu with mod_oai doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc100; last mod 2003-09-11 … …

7 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Outline (0) The Problem (1) OAI-PMH Mechanics (2) OAI-PMH for Resource Harvesting (3) mod_oai (4) Future Research

8 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 A Very Brief OAI-PMH History Universal Preprint Service o A cross-archive DL that that provides services on a collection of metadata harvested from multiple archives -not distributed searching o Demonstrated at Santa Fe NM, October 21-22, 1999 -D-Lib Magazine, 6(2) 2000 (2 articles) –http://www.dlib.org/dlib/february00/02contents.html o UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/ The OAI has authored, among other things, the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) o in use around the world, 600+ known instances -registration not required; many unknown instances –http://gita.grainger.uiuc.edu/registry/ –http://celestial.eprints.org/cgi-bin/status o used by Google ca. late 2004 -http://www.nla.gov.au/digicoll/oai/

9 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 “A repository is a network accessible server that can process the 6 OAI-PMH requests … A repository is managed by a data provider to expose metadata to harvesters.” “A harvester is a client application that issues OAI-PMH requests. A harvester is operated by a service provider as a means of collecting metadata from repositories.” Data Providers / Repositories Service Providers / Harvesters

10 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Aggregators data providers (repositories) service providers (harvesters) aggregator aggregators allow for: scalability for OAI-PMH load balancing community building discovery

11 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 OAI-PMH data model resource item Dublin Core metadata MARCXML metadata records entry point to all records pertaining to the resource metadata pertaining to the resource OAI-PMHidentifier metadataPrefix datestamp OAI-PMH identifierOAI-PMH sets

12 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Overview of OAI-PMH Verbs VerbFunction Identifydescription of repository ListMetadataFormatsmetadata formats supported by repository ListSetssets defined by repository ListIdentifiersOAI unique ids contained in repository ListRecordslisting of N records GetRecordlisting of a single record metadata about the repository harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)

13 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Outline (0) The Problem (1) OAI-PMH Mechanics (2) OAI-PMH for Resource Harvesting (3) mod_oai (4) Future Research

14 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Resource Harvesting: Use cases Discovery: use content itself in the creation of services o search engines that make full-text searchable o citation indexing systems that extract references from the full-text content o browsing interfaces that include thumbnail versions of high-quality images from cultural heritage collections Preservation: o periodically transfer digital content from a data repository to one or more trusted digital repositories o trusted digital repositories need a mechanism to automatically synchronize with the originating data repository Ideas first presented in Van de Sompel, Nelson, Lagoze & Warner, http://www.dlib.org/dlib/december04/vandesompel/12vandesompel.html

15 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches Typical scenario: 1.An OAI-PMH harvester harvests Dublin Core records from the OAI-PMH repository. 2.The harvester analyzes each Dublin Core record, extracting dc.identifier information in order to determine the network location of the described resource. 3.A separate process, out-of-band from the OAI-PMH, collects the described resource from its network location.

16 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Issue 1  Locating the resource based on information provided in dc.identifier  dc.identifier used to convey a variety of identifier: (simultaneously) URL DOI, bibliographic citation, … Not expressive enough to distinguish between identifier, locator.  Several dereferencing attempts required  URI provided in dc.identifier is commonly that of a bibliographic “splash page”  How to know it is a bibliographic “splash page”, not the resource?  If it is a bibliographic “splash page”, where is the resource?

17 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Issue 2  Using the OAI-PMH datestamp of the Dublin Core record to trigger incremental harvesting:  Datestamp of DC record does not necessarily change when resource changes no DC datestamp changeDC datestamp change no resource updateOKunnecessary resource download resource updatemissed resource update OK

18 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Conventions  Cannot really address issue 2 (datestamps) with metadata conventions  Issue 1 (identifier & locator of the resource) is currently addressed with a range of conventions  First dc.identifier is locator of the resource  what if the resource is not digital?  Use of dc.format and/or dc.relation to convey locator

19 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Conventions A Simple Parallel-Plate Resonator Technique for Microwave. Characterization of Thin Resistive Films Vorobiev, A. ING-INF/01 Elettronica A parallel-plate resonator method is proposed for non-destructive characterisation of resistive films used in microwave integrated circuits. A slot made in one... Microwave engineering Europe 2002 Documento relativo ad una Conferenza o altro Evento PeerReviewed http://amsacta.cib.unibo.it/archive/00000014/ pdf http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf locator of resource splash page

20 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Conventions … http://amsacta.cib.unibo.it/archive/00000014/ http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf … locator of resource splash page

21 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Conventions … http://amsacta.cib.unibo.it/archive/00000014/ http://resolver.unibo.it/00000014/ http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf … locator of resource splash page

22 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Other attempts  dc.identifier leads to splash page & splash page contains special purpose XHTML link to resource(s)  What if there is no splash page?  How does a harvester recognize this situation?  OA-X: protocol extension  OK in local context  Strategic problem to generalize  How to consolidate with OAI-PMH data model  Qualified Dublin Core  Could bring expressiveness to distinguish between locator & identifier  But what about the datestamp issue?

23 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Proposed OAI-PMH based approach  Use metadata formats that were specifically created for representation of digital objects:  Complex Object Formats as OAI-PMH metadata formats  MPEG-21 DIDL, METS,..

24 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 OAI-PMH data model resource item Dublin Core metadata METS records OAI-PMH identifier = entry point to all records pertaining to the resource MPEG-21 DIDL metadata pertaining to the resource simplehighly expressive more expressive highly expressive MARCXML metadata

25 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Complex Object Formats : characteristics Representation of a digital object by means of a wrapper XML document. Represented resource can be: o simple digital object (consisting of a single datastream) o compound digital object (consisting of multiple datastreams) Unambiguous approach to convey identifiers of the digital object and its constituent datastreams. Include datastream: o By-Value: embedding of base64-encoded datastream o By-Reference: embedding network location of the datastream o not mutually exclusive; equivalent Include a variety of secondary information o By-Value o By-Reference o Descriptive metadata, rights information, technical metadata, …

26 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 http://amsacta.cib.unibo.it/archive/00000014/ A Simple Parallel-Plate Resonator Technique for Microwave. Characterization of Thin Resistive Films Vorobiev, A. http://amsacta.cib.unibo.it/archive/00000014/ application/pdf …

27 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Complex Object Formats & OAI-PMH Resource represented via XML wrapper => OAI-PMH Uniform solution for simple & compound objects Unambiguous expression of locator of datastream Disambiguation between locators & identifiers OAI-PMH datestamp changes whenever the resource (datastreams & secondary information) changes OAI-PMH semantics apply: “about” containers, set membership

28 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 OAI-PMH based approach using Complex Object Format Typical scenario: 1.An OAI-PMH harvester checks for support of a locally understood complex object format using the ListMetadataFormats verb 2.The harvester harvests the complex object metadata. Semantics of the OAI-PMH datestamp guarantee that new and modified resources are detected. 3.A parser at the end of the harvesting application analyzes each harvested complex object record: -The parser extracts the bitstreams that were delivered By-Value. -The parser extracts the unambiguous references to the network location of bitstreams delivered By-Reference. 4.A separate process, out-of-band from the OAI-PMH, collects the bitstreams delivered By-Reference from the extracted network locations.

29 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Complex Object Formats & OAI-PMH : OAIS archive export/ingest

30 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Complex Object Formats & OAI-PMH : issues Which Complex Object Format(s) How to Profile Complex Object Format(s) for OAI-PMH Harvesting Large records Making resources re-harvestable Because the resource is represented as, can rights pertaining to the resource be expressed according to the “rights for metadata” OAI-rights guideline? Tools: o Software library to write compliant complex objects o Integration of this library with repository systems (Fedora, DSpace, eprints.org, ….)

31 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Outline (0) The Problem (1) OAI-PMH Mechanics (2) OAI-PMH for Resource Harvesting (3) mod_oai (4) Future Research

32 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Goal: integrate OAI-PMH functionality into the web server itself… mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server o written in C o respects values in.htaccess, httpd.conf compile mod_oai on http://www.foo.edu/http://www.foo.edu/ baseURL is now http://www.foo.edu/modoaihttp://www.foo.edu/modoai o Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets) -http://www.foo.edu/modoai? verb=ListIdentifiers & metdataPrefix=oai_dc & from=2004-09-15 & set=mime:video:mpeg mod_oai approach

33 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 OAI-PMH data model in mod_oai resource item Dublin Core metadata records OAI-PMH identifier = entry point to all records pertaining to the resource MPEG-21 DIDL metadata pertaining to the resource HTTP header metadata http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf OAI-PMH sets MIME type

34 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 OAI-PMH Entityvaluedescription ResourceURLPDF, PS, XML, HTML or other file Item identifierOAI IdentifierDNS-based name of metadata about resource set membershipLCSHLibrary of Congress Subject Heading Record metadataPrefixoai_dcbibliographic metadata in Dublin Core datestamp2004-10-18modification date of DC record Record metadataPrefixoai_marcbibliographic metadata in MARC datestamp2004-07-31modification date of MARC record OAI-PMH concepts : typical repository

35 OAI-PMH Entityvaluedescription ResourceURLHTML, GIF, PDF or other web file Item identifierURLsame URL as the resource set membershipMIME typeMIME type of the resource Record metadataPrefixhttp_headerthe http headers that would have been returned via HTTP GET/HEAD datestamp2004-07-31modification date of resource Record metadataPrefixoai_dca subset of http_header in DC datestamp2004-07-31modification date of resource Record metadataPrefixoai_didlMPEG-21 DIDL: base64 encoded resource + http_header metadata datestamp2004-07-31modification date of resource OAI-PMH concepts : mod_oai

36 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 harvester issues a ListIdentifiers, finds URLs of updated resources does HTTP GETs updates only can get URLs of resources with specified MIME types Resource Discovery: ListIdentifiers

37 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Preservation: ListRecords harvester issues a ListRecords, Gets updates as MPEG- 21 DIDL documents (HTTP headers, resource By Value or By Reference) can get resources with specified MIME types

38 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 performance of mod_oai and wget on www.cs.odu.eduwww.cs.odu.edu for more detail: “mod_oai: An Apache Module for Metadata Harvesting “ http://arxiv.org/abs/cs.DL/0503069

39 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Outline (0) The Problem (1) OAI-PMH Mechanics (2) OAI-PMH for Resource Harvesting (3) mod_oai (4) Future Research

40 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Issues and Future Work For a given server, there are a set of URLs, U, and a set of files F o Apache maps U  F o mod_oai maps F  U Neither function is 1-1 nor onto o We can easily check if a single u maps to F, but given F we cannot (easily) generate U Short-term issues: o dynamic files -exporting unprocessed server-side files would be a security hole o IndexIgnore -httpd will “hide” valid URLs o File permissions -httpd will advertise files it cannot read Long-term issues o Alias, Location -files can be covered up by the httpd o UserDir -interactions between the httpd and the filesystem

41 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 IndexIgnore & File Permissions

42 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Alias: Covering Up Files httpd.conf: Alias /A /usr/local/web/htdocs/B Alias /B /usr/local/web/htdocs/A the files “A” and “B” will be different from the URLs http://server/A http://server/B

43 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 UserDir: “Just in Time” mounting of directories whiskey.cs.odu.edu:/ftp/WWW/conf% ls /home liu_x/ mln/ whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso /home/tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf % ls /home liu_x/ mln/ tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf %

44 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Looking Further Down the Road for mod_oai “Reverse” the method of URL discovery o cannot look to the files; o listen to incoming requests and build a list of valid URLs -could be seeded with files at start -also the method for handling server processed files / URLs Plug-ins for descriptive metadata o DC tags in HTML o MS Office formats, PDF o Tags from JPEG, TIFF, MP3, etc. Additional metadata in the DIDL o technical metadata from JHOVE o estimated change rate -cf. Cho & Garcia-Molina, ACM TOIT 28(4) http log access as separate metadata formats -cf. Van de Sompel, Young & Hickey, D-Lib 9(7/8)

45 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Expanding OAI-PMH / Complex Object Access OAI-PMH / CO access for: o blogs o message boards o native file systems -e.g. Mac OS X “Spotlight” More aggressive use of OAI-PMH / CO for preservation o recently funded NSF DIGARCH program o use for preservation: -Usenet -Email -Multicasting

46 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 OAI-PMH + Complex Objects: A New Model for Web Resource Harvesting Better web harvesting can be achieved through: o OAI-PMH: structured access to updates o Complex object formats: modeled representation of digital objects Use cases: o Preservation (ListRecords) o Web crawling (ListIdentifiers) mod_oai: reference implementation o Better performance than wget o static files only; dynamic files in the future o not a replacement for DSpace, Fedora, eprints.org, etc. More info: o http://www.modoai.org/ http://www.modoai.org/ o http://whiskey.cs.odu.edu/ http://whiskey.cs.odu.edu/

47 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Datestamps and Etags Procedure o 16 harvests over 1 month of 465,374.dk domains o 5,543,470 possible downloads o 5,182,034 successful downloads o 599,143 changes Datestamp and Etag Example L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004 http://www.netarchive.dk/website/publications/Etags-2004.pdf

48 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Errors in Datestamps and Etags Indicating Change EtagsDatestamps missed change0.087%0.30% redundant crawl32%10.7% L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004 http://www.netarchive.dk/website/publications/Etags-2004.pdf 40.1 % of pages without Etags 0.07% of pages without Datestamps

49 http_header

50 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Complex Object Formats & OAI-PMH : existing implementations LANL Repository o Local storage of Terrabytes of scholarly assets o Assets stored as MPEG-21 DIDL documents o DIDL documents made accessible to downstream applications via the OAI-PMH Mirroring of American Physical Society collection at LANL o Maps APS document model to MPEG-21 DIDL Transfer Profile o Exposes MPEG-21 DIDL documents through OAI-PMH infrastructure o Inlcudes digests/signatures DSpace & Fedora plug-ins o Maps DSpace/Fedora document model to MPEG-21 DIDL Transfer Profile o Exposes MPEG-21 DIDL documents through OAI-PMH infrastructure mod_oai

51 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 conceptmod_oai implementation OAI-PMH IdentifierURL of resource setMIME type of resource datestampchange time of resource deleted records“no” deleted records mod_oai : OAI-PMH concepts


Download ppt "A New Model for Web Resource Harvesting Michael L. Nelson Old Dominion University joint work with: Her Herbert Van de Sompel Xiaoming Liu Carl Lagoze Simeon."

Similar presentations


Ads by Google