Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda.

Similar presentations


Presentation on theme: "ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda."— Presentation transcript:

1 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files Xiaoming Liu (1), Luda Balakireva (1), Patrick Hochstenbach (2) and Herbert Van de Sompel (1) (1) Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory (2) University Library Ghent University liu_x@lanl.gov, ludab@lanl.gov, patrick.hochstenbach@ugent.be, herbertv@lanl.gov

2 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Disclaimer The term Digital Object (DO) will be used as in Kahn/Wilensky: o Compound object o Multiple datastreams of different mime types o Secondary information pertaining to object and datastreams o Identifiers for object (and datastreams) This is ~ OAIS Content Information TypeMIMEidentifier Digital Objectscholarly paperN/ADOI Constituent Datastream 1metadata recordapplication/xmlPMID Constituent Datastream 2fulltext fileapplication/pdf–

3 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY XML-based representation of DOs Growing interest in XML-based representation of DOs in Digital Library architectures: o Platform-independence, o Industry-support o Longevity, potential migration paths o Processing tools, validation capabilities XML-based Compound Object formats: o ISO/IEC 21000-2 MPEG-21 DID & DIDL o METS o IMS/CP o CCDS XFDU Typical functionality: o By-Value (base64) and/or By-Reference provision of constituent datastreams o By-Value and/or By-Reference provision of secondary information o Provision of identifiers

4 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Storing XML-based representations of DOs Existing approaches: o storage of the XML-representations as individual files in a file system: -Poor access performance -Poor backup performance o storage of the XML-representations in (SQL, XML, object) databases -Long term? Data are dependent on the underlying system o storage of the XML-representations by concatenating many such documents into a single file such as tar or zip -Not XML aware, hence, no use of off-the-shelf XML tools -Increasing storage space (base64-encoding of the constituent datastreams)

5 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape/ARCfile solution Part of LANL aDORe repository effort: o Standards-based, modular repository architecture -Distributed architecture -Protocol-based interactions between modules -Usable to create interoperable federations of heterogeneous repositories o Actual implementation of the architecture at LANL o Components of aDORe software will be released Inspired by Internet Archive ARC file approach: o File-based mechanism to store datastreams resulting from Web-crawling o Concatenation of multiple datastreams into a single file o Metadata as seperators between datastreams o But not OK to store XML-based representations of DOs: -Metadata capabilities very limited & crawling related -Lose power of XML processing tools

6 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape/ARCfile solution Two interconnected file-based storage mechanisms: o XMLtapes: File storage of XML-based representations of Digital Objects o ARCfiles: File storage of constituent datastreams of Digital Objects The ARC files are interconnected with one or more XMLtapes during the ingestion process A protocol-based access mechanism is introduced: o XMLtape is exposed as an autonomous OAI-PMH repository o ARCfile is exposed as an OpenURL Resolver Write once - Read many: o Files remain stable o Protocol-based access mechanism remains stable o Indexing mechanisms can change as technologies evolve Storage approach is independent from the compound object format used to represent DOs as XML o aDORe uses MPEG-21 DIDL

7 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY ISO/IEC 21000-2: MPEG-21 DID & DIDL Digital Item Digital Item Declaration DIDL document has declaration has XML serialization MPEG-21 Abstract Model MPEG-21 DIDL has XML serialization based on

8 Representing DOs using MPEG-21 DID Digital Object Package sample DIDL document

9 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape An XML file that concatenates the XML-based representations of multiple DOs Structure is defined by an XML Schema o http://purl.lanl.gov/aDORe/schemas/2005-08/XMLtape.xsd http://purl.lanl.gov/aDORe/schemas/2005-08/XMLtape.xsd o tape-level administrative section: -Open-ended content -Plug-in for processing-related information, indication of related ARCfiles: -http://purl.lanl.gov/aDORe/schemas/2005-08/XMLtapeBasics.xsdhttp://purl.lanl.gov/aDORe/schemas/2005-08/XMLtapeBasics.xsd o concatenation of records, each of which consists of: -record-level administrative section -identifier and datestamp of the contained record -other record-level administrative information -a record (can be from any XML Namespace). DIDL in case of aDORe: -http://purl.lanl.gov/aDORe/schemas/2005-08/DIDL.xsdhttp://purl.lanl.gov/aDORe/schemas/2005-08/DIDL.xsd An XMLtape is a valid and well-formed XML file Independent from chosen XML-based Compound Object Format

10 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape   <ta:tape xmlns:ta="http://library.lanl.gov/2005-08/aDORe/XMLtape/"  ...   oai:aps.org:PhysRevA.71.040101  2005-03-29T04:31:22Z  ...  ...  aDORe ta:tape sample XMLtape

11 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape index identifier datestamp of ingestion XMLtape record identifier datestamp of ingestion identifier datestamp of ingestion index identifier/datestamp Indexing: Can be achieved with a variety of technologies Current implementation: Berkeley DB Java Edition

12 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape as OAI-PMH repository XMLtape record index identifier/datestamp OAI-PMH request DIDL document OAI-PMH identifier = identifier from OAI-PMH datestamp = datetime from OAI-PMH response = content of

13 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Internet Archive ARCfile Concatenation of binary files Designed and used by the Internet Archive (Wayback machine) o > 400 TB web data Under revision by the International Internet Preservation Consortium (IIPC): WARC file format o Input from LANL to facilitate non-Web-crawling use case The ARC file format is structured as follows: o file header that provides administrative information about the ARC file itself o a sequence of document records, consisting of: -a header line containing some, mainly crawl-related, metadata. -URI of the crawled document -timestamp of acquisition of the data -size of the data block -a response to a protocol request such as an HTTP GET

14 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Internet Archive ARC file filedesc://IA-001102.arc 0 19960923142103 text/plain 76 1 0 Alexa Internet URL IP-address Archive-date Content-type Archive-length http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202 HTTP/1.0 200 Document follows Date: Mon, 04 Nov 1996 14:21:06 GMT Server: NCSA/1.4.1 Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMT Content-length: 30 Hello World!!! sample ARC file

15 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Internet Archive ARC file in aDORe filedesc://singletape.arc 0.0.0.0 20050922142103 text/plain 76 1 0 Internet Archive URL IP-address Archive-date Content-type Archive-length info:lanl-repo/ds/39c2fa93-fa22-4c19-90af-b5f58b9b989a 0.0.0.0 20050907221344 application/pdf 415025  %PDF-1.3  %âãÏÓ  290  0 obj  <<  /Linearized 1  /O 295  /H [ 3642 1057 ]  /L 415025  … sample aDORe ARC file sample ARCfile

16 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Internet Archive ARC file index URL ARC datastream URL Indexing: Can be achieved with a variety of technologies Current implementation in aDORe: Heritrix toolkit URL IP-address Archive-date Content-type Archive-length

17 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY ARC file as OpenURL Resolver ARC file datastream index URL OpenURL OpenURL request datastream Referent Identifier = datastream identifier = URL from ARC record header Resolver Identifier = identifier of ARC file

18 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Associating an XMLtape with ARC Files (1) A Digital Object is represented using an XML-based Complex Object format (e.g. MPEG-21 DID) The resulting package (e.g. DIDL document) is stored in an XMLtape Constituent datastreams of the Digital Object are provided By-Reference: o Using the ref attribute of the Resource element in MPEG-21 DID o The value of the network location of the constituent datastream is compliant with the NISO OpenURL Framework: baseURL(ARCfile OpenURL Resolver)? url_ver = Z39.88-2004 & rft_id = Datastream Identifier & res_id = ARCfile identifier

19 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Associating an XMLtape with ARC Files (1) …… info:lanl-repo/ds/ba0797d3-9414-42d0-90e8-f5397e74892b <didl:Resource mimeType="application/pdf“ ref="http://purl.lanl.gov/aDORe/demo/adore-arcfile-resolver/resolver? url_ver=Z39.88-2004 res_id=info:lanl-repo/arc/2001_4acb6e28-1ef9-11da-9e1e-d8ccd1d6c8f2 rft_id=info:lanl-repo/ds/ba0797d3-9414-42d0-90e8-f5397e74892b“/> …… Extract from DIDL

20 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Associating an XMLtape with ARC Files (2) An XMLtape is associated with its corresponding ARCfiles through a plug-in for the XMLtape-level administrative section.

21 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Associating an XMLtape with ARC Files (2) info:lanl-repo/xmltape/singlescitape info:lanl-repo/arc/singlescitape gov.lanl.xmltape.SingleTapeWriter 2005-09-07T22:13:39Z … XMLtape header

22 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY AGENT Identifier Locator DIDLDocument-id or content-id List of (baseURL, DIDLDocument-id) DIDLDocument-id or content-id XMLtape DIDLDocument- id index creation datetime index ref DIDL document ref OpenURL datastream-id datastream ARC file datastream id datastream-id index

23 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape/ARCfile environment

24 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Implementation XMLtapes: o Berkeley DB Java Edition o OCLC OAICat ARCfiles: o Heritrix o OCLC OpenURL software XMLtape Registry o MySQL db o OCLC OAICat ARCfile Registry: o MySQL db o OCLC OAICat

25 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Performance indicators System: o Model: Dell 2650 2U rack-mount server o CPU: dual 2.8 GHz Intel Xeon processors o RAM: 5GB RAM o Disks: 10k RPM SCSI disks XMLtape: o 1786 MB, 201872 DIDL records o download 100 consecutive DIDL records (787 KB) => 0.18 second o download static file of same size => 0.09 second ARCfile: o 272 MB, 4910 files o download a sample PDF file (312 KB) => 0.24 second o download static file of same size => 0.036 second

26 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Software Software - ARC files: o Heritrix: the internet archive's open-source, extensible, web-scale, archival- quality web crawler project. http://crawler.archive.org/http://crawler.archive.org/ o NetArchive.dk: a project that plans for the preservation of Denmark's cultural heritage on the internet for future generations. http://www.netarchive.dk/http://www.netarchive.dk/ o Many other tools: http://archive-access.sourceforge.Nethttp://archive-access.sourceforge.Net XMLtapes: o Perl tool, XML::Tape (LANL & Ghent University), http://search.cpan.org/~hochsten/XML-Tape/ Combined aDORe XMLtape/ARCfile environment: o Java tool (LANL), soon to be released on SourceForge

27 ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Conclusion The file-based approach is inherently simple, and reduces dependency on database system. The autonomy of the indexes allows retaining the files over time, while the indexes can be created using other techniques as technologies evolve. The protocol-based nature of the access increases the flexibility in light of evolving technologies as it introduces another layer of abstraction. The XMLtape approach is inspired by the ARC file format, but provides several additional attractive features: o Off-the-shelf XML tools can be used to parse/validate an XMLtape o All DO metadata can be stored in XML-based compound object format Presentation available via http://public.lanl.gov/herbertv/ Install TSCC codec for avi movies


Download ppt "ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda."

Similar presentations


Ads by Google