Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Modular, Standards-based Digital Object Repository

Similar presentations


Presentation on theme: "A Modular, Standards-based Digital Object Repository"— Presentation transcript:

1 A Modular, Standards-based Digital Object Repository
aDORe: A Modular, Standards-based Digital Object Repository Herbert Van de Sompel Digital Library Research & Prototyping Team Research Library Los Alamos National Laboratory

2 aDORe repository architecture : an overview

3 context Initial motivation: undo tight integration between data and application Uniform approach for ingesting, storing, and disseminating LANL RL data collections Bigger picture: Allow for multiple, parallel applications on top of stored content Create an environment that provides guarantees regarding long-term accessibility of stored content

4 context Core characteristics of the aDORe architecture:
Standards-based: XML, XML Schema, MPEG-21 Digital Item Declaration, the MPEG-21 Digital Item Identification, the MPEG-21 Digital Item Processing, OAI-PMH, NISO OpenURL Framework for Context-Sensitive Services, Internet Archive ARC file format, OAIS concepts Component-based, modular design. Interaction between components is protocol-based Dynamic attachment of dissemination methods to stored content future proof-ness ability to use off-the-shelf software ability to replace components while maintaining stability increase interoperability with information environment at large scale

5 core aDORe modules Ingestion process: Repository Index
Representing Digital Objects using MPEG-21 DID Identification Storing Digital Objects: XMLtape & ARC files Autonomous OAI-PMH Repositories Repository Index Identifier Locator OAI-PMH Federator OpenURL Gateway

6 overview of the aDORe architecture
1 2 3 4 5 6 7 TechReport OAI-PMH Indata.lanl.gov LANL OAI PMH A&I MPEG-21 DIP Engine DID with DIM OAI PMH publisher DID A&I APPLICATION publisher OpenURL OpenURL Registry of trans- formations Profile/ BehaviorRegistry FTXT publisher OAI PMH Ingest Pre-Ingest Repo Index OAI PMH Identifier Resolver OAI PMH CNRI handle, JAVA, C

7 Pre-Ingest: data input from information provider
1 Pre-Ingest: data input from information provider Data feeds from third parties: Delivered in various ways (http, ftp, OAI-PMH, ..) Many different formats Typically contain many assets in a single feed Assets are typically ‘complex’, i.e. they consist of multiple datastreams

8 Pre-Ingest: sample Digital Object
1 Pre-Ingest: sample Digital Object Type MIME identifier Digital Object scholarly paper N/A DOI Constituent Datastream 1 metadata record application/xml PMID Constituent Datastream 2 fulltext file application/pdf

9 Ingest: representing Digital Objects using MPEG-21 DID
Ingest process creates a Package per Digital Object The Package is an XML document compliant with the MPEG-21 Digital Item Declaration Language ~ DIDL document The DIDL document is the OAIS Archival Information Package in aDORe A new DIDL document is created when a new version of a previously ingested Digital Object is ingested The DIDL document typically contains: By-Value: metadata (Digital Object & Ingest-related) By-Reference: other constituent datastreams of the Digital Object

10 MPEG-21 DID - 1. Data Model abstract definitions + W3C XML Schema
DID entities + DIDL XML representation a container didl:Container an item didl:Item a component didl:Component a resource didl:Resource a descriptor didl:Descriptor remarks we defined a DIDL profile for LANL repository we define a profile ‘per collection’ all profiles are fully DIDL compliant

11 MPEG-21 DID - Data Model + XML representation

12 MPEG-21 DID - Descriptors
secondary information pertaining to entities MPEG-21 defined uses identification information – MPEG-21 Part 3 : DII rights information – MPEG-21 Part 5 : REL / Part 4 : IPMP processing information – MPEG-21 Part 10 : DIP community/application specific uses cf. use of Descriptors in LANL profile

13 MPEG-21 DID - Descriptors - identifiers
<didl:Item> <didl:Descriptor> <didl:Statement mimeType="text/xml; charset=UTF-8"> <dii:Identifier xmlns:dii="urn:mpeg:mpeg21:2002:01-DII-NS"> urn:isbn: </dii:Identifier> </didl:Statement> </didl:Descriptor> </didl:Item> MPEG-21 dii:Identifier

14 MPEG-21 DID - Descriptors - rights
<didl:Item> <didl:Descriptor> <didl:Statement mimeType="text/xml; charset=UTF-8"> <r:license xmlns:r="urn:mpeg:mpeg21:2003:01-REL-R-NS"> <!-- optionally, specific rights can be added here.--> <r:otherInfo> <dc:rights xmlns:dc=" Copyright2003; American Physical Society</dc:rights> </r:otherInfo>              </r:license> </didl:Statement> </didl:Descriptor> </didl:Item> MPEG-21 r:license

15 MPEG-21 DID - Descriptors - processing information
<didl:Component> <didl:Descriptor> <didl:Statement mimeType="text/xml; charset=UTF-8"> <dip:ObjectType xmlns:dip="urn:mpeg:mpeg21:2002:01-DIP-NS"> urn:foobar:Argument</dip:ObjectType> </didl:Statement> </didl:Descriptor> </didl:Component> MPEG-21 dip:ObjectType Content <didl:Item> <didl:Descriptor> <didl:Statement mimeType="text/xml; charset=UTF-8"> <dip:Argument xmlns:dip="urn:mpeg:mpeg21:2002:01-DIP-NS"> urn:foobar:Argument</dip:Argument> </didl:Statement> </didl:Descriptor> <didl:Resource> function PlayTrack() { } </didl:Resource> </didl:Item> MPEG-21 dip:Argument Processing Item

16 profiling MPEG-21 DIDL for the aDORe architecture
question: how to map datastreams of compound objects to the DID data model: local choices how to use Descriptors to meet the design goals of the repository and its associated applications: core aDORe characteristics how to convey a variety of non-core secondary information: local choices

17 Construction of DIDL documents in aDORe
Each Digital Object is mapped to a top-level DIDL Item element. Constituent datastreams are provided in child elements of this top-level Item. An identifier of the Digital Object is expected at this level. A constituent datastream of a Digital Object is provided in a Component/Resource construct. If identifier => Component/Resource construct is embedded in a sub-Item of the top-level Item, If no identifier => Component/Resource construct is child of top-level Item Top-level Item is embedded in a Container element (transformations of DIDL documents) The top-level Item and its parent Container element are then embedded in the DIDL root element => DIDL XML document == OAIS AIP that represents the Digital Object.

18 Pre-Ingest: sample Digital Object
Type MIME identifier Digital Object scholarly paper N/A DOI Constituent Datastream 1 metadata record application/xml PMID Constituent Datastream 2 fulltext file application/pdf

19 Ingest: representing Digital Objects using MPEG-21 DID
Package <Container> Digital Object

20 Ingest: Identification (core)
2 Ingest: Identification (core) Package Identifiers @DIDid <Container> Content Identifiers MPEG-21 DII

21 Ingest: DIDL Creation Dates (core)
2 Ingest: DIDL Creation Dates (core) <Container> @DIDcreated T15:42:16Z

22 Ingest: Formats (core)
2 Ingest: Formats (core) <Container> dc.format - info:lanl-repo/pro/DID dc.format - info:lanl-repo/pro/pub dc.format info:lanl-repo/fmt/1 dc.format info:lanl-repo/fmt/456

23 ‘Formats’ as placeholder for dynamic behaviors
2 ‘Formats’ as placeholder for dynamic behaviors stored DID disseminated DID <didl:Descriptor> <didl:Statement> <dc:format> info:lanl-repo/fmt/1 </dc:format> </didl:Statement> </didl:Descriptor> <didl:Item> <didl:Descriptor> <didl:Statement mimeType="text/xml; charset=UTF-8"> <dip:ObjectType xmlns:dip="urn:mpeg:mpeg21:2002:01-DIP-NS"> urn:foobar:Argument</dip:ObjectType> </didl:Statement> </didl:Descriptor> </didl:Item> Content Item Profile/ BehaviorRegistry MPEG-21 dip:ObjectType <didl:Item> <didl:Descriptor> <didl:Statement mimeType="text/xml; charset=UTF-8"> <dip:Argument xmlns:dip="urn:mpeg:mpeg21:2002:01-DIP-NS"> urn:foobar:Argument</dip:Argument> </didl:Statement> </didl:Descriptor> <didl:Resource> function PlayTrack() { } </didl:Resource> </didl:Item> Processing Item MPEG-21 dip:Argument dynamic insertion of behaviors

24 Ingest: Digests (core)
2 Ingest: Digests (core) <Container> W3C XML Signature W3C XML Signature W3C XML Signature

25 Ingest: Bitstream Creation Dates (local)
2 Ingest: Bitstream Creation Dates (local) <Container> dc.created T12:05:33Z dc.created T14:22:54Z

26 Ingest: Collection Membership (local)
2 Ingest: Collection Membership (local) <Container> dcterms.isPartOf info:sid/library.lanl.gov:Elsevier

27 Ingest: Rights Information (local)
2 Ingest: Rights Information (local) <Container> dc.rights - textual statement

28 Ingest: Storing DIDL documents in XMLtapes & ARC files
2 Ingest: Storing DIDL documents in XMLtapes & ARC files File-based storage approach combines: XMLtapes: Valid XML file that concatenates multiple DIDL documents (all metadata & identifiers) Internet Archive ARC files: File that concatenates multiple bitstreams Connection XMLtapes & ARC files: Pointers from DIDL documents into ARC files

29 XMLTape: sequential storage of DIDs
2 XMLTape: sequential storage of DIDs XMLTape XMLTape: XML wrapper for batch of DIDs index based on byte offset and byte count in XML file DID content: inline XML (typcially including descriptive metadata) secondary information pointers to bitstreams in ARC files DID DID-identifier datestamp of creation DID-identifier datestamp of creation DID-identifier datestamp of creation

30 ARC files: sequential storage of bitstreams
2 ARC files: sequential storage of bitstreams ARC ARC file: Internet Archive file format index (arc identifier) based on byte offset and byte count in ARC file content: bitstreams resource resource resource resource resource resource resource resource resource

31 XMLtapes & ARC files 2 XMLtape ARC resource DID resource
XMLtape Index DID-id 1 (Byte offset 1, Byte Count 1) DID-id 2 (Byte offset 2, Byte Count 2) DID-id 3 (Byte offset 3, Byte Count 3) DID-created 1 DID-created 2 DID-id 8 pointers are OpenURLs resource ARC Index arc id 1 (Byte offset 1, Byte Count 1) arc id 2 (Byte offset 2, Byte Count 2) arc id 3 (Byte offset 3, Byte Count 3) resource resource resource resource resource resource

32 overview of the aDORe architecture
1 2 3 4 5 6 7 TechReport OAI-PMH Indata.lanl.gov LANL OAI PMH A&I MPEG-21 DIP Engine DID with DIM OAI PMH publisher DID A&I APPLICATION publisher OpenURL OpenURL Registry of trans- formations Profile/ BehaviorRegistry FTXT publisher OAI PMH Ingest Pre-Ingest Repo Index OAI PMH Identifier Resolver OAI PMH CNRI handle, JAVA, C

33 Autonomous OAI-PMH Repositories
3 Autonomous OAI-PMH Repositories techReport OAI-PMH identifier = @DIDid OAI-PMH datestamp = @DIDcreated OAI-PMH response = DIDs techReport baseURL(1) LANL A&I A&I baseURL(2) A&I publisher OAI-PMH sets collection = dcterms.isPartOf profile ~ Digital Format Identifier= dc.format FTXT FTXT baseURL(3) publisher Expose Ingest XMLtapes (or other)

34 overview of the aDORe architecture
1 2 3 4 5 6 7 TechReport OAI-PMH Indata.lanl.gov LANL OAI PMH A&I MPEG-21 DIP Engine DID with DIM OAI PMH publisher DID A&I APPLICATION publisher OpenURL OpenURL Registry of trans- formations Profile/ BehaviorRegistry FTXT publisher OAI PMH Ingest Pre-Ingest Repo Index OAI PMH Identifier Resolver OAI PMH CNRI handle, JAVA, C

35 Repository Index: Registry of Autonomous OAI-PMH repositories
4 techReport STEP 2: ListRecords (OAI-PMH) List of DIDs baseURL(1) A&I Repository Index baseURL(1) baseURL(2) baseURL(3) baseURL(2) STEP 1: ListIdentifiers (OAI-PMH) baseURL(1) Repo Index baseURL(index) Expose

36 overview of the aDORe architecture
1 2 3 4 5 6 7 TechReport OAI-PMH Indata.lanl.gov LANL OAI PMH A&I MPEG-21 DIP Engine DID with DIM OAI PMH publisher DID A&I APPLICATION publisher OpenURL OpenURL Registry of trans- formations Profile/ BehaviorRegistry FTXT publisher OAI PMH Ingest Pre-Ingest Repo Index OAI PMH Identifier Resolver OAI PMH CNRI handle, JAVA, C

37 5 Identifier Locator: Locating DIDL documents, Digital Objects, constituent datastreams techReport DID-id identifier locator identifier datestamp repository DID-id 1 baseURL(1) & DID-id 1 Content-id 1 baseURL(2) & DID-id x Content-id 2 baseURL(x) & DID-id y baseURL(9) & DID-id p Content-id monitors Content-id A&I baseURL(2) DID-id or content-id baseURL & DID-id Repo Index Identifier Locator baseURL(index) Expose

38 Identifier Locator 5 Identifier Repository Location baseURL protocol
Repository Id extension (XML ID) info:lanl-repo/i/UUID1 baseURL1 OAI-PMH info:lanl-repo/opac/LANLb UUID2 info:lanl-repo/tr/LA-9870 UUID3

39 overview of the aDORe architecture
1 2 3 4 5 6 7 TechReport OAI-PMH Indata.lanl.gov LANL OAI PMH A&I MPEG-21 DIP Engine DID with DIM OAI PMH publisher DID A&I APPLICATION publisher OpenURL OpenURL Registry of trans- formations Profile/ BehaviorRegistry FTXT publisher OAI PMH Ingest Pre-Ingest Repo Index OAI PMH Identifier Resolver OAI PMH CNRI handle, JAVA, C

40 OAI-PMH Federator: Retrieve (batches of) OAIS DIPs
6 OAI-PMH Federator: Retrieve (batches of) OAIS DIPs techReport OAI-PMH Federator set = baseURL(1) set = baseURL(2) set = baseURL(3) OAI-PMH Key = Package Identifier OAI-PMH DID Profile/ BehaviorRegistry DID with PI A&I DID, METS, SCORM, … MPEG-21 DIP Engine Registry of trans- formations FTXT OAI-PMH sets baseURL Collection Format Expose OAIS Package level access ~ DIDL documents & transforms

41 DIM Inserter: dynamic insertion of behaviors

42 OpenURL Resolver: Retrieve OAIS Result Sets
7 OpenURL Resolver: Retrieve OAIS Result Sets OpenURL Requester ServiceType Referent OpenURL techReport OpenURL Key = Content Identifier Key = Package Identifier OAI-PMH Profile/ BehaviorRegistry DID with PI A&I transformed content MPEG-21 DIP Engine Registry of trans- formations FTXT Expose OAIS Result Set level access: Digital Object, contained datastreams & services

43 OpenURL-based disseminations
7 OpenURL-based disseminations disseminate DIDs, contained datastreams and transforms thereof & rfr_id=info:sid/library.lanl.gov & url_ver=Z & rft_id=info:lanl-repo/biosis/PREV & svc_id=info:lanl-repo/svc/tomods.marc

44 OAI-PMH Federator & OpenURL Resolver
aDORe front-end Interface standard identifier OAIS Access Type # items in response OAI-PMH Federator Package Identifier OAIS DIP 1 or more OpenURL Resolver NISO Content Identifier, Package Identifier (with XML ID fragment) Result Set 1

45 aDORe architecture : papers
Using MPEG-21 DIDL to Represent Complex Digital Objects in LANL Using MPEG-21 DIP and NISO OpenURL for the Dynamic Dissemination of Complex Digital Objects in LANL The multi-faceted use of the OAI-PMH in the LANL Repository aDORe: a modular, standards-based Digital Object Repository arXiv:cs.DL/

46 aDORe architecture : conclusions
aDORe & scale: Modular nature, storage of DIDL documents in Autonomous OAI-PMH Repositories, storage of bitstreams in ARC files Dynamic binding of behaviors Create new DIDL document in case of updates First large-scale use of MPEG-21 technologies aDORe & standards: Use off-the-shelf software Migration to other implementations without major disruptiuons When new generation standards emerge, probably/hopefully migration tools will be available aDORe & protocols: Distributed implementation (cf. Federation of Institutional Repositories) Novel use of OpenURL: contextual capabilities, generic DL front-end

47 aDORe architecture : conclusions
In production since 08/2004 Currently 30,000,000 DIDL documents Various downstream applications harvesting from aDORe (search engines, de-duplication component) New version ~ Summer 2005

48 Dynamic de-duplication of bibliographic information

49 LANL De-duplication Problem
LANL Research Library locally hosts a large data collection A&I databases: ISI Citation Databases, Inspec, BIOSIS, Engineering Index, … Full-text collections: Elsevier, Wiley, APS, IOP, … Duplicates in LANL data collection: amongst bibliographic records between bibliographic records and citations amongst citations De-duplication need: join records from several databases that describe the same work find works that cite a given work

50 Bibliographic Items Citation Items Biosis 13,947,365 - Inspec 7,510,299 Engineering Index 5,241,479 ISI Science 25,453,618 414,983,407 ISI Arts & Humanities 3,012,800 20,856,114 ISI Social Sciences 3,738,926 53,915,890 Total 58,904,487 489,755,411 Annual Growth ~ 2,500,000 ~ 26,000,000

51 DAVIS BJ ANN NY ACAD SCI DAVIS BJ ANN NY ACAD SCI DAVIS BJ ANN NY ACAD SCI DAVIS BJ ANN NY ACAD SCI DAVIS BJ ANN NEY YORK ACAD SC 1964 ___ DAVIS BJ ANN NY ACAD SCI CLARK BJ ANN N Y ACAD SCI DALLNER BJ ANN NY ACAD SCI DAVIES BJ ANNALS NY ACAD SCI

52 Current LANL De-duplication Approach
Strategy: Batch processing Bibliographic key matching Complex heuristics Issues: Extensive processing time Scalability problem in light of growing data collection Revision of heuristics requires reprocessing of collection Explore alternative: On-the-fly de-duplication De-duplication approach that is appropriate for citation matching Flexibility regarding revision of matching approach

53 Netrics Software Netrics in the literature:
C. Lee Giles Steve Lawrence Kurt D. Bollacker CiteSeer: an automatic citation indexing system. International Conference on Digital Libraries. Proceedings of the third ACM conference on Digital libraries Pittsburgh, Pennsylvania. Pages: 89 – 98. DOI / C. Lee Giles Steve Lawrence Kurt D. Bollacker Autonomous Citation Matching. International Conference on Autonomous Agents. Proceedings of the third annual conference on Autonomous Agents, Seattle, Washington. Pages: 392 – 393. DOI / Peter N. Yianilos Data structures and algorithms for nearest neighbor search in general metric spaces. Symposium on Discrete Algorithms. Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, Austin, Texas. Pages: 311 – 321. Various papers at

54 Netrics Software Netrics properties:
Forgiving with respect to errors in dataset Forgiving with respect to errors in query Compares strings like humans do Response can be optimized for specific datasets: machine-learning module Performance scales well with growing dataset RAM-based index

55 De-duplication component: database setup
bibliographic aulast auinit – stitle – year – volume – issue – spage – epage || identifiers of bibliographic records with given key indexed DAVIS BJ - ANN NY ACAD SCI – 121 – A2 – || info:lanl-repo/biosis/PREV citation aulast auinit – stitle – year – volume – spage || identifiers of bibliographic records in which citation is found DAVIS BJ - ANN NY ACAD SCI – || info:lanl-repo/isi/A #10 ; info:lanl-repo/isi/A #3

56 De-duplication component: database setup
IN OUT bib key list (matching bib key, bib id) cit key bib id bibliographic citation IN OUT bib key list (citing bib id) cit key

57 Query: OTT HR – PHYS REV LETT – 1983 – 50 - 1595
Response: keys likelihood client application decides on cut-off point

58 De-duplication component: populating the database
ISI 1 Netrics Harvester OAI-PMH Federator bibliographic OAI-PMH MPEG-21 DID XML documents ISI 2 OAI-PMH MPEG-21 DID XML documents citation BIOSIS Expose

59 Repository crawling

60

61 Repository crawling XHTML Nutch Search Crawler seed list: identifiers
bibliographic Nutch Search XHTML biblio info Crawler citation 1 citation 2 ISI 2 OpenURL DIP Engine ISI 1 seed list: identifiers of bib records citation 3 seed list: (Open)URLs pointing at bib records in LANL repository bib 13 bib 26 Repository crawling

62 1 5 2 3 PageRank XHTML XHTML XHTML XHTML biblio info citation 4


Download ppt "A Modular, Standards-based Digital Object Repository"

Similar presentations


Ads by Google