Digital Curation: Curation Micro- services approach to building repositories Mark Phillips UNT Libraries November 8, 2010.

Slides:



Advertisements
Similar presentations
Data Publishing Service Indiana University Stacy Kowalczyk April 9, 2010.
Advertisements

Configuration management
Welcome to Middleware Joseph Amrithraj
A Micro-Services-Based Approach for Curation and Preservation Solutions Stephen Abrams Patricia Cruse John Kunze Perry Willett University of California.
OCLC Digital Archive Overview Judith Cobb LIPA Meeting July 2006.
The future’s so bright…. DAITSS DIGITAL PRESERVATION SYSTEM: RE-ARCHITECTED, RE- WRITTEN, AND OPEN SOURCE Priscilla Caplan Florida Center for Library Automation.
HATHI TRUST A Shared Digital Repository Digital Repositories for Preservation and Access Digital Directions 2013 Jeremy York July 22, 2013 Unless otherwise.
Interoperability and Preservation with the Hub and Spoke (HandS) Matt Cordial, Tom Habing, Bill Ingram, Robert Manaster University of Illinois Urbana-Champaign.
Interoperability and Preservation with the Hub and Spoke (HandS) Tom Habing, Bill Ingram, Robert Manaster University of Illinois Urbana-Champaign
Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.
PREMIS in Thought: Data Center for LC Digital Holdings Ardys Kozbial, Arwen Hutt, David Minor February 11, 2008.
SOFTWARE PRESENTATION ODMS (OPEN SOURCE DOCUMENT MANAGEMENT SYSTEM)
Hydra Partners Meeting March 2012 Bill Branan DuraCloud Technical Lead.
Funded by: © AHDS Sherpa DP – a Technical Architecture for a Disaggregated Preservation Service Mark Hedges Arts and Humanities Data Service King’s College.
Depositing e-material to The National Library of Sweden.
ISO & OAI-PMH By Neal Harmeyer, Amy Hatfield, and Brandon Beatty PURDUE UNIVERSITY RESEARCH REPOSITORY.
PREMIS What is PREMIS? o Preservation Metadata Implementation Strategies When is PREMIS use? o PREMIS is used for “repository design, evaluation, and archived.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
WMS: Democratizing Data
Peoplesoft: Building and Consuming Web Services
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.
Archival Prototypes and Lessons Learned Mike Smorul UMIACS.
Incompatible or Interoperable? A METS bridge for a small gap between two digital preservation software packages Lucas Mak Metadata & CatalogLibrarian
EZID (easy-eye-dee) is a service that makes it simple for digital object producers (researchers and others) to obtain and manage long-term identifiers.
Managing the Record of Research At the Smithsonian Using SIdora SAA Research Forum August 12, 2014.
1. 2 introductions Nicholas Fischio Development Manager Kelvin Smith Library of Case Western Reserve University Benjamin Bykowski Tech Lead and Senior.
Finding a New Way Richard Pearce-Moses Deputy Director for Technology & Information Resources Arizona State Library, Archives and Public Records Using.
Curation Micro-Services “It’s a Series of Tubes” Curation Micro-Services “It’s a Series of Tubes”
Implementing an Integrated Digital Asset Management System: FEDORA and OAIS in Context Paul Bevan DAMS Implementation Manager
Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS May, 2008 National e-Science Centre Edinburgh Dr Robert.
UC3 Standards and Best Practices for Datasets and Other Supplemental Journal Article Materials UC3 Stephen Abrams Patricia Cruse John Kunze.
Access Across Time: How the NAA Preserves Digital Records Andrew Wilson Assistant Director, Preservation.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
DAITSS: Dark Archive in the Sunshine State Priscilla Caplan, Florida Center for Library Automation DCC Workshop on Long-term Curation within Digital Repositories.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
XML Web Services Architecture Siddharth Ruchandani CS 6362 – SW Architecture & Design Summer /11/05.
Implementor’s Panel: BL’s eJournal Archiving solution using METS, MODS and PREMIS Markus Enders, British Library DC2008, Berlin.
The FCLA Digital Archive Joint Meeting of CSUL Committees, 2005.
CyberCemetery Preserving At-Risk Government Web Content.
Caltech CODA CODA: Collection of Digital Archives Caltech Scholarly Communication.
Selene Dalecky March 20, 2007 FDsys: GPO’s Digital Content System.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
The New DRS Introduction. What is DRS? Digital repository for preservation and access – Maintains integrity of deposited content – Preserves content for.
Interoperability and Collection of Preservation Metadata for Digital Repository Content Matt Cordial, Tom Habing, Bill Ingram, Robert Manaster University.
PREMIS at the British Library Markus Enders, The British Library PREMIS Implementation Fair, San Fransisco, CA 07 October 2009.
The Mint Mapping tool The MoRe aggregator Vassilis Tzouvaras, Dimitris Gavrilis National Technical University of Athens Digital Curation Unit - IMIS, Athena.
Fedora and the Preservation of University Electronic Records Project NHPRC Electronic Records Research Grant Kevin L. Glick Manuscripts and Archives, Yale.
DAITSS and the Florida Digital Archive Priscilla Caplan Florida Center for Library Automation iPRES 2006.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
ARIADNE is funded by the European Commission's Seventh Framework Programme Archiving and Repositories Holly Wright.
Institutional Repositories July 2007 DIGITAL CURATION creating, managing and preserving digital objects Dr D Peters DISA Digital Innovation South.
Meeting of the Member States Expert Group on Digitisation and Digital Preservation , Luxembourg European Archival Records and Knowledge Preservation.
A SCRIPT FOR ARCHIVING DIGITAL RESEARCH DATA IMPROVING ACCURACY AND EFFICIENCY IN THE DATAVERSE NETWORK ABSTRACT SUMMARY Rachel Carriere, Thu-Mai Christian,
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Discover ScholarSphere A repository service collaboration between the University Libraries and ITS.
Fitting into an Appraisal, Accessioning, Processing, Discovery, and Delivery Workflow Chris Prom, University of Illinois at Urbana Champaign.
Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.
Joint Meeting of CSUL Committees,
Hadoop.
DAITSS and the Florida Digital Archive
Overview: Fedora Architecture and Software Features
Building Search Systems for Digital Library Collections
Flexible Extensible Digital Object Repository Architecture
Flexible Extensible Digital Object Repository Architecture
CNI Spring 2010 Membership Meeting
The Re3gistry software and the INSPIRE Registry
CS6604 Digital Libraries IDEAL Webpages Presented by
Robin Dale RLG OAIS Functionality Robin Dale RLG
Presentation transcript:

Digital Curation: Curation Micro- services approach to building repositories Mark Phillips UNT Libraries November 8, 2010

Digital Curation “Digital curation is the set of policies and practices focused on maintaining and adding value to trusted digital content for use now and into the indefinite future. Curation encompasses preservation and access, and can be applied to the humanities, social sciences, and sciences.”

Preservation is the goal, it is like the finish line in a relay, you hand the responsibility off to various actors with an overarching goal being that the content us viable at a later point in time.

Focus on doing the smartest thing you can do right now.

Smartest does not equal “best”

Like many areas, “the perfect is the enemy of the good”.

Digital Curation Digital Stewardship Digital Preservation

Access Preservation Near-term Long-term Maintain Add value Cross discipline

Curation Micro-Services California Digital Library Methodology for building infrastructure to support curation and preservation Thinking about the components and interactions in a repository as a set of smaller services Loosely coupled services Reaction to large monolithic repository systems

Unix philosophy for system design Output of one service is the input for another yet to be created service Swap out pieces as needed Focus on simple tools that do one thing. Often referred to as “building blocks” or “Legos”

At this time it isn’t exactly clear what is and what isn’t a Curation Micro-Service

Kind of sounds like a Web-service Or any other service for that matter

This really hasn’t been “answered” in the community

CDL “services” Identity Service Storage Service Fixity Service Replication Service Inventory Service Characterization Service Ingest Service Index Service Search Service Transformation Service Notification Service Annotation Service

Some example components Anvl Namaste Pairtree BagIt CAN D-Flat ReDD Checkm Cutie ERC

Pairtree filesystem hierarchy for holding objects Identifier strings mapped to object directory Two characters at a time abcd -> ab/cd/ abcdefg -> ab/cd/ef/g/ xy4 -> 12/-9/86/xy/4/ Object folder at the end of the mapping

Full Example current_directory/ | pairtree_version0_1 [which version of pairtree] | ( This directory conforms to Pairtree Version 0.1. Updated spec: ) | ( ) | | pairtree_prefix | ( ) | \--- pairtree_root/ |--- aa/ | |--- cd/ | | |--- foo/ | | | | README.txt | | | | thumbnail.gif | |... | |--- ab/... | |--- af/... | |--- ag/... |... |--- ab/ \--- zz/... |...

Namaste NAMe AS TExt file naming convention primitive directory-level metadata tags exposed directly via filenames Answers the following question – “What kind of directory is this?” Examples – 0=bagit_0.96 – 0=untl_sip_1.0 – 0=untl_aip_1.0 – 0=untl_acp_1.0

Building a repository UNT Libraries Two separate systems with similar components Access system = Aubrey Preservation system = Coda Built as a set of “services”

UNT and micro-services Modular Build out in stages – Presentation System – Preservation System – Other services as we need them Replace in the future as needed Easy to implement, easy to discard

Identity Service Archival Resource Keys (ARK) for identifiers Number Server for minting names for objects Implemented as a Web service Query a URL and get a new unique name – metapth12604 Append that to UNT’s NAAN – ark:/67531/metapth12604 Currently 5 name spaces for identifiers

Vocabulary Service Simple system for providing canonical versions of names Unique identifiers for each vocabulary term Provided as Linked data in RDF/XML Other serializations – Legacy XML format – Json – Python object Easy to integrate into code Promotes reuse of vocabularies

Storage Service Provide a consistent way of requesting an item Use http for communication Read only currently Makes use of public specifications CAN PairTree BagIt Exposed with Apache

Storage Service Example For a known identifier, and a known storage service coda1gel on coda |-- 0=can_0.10 |-- admin |-- can-info.txt |-- log `-- store |-- pairtree_index |-- pairtree_prefix |-- pairtree_root | `-- co | `-- da | |-- 1g | | |-- el | | | `-- coda1gel | | | |-- 0=untl_aip_1.0 | | | |-- bag-info.txt | | | |-- bagit.txt | | | |-- coda_directives.py | | | |-- data

Storage Service Example Proxy for abstracting “which node” We expected to never have all of our data in one place Shifts the problem from infrastructure/storage to a software problem

Storage service Coda repository application has a list of active content nodes Coda queries each content node for desired object, (http head requests) Primary and secondary content nodes are usable for increased fault tolerance Coda streams content to end user to allow for very large files to be transferred

Replication Service Software neutral content replication Master nodes in Library server room Secondary nodes at Library Annex server room Coda instance at each location Different number of content nodes 6 vs 3 currently Different content node sizes 9TB vs 25TB Need to balance content across content nodes

Replication Service Series of conventions for making content available for replication Three requirements – Provide a list of objects you want to replicate – Point to a manifest defining all files of an object – Provide a way to validate an object when replicated

Replication Service – Coda Implementation Restful replication service Components – Replication queue – Queue of objects to replicate – Collector – Adds object to the Replication queue – Harvester – queries queue for objects to harvest – Coda Metadata Store As content is replicated, a validation and replication event is logged centrally.

Event Service Based on the PREMIS Event Model Restful interface for creating new events Provides an interface for creating and maintaining PREMIS Agents Collects and provides access to events important to the lifecycle of the object Currently setup to capture ingest, replication, fixityCheck and virusCheck events

Ingest Service A more complex workflow for accessioning content into the repository Uses BagIt as a packaging container Validation of content each network or disk hop Sanity check after atomic moves Folder based workflow with python management scripts

Folder Workflow pth_dropbox/ |-- 0.Staging/ |-- 1.ToAIP/ |-- 2.ToAIP-Error/ |-- 3.ToACP/ |-- 4.ToACP-Error/ |-- 5.ToArchive/ |-- 6.ToAubrey/ |-- 7.ToAubrey-Sorted/ |-- 8.ToAubrey-Sorted-Error/ |-- dropbox_config.py |-- makeACP.py -> /home/digitalprojects/coda/makeACP.py |-- makeACPSort.py -> /home/digitalprojects/coda/makeACPSort.py |-- makeAIP.py -> /home/digitalprojects/coda/makeAIP.py `-- moveToLibDigiArch_coda-005.sh

1.ToAIP Objects start in this directory, typically using rsync from local machines. Full validation of Bag Check that Bag is a Submission Information Package (SIP) Check for coda_directives.py for processing instructions Request identifier from Number Server Create METS document from supplied files Create PREMIS record, JHOVE stream, File stream Move to 3.ToACP on success or 2.ToAIP-Error on failure

3.ToACP Check that Bag is an Archival Information Package (AIP) Check for coda_directives.py for processing instructions Process METS structure and create Web derivatives based on current practice Move AIP to 5.ToArchive on Success, Move AIP to 4.ToACP-Error on failure Move ACP to 6.ToAubrey on Success

5.ToArchive/6.ToAubrey Run bash script to rsync contents of 5.ToArchive over to current archival dropbox Run makeACPSort.py to sort contents of 6.ToAubrey into odd and even folders, upload to appropriate content node on delivery system

Ingest Service Archival Information Package (AIP) is ingested into Coda in a very similar fashion, it has the following steps – Verify Bag – Check bag is AIP – Assign coda identifier Access Content Package (ACP) is moved to the Aubrey content delivery platform and made avaliable in the following systems – –

Current statistics for UNT systems Coda – 27,552,721 files – 139,062 objects – 42.3 TB in use / 120 TB capacity Aubrey – texashistory 125,721 objects 114,847 “live” 1,248,416 “fileSets” – digital.library 38,755 objects 38,451 “live” 2,253,031 “fileSets”

Questions?