Archival Prototypes and Lessons Learned Mike Smorul UMIACS.

Slides:



Advertisements
Similar presentations
OGF-23 iRODS Metadata Grid File System Reagan Moore San Diego Supercomputer Center.
Advertisements

1 Metadata Tools for JISC Digitisation Projects of still images and text Ed Fay BOPCRIS, Hartley Library University of Southampton.
October 28, 2003Copyright MIT, 2003 METS repositories: DSpace MacKenzie Smith Associate Director for Technology MIT Libraries.
Audit Control Environment Mike Smorul UMIACS. Issues surrounding asserting integrity Threats to Integrity of Digital Archives –Hardware/media degradation.
Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.
Environmental Council of States Network Authentication and Authorization Services The Shared Security Component February 28, 2005.
Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center.
ADAPT An Approach to Digital Archiving and Preservation Technology Principal Investigator: Joseph JaJa Lead Programmers: Mike Smorul and Mike McGann Graduate.
PAWN: Producer-Archive Workflow Network University of Maryland Institute for Advanced Computer Studies Joseph Ja’Ja, Mike Smorul, Mike McGann.
May Archiving PAWN: A Policy-Driven Software Environment for Implementing Producer- Archive Interactions in Support of Long Term Digital.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Producer-Archive Workflow Network (PAWN) Goals Consistent with the Open Archival Information System (OAIS) model Use of web/grid technologies and platform.
ACE: A Software Tool to Ensure the Integrity of Digital Archives Principal Investigator: Joseph JaJa Graduate Student: Sangchul Song Lead Programmer: Michael.
Supporting Customized Archival Practices Using the Producer-Archive Workflow Network (PAWN) Mike Smorul, Mike McGann, Joseph JaJa.
Data Grids: Globus vs SRB. Maturity SRB  Older code base  Widely accepted across multiple communities  Core components are tightly integrated Globus.
Brief Overview of Major Enhancements to PAWN. Producer – Archive Workflow Network (PAWN) Distributed and secure ingestion of digital objects into the.
July NAGARA 1 Producer-Archive Workflow Network Mike Smorul, Mike McGann, Joseph JaJa Institute for Advanced Computer Science Studies University.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
ACE: A Software Tool to Ensure the Integrity of Digital Archives Principal Investigator: Joseph JaJa Graduate Student: Sangchul Song Lead Programmers:
May 23, 2007 Archiving ACE: A Novel Software Platform to Ensure the Integrity of Digital Archives Sangchul Song and Joseph JaJa Institute for Advanced.
Robust Technologies for Automated Ingestion and Long-Term Preservation of Digital Information Principal Investigator: Joseph JaJa Lead Programmers: Mike.
PAWN: Producer-Archive Workflow Network University of Maryland Institute for Advanced Computer Studies Joseph JaJa, Mike Smorul, Mike McGann.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
PAWN: Producer-Archive Workflow Network University of Maryland Institute for Advanced Computer Studies Joseph Ja’Ja, Mike Smorul, Mike McGann.
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
Data Grid: GRASP Mike Smorul. Grid Retrieval and Search Platform Based on concepts developed in the Earth Science Data Interface (ESDI) developed at the.
Robust Technologies for Automated Ingestion and Long-Term Preservation of Digital Information PI: Joseph JaJa Co-PIs: Allison Druin and Doug Oard Major.
DIGIFLOW Digitalization work process management software for realizing capacious digitalization projects.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.
Digitization Workflow Management System for Massive Digitization Projects Bibliotheca Alexandrina November 19, 2006 The 2 nd International Conference on.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
ViciDocs for BPO Companies Creating Info repositories from documents.
High-Speed, High Volume Document Storage, Retrieval, and Manipulation with Documentum and Snowbound March 8, 2007.
Cloud Integrity Monitoring Mike Smorul ADAPT Group University of Maryland, College Par.
Configuring and Troubleshooting Identity and Access Solutions with Windows Server® 2008 Active Directory®
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
AIP Backup & Restore Sunita Barve NCRA, Pune. AIP The latest version of DSpace 1.7.0, supports backup and restore of all its contents as a set of AIP.
National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.
Selene Dalecky March 20, 2007 FDsys: GPO’s Digital Content System.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
How to Implement an Institutional Repository: Part II A NASIG 2006 Pre-Conference May 4, 2006 Technical Issues.
Module 7: Implementing Security Using Group Policy.
M-1 INGEST OVERVIEW Don Sawyer National Space Science Data Center NASA/GSFC October 13, 1999.
DSpace System Architecture 11 July 2002 DSpace System Architecture.
Partnerships in Innovation: Serving a Networked Nation Grid Technologies: Foundations for Preservation Environments Portals for managing user interactions.
1 CLASS – Simple NOAA Archive Access Portal SNAAP Eric Kihn and Rob Prentice NGDC CLASS Developers Meeting July 14th, 2008 Simple NOAA Archive Access Portal.
Integrating with and Extending Visual Studio. Objectives.
Managing live digital content with DuraSpace services Bill Branan PASIG Spring 2015.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
OAIS (archive) Producer Management Consumer. Representation Information Data Object Information Object Interpreted using its Yields.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
OAIS (archive) OAIS (archive) Producer Management Consumer.
R2R ↔ NODC Steve Rutz NODC Observing Systems Team Leader May 12, 2011 Presented by L. Pikula, IODE OceanTeacher Course Data Management for Information.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
PAWN: Producer-Archive Workflow Network
Integrating ArcSight with Enterprise Ticketing Systems
Integrating ArcSight with Enterprise Ticketing Systems
Joseph JaJa, Mike Smorul, and Sangchul Song
Implementing an Institutional Repository: Part II
Robin Dale RLG OAIS Functionality Robin Dale RLG
Implementing an Institutional Repository: Part II
How to Implement an Institutional Repository: Part II
Presentation transcript:

Archival Prototypes and Lessons Learned Mike Smorul UMIACS

Research Objectives Development of tools and technologies for: Automated Distributed Ingestion – flexible platform for Producer-Archive Interactions Automated Distributed Ingestion – flexible platform for Producer-Archive Interactions Management of Preservation Processes – Monitoring, Integrity Auditing, and Preservation Services. Management of Preservation Processes – Monitoring, Integrity Auditing, and Preservation Services. Evaluation and demonstration of tools on widely different collections.

Research lessons Packaging Package Considerations Package Considerations PAWN Tool PAWN Tool Lessons Learned Lessons Learned Integrity Checking Hashing 101 Hashing 101 Integrity Considerations Integrity Considerations ACE Tool ACE Tool Other Tools

Package considerations Package formats have a wide variety of options Simple manifests Simple manifests Structure information relating data and metadata Structure information relating data and metadata Does a package need to physically include data Zip, Tar, proprietary containers Zip, Tar, proprietary containers How does a package track references to data Organization of data in a package format Is integrity information included Other information in a package Format, Embedded metadata, remote data Format, Embedded metadata, remote data Scaling of package format Is everything in one data file? Is everything in one data file? Can you link multiple package files Can you link multiple package files Package formats are usually limited by file count, not file size Package formats are usually limited by file count, not file size

What is PAWN? Software that provides an ingestion framework Distributed and secure ingestion of digital objects into an archive. Handles the process From package assembly From package assembly To archival storage To archival storage Simple, customizable interface for end- users Flexible interface for archive publication

What does it look like?

PAWN Evolution V1, Initially 100% METS based, Internally stored administrative data, package data in METS- accessible form. Internally stored administrative data, package data in METS- accessible form. Entire package hierarchy could be exported as METS Entire package hierarchy could be exported as METS Packages represented as single METS file Packages represented as single METS file V2, METS for packaging only Administrative data stored in DB Administrative data stored in DB Packages submitted as METS, files stored on disk in directories along side xml METS files Packages submitted as METS, files stored on disk in directories along side xml METS files Packages represented as multiple, linked METS files Packages represented as multiple, linked METS files V3, METS as plugin Packages submitted natively, stored on disk in custom format Packages submitted natively, stored on disk in custom format Packages could be exported with METS description Packages could be exported with METS description

Why the changes? Issues inherent in any package format. Mapping administrative information into a package file is restrictive. Accepting XML packages leaves too much uncertainty. What convention should be used for metadata? What convention should be used for metadata? What are required checksums? What are required checksums? METS profiles help, but not machine parsable METS profiles help, but not machine parsable Performance when updating Each change required re-writing affected METS file. Each change required re-writing affected METS file. Not good at tracking state information Locking files, tracking log information Locking files, tracking log information Defined ‘formats’ may not be interoperable. Most packaging dependant on community convention for format use. Most packaging dependant on community convention for format use.

Detecting Changes Digests are fingerprints of files Unique mapping of file content to a fixed-length string. Unique mapping of file content to a fixed-length string. Not reversible, cannot recreate file from hash. Not reversible, cannot recreate file from hash. Choose carefully Tradeoffs between security, longevity, and speed Tradeoffs between security, longevity, and speed MD5 – not a NIST standard MD5 – not a NIST standard SHA-1 – weakness demonstrated SHA-1 – weakness demonstrated For now, SHA-256,384,512 For now, SHA-256,384,512 Follow current NIST recommendations

When to digest or check Ask data supplies to supply digests Check digests on data receipt. Check digests on data receipt. Check digests after every media move Good way to detect lost files Good way to detect lost files Periodically check disk-based files Unexpected applications will modify files Opening a file in MS Office changes the file! Opening a file in MS Office changes the file!

Digesting in PAWN Digests checked multiple times 1. Client creates or uses existing digest 2. Digest and data is sent to receiving server, where it is verified 3. Audit of data and digest can be requested manually 4. Digest passed on to final archival destination

Auditing Control Environment Two part integrity service Auditing – local service to periodically check files Auditing – local service to periodically check files Hash Integrity – Remote, auditable service to secure your hash Hash Integrity – Remote, auditable service to secure your hash Based on an Integrity Token issued to track files

ACE – Basic Methodology Three-tiered Cryptographic Information. Each tier is periodically audited separately according to policies set by managers. Integrity Token Witness Cryptographic Summary Information 1 IT/object ~1KB 1 CSI/time window 1 CSI / (n) objects ~100MB/year 1 Witness/week ~2-3KB/year k:1l:1

Other Accomplishments FOCUS – a scalable, and secure registry for persistent information and services applied to formats. SRB Replication Monitor – 3 rd party replication in a data grid environment Web Archiving – Methodology for archiving and searching web content over time.

More Information Project wiki: Papers, etc: /Lab/Papers Papers, etc: /Lab/Papers /Lab/Papers /Lab/Papers