Download presentation
Presentation is loading. Please wait.
1
Archival Prototypes and Lessons Learned Mike Smorul UMIACS
2
Research Objectives Development of tools and technologies for: Automated Distributed Ingestion – flexible platform for Producer-Archive Interactions Automated Distributed Ingestion – flexible platform for Producer-Archive Interactions Management of Preservation Processes – Monitoring, Integrity Auditing, and Preservation Services. Management of Preservation Processes – Monitoring, Integrity Auditing, and Preservation Services. Evaluation and demonstration of tools on widely different collections.
3
Research lessons Packaging Package Considerations Package Considerations PAWN Tool PAWN Tool Lessons Learned Lessons Learned Integrity Checking Hashing 101 Hashing 101 Integrity Considerations Integrity Considerations ACE Tool ACE Tool Other Tools
4
Package considerations Package formats have a wide variety of options Simple manifests Simple manifests Structure information relating data and metadata Structure information relating data and metadata Does a package need to physically include data Zip, Tar, proprietary containers Zip, Tar, proprietary containers How does a package track references to data Organization of data in a package format Is integrity information included Other information in a package Format, Embedded metadata, remote data Format, Embedded metadata, remote data Scaling of package format Is everything in one data file? Is everything in one data file? Can you link multiple package files Can you link multiple package files Package formats are usually limited by file count, not file size Package formats are usually limited by file count, not file size
5
What is PAWN? Software that provides an ingestion framework Distributed and secure ingestion of digital objects into an archive. Handles the process From package assembly From package assembly To archival storage To archival storage Simple, customizable interface for end- users Flexible interface for archive publication
6
What does it look like?
7
PAWN Evolution V1, Initially 100% METS based, Internally stored administrative data, package data in METS- accessible form. Internally stored administrative data, package data in METS- accessible form. Entire package hierarchy could be exported as METS Entire package hierarchy could be exported as METS Packages represented as single METS file Packages represented as single METS file V2, METS for packaging only Administrative data stored in DB Administrative data stored in DB Packages submitted as METS, files stored on disk in directories along side xml METS files Packages submitted as METS, files stored on disk in directories along side xml METS files Packages represented as multiple, linked METS files Packages represented as multiple, linked METS files V3, METS as plugin Packages submitted natively, stored on disk in custom format Packages submitted natively, stored on disk in custom format Packages could be exported with METS description Packages could be exported with METS description
8
Why the changes? Issues inherent in any package format. Mapping administrative information into a package file is restrictive. Accepting XML packages leaves too much uncertainty. What convention should be used for metadata? What convention should be used for metadata? What are required checksums? What are required checksums? METS profiles help, but not machine parsable METS profiles help, but not machine parsable Performance when updating Each change required re-writing affected METS file. Each change required re-writing affected METS file. Not good at tracking state information Locking files, tracking log information Locking files, tracking log information Defined ‘formats’ may not be interoperable. Most packaging dependant on community convention for format use. Most packaging dependant on community convention for format use.
9
Detecting Changes Digests are fingerprints of files Unique mapping of file content to a fixed-length string. Unique mapping of file content to a fixed-length string. Not reversible, cannot recreate file from hash. Not reversible, cannot recreate file from hash. Choose carefully Tradeoffs between security, longevity, and speed Tradeoffs between security, longevity, and speed MD5 – not a NIST standard MD5 – not a NIST standard SHA-1 – weakness demonstrated SHA-1 – weakness demonstrated For now, SHA-256,384,512 For now, SHA-256,384,512 Follow current NIST recommendations http://csrc.nist.gov/groups/ST/hash/index.html http://csrc.nist.gov/groups/ST/hash/index.html
10
When to digest or check Ask data supplies to supply digests Check digests on data receipt. Check digests on data receipt. Check digests after every media move Good way to detect lost files Good way to detect lost files Periodically check disk-based files Unexpected applications will modify files Opening a file in MS Office changes the file! Opening a file in MS Office changes the file!
11
Digesting in PAWN Digests checked multiple times 1. Client creates or uses existing digest 2. Digest and data is sent to receiving server, where it is verified 3. Audit of data and digest can be requested manually 4. Digest passed on to final archival destination
12
Auditing Control Environment Two part integrity service Auditing – local service to periodically check files Auditing – local service to periodically check files Hash Integrity – Remote, auditable service to secure your hash Hash Integrity – Remote, auditable service to secure your hash Based on an Integrity Token issued to track files
13
ACE – Basic Methodology Three-tiered Cryptographic Information. Each tier is periodically audited separately according to policies set by managers. Integrity Token Witness Cryptographic Summary Information 1 IT/object ~1KB 1 CSI/time window 1 CSI / (n) objects ~100MB/year 1 Witness/week ~2-3KB/year k:1l:1
14
Other Accomplishments FOCUS – a scalable, and secure registry for persistent information and services applied to formats. SRB Replication Monitor – 3 rd party replication in a data grid environment Web Archiving – Methodology for archiving and searching web content over time.
15
More Information Project wiki: http://adaptwiki.umiacs.umd.edu http://adaptwiki.umiacs.umd.edu http://adaptwiki.umiacs.umd.edu Papers, etc: http://adaptwiki.umiacs.umd.edu/twiki/bin/view /Lab/Papers Papers, etc: http://adaptwiki.umiacs.umd.edu/twiki/bin/view /Lab/Papers http://adaptwiki.umiacs.umd.edu/twiki/bin/view /Lab/Papers http://adaptwiki.umiacs.umd.edu/twiki/bin/view /Lab/Papers
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.