Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Published byModified over 5 years ago
Presentation on theme: "Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer."— Presentation transcript:
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer Studies Department of Electrical and Computer Engineering University of Maryland, College Park
Research Objectives Development of flexible, platform- independent modular tools and technologies. Automated, distributed and secure ingestion software. Management of preservation processes – monitoring, integrity auditing, and preservation services. Access technologies in support of search, discovery, and delivery of archived objects within a temporal context. Evaluation and demonstration on widely different collections
Flexible, Layered, OAIS-Compliant Architecture Data Management Metadata Management Administrative DescriptivePreservation Ingestion Workflow PAWN M e t a d a t a D a t a Search and Access Monitoring and Preservation Services Storage Infrastructure Open Standards, Platform Independent Components
PAWN – Producer Archive Workflow Network Software that provides a flexible and customizable ingestion framework Handles the process in a reliable and secure fashion: From package assembly To archival storage Simple interface for end-users Flexible interface for archive managers Designed for use in multiple contexts
Sample End to End workflow On ingest, filtering processes run to perform basic validation. Ensure file is valid image file and virus free EXIF metadata extraction All approved files are automatically pushed to archival storage Archivist may login and handle rejected or non-archived items Manual process is invoked to force move files to archival storage
PAWN Summary Flexible environment to handle ingestion between many producers and an archive. Very little effort for producers to push their data or for archives to pull data into the archive. Granular workflow definition. Fully automated to completely manual. Easy to include new standards (metadata, packaging, …). Tested in a number of environments
ACE – Auditing Control Environment Software to protect the integrity of digital assets in the long term Hardware/media degradation Security breaches, malicious alterations Infrequent access to most data Evolution of cryptographic schemes Underpinnings are based on rigorous cryptographic techniques. Scalable, cost-effective, and can interoperate with any archiving architecture.
ACE – Basic Methodology Builds on cryptographic hashing by introducing additional layers of trust. Layers of cryptographic summary information Is not confined to the local processes of the archive, and assumes a third-party, which is not fully trusted. An independent party can assert the correctness of any object in the future based on the archive’s information and publically available information.
ACE Summary Software to track the availability and integrity of the archive’s data holdings. Auditing – local service to periodically verify integrity of files Hash integrity – remote, auditable service to secure hash Extensively tested – main bottleneck is network and I/O bandwidth. Chronopolis 3 Collections 5+ million files, 12.2Tb total High performance, Scalable A single manager can audit over 6 million files a day Version 1.2 publically available
Tracking and Replication Monitoring Portal that provides overview of the status of all the collections in the archive. Enforces policies regarding availability and replication. Tracks files at master locations and periodically copy new files to replica sites. Log actions on a collection and errors during any processing. Currently, incorporated with ACE.
Chronopolis Workflow 1. Data is ingested at SDSC. 2. Replication monitor places copies at UMD and NCAR. 3. ACE AM Installations at each site audits local copies 4. Digests from all three sites are compared in the AM to ensure valid replication. 5. AM Provides information for data providers to ensure we are preserving their data
Chronopolis Collections Over 5.5 million files 20+Tb Audit time is ~ 1 week.
Access Technologies for Long Term Archives Search and information discovery within a temporal context. Content exploration to enable knowledge discovery. Indexing structure based on advanced multiversion B-trees. Test and validation on significant scale web archives.
Scalable Technology for Information Discovery of Web Archives Allows discovery through a combination of words and time spans. Efficient for handling temporal queries rather than “search and then filter”: “Retrieve documents containing September 11 which were written before 2001” Returned web links are ranked according to a temporal-based scoring function. Allows the possibility of coalescing similar versions of a web page.
Conclusion Focus has been on platform and architecture – independent tools and services that are specifically tailored to handle core issues in long term archiving. Empirical testing and evaluation using a wide variety of collections and different infrastructures. Released tools to manage distributed ingestion, monitoring, and integrity of archived objects.