Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

Configuration management
Presentation by Priyanka Sawarkar
Audit Control Environment Mike Smorul UMIACS. Issues surrounding asserting integrity Threats to Integrity of Digital Archives –Hardware/media degradation.
Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.
11© 2011 Hitachi Data Systems. All rights reserved. HITACHI DATA DISCOVERY FOR MICROSOFT® SHAREPOINT ® SOLUTION SCALING YOUR SHAREPOINT ENVIRONMENT PRESENTER.
DESIGNING A PUBLIC KEY INFRASTRUCTURE
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Search and Access Strategies for Web Archives Sangchul Song and Joseph JaJa 3. Existing Access Methods 1. Background o The Web has become the main publication.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center.
ADAPT An Approach to Digital Archiving and Preservation Technology Principal Investigator: Joseph JaJa Lead Programmers: Mike Smorul and Mike McGann Graduate.
PAWN: Producer-Archive Workflow Network University of Maryland Institute for Advanced Computer Studies Joseph Ja’Ja, Mike Smorul, Mike McGann.
May Archiving PAWN: A Policy-Driven Software Environment for Implementing Producer- Archive Interactions in Support of Long Term Digital.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Producer-Archive Workflow Network (PAWN) Goals Consistent with the Open Archival Information System (OAIS) model Use of web/grid technologies and platform.
Rutgers University Libraries What is RUcore? o An institutional repository, to preserve, manage and make accessible the research and publications of the.
ACE: A Software Tool to Ensure the Integrity of Digital Archives Principal Investigator: Joseph JaJa Graduate Student: Sangchul Song Lead Programmer: Michael.
PAWN V0.7 University of Maryland Institute for Advanced Computer Studies.
1 Using Scalable and Secure Web Technologies to Design Global Format Registry Muluwork Geremew, Sangchul Song and Joseph JaJa Institute for Advanced Computer.
Supporting Customized Archival Practices Using the Producer-Archive Workflow Network (PAWN) Mike Smorul, Mike McGann, Joseph JaJa.
July NAGARA 1 Producer-Archive Workflow Network Mike Smorul, Mike McGann, Joseph JaJa Institute for Advanced Computer Science Studies University.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
Internet Resources Discovery (IRD) IBM DB2 Digital Library Thanks to Zvika Michnik and Avital Greenberg.
ACE: A Software Tool to Ensure the Integrity of Digital Archives Principal Investigator: Joseph JaJa Graduate Student: Sangchul Song Lead Programmers:
May 23, 2007 Archiving ACE: A Novel Software Platform to Ensure the Integrity of Digital Archives Sangchul Song and Joseph JaJa Institute for Advanced.
Robust Technologies for Automated Ingestion and Long-Term Preservation of Digital Information Principal Investigator: Joseph JaJa Lead Programmers: Mike.
PAWN: Producer-Archive Workflow Network University of Maryland Institute for Advanced Computer Studies Joseph JaJa, Mike Smorul, Mike McGann.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
PAWN: Producer-Archive Workflow Network University of Maryland Institute for Advanced Computer Studies Joseph Ja’Ja, Mike Smorul, Mike McGann.
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
Academic Services Interactive Media Managing the Web with Java JA-SIG Winter 2002 Robert Sherratt Academic Services, Interactive Media.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.
Archival Prototypes and Lessons Learned Mike Smorul UMIACS.
SAN DIEGO SUPERCOMPTER CENTERUC SAN DIEGO LIBRARIESNDIIPP PARTNERS MEETING David Minor SDSC Robert H. McDonald SDSC Sangchul Song UMIACS Bryan.
1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.
Data-PASS Shared Catalog Micah Altman & Jonathan Crabtree 1 Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate.
This chapter is extracted from Sommerville’s slides. Text book chapter
Technology Overview. Agenda What’s New and Better in Windows Server 2003? Why Upgrade to Windows Server 2003 ?  From Windows NT 4.0  From Windows 2000.
Cloud Integrity Monitoring Mike Smorul ADAPT Group University of Maryland, College Par.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
material assembled from the web pages at
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
Master Thesis Defense Jan Fiedler 04/17/98
Configuration Management (CM)
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Web Archiving and Access Mike Smorul Joseph JaJa ADAPT Group University of Maryland, College Park.
Developing Policy and Procedure Management System إعداد برنامج سياسات وإجراءات العمل 8 Safar February 2007 HERA GENERAL HOSPITAL.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
GPO’s Federal Digital System December 10, 2009 U.S. Government Printing Office.
Selene Dalecky March 20, 2007 FDsys: GPO’s Digital Content System.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Microsoft Management Seminar Series SMS 2003 Change Management.
Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University
Search and Access Technologies for Large Scale Web Archives Joseph JaJa, Sangchul Song, and Mike Smorul Institute for Advanced Computer Studies Department.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
Infrastructure Breakout What capacities should we build now to manage data and migrate it over the future generations of technologies, standards, formats,
Managing live digital content with DuraSpace services Bill Branan PASIG Spring 2015.
De Rigueur - Adding Process to Your Business Analytics Environment Diane Hatcher, SAS Institute Inc, Cary, NC Falko Schulz, SAS Institute Australia., Brisbane,
CONTENTdm A proven solution September A complete digital collection management software solution Stores, manages and provides access for all digital.
CyVerse Workshop Discovery Environment Overview. Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
PAWN: Producer-Archive Workflow Network
Joseph JaJa, Mike Smorul, and Sangchul Song
Presentation transcript:

Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer Studies Department of Electrical and Computer Engineering University of Maryland, College Park

Research Objectives Development of flexible, platform- independent modular tools and technologies. Automated, distributed and secure ingestion software. Management of preservation processes – monitoring, integrity auditing, and preservation services. Access technologies in support of search, discovery, and delivery of archived objects within a temporal context. Evaluation and demonstration on widely different collections

Flexible, Layered, OAIS-Compliant Architecture Data Management Metadata Management Administrative DescriptivePreservation Ingestion Workflow PAWN M e t a d a t a D a t a Search and Access Monitoring and Preservation Services Storage Infrastructure Open Standards, Platform Independent Components

PAWN – Producer Archive Workflow Network Software that provides a flexible and customizable ingestion framework Handles the process in a reliable and secure fashion: From package assembly To archival storage Simple interface for end-users Flexible interface for archive managers Designed for use in multiple contexts

Package Workflow Overview 1. Create Producer-Archive Agreement and client package template. Configure automated processes, required signatures 2. Create package based on template 3. Optionally, review submitted items 4. Invoke publishing processes.

Sample End to End workflow On ingest, filtering processes run to perform basic validation. Ensure file is valid image file and virus free EXIF metadata extraction All approved files are automatically pushed to archival storage Archivist may login and handle rejected or non-archived items Manual process is invoked to force move files to archival storage

PAWN Summary Flexible environment to handle ingestion between many producers and an archive. Very little effort for producers to push their data or for archives to pull data into the archive. Granular workflow definition. Fully automated to completely manual. Easy to include new standards (metadata, packaging, …). Tested in a number of environments

ACE – Auditing Control Environment Software to protect the integrity of digital assets in the long term Hardware/media degradation Security breaches, malicious alterations Infrequent access to most data Evolution of cryptographic schemes Underpinnings are based on rigorous cryptographic techniques. Scalable, cost-effective, and can interoperate with any archiving architecture.

ACE – Basic Methodology Builds on cryptographic hashing by introducing additional layers of trust. Layers of cryptographic summary information Is not confined to the local processes of the archive, and assumes a third-party, which is not fully trusted. An independent party can assert the correctness of any object in the future based on the archive’s information and publically available information.

ACE Audit Manager

ACE Summary Software to track the availability and integrity of the archive’s data holdings. Auditing – local service to periodically verify integrity of files Hash integrity – remote, auditable service to secure hash Extensively tested – main bottleneck is network and I/O bandwidth. Chronopolis 3 Collections 5+ million files, 12.2Tb total High performance, Scalable A single manager can audit over 6 million files a day Version 1.2 publically available

Tracking and Replication Monitoring Portal that provides overview of the status of all the collections in the archive. Enforces policies regarding availability and replication. Tracks files at master locations and periodically copy new files to replica sites. Log actions on a collection and errors during any processing. Currently, incorporated with ACE.

Replication Monitor

Chronopolis Workflow 1. Data is ingested at SDSC. 2. Replication monitor places copies at UMD and NCAR. 3. ACE AM Installations at each site audits local copies 4. Digests from all three sites are compared in the AM to ensure valid replication. 5. AM Provides information for data providers to ensure we are preserving their data

Chronopolis Collections Over 5.5 million files 20+Tb Audit time is ~ 1 week.

Access Technologies for Long Term Archives Search and information discovery within a temporal context. Content exploration to enable knowledge discovery. Indexing structure based on advanced multiversion B-trees. Test and validation on significant scale web archives.

Scalable Technology for Information Discovery of Web Archives Allows discovery through a combination of words and time spans. Efficient for handling temporal queries rather than “search and then filter”: “Retrieve documents containing September 11 which were written before 2001” Returned web links are ranked according to a temporal-based scoring function. Allows the possibility of coalescing similar versions of a web page.

Conclusion Focus has been on platform and architecture – independent tools and services that are specifically tailored to handle core issues in long term archiving. Empirical testing and evaluation using a wide variety of collections and different infrastructures. Released tools to manage distributed ingestion, monitoring, and integrity of archived objects.