Building Preservation Environments from Federated Data Grids Reagan W. Moore San Diego Supercomputer Center Storage.

Slides:



Advertisements
Similar presentations
Building Shared Collections Using the Storage Resource Broker Storage Resource Broker Reagan W. Moore
Advertisements

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
OGF-23 iRODS Metadata Grid File System Reagan Moore San Diego Supercomputer Center.
Peter Berrisford RAL – Data Management Group SRB Services.
Digital Preservation Lifecycle Management Building a demonstration prototype for the preservation of large-scale multi-media collections Arcot Rajasekar.
Data Grid: Storage Resource Broker Mike Smorul. SRB Overview Developed at San Diego Supercomputing Center. Provides the abstraction mechanisms needed.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Grids, Digital Libraries, and Persistent Archives ESIP.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids Reagan W. Moore San Diego Supercomputer Center.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids, Digital Libraries and Persistent Archives Reagan.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Integration of Data Grids, Digital Libraries, and Persistent.
San Diego Supercomputer Center NARA Research Prototype Persistent Archive Building Preservation Environments with Data Grid Technology (NARA Research Prototype.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Grid Based Solutions for Distributed Data Management Reagan.
Federating Archives in the DELAMAN Network Reagan W. Moore San Diego Supercomputer Center Storage Resource.
Security Requirements for Shared Collections Storage Resource Broker Reagan W. Moore
“Enabling Success: IT Infrastructure & Repositories” Andrew Bennett, University of Qld Library APSR : The Successful Repository University of Queensland.
GGF-17 Astro Workshop Preservation Environment Working Group Officers: Bruce Barkstrom (NASA Langley) Reagan Moore (SDSC) Goals  Demonstrate.
VL-e PoC Introduction Maurice Bouwhuis VL-e work shop, April 7 th, 2006.
Applying Data Grids to Support Distributed Data Management Storage Resource Broker Reagan W. Moore Ian Fisk Bing Zhu University of California, San Diego.
Modern Data Management Overview Storage Resource Broker Reagan W. Moore
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.
On Developing Data Grid Workflows using Storage Resource Broker (SRB) and Kepler Tim H. Wong - UC Davis Efrat Frank - SDSC Dr. Bertram Ludäscher - UC Davis.
Data Grid Interactions with Firewalls Michael Wan Reagan Moore SDSC/UCSD/NPACI.
January, 23, 2006 Ilkay Altintas
DCC Conference, Glasgow November, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego.
SDSC Projects Part 1: BUILDING PRESERVATION ENVIRONMENTS (Reagan Moore, Storage Resource Broker (SRB) and collection migration technologies:
Data Grids and Data Management Storage Resource Broker Reagan W. Moore
National Partnership for Advanced Computational Infrastructure Digital Library Architecture Reagan Moore Chaitan Baru Amarnath Gupta George Kremenek Bertram.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Information Management and Distributed Data Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
MCAT: A Metadata Catalog San Diego Supercomputing Center Part of the Storage Resource Broker (SRB)
Jan Storage Resource Broker Managing Distributed Data in a Grid A discussion of a paper published by a group of researchers at the San Diego Supercomputer.
Data Grids and Data Management Storage Resource Broker Reagan W. Moore
Managing Simulation Output Storage Resource Broker Reagan W. Moore
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Data Grid Services/SRB/SRM & Practical Hai-Ning Wu Academia Sinica Grid Computing.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Data Grids, Digital Libraries, and Persistent Archives Reagan.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
Michael Doherty RAL UK e-Science AHM 2-4 September 2003 SRB in Action.
GGF-17 Preservation Environments Research Group Preservation Environment Working Group Officers: Bruce Barkstrom (NASA Langley) Reagan.
Introduction to The Storage Resource.
SDSC Storage Resource Broker & Meta-data Catalog SRB Archives HPSS, ADSM, UniTree, DMF Databases DB2, Oracle, Sybase File Systems Unix, NT, Mac OSX Application.
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
The Storage Resource Broker and.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Building Preservation Environments Reagan W. Moore San Diego Supercomputer Center Storage Resource Broker.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
Preservation Environments for GIS Systems Reagan Moore Richard Marciano Ilyz Zaslavsky San Diego Supercomputer Center.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Databases and DBMSs Todd S. Bacastow January 2005.
Collection Based Persistent Archives
Policy-Based Data Management integrated Rule Oriented Data System
Implementing an Institutional Repository: Part II
Arcot Rajasekar Michael Wan Reagan Moore (sekar, mwan,
VORB Virtual Object Ring Buffers
Technical Issues in Sustainability
How to Implement an Institutional Repository: Part II
Presentation transcript:

Building Preservation Environments from Federated Data Grids Reagan W. Moore San Diego Supercomputer Center Storage Resource Broker

Topics Preservation environments Authenticity Integrity Digital library technology Metadata management Data grid technology Technology evolution management

Preservation Archival processes through which a digital entity is extracted from its creation environment, and migrated into a preservation environment, while maintaining authenticity and integrity information. Extraction process requires insertion of support infrastructure underneath the digital material Goal is infrastructure independence, the ability to use any commercial storage system, database, or access mechanism

Preservation Communities InterPARES - diplomatics Preservation of records NARA Preservation of records from federal agencies State archives Preservation of submitted “collections” Continuum model Preservation of active data and records

Digital Libraries Support the community vocabulary Discovery and browse using community relevant terms Support the community data format Maintain information on the data format of each item Support the community access services Provide services that manipulate and display the community data format

Preservation Mandates Diplomatics Authenticity Integrity NARA Infrastructure independence Scalability State archives Automation of archival processes

InterPARES - Diplomatics Authenticity - maintain links to metadata for: Date record is made Date record is transmitted Date record is received Date record is set aside [i.e. filed] Name of author (person or organization issuing the record) Name of addressee (person or organization for whom the record is intended) Name of writer (entity responsible for the articulation of the record’s content) Name of originator (electronic address from which record is sent) Name of recipient(s) (person or organization to whom the record is sent) Name of creator (entity in whose archival fonds the record exists) Name of action or matter (the activity for which the record is created) Name of documentary form (e.g. , report, memo) Identification of digital components Identification of attachments (e.g. digital signature) Archival bond (e.g. classification code)

InterPARES - Diplomatics Integrity - maintain links to metadata for Name(s) of the handling office / officer Name of office of primary responsibility for keeping the record Annotations or comments Actions carried out on the record Technical modifications due to transformative migration Validation

Preservation Approach Provide mechanisms to: Create archival context for the content Context is preservation metadata (provenance, administrative, descriptive, structural, behavioral) Content is the submitted digital entity Assert integrity - the consistency between the context and the content Track operations done on material and update context Assert authenticity - that the material represents the original site Track the chain of custody Manage technology evolution (encoding standard, storage repository, information repository, access methods)

Data Grids Manage shared collections that are distributed in space Location of item, access controls, checksums Implement infrastructure independence Standard operations for interacting with storage repositories Implement presentation independence Standard APIs to support porting of user interfaces

Preservation Environment Digital library infrastructure that supports Preservation metadata Arrangement and description of items Access mechanisms Data grid infrastructure that supports Shared collections that are migrated forward in time Management of technology evolution Administrative metadata providing status of records

Infrastructure Independence Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Access Methods (Web Browser, DSpace, OAI-PMH) Naming conventions provided by storage systems

Data Grids Provide a Level of Indirection for Each Naming Convention Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection Data Access Methods (C library, Unix, Web Browser) Data is organized as a shared collection

Data Grids Provide two levels of indirection: Low level API used to interact with storage repositories Standard operations for manipulating files in a storage system Standard operations for manipulating a catalog stored in a database High level API used to support user interfaces Three basic APIs - “C” library call, Unix shell commands, Java class library Other are interfaces ported on top of the basic APIs.

Unix Shell NT Browser, Kepler Actors OAI, WSDL, (WSRF) HTTP, DSpace, OpenDAP, GridFTP Archives - Tape, Sam-QFS, DMF, HPSS, ADSM, UniTree, ADS Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Application ORB Storage Repository Abstraction Database Abstraction Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix C Library, Java Logical Name Space Latency Management Data Transport Metadata Transport Consistency & Metadata Management / Authorization, Authentication, Audit Linux I/O C++ DLL / Python, Perl, Windows Federation Management Storage Resource Broker 3.3

Standard Data Access Operations Common set of operations for interacting with every type of storage repository User Application Remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries Collective operations Replication Fault tolerance Load leveling Archive at SDSC Archive at NARA Archive at U Md

Building a Distributed Collection Archive at SDSC Data Grid Common naming convention and set of attributes for describing digital entities User Application Logical name space Location independent identifier Persistent identifier Collection owned data Authenticity metadata Access controls Audit trails Checksums Descriptive metadata Inter-realm authentication Single sign-on system Archive at NARA Archive at U Md

SRB server SRB agent SRB server Federated Server Architecture MCAT Read Application SRB agent Logical Name Or Attribute Condition 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control Peer-to-peer Brokering Server(s) Spawning Data Access Parallel Data Access R1 R2 5/6

Managing Access Authenticate users independently of storage systems Preservation environment owns the data Authorize data access independently of storage system ACLs on both data and metadata Maintain audit trails of all accesses Both read and write

Collection-owned Data Store data at remote storage system under data-grid ID Access data through data grid servers Track all operations on data and update state information User authenticates to a data grid server Access controls are checked for permissions Data grid servers authenticate messages from other servers Remote server authenticates to remote storage system Multiple authentication mechanisms GSI / challenge-response / tickets

Provide Context for Data Properties of files Provenance - source Descriptive attributes Structure Organize properties as metadata in a collection hierarchy Define operations on file properties Manage state information - location, replicas, containers Separate context management from content management Maintain consistency of context as operations are done on content

Database Operations Standard interface to support Schema extension - user defined attributes Snowflake table creation SQL generation Import and export of XML files Bulk metadata load and unload Operations required to manage a catalog that resides in a database

National Archives and Records Administration - Research Prototype Persistent Archive NARAU MdSDSC MCAT Principle copy stored at NARA with complete metadata catalog Replicated copy at U Md for improved access, load balancing and disaster recovery Deep Archive at SDSC, no user access, but complete copy Demonstrate preservation environment Authenticity Integrity Management of technology evolution Mitigation of risk of data loss Replication of data Federation of catalogs Management of preservation metadata Scalability EAP collection 350,000 files 1.2 TBs in size Federation of Three Independent Data Grids

Preservation Requirements Maintain authenticity and integrity of electronic records Authenticity - assertion of provenance of data Integrity - assertion of invariance of bits Manage risk of data loss Media corruption / System failures / Operational errors / Natural disaster / Malicious users Manage technology obsolescence Support migration of collection to new systems Bulk data operations

Federation Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection B Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection A Access controls and consistency constraints on cross registration of digital entities

Data Grid Zones Choose how name spaces will be shared Cross register storage resources May the other data grid write to my storage? Cross register user names Users are authenticated by their home zone Cross register files Can replicate files into another data grid Cross register metadata Can build a copy of the metadata catalog

Replicated Catalog Deep Archive Partial User-ID Sharing Partial Resource Sharing No Metadata Synch Hierarchical Zone Organization One Shared User-ID System Managed Replication Connection From Any Zone Complete Resource Sharing System Set Access Controls System Controlled Complete Synch Complete User-ID Sharing System Managed Replication System Set Access Controls System Controlled Partial Synch No Resource Sharing Super Administrator Zone Control System Controlled Complete Synch No User-ID Sharing Peer-to-Peer Data Grids Replication Data Grids Hierarchical Data Grids Occasional Interchange Free Floating Resource Interaction User and Data Replica Nomadic Snow Flake Master Slave Replicated Data Federation Environments Replication Constraints Consistency Constraints Access Constraints

Examples of Extensibility Storage Repository Driver evolution Initially supported Unix file system Added archival access - UniTree, HPSS Added FTP/HTTP Added database blob access Added database table interface Added Windows file system Added project archives - Dcache, Castor, ADS Added Object Ring Buffer, Datascope Adding GridFTP version 3.3 Database management evolution Postgres DB2 Oracle Informix Sybase mySQL (most difficult port - no locks, no views, limited SQL)

Examples of Extensibility The 3 fundamental APIs are C library, shell commands, Java Other access mechanisms are ported on top of these interfaces API evolution Initial access through C library, Unix shell command Added inQ Windows browser (C++ library) Added mySRB Web browser (C library and shell commands) Added Java (Jargon) Added Perl/Python load libraries (shell command) Added WSDL (Java) Added OAI-PMH, OpenDAP, DSpace digital library (Java) Added Kepler actors for dataflow access (Java) Adding GridFTP version 3.3 (C library )

Sites Using the SRB

Preservation Strategies Emulation Migrate the display application onto new operating systems Equivalent to forcing use of candlelight to look at 16th century documents Transformative migration Migrate the encoding format to the new standard Migration period is expected to be 5-10 years Persistent object Characterize the encoding format Migrate the characterization forward in time

Persistent Objects Display Applications Digital Entities Characterize standard manipulation operations Characterize encoding format - data structure

Preservation Archival processes through which a digital entity is extracted from its creation environment and migrated to a preservation environment, while maintaining authenticity and integrity information. Extraction process requires insertion of support infrastructure underneath the digital material, characterization of the authenticity and integrity, characterization of the digital encoding format, and characterization of the display operations Goal is infrastructure independence, the ability to use any commercial storage system, database, or access mechanism

For More Information Reagan W. Moore San Diego Supercomputer Center