Replicated & Distributed Storage Technologies : “Impact on Social Science Data Archive Policies” IASSIST 2010 Ithaca, New York Jonathan Crabtree June.

Slides:



Advertisements
Similar presentations
What is HathiTrust and How Can it Make a Difference? Sourcing and Scaling brought to the collective collection.
Advertisements

A Community Approach to Preservation: Experiences with Social Science Data ASIST Summit 2010 Jonathan Crabtree April 9, 2010.
Digital Preservation A Matter of Trust. Context * As of March 5, 2011.
Digital Preservation and Trusted Digital Repositories Priscilla Caplan Florida Center for Library Automation ALA 2005 Chicago IL.
1 Overview for DAP Business Units Digital Archives Problem Statement Records are all material "regardless of physical form, created or received in connection.
Institutional Repositories It’s not Just the Technology New England Archivists Boston College March 11, 2006 Eliot Wilczek University Records Manager Tufts.
Dr Gordon Russell, Napier University Unit Data Dictionary 1 Data Dictionary Unit 5.3.
PREMIS in Thought: Data Center for LC Digital Holdings Ardys Kozbial, Arwen Hutt, David Minor February 11, 2008.
SCIDIP-ES Components Oct ,Brussels. Basic Preservation Strategies Often stated as: “Emulate or Migrate” OAIS concepts change these to: Add Representation.
Sustainable Preservation Services for Archivists through Distributed Custody Caryn Wojcik State of Michigan Records Management Services.
Trustworthy Repository Criteria, Virtual Organizations, and Infrastructure MacKenzie Smith, MIT Libraries NDIIPP Meeting, July 2010.
AN OPEN-SOURCE SYSTEM FOR AUTOMATIC POLICY-BASED COLLABORATIVE ARCHIVAL REPLICATION Using the SafeArchive System The SafeArchive System coordinates six.
TRAC / TDR ICPSR Trustworthy Digital Repositories.
Extracting and Ingesting DDI Metadata and Digital Objects from a Data Archive into the iRODS extension of the NARA TPAP Using the OAI-PMH J. Ward, A. de.
Micah Altman MIT Libraries Jonathan Crabtree Odum Institute UNC Chapel Hill Prepared for Aligning Digital Preservation across Nations Amsterdam 2013 Website:
Title Subtitle. Building Relationships: “A Foundation for Digital Archives” Digital Object Repository Systems in Digital Libraries (DORSDL) September.
A Community Approach to Preservation: “Experiences with Social Science Data” Community Approaches to Digital Preservation 2009 Jonathan Crabtree February.
Introduction to Database Management  Department of Computer Science Northern Illinois University January 2001.
Developing a Records & Information Retention & Disposition Program:
Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center.
Depositing and Disseminating Digital Resources Alan Morrison Collections Manager AHDS Subject Centre for Literature, Linguistics and Languages.
Chapter 1 Introduction to Databases
Data-PASS Shared Catalog Micah Altman & Jonathan Crabtree 1 Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate.
DCC Conference, Glasgow November, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego.
Database System Development Lifecycle © Pearson Education Limited 1995, 2005.
Keeping your Archive Safe (and on TRAC) with SafeArchive and LOCKSS Thu-Mai Christian [Slides] Micah Altman Jonathan Crabtree [Project Directors]
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Information Management and Distributed Data Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
Introduction: Databases and Database Users
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
Digital Preservation: Lessons learned through national action Digital Preservation Interoperability Framework Workshop April 2010.
Bryan Beecher University of Michigan Director, Computing & Network Services E: W:
Richard MarcianoChien-Yi Hou Caryn Wojcik University of University of State of Michigan North Carolina North Carolina Records Management ServicesSALT DCAPE.
Electronic Records Management: A Checklist for Success Jesse Wilkins April 15, 2009.
Trustworthy Repositories, Organizations & Infrastructure Micah Altman, Institute for Quantitative Social Science, Harvard University Jonathan Crabtree,
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
DigCCurr Professional Institute: Curation Practices for the Digital Object Lifecycle Digital Curation Program Development Nancy Y McGovern Research Assistant.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Libraries, Archives, and Digital Preservation: The Reality of What We Must Do Leslie Johnston Acting Director, National Digital Information Infrastructure.
HathiTrust’s Past, Present and Future. Short- and Long-term Functional Objectives Short-term Page turner mechanism (and Mobile!) Branding (overall initiative;
Micah Altman Associate Director, Harvard-MIT Data Center Institute for Quantitative Social Science, Harvard University Bryan Beecher Director of Computing.
The Canadian Information Network for Research in the Social Sciences and Humanities Tim Au Yeung and Mary Westell Libraries.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
Report on Preservation of ETDs: The LOCKSS Prototype The work of Kamini Santhanagopalan Virginia Tech Graduate Student in Computer Science Reported at.
Data-PASS Partners: Plans for Moving Forward Marc Maynard The Roper Center for Public Opinion Research University of Connecticut July 2010.
Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University
11 Jan Preserving Peer Replicas By Rate-Limited Sampled Voting Peer to Peer Seminar Prof. Dr.-Ing. Gerhard Weikum Presentation by: Renata Dividino.
National Science Foundation Cooperative Agreement: OCI Reagan Moore, PI Mary Whitton, Project Manager.
©MIT LKTR Workshop, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego Supercomputer.
Aligning Digital Preservation Policies with Community Standards Nancy McGovern Digital Preservation Officer.
Managing Access at the University of Oregon : a Case Study of Scholars’ Bank by Carol Hixson Head, Metadata and Digital Library Services
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Information Resource Stewardship A suggested approach for managing the critical information assets of the organization.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Open Science and Research – Services for Research Data Management © 2014 OKM ATT 2014–2017 initiative Licenced under.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Data Management and Digital Preservation Carly Dearborn, MSIS Digital Preservation & Electronic Records Archivist
Introduction: Databases and Database Systems Lecture # 1 June 19,2012 National University of Computer and Emerging Sciences.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Outline Types of Databases and Database Applications Basic Definitions
Trustworthiness of Preservation Systems
An Overview of Data-PASS Shared Catalog
Policy-Based Data Management integrated Rule Oriented Data System
Chapter 2 Database Environment Pearson Education © 2009.
Sophia Lafferty-hess | research data manager
Database Environment Transparencies
Institutional Repositories
Digital Preservation and Trusted Digital Repositories
Robin Dale RLG OAIS Functionality Robin Dale RLG
Nancy Y. McGovern Digital Preservation Officer, ICPSR IASSIST 2007
Presentation transcript:

Replicated & Distributed Storage Technologies : “Impact on Social Science Data Archive Policies” IASSIST 2010 Ithaca, New York Jonathan Crabtree June 2, 2010

The Odum Institute Oldest Institute or Center at UNC-CH Founded 1924 Mission: teaching, research, & service for social sciences Cross-disciplinary focus

The Partners ICPSR Odum Institute Roper Center Henry A. Murray Research Archive Harvard IQSS National Archives and Records Administration

One of the key functions of social science data archives is to preserve historic and import data used in social science research.

How can we promise Preservation There are many definitions of preservation and many key components to policies that support preservation of social science data. “Social science archives should consistently update and evaluate policies to ensure they meet the goals of their organizations” Green, Ann, Stuart Macdonald, and Robin Rice. "Policy-Making for Research Data in Repositories: A Guide." Edinburgh, UK: EDINA and University Data Library, University of Edinburgh, 2009.

Data Replication “Storage alone will not solve the problem of digital preservation. Academic materials have many enemies beyond natural bit rot: ideologies, governments, corporations, and inadequate budgets. It is essential that sound storage and administration practices are complemented with the institution of communities acting together to thwart attacks that are too strong or too extrinsic for such practices to protect against.” Maniatis, Petros, Mema Roussopoulos, T.J. Giuli, David S. H. Rosenthal, and Mary Baker. "The LOCKSS Peer-to-Peer Digital Preservation System.” ACM Transactions on Computer Systems 23, no. 1 (2005): p41.

Distributed Replication and Storage Projects Policy-Based Replication and Auditing –Data-PASS project –LOC funded prototype –Currently IMLS funded project –LOCKSS PLN foundation –Schema based auditing Rules-Based Distributed Storage –NARA/Odum/UNC Chapel Hill SILS project –NARA funded –iRODS grid based foundation –Rules based policy enforcement

Policy-Based Replication and Auditing Data-PASS Syndicated Storage Technology “SSP”

Multi-Archival: Syndicated Storage Platform

Preservation Failures Technical –Media failure: storage conditions, media characteristics –Format obsolescence –Preservation infrastructure software failure –Storage infrastructure software failure –Storage infrastructure hardware failure External Threats to Institutions –Third party attacks –Institutional funding –Change in legal regimes

Replication as Part of a Multi-Institutional Preservation Strategies There are potential single points of failure in both technology, organization and legal regimes: Diversify your portfolio: multiple software systems, hardware, organization Find diverse partners – diverse business models, legal regimes Preservation is impossible to demonstrate conclusively: Consider organizational credentials No organization is absolutely certain to be reliable Consider the trust relationships across institutions

Data-PASS Requirements for SSP Policy Driven –Institutional policy creates formal replication commitments –Replication commitments are described in metadata, using schema –Metadata drives Configuration of replication network Auditing of replication network Asymmetric Commitments –Partners vary in storage commitments to replication –Partners vary in size of holdings being replicated –Partners vary in what holdings of other partners they replicate Completeness –Complete public holdings of each partner –Retain previous version of holdings –Include metadata, data, documentation, legal agreements Restoration guarantees –Restore groups of versioned content to owning archive –Institutional failure restoration – support transfer of entire holdings of a designated archive to another partner Trust & Verification –Each partner is trusted to hold the public content of other, not to disseminate improperly –Each partner trusts replication broker to add units to be harvested –No partner is trusted to have “super-user” rights to delete (or directly manipulate) replication storage owned by another partner –Legal agreements reinforce trust model –Schema based auditing used to verify replication guarantees are met by the network

Syndicated Storage Platform (SSP) Start with LOCKSS Lots of Copies Keep Stuff Safe But used in a closed network –Private LOCKSS Network (PLN) –A few of them out there Educopia Institute/MetaArchive perhaps the best known Biggest selling point was independence of each node in the PLN

PLNs Other differences between traditional PLN and our needs –Our content isn’t harvestable via HTTP In our case we use OAI-PMH –Our PLN nodes are different sizes –Our trust model requirement prevents a centralized authority controlling the network

SSP Commitment Schema Network level: –Identification: name; description; contact; access point URI –Capabilities: protocol version; number of replicates maintained; replication frequency; versioning/deletion support –Human readable documentation: restrictions on content that may be placed in the network; services guaranteed by the network; Virtual Organization policies relating to network maintenance Host level –Identification: name; description; contact; access point URI –Capabilities: protocol version; storage available –Human readable terms of use: Documentation of hardware, software and operating personnel in support of TRAC criteria Archival unit level –Identification: name; description; contact; access point URI –Attributes: update frequency, plugin required for harvesting, storage required –Terms of use: Required statement of content compliance with network terms. ; Dissemination terms and conditions TRAC Integration –A number of elements comprise documentation showing how the replication system itself supports relevant TRAC criteria –Other elements that may be use to include text, or reference external text that documents evidence of compliance with TRAC criteria. –Specific TRAC criteria are identified implicitly, can be explicitly identified with attributes –Schema documentation describes each elements relevance to TRAC, and mapping to particular TRAC criteria

Current Efforts

IMLS Project Goals Move from prototype to production Adapt to more generic uses Examine scalability issues Bulk recovery to home repositories Work toward a fully automated update system Rework the interface to LOCKSS cache Work with the community to develop standard PLN auditing

Rules-Based Distributed Storage Rules-Based policy enforcement iRODS grid based technology OAI-PMH harvesting from Odum Dataverse network

Using approach modeled on MIT Pledge project Step 1 = define policy areas Step 2 = create policy declaration statements for each policy area; state the requirements for operation, not technical specifics Step 3 = each entity in a policy statement is defined in language descriptions: humans and machine-readable references Step 4 = deontic statements: logical statements define actors, actions, and constraints that enforce a policy statement. Step 5 = Write iRODS rules for each statement Wolfe, Robert PLEDGE policy list. MIT Libraries.

Policy Areas Organization, Environment, and Legal Policies Community and Usability Policies Process and Procedure Policies Technology and Infrastructure Policies Wolfe, Robert PLEDGE policy list. MIT Libraries.

Initial Rules Developed Organization, Environment, and Legal Policies –Defined dataset succession plan –Defined access policies –Log access for accountability –Reference TRAC criteria Community and Usability Policies –Require a deposit agreement Process and Procedure Policies –Defined iCAT to DDI discovery crosswalk –Store dataset’s DDI metadata as object –Defined persistent identifiers –Defined UNF’s and Checksums –Provide reporting of preservation network Technology and Infrastructure Policies –Defined number of replication copies –Defined geographic location for the copies –Provide authentication policy –Provide versioning –Provide control for deletion/replacement –Defined replica validation frequency via UNF’s and Checksums

Summary Replication ameliorates institutional risks to preservation Strengthen preservation through institutional diversification Data-PASS requires policy based, auditable, asymmetric replication commitments Formalize policies in schema or rules Build trust models Data-PASS approach to preservation combines Trust Models, Institutional Collaborations and Digital Replication Infrastructures

Contact Information Website: Jonathan Crabtree