Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, 2004 1 Challenges in Long-Term Data.

Slides:



Advertisements
Similar presentations
Long-Term Preservation. Technical Approaches to Long-Term Preservation the challenge is to interpret formats a similar development: sound carriers From.
Advertisements

An Introduction June 17, 2013 Open Archival Information System (OAIS)
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.
SCIDIP-ES Components Oct ,Brussels. Basic Preservation Strategies Often stated as: “Emulate or Migrate” OAIS concepts change these to: Add Representation.
Mark Evans, Tessella Digital Preservation Boot Camp – PASIG meeting, Washington DC, 22 nd May 2013 PREMIS Practical Strategies For Preservation Metadata.
Common Use Cases for Preservation Metadata Deborah Woodyard-Robinson Digital Preservation Consultant Long-term Repositories:
StatCat Building a Statistical Data Finder ssrs.yale.edu/statcat Steven Citron-Pousty Ann Green Julie Linden Yale University.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
The Subject Librarian's Role in Building Digital Collections: Where Information Management and Subject Expertise Meet Ruth Vondracek Oregon State University.
Institutional Repositories Tools for scholarship Mary Westell University of Calgary AMTEC Conference May 26, 2005.
1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation.
Persistent Digital Archives and Library System (PeDALS) A Guide for Wisconsin State Agencies.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
An Overview of Selected ISO Standards Applicable to Digital Archives Science Archives in the 21st Century 25 April 2007 Donald Sawyer - NASA/GSFC/NSSDC.
Teaching Metadata and Networked Information Organization & Retrieval The UNT SLIS Experience William E. Moen School of Library and Information Sciences.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Agenda: DMWG SM policy status ESIP meeting recap Reminder - DM Webinar Series New and updated web pages on DM website Metadata Training Sessions CDI meeting.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
How to build your own Dark Archive (in your spare time) Priscilla Caplan FCLA.
Elements of a Data Management Plan Bill Michener University Libraries University of New Mexico Data Management Practices for.
Relationships July 9, Producers and Consumers SERI - Relationships Session 1.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
Ellsworth LeDrew, University of Waterloo of-ipy// Mark Parsons Taco de Bruin.
Metadata Lessons Learned Katy Ginger Digital Learning Sciences University Corporation for Atmospheric Research (UCAR)
19/10/20151 Semantic WEB Scientific Data Integration Vladimir Serebryakov Computing Centre of the Russian Academy of Science Proposal: SkTech.RC/IT/Madnick.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Archival Information Packages for NASA HDF-EOS Data R. Duerr, Kent Yang, Azhar Sikander.
PREMIS Rathachai Chawuthai Information Management CSIM / AIT.
Creating Archive Information Packages for Data Sets: Early Experiments with Digital Library Standards Ruth Duerr, NSIDC MiQun Yang, THG Azhar Sikander,
Digital Preservation: Current Thinking Anne Gilliland-Swetland Department of Information Studies.
Small steps and lasting impact: making a start with preservation or It’s not all NASA Patricia Sleeman Digital Archives and Repositories University of.
Archival Workshop on Ingest, Identification, and Certification Standards Certification (Best Practices) Checklist Does the archive have a written plan.
Data in the NEES Data Repository Conditions for Current and Future Use and Re-Use Quake Summit 2012, Boston, Massachusetts July 12, 2012 Stanislav Pejša.
PREMIS Implementation Fair, San Francisco, CA October 7, Stanford Digital Repository PREMIS & Geospatial Resources Nancy J. Hoebelheinrich Knowledge.
Metadata for digital preservation: a review of recent developments Michael Day UKOLN, University of Bath ECDL2001, 5th European Conference.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Data Integrity Issues: How to Proceed? Engineering Node Elizabeth Rye August 3, 2006
M-1 ISO “Reference Model For an Open Archival Information System (OAIS)” ISO “Reference Model For an Open Archival Information System (OAIS)” Presentation.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
M-1 INGEST OVERVIEW Don Sawyer National Space Science Data Center NASA/GSFC October 13, 1999.
Foundations of Information Systems in Business. System ® System  A system is an interrelated set of business procedures used within one business unit.
Digitization & Digital Preservation
Preserving Electronic Mailing Lists as Scholarly Resources: The H-Net Archives Lisa M. Schmidt
The OAIS Reference Model Michael Day, Digital Curation Centre UKOLN, University of Bath Reference Models meeting,
Trials and Tribulations of a Small Archive Presented at the THIC Conference, NCAR, Boulder CO June 30, 2004 Presented at the THIC Meeting at the National.
HDF and HDF-EOS: Implications for Long-Term Archiving and Data Access.
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
SEDAC Long-Term Archive Development Robert R. Downs Socioeconomic Data and Applications Center Center for International Earth Science Information Network.
Providing access to your data: Determining your audience Robert R. Downs, PhD NASA Socioeconomic Data and Applications Center (SEDAC) Center for International.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Chang, Wen-Hsi Division Director National Archives Administration, 2011/3/18/16:15-17: TELDAP International Conference.
Data Stewardship Lifecycle A framework for data service professionals Protectors of data.
OAIS (archive) OAIS (archive) Producer Management Consumer.
Understanding the Value and Importance of Proper Data Documentation 5-1 At the conclusion of this module the participant will be able to List the seven.
R2R ↔ NODC Steve Rutz NODC Observing Systems Team Leader May 12, 2011 Presented by L. Pikula, IODE OceanTeacher Course Data Management for Information.
NASA Earth Science Data Stewardship
OAIS Producer (archive) Consumer Management
Building A Repository for Digital Objects
An Overview of Data-PASS Shared Catalog
Persistent Identifiers Implementation in EOSDIS
Presented April 7, 2005 at the 2005 AAG meeting, Denver, CO
Statewide Digitization and the FCLA Digital Archive
Active Data Management in Space 20m DG
Open Archival Information System
Metadata The metadata contains
Fundamental Science Practices (FSP) of the U.S. Geological Survey
Presentation transcript:

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Challenges in Long-Term Data Stewardship Ruth Duerr University of Colorado at Boulder Boulder, CO (303)

I’d like to thank my coauthors: Mark A. Parsons, Melinda Marquis, Rudy Dichtl, Teresa Mullins Thanks also to Jenny Jenkins and Wendy Thoreaux for their review and input

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Presentation Overview A Brief History of Scientific Data Stewardship Scientific Data Stewardship Defined Differences Between Digital Preservation in the Library and Science Data Contexts Q&A Data and Metadata Challenges Scientific Stewardship Related Challenges Q&A

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, A Brief History of Science Data Stewardship The distant past  Historically scientific data were recorded in notebooks, logs or in maps  With luck a library or archive would collect and preserve these  Finding and accessing data were difficult

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, A Brief History of Science Data Stewardship (cont.) In recent centuries  The establishment and growth of academic and public libraries improved the situation o Librarians became data stewards, developing cataloging, indexing, preservation and accessing schemes o Data were still analog

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, A Brief History of Science Data Stewardship (cont.) World Data Centers (WDC)  Established during the International Geophysical Year  Focus on preservation and distribution of raw data  Organized by discipline  Data were still analog

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, A Brief History of Science Data Stewardship (cont.) After the 1960’s  Discipline-specific data centers proliferate o Federal government  A total of 9 national data centers were established in the US  Sponsored by NOAA, NASA, USGS, DOE  Focus on archival and distribution of data o Local and state governments o Universities o Commercial Entities

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, A Brief History of Science Data Stewardship (cont.) 1990’s Earth Observing System (EOS)  A large system of remote sensing instruments and data systems  Distributed Active Archive Centers o 8 discipline specific centers designated by NASA o Typically co-located with established data centers o Focus on archive and distribution during most active part of the data life cycle  Provides a web-based interface to simultaneously search and access data from all the DAACs as well as data centers scattered around the world

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, A Brief History of Science Data Stewardship (cont.) What is important to note about EOS and the DAACs is that they were arguably the functional beginning of a new data management model:  Geographically distributed data archival  Centralized search and order

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, A Brief History of Science Data Stewardship (cont.) Trends that continue today  Centralized access to decentralized data  Inadequate planning for long term data preservation

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, A Brief History of Science Data Stewardship (cont.) Role of the World Wide Web in decentralizing data storage  Search engines o Theoretically assist with locating data o Rarely provide sufficient information about the utility of the data  Data reliability and integrity are not verifiable  The content of the web is ephemeral  Users expect ready access to data

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, A Brief History of Science Data Stewardship (cont.) Role of private records management companies  Cost/benefits analysis  International policies regarding access to data

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Long-Term Stewardship Defined Within the data management field the phrase “long- term” typically is defined as: “A period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing user community, on the information being held in a repository.” OAIS Reference Model

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Long-Term Stewardship Defined Notions of stewardship are less well defined Three relevant definitions: “the person or group that manages the development, approval, and use of data within a specified functional area, ensuring that it can be used to satisfy data requirements throughout the organization” DOD Directive M.1

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Long-Term Stewardship Defined (cont.) Three relevant definitions (continued) Long-term archiving needs to be a “continuing program for preservation and responsive supply of reliable and comprehensive data, products, and information … for use in building new knowledge to guide public policy and business decisions” Global Change Science Requirements for Long-Term Archiving

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Long-Term Stewardship Defined (cont.) Three relevant definitions (continued) “maintaining the scientific integrity and long term utility of scientific records” NOAA/NESDIS, 2003

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Long-Term Stewardship Defined (cont.) These definitions associate the notion of science stewardship with two concepts  Data preservation  Access or use in the future

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Why preserve data? To ensure its utility for users in the future. Some examples include:  To allow combination with historical data to assess change over time  To allow future development of new or improved products  For use of data in ways that were not originally anticipated  To permit replication of scientific results

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Preservation in the Historical Library Context Library patrons expect to experience the material preserved Library patrons do not expect to be able to transform the accessed materials Library patrons are typically less concerned with how the original object was created Libraries are therefore concerned more with issues such as whether and how to preserve the “look and feel” of an object

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Preservation in the Science Data Context Users expect to be able to manipulate the data retrieved Users even expect to receive data that has been transformed during the process of extracting it from the archive Scientists also need to understand how the data were created Data archives are therefore more concerned with preserving the bits and their meaning as well as information about how the data were created

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Information About the Data that Must be Preserved “Instrument/sensor characteristics including pre-flight or pre-operational performance measurements (e.g., spectral response, noise characteristics) Instrument/sensor calibration data and method Processing algorithms and their scientific basis, including complete description of any sample or mapping algorithm used in the creation of the product (e.g., contained in peer reviewed papers, in some cases supplemented by thematic information introducing the data set or product to scientists unfamiliar with it) Complete information on any ancillary data or other data sets used in generation or calibration of the data set or derived product” Global Change Science Requirements

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Information About the Data that Must be Preserved (Cont.) “Processing history including version of processing source code corresponding to versions of the data set or derived product Quality assessment information Validation record, including identification of validation data sets Data structure and format, with definition of all parameters and fields In the case of earth-based data, station location and any changes in location, instrumentation, controlling agency, surrounding land use and other factors which could influence the long-term record” Global Change Science Requirements

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Information About the Data that Must be Preserved (cont.) “A bibliography of pertinent Technical Notes and articles, including refereed publications reporting on research using the data set Information received back from users of the data set or product” Global Change Science Requirements

Break for Questions

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Presentation Overview A Brief History of Scientific Data Stewardship Scientific Data Stewardship Defined Differences Between Digital Preservation in the Library and Science Data Contexts Q&A Data and Metadata Challenges Scientific Stewardship Related Challenges Q&A

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Data and Metadata Challenges Standards Preservation vs Access Separation Issues Data Security and Integrity Long-Term Preservation and Technology Refresh Size Does Count!

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Standards The EOS Core System (ECS) experience  Community-based standards work best  Standards profiles that support a particular community may be needed Plethora of types of standards - for example  Data format  Metadata types, content, and format  Documentation format and content

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Format Standards The challenges of preserving information stored in proprietary formats are well known Increasingly this is an issue for ancillary information about the data as well as for upper level data products Even non-proprietary standards change over time

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Format Standards (continued) Proposed solutions  Digital format archives  Archival in a technology independent representation (e.g., Universal Data Format)  Keep the archive simple and format the data for the user on the fly

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, The OAIS Reference Model A CCSDS and ISO standard that describes data preservation concepts such as:  Responsibilities of an archive  Functional model describing how to preserve information and make it available to users  Information model describing what ancillary information is needed to ensure that future users understand and can use the information preserved  A common set of terminology that can be used to describe the above

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, OAIS Archive Responsibilities Negotiate with information providers to receive and obtain sufficient rights to appropriate information to ensure long-term preservation Designate a community which should be able to understand the information preserved Ensure that the information is independently understandable to that community Document procedures and policies regarding data preservation and access Make the information available

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, OAIS Functional Model OAIS Archive IngestAccess Archive Data Mgmt Administration Producer Preservation Planning Consumer MANAGEMENT

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, OAIS Information Model Content Information Preservation Description Information Descriptive Information About Package 1 Package 1 Packaging Information

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, OAIS Information Model - Content Info. Data Object - the information to be preserved Representational Information - allows a user to understand the data  Structure (e.g., flat binary file, ASCII table, net-CDF file, HDF, etc.)  Content (e.g., a table of station IDs, dates, latitude, longitude, incidence angle, brightness temperature)

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, OAIS Info. Model - Preservation Description Provenance - documents the history of the object Reference - documents object identifiers and their generation mechanisms Fixity - documents methods used to ensure there are no undocumented changes Context - the relationship of the object to its environment

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, OAIS Preservation Description - Provenance Information about the pedigree/history of the data  Where did it come from and where has it been since?  Who created it?  How was it created; what algorithms, algorithm versions, ancillary and calibration data sets were used?  What other data were used to validate these data?  What changes have taken place since these data were originally created?

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, OAIS Preservation Description - Reference Persistent, unambiguous identifiers Aliases commonly in use A description of the rules (if any) for creating the identifier

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, OAIS Preservation Description - Fixity Authentication information  Descriptions of the mechanisms used to ensure that the data has not been changed in an undocumented way  Authentication keys  Fixity information is uncommon

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, OAIS Preservation Description - Context Information context  Why were the data created?  How do these data relate to other data?

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Beyond the OAIS Reference Model The OAIS Reference Model was not intended to be a design or implementation level standard The document discusses a wide variety of implementation level standards that could be developed Several organizations have either defined their own preservation metadata format or are working on doing so

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, The RLG/OCLC Metadata Framework The Online Computer Library Center and the Research Libraries Group sponsored development of a preservation metadata framework for digital objects based on the OAIS model The framework  defines schema elements for preservation metadata  does not specify implementation level details  allows expansion of lowest level elements

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Other Preservation Metadata Activities The OCLC PREMIS subgroup is working on defining a set of core attributes and implementation strategies The Dublin Core Metadata Initiative’s Preservation Working Group is working on a charter which will include investigation of the need for domain specific preservation metadata schemas Recently the CODATA group has started to look at whether there is a need to define preservation metadata schema for science data

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Content Standard for Digital Geospatial Metadata Established by the Federal Geographic Data Committee All federally funded programs that involve geospatial data are required to adhere to this standard Purpose is to allow users to find geospatial data, assess its utility for their purposes and to access the data Has some overlap with the OAIS reference model

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, ISO More or less the international version of the FGDC standard A “cross-walk” between the FGDC and ISO standards exists Is a content standard, not an implementation standard

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, XML Consensus seems to be building that whatever the schema, XML should be the implementation standard for metadata  ISO Technical Committee 211 is developing a UML implementation standard for ISO that will include an associated XML schema  NRC report on “Government Data Centers: Meeting Increased Demands” also recommends XML

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Preservation vs Access Science data users want data in easy to use forms  May wish to receive the data in a specified format  May wish to obtain only a particular subset of the data  May wish to have the data re-gridded or re-projected

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Preservation vs Access - Implications Science data archives may need interfaces supporting many different data access formats, grid types and projections Access formats are likely to change over time

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Preservation vs Access - Strategies Separate preservation and access storage Storage as a simple technology- independent stream of bytes, with adequate “representation information” with “format on the fly” access capabilities Storage in a database

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Separation Issues Storing data and their associated metadata separately increases the risk that they will become detached  May impede utility  May result in misuse Separation can occur even if simple techniques such as ‘tar’ are used Embedding the metadata within the data can solve this but raises other issues

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Separation Issues (continued) The situation is exacerbated when the data and metadata start out geographically separated  Even preservation of the data can be at risk in this situation

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Separation Issues - Brokered Products NSIDC is often tasked with creating metadata and advertising products held elsewhere When users request data they are referred to the external site holding the data Simply maintaining the links to these external sites is a challenge

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Separation Issues - the CAPS Example NSIDC collaborating with the International Permafrost Association released a CD titled Circumpolar Active-Layer Permafrost System in 1998  A major milestone of the Global Geocryological Data (GGD) system  The CD held 56 data sets and references to about 100 more held at other “nodes” of the GGD system

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Separation Issues - The CAPS Example (cont.) Unfortunately funding for the GGD stopped in 1998 In 2002 a new initiative started - creating an updated version of the CD was high on the to do list Dozens of the original “brokered” products are no longer readily available

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Data Security and Integrity Ensuring the integrity of the data involves at least three components  The data must demonstrate scientific integrity  The data must not have been altered since creation  Adequate preservation practices exist

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Data Security and Integrity - Scientific Integrity Notions of scientific integrity are rooted in the concept of the scientific method  Experiments must be repeatable  Results should be published in peer- reviewed literature  Data and information used must be specifically acknowledged and accessible

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Acknowledging Data and Information Traditionally data have been published in journals or monographs that could be specifically cited Currently methods vary by author  Simple acknowledgement of the data source in the paper o Often difficult to trace especially over time o Often imprecise o Sometimes do not acknowledge the true data source

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Acknowledging Data and Information (cont.) Currently methods vary by author (cont.)  Citation of an article published by the data provider that describes the data set and its collection o May not exist in the peer-reviewed literature o May only describe a portion of the data set o May not be relevant to this new application of the data o May not allow readers to acquire the data and even if it does the information may degrade over time

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Acknowledging Data and Information (cont.) Currently methods vary by author (cont.)  Use of data citations o What is a data citation?  Typically the “author” is the data provider or person who invested intellectual effort into creating the data set  Typically the “publisher” is the archive that distributed the data  The publication date is used to distinguish different versions of related data sets

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Acknowledging Data and Information (cont.) Currently methods vary by author (cont.)  Use of data citations (continued) o Publisher information may degrade over time With the rise of “electronic journals” the concept of including the data within the publication has been informally discussed  The electronic journal becomes a science data archive with all the attendant challenges

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Ensuring the Data Received was as Expected The “fixity” issue from the OAIS reference model Often this is described as a problem that is solved - not true!  Using message digest algorithms such as MD5 to ensure that the data sent is the data received  Then using digital signature technologies to ensure that the data came from a reputable source

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Ensuring the Data Received was as Expected Issues  Resources required  Algorithm/mechanism stability over time  Based on the reputation of the data source

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Trusting the Data Source The user must be able to trust that the preservation practices of the source are adequate. For example:  Archive media are routinely verified and refreshed  Facilities are secure  Processes to verify and ensure the fixity of the data are operational  Adequate mechanisms exist to ensure data can be recovered in case of emergency  Disaster recovery plans and procedures are in place

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Trusting the Data Source (continued) RLG/OCLC Working Group on Digital Archive Attributes suggests that processes for certifying digital repositories be put in place It has been suggested that folks with administrative access to data and metadata be subject to “strong proofs of identity”

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Long-Term Preservation & Technology Refresh “digital objects require constant and perpetual maintenance, and they depend on elaborate systems of hardware, software, data and information models, and standards that are upgraded or replaced every few years” NSF and Library of Congress, August 2003

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Long-Term Preservation & Technology Refresh Three proposed solutions  Normalization - Conversion to a few “technology independent” standard formats on ingest  Migration - transferring data to new technologies before the old become obsolete  Emulation - recreating the original environment on current technologies

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Size Does Count! Generally there are many more small data sets than large data sets Most of the collection level metadata creation resources are needed for these small data sets Collection level metadata needs are more or less independent of data set size How can automated metadata generation tools mitigate these resource needs?

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Scientific Stewardship Challenges Maintaining science understanding over time Decisions, Decisions, Decisions - Deciding what data to acquire and retain Upfront planning

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Maintaining Science Understanding Over Time Scientists need to be involved with the data Maintain data integrity over time Avoid misapplications of data Address known limitations of data Include information on data harmonization and improvements

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Decisions, Decisions, Decisions Deciding what data to retain - the problem  Impractical to retain all data for all time  Effective business models for cost/benefits of long-term data archive do not exist  History shows us that many data sets have unanticipated future applications  In order for results to be reproducible, the data used must remain accessible

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Decisions, Decisions, Decisions Preservation Options  Preserve all levels of the data for all time  Preserve the lowest level of the data along with the algorithms to create higher level products  Preserve only the processed products  Preserve only products that have been requested

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Upfront Planning The best time to start thinking about data stewardship is at the very beginning  Doing otherwise puts the data at risk  Doing so can increase the quality and availability of not only the metadata but also the data

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, An Example - The CLP Experience NSIDC was involved from the start NSIDC management folks were in the field  Interviews and follow up with the investigators  Manual and automated QC of the data collected each day Resulted in better QC documentation and higher- quality data

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, Summary Preservation of digital data presents many challenges Some of which are exacerbated when data is distributed Technology can be used to mitigate many of these challenges; however, People, especially scientists, need to be involved to maintain the scientific integrity of these data over time

Break for Questions

Challenges in Long-Term Data Stewardship NASA/IEEE Conference on Mass Storage Systems and Technologies - April 13, For More Information About NSIDC in general   About data management or archiving at NSIDC   (303)