Presentation on theme: "Data Stewardship and Data Provenance Activities May 10-11, 2011 Steve Kempler and Greg Leptoukh."— Presentation transcript:
Data Stewardship and Data Provenance Activities May 10-11, 2011 Steve Kempler and Greg Leptoukh
Data Stewardship 2 Data storage, preservation, and integrity Ease of use Interoperability Quality information NASA Earth science data systems are authoritative sources of science data and recognized as national assets. They are systems of record. Therefore, data stewardship serves a vital role. Science data stewardship is the protection of science data records, their integrity, long term utility, and other actions that maximize the return on investment. Science data stewardship includes areas such as: Metadata, Availability Documentation and user presentation Attribution and accountability. From Berrick, dsds.nasa.gov/day1/D1_LessonsLearned_Berrick.ppt
Data Stewardship 3 Ruth Duerr, NSIDC, provides a very comprehensive discussion on Data Stewardship in her presentation, Challenges in Long-Term Data Stewardship: http://storageconference.org/2004/Presentations/Tutorials/05- Duerrfinal.pdf http://storageconference.org/2004/Presentations/Tutorials/05- Duerrfinal.pdf Duerr: The distant past: Historically scientific data were recorded in notebooks, logs or in maps With luck a library or archive would collect and preserve these Finding and accessing data were difficult And we can vouch for this!... Unfortunately… Trends that continue today: Inadequate planning for long term data preservation
AGU Poster by Rebuilding and Organizing 1960’s Era Datasets to 2010’s Data Stewardship Expectations AGU Poster by John F Moses, et al 4 Nimbus satellites were flown in near-polar sun-synchronous orbits from 1964 - 1978. Original observations have been retained in the form of film and in digitized data on magnetic tape. Nimbus datasets are being recovered from their original media and organized with documentation for access through EOSDIS, at the GES DISC. This historical data supplements current space based observations for understanding critical geo-physical parameters In addition to recovering valuable datasets for scientific usage, this effort has demonstrated the significance of preparing for the preservation of datasets for future generations. This presentation illustrates methods required to establish the utility of the Nimbus datasets for consideration in future Earth science research studies. Lessons highlight what should be considered in preserving today’s data collections
Rebuilding and Organizing 1960’s Era Datasets: Project Highlights 5 Nimbus Data Recovered From 1960’s Media 70 mm film 7 track tapes (some restored to 9 track tapes) contain Level 2 product consisting of calibrated and geolocated swath data 8-1/2” x 11” film positives with lat-lon grid lines 35 mm film on 100-foot reels of gridded and ungridded pictures John Bordynuik Inc. uses refurbished tape drives and 28 head readers to recover almost all data (98%) from tapes and stores on disk in TAP (tape emulation format) Data were examined and validated using Nimbus documentation README includes C code snippets for decoding 36 bit words JAVA software created to check file formats and content Geolocation was checked by mapping to Equal Area Grid Cylindrical Grid IDL software was created to display images Constructed an inventory of TAP files recovered from 7 and 9 track tapes Created software to examine and ingest orbit files Created Web portal to organize data, documentation and provenance information Scanned User Guides and Data Catalog Documentation
6 Nimbus II -IV dataset ingested into the archive and distribution system with metadata that meets EOSDIS standards for searchable Level 2 granules Searchable through ECHO-WIST and Global Change Master Directory Documentation includes README files, Users Guides, Data Catalogs Provenance information, inventory, and quality information have been placed on the public website Nimbus dataset is available from Goddard Earth Science Data and Information Service Center (GES DISC) public website Website : http://disc.sci.gsfc.nasa.gov/nimbushttp://disc.sci.gsfc.nasa.gov/nimbus Nimbus II HRIR Dataset:1740 tapes (7 track), 1703 TAP files recovered, 1678 read, 2470 files ingested Nimbus II MRIR Dataset: 8 tapes (9 track), 1771 TAP files recovered, 1685 read, 1616 files ingested Nimbus III HRIR Dataset: 1015 tapes (7 track), 951 TAP files recovered, 865 read, 1101 files ingested Nimbus IV Ch 67 THIR Dataset: 1032 tapes, 964 files recovered, 778 read, 1240 files ingested Nimbus IV Ch 115 THIR Dataset:1293 tapes, 1077 files recovered, 813 read,1268 files ingested Rebuilding and Organizing 1960’s Era Datasets: Achievements
7 Preserve data catalogs in table or database digital forms. A lot of effort goes into scanning and converting original paper data catalogs to quantitative information for comparison to recovered inventory Include production source code so that future users can definitively understand how metadata was generated and for what purpose it can be used (e.g., in improving geo-location or calibration) Collect and maintain names of contributors and key published papers – provenance information Data migration to new media Distinguish format problems from loss of data associated with deteriorating media Keep mapping from original media to restored media for provenance documentation Even after successful ingest, make available bit quality information from recovery at the granule level Rebuilding and Organizing 1960’s Era Datasets: Lessons Learned
Data Stewardship Challenges (from Duerr) 8 Scientists need to be involved with the data Maintain data integrity over time Avoid misapplications of data Address known limitations of data Include information on data harmonization and improvements Deciding what data to retain - the problem Impractical to retain all data for all time Effective business models for cost/benefits of long-term data archive do not exist History shows us that many data sets have unanticipated future applications The best time to start thinking about data stewardship is at the very beginning Doing otherwise puts the data at risk Doing so can increase the quality and availability of not only the metadata but also the data
Preservation in the Science Data Context (from Duerr) 9 Users expect to be able to manipulate the data retrieved Users even expect to receive data that has been transformed during the process of extracting it from the archive Scientists also need to understand how the data were created Data archives are therefore more concerned with preserving the bits and their meaning as well as information about how the data were created
Preserving Data through Data Provenance 10 Open Archival Information System (OAIS) Provenance Description: Information about the pedigree/history of the data Where did it come from and where has it been since Who created it? How was it created; what algorithms, algorithm versions, ancillary and calibration data sets were used? What other data were used to validate these data? What changes have taken place since these data were originally created?
Preserving Data through Data Provenance 11 What is really important to preserve? What should users care about? One school of thinking is that everything needs to be stored, including the minute details of computer environment during data processing, e.g., OS version, FORTRAN version, …., temperature in the room where the processing was done, etc. Another approach is to develop some benchmarks or a “golden” set of dataset characteristics that can be used for the restored data to regress against.
Collecting and Delivering Data Provenance Where to find the knowledge about data? It is scattered in scientific papers, the actual code, unwritten assumptions, folklore, etc. Assess sensitivity of the results to variations in processing algorithms/steps… Work closely with scientists to guarantee science quality How to deliver provenance? Deliver to users together with the data Present to users in a convenient, easy-to-read fashion Provide recommendations for different data usage (applications vs. climate studies)
Data from multiple sensors: harmonization It is not sufficient just to have the data and their provenance from different sensors in one place Before data can be compared and fused, many items need to be harmonized: Data: format, grid, spatial and temporal resolution Metadata: standard fields, units, scales, quality? Provenance: what to do with it? Product AProduct B Good3 Bad2 Ugly1 0 Are these quality flags compatible?
How to work with multi-sensor data? Capture and classify the details of measurement technique, data collection and processing Identify and spell out similarities and differences Assess importance of these differences Deliver all this information in such a way that a user can easily see and understand the details Present recommendations to guide the data usage and avoid apples-to-oranges comparison and fusion 14
Discussion 15 How much stewardship is too much stewardship (i.e. effort=money) What is an avenue of establishing a framework for provenance and its compatibility across all data? How do we get data providers to think about data provenance? What is the requirement for “goodness” of data restoration? Within 0.05%? … Even by asking this question, we are forcing ourselves to upfront defining various aspects of data quality.