NetCDF and Scientific Data Durability Russ Rew, UCAR Unidata ESIP Federation Summer Meeting 2009-07-08.

Slides:



Advertisements
Similar presentations
A Draft Standard for the CF Metadata Conventions Cheryl Craig and Russ Rew UCAR.
Advertisements

Reading HDF family of formats via NetCDF-Java / CDM
Recent Work in Progress
ESCI/CMIP5 Tools - Jeudi 2 octobre CMIP5 Tools Earth System Grid-NetCDF4- CMOR2.0-Gridspec-Hyrax …
Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.
THREDDS, CDM, OPeNDAP, netCDF and Related Conventions John Caron Unidata/UCAR Sep 2007.
The Future of NetCDF Russ Rew UCAR Unidata Program Center Acknowledgments: John Caron, Ed Hartnett, NASA’s Earth Science Technology Office, National Science.
NetCDF An Effective Way to Store and Retrieve Scientific Datasets Jianwei Li 02/11/2002.
NetCDF Ed Hartnett Unidata/UCAR
Introduction to NetCDF Russ Rew, UCAR Unidata ICTP Advanced School on High Performance and Grid Computing 13 April 2011.
Status of netCDF-3, netCDF-4, and CF Conventions Russ Rew Community Standards for Unstructured Grids Workshop, Boulder
Show of Hands... How many traveled to be here? University/Gov't/Industry How many use netCDF? Primary programming language for netCDF? Other data formats.
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps Mike Folks, The HDF Group Ruth Duerr, NSIDC 1.
Developing a NetCDF-4 Interface to HDF5 Data
1 Writing NetCDF Files: Formats, Models, Conventions, and Best Practices Russ Rew, UCAR Unidata June 28, 2007.
OPeNDAP and the Data Access Protocol (DAP) Original version by Dave Fulker.
Data Formats: Using Self-describing Data Formats Curt Tilmes NASA Version 1.0 February 2013 Section: Local Data Management Copyright 2013 Curt Tilmes.
Introduction to NetCDF4 MuQun Yang The HDF Group 11/6/2007HDF and HDF-EOS Workshop XI, Landover, MD.
NetCDF and HDF5 Ed Hartnett, Unidata/UCAR, Unidata Mission: To provide the data services, tools, and cyberinfrastructure leadership that advance.
Developing a NetCDF-4 Interface to HDF5 Data Russ Rew (PI), UCAR Unidata Mike Folk (Co-PI), NCSA/UIUC Ed Hartnett, UCAR Unidata Quincey Kozial, NCSA/UIUC.
The HDF Group ESIP Summer Meeting HDF OPeNDAP update Kent Yang The HDF Group 1 July 8 – 11, 2014.
1 Russ Rew, Ed Hartnett, John Caron UCAR Unidata Program Center Mike Folk, Robert McGrath, Quincey Kozial NCSA and The HDF Group, Inc. Final Project Review,
NetCDF for Developers and Data Providers Russ Rew, UCAR Unidata ICTP Advanced School on High Performance and Grid Computing 14 April 2011.
Unidata’s TDS Workshop TDS Overview – Part II October 2012.
HDF5 A new file format & software for high performance scientific data management.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Unidata’s Common Data Model John Caron Unidata/UCAR Nov 2006.
Mid-Course Review: NetCDF in the Current Proposal Period Russ Rew
Accomplishments and Remaining Challenges: THREDDS Data Server and Common Data Model Ethan Davis Unidata Policy Committee Meeting May 2011.
The netCDF-4 data model and format Russ Rew, UCAR Unidata NetCDF Workshop 25 October 2012.
Chapter 9 Object-Oriented Software Development F Software Development Process F Analyze Relationships Among Objects F Class Development F Class Design.
October 15, 2008HDF and HDF-EOS Workshop XII1 What will be new in HDF5?
Integrating netCDF and OPeNDAP (The DrNO Project) Dr. Dennis Heimbigner Unidata Go-ESSP Workshop Seattle, WA, Sept
DAP4 James Gallagher & Ethan Davis OPeNDAP and Unidata.
Advanced Utilities Extending ncgen to support the netCDF-4 Data Model Dr. Dennis Heimbigner Unidata netCDF Workshop August 3-4, 2009.
Recent developments with the THREDDS Data Server (TDS) and related Tools: covering TDS, NCML, WCS, forecast aggregation and not including stuff covered.
NetCDF Data Model Issues Russ Rew, UCAR Unidata NetCDF 2010 Workshop
Unidata’s Common Data Model and the THREDDS Data Server John Caron Unidata/UCAR, Boulder CO Jan 6, 2006 ESIP Winter 2006.
Unidata’s TDS Workshop TDS Overview – Part I July 2011.
The HDF Group Data Interoperability The HDF Group Staff Sep , 2010HDF/HDF-EOS Workshop XIV1.
Unidata’s Common Data Model and NetCDF Java Library API Overview John Caron Unidata/UCAR Nov 2008.
The HDF Group Introduction to netCDF-4 Elena Pourmal The HDF Group 110/17/2015.
NetCDF-4: Software Implementing an Enhanced Data Model for the Geosciences Russ Rew, Ed Hartnett, and John Caron UCAR Unidata Program, Boulder
Data File Formats: netCDF by Tom Whittaker University of Wisconsin-Madison SSEC/CIMSS 2009 MUG Meeting June, 2009.
Advances in the NetCDF Data Model, Format, and Software Russ Rew Coauthors: John Caron, Ed Hartnett, Dennis Heimbigner UCAR Unidata December 2010.
11/8/2007HDF and HDF-EOS Workshop XI, Landover, MD1 Software to access HDF5 Datasets via OPeNDAP MuQun Yang, Hyo-Kyung Lee The HDF Group.
E-Science Data Information and Knowledge Transformation BinX – A Tool for Binary File Access eDIKT project team Ted Wen
A Draft Standard for the CF Metadata Conventions Russ Rew, Unidata GO-ESSP 2009 Workshop
Unidata Technologies Relevant to GO-ESSP: An Update Russ Rew
CF 2.0 Coming Soon? (Climate and Forecast Conventions for netCDF) Ethan Davis ESO Developing Standards - ESIP Summer Mtg 14 July 2015.
Developing Conventions for netCDF-4 Russ Rew, UCAR Unidata June 11, 2007 GO-ESSP.
Development of a CF Conventions API Russ Rew GO-ESSP Workshop, LLNL
NetCDF: Data Model, Programming Interfaces, Conventions and Format Adapted from Presentations by Russ Rew Unidata Program Center University Corporation.
Update on Unidata Technologies for Data Access Russ Rew
Utilities for netCDF-4 Dr. Dennis Heimbigner Unidata Advanced netCDF Workshop July 25, 2011.
Unidata Infrastructure for Data Services Russ Rew GO-ESSP Workshop, LLNL
Libcf – A CF Convention Library for NetCDF Ed Hartnett Unidata Program Center Boulder Colorado June 11, 2007.
NetCDF Data Model Details Russ Rew, UCAR Unidata NetCDF 2009 Workshop
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
NetCDF-Java version 2.2 Common Data Model John Caron Unidata/UCAR Dec 10, 2004.
Can Data be Organized for Science and Reuse?
Moving from HDF4 to HDF5/netCDF-4
SRNWP Interoperability Workshop
Type Checking Generalizes the concept of operands and operators to include subprograms and assignments Type checking is the activity of ensuring that the.
NetCDF 3.6: What’s New Russ Rew
Plans for an Enhanced NetCDF-4 Interface to HDF5 Data
Efficiently serving HDF5 via OPeNDAP
Unidata & NetCDF BoF Scientific File Formats
Storage Basic recommendations:
Status for Endeavor 6: Improved Scientific Data Access Infrastructure
Presentation transcript:

NetCDF and Scientific Data Durability Russ Rew, UCAR Unidata ESIP Federation Summer Meeting

For preserving data, is format obsolescence a non-issue? Why do formats (and their access software) change? Are format changes consistent with data preservation and stewardship? How can the evolution of formats support preservation and stewardship? What principles should guide format developers in support of data durability? What are the most important threats to netCDF data in archives?

Why do open formats and their supporting software libraries change? To better represent data semantics  Capturing intent of data providers  Exploiting metadata advances, new conventions To improve performance, avoid obsolescence  Compression, caching, chunking, indexing, …  Parallel file systems To enhance interoperability  Replacing specialized formats with more general formats To fix mistakes  32-bit offsets for data in files  ASCII characters for all metadata To respond to users’ needs

How do formats change? Simple formats don’t change, they’re defined once and frozen forever Some formats change infrequently and usually incompatibly Complex formats (and their software) may evolve in lots of small increments ASCII GRIB 1 GRIB netCDF

NetCDF: not just a format A standard format for platform-independent data (NASA ESDS-RFC-011) CF-netCDF is being proposed as a formal OGC binary encoding standard A data model for multidimensional and structured scientific data A set of application programming interfaces (C, Java, Fortran, C++, …) for data access A reference implementation for the APIs But netCDF is also

How has netCDF changed? Software FormatsFeatures 2009: 4.1 netCDF-4 64-bit offset classic OPeNDAP client support, integration/inclusion of udunits and libcf, improved HDF5 and HDF4 support 2008: 4.0 netCDF-4 64-bit offset classic Enhanced data model, expanded APIs, HDF5 storage layer, compression, chunking, parallel I/O, Unicode names, … 2006: bit offset classic NcML, 64-bit offset format 2001: 3.5classicnew Java API, Fortran-90 API 1998: 3.4classic Java API, limited large file support, performance enhancements 1997: 3.3classicC, F77 “version 3” type-safe APIs 1996: 2.4classicC++ API, optimizations, format spec published 1989: 1.0classicC, F77 APIs

Ways to deal with format changes Use only published standards for archives  Format standardization is slow GRIB1 (1985) to GRIB2 (2001)  Impractical if many intermediate versions (e.g CF Conventions 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, …) Convert archived data periodically  Upgrading older formats is costly, risky  Migrating to a more general format may break older access software Save data access software versions with data  Requires data archives to become software version control repositories  Imposes often unnecessary burden on data access Rely on a commitment to compatibility by format developers, maintainers, and responsible organization

Compatibility commitment For scientific data, preserving access to data for future generations should be sacrosanct Strong commitment is needed to ensure practical access to old data by new programs Careful library evolution can ensure data and API compatibility An example public commitment presented at American Meteorological Society annual meeting, January 2006 …

For future access to archives, netCDF development will continue to ensure the compatibility of: Data access: netCDF software will provide both read and write access to all earlier forms of netCDF data. Programming interfaces: C, Fortran, and Java programs using documented netCDF APIs from previous versions will continue to work after recompiling and relinking (if needed). Future versions: netCDF will continue to support both data access compatibility and API compatibility in future releases. Declaration of Compatibility

Aspects of compatibility Costs  Effort to support older interfaces and formats  Comprehensive compatibility testing with every software release Benefits  Data in archives don’t have to change  Client program sources don’t have to change  Software can access archived data without being aware of format version Implemented by compatibly evolving data model  Add or grow abstractions, instead of replacing them  Ensure previous data model is included in enhanced data model

Classic netCDF data model A file has variables, dimensions, and attributes. Variables also have attributes. Variables may share dimensions, indicating a common grid. One dimension may be of unlimited length. Dimension name: String length: int isUnlimited( ) Attribute name: String type: DataType values: 1D array Variable name: String shape: Dimension[ ] type: DataType array: read( ), … File location: Filename create( ), open( ), … Variables and attributes have one of six primitive data types. DataType PrimitiveType char byte short int float double

Enhanced netCDF data model, for netCDF-4 A file has a top-level unnamed group. Each group may contain one or more named subgroups, user-defined types, variables, dimensions, and attributes. Variables also have attributes. Variables may share dimensions, indicating a common grid. One or more dimensions may be of unlimited length. Dimension name: String length: int isUnlimited( ) Attribute name: String type: DataType values: 1D array Variable name: String shape: Dimension[ ] type: DataType array: read( ), … Group name: String File location: Filename create( ), open( ), … Variables and attributes have one of twelve primitive data types or one of four user-defined types. DataType PrimitiveType char byte short int int64 float double unsigned byte unsigned short unsigned int unsigned int64 string UserDefinedType typename: String Compound VariableLength Enum Opaque

NetCDF-4 classic-model: a transitional format netCDF-3 netCDF-4 classic model netCDF-4 Compatible with existing applications Simplest data model and API Not compatible with some existing applications Enhanced data model and API, more complex, powerful Uses classic API for compatibility Uses netCDF-4/HDF5 storage for compression, chunking, performance To use, just recompile, relink

Other ways netCDF supports data durability CF Conventions add earth-science specific semantics to low-level data model, without changing format Java netCDF reads multiple data formats through an abstract Common Data Model interface HDF4, HDF5, HDF-EOS, GRIB1, GRIB2, BUFR, GEMPAK, GINI, DMSP, NEXRAD, … NcML wrappers support efficient addition of new metadata, virtual aggregations “history” attribute for provenance automatically maintained by utilities like NCO

Concluding remarks Format obsolescence need not be an issue for data durability Evolve data models by extension, not by incompatible modification Preserve previous programming interfaces Support previous format variants transparently Avoid gratuitous invention of new formats Data preservation and stewardship requires much more than dealing with format evolution  Economic failures  Organizational failures  Operator or administrative errors  Hardware problems  Software errors