Presentation on theme: "File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion."— Presentation transcript:
File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion
Introduction & overview Outline of objectives: – Discuss role of standard, self-describing “File formats” in data level interoperability – Summarize common file formats in use, their properties, & benefits --“data life cycle economics” – Discuss criteria for choosing a file format, matching it to needs of consumer/producers. – Discuss critical role of Conventions – any file format needs good recipes to make them interoperable! – Examples: NASA Measures F/T, SMAP, AIRs, Aura
Role(s) Of File Formats in Interoperability File formats represent versatile “packages” for multi-dimensional science data and metadata. Offer self-describing “well-known structures” to codify desired, common conventions and practices. Offer well-documented reference cases to encapsulate specific data models. Standard file formats dock with format-aware tools to offer users a seamless end-to-end experience and platform portability Enhance Mission-to-Mission continuity
Why (and how) are file formats important? Standard formats – Come with thorough documentation – Provide good Reference implementations Common formats – More datasets in a format more tools that read that format Canonical structures and names general purpose handlers for coordinates, etc. smarter tools
A generic work flow… Consider user community needs and culture, fit within architecture, institutional policies & preferences Choose a standard file format (or sub-variant) Design a convention-enabled, specific internal layout with metadata interfaces Prototype: Implement in prototype, evaluate Implement in production context Integrate within discovery and catalog environments (Catalog interoperability…)
Examples of standard file formats HDF5 – a file format on its own, as well as a broad foundation for others netCDF v4 (stable at v4.1.1, newest : v4.1.2-beta1) – v4 Classic (widespread adoption, some limitations…) – v4 Enhanced (support Groups, User-defined, variable length types, and more) netCDF v3 Classic (legacy+, tools+, but limited) HDFEOS2, HDFEOS5 – EOS Terra, Aqua, Aura… HDF4 – legacy, extensive use by MODIS Terra, Aqua Many other domain-specific, less generic formats abound… (need transform tools to/from HDF?)
Some selection criteria… Do file-format’s capabilities support required functionality? What is breadth of acceptance, adoption within larger community? (and/or, does institutional policy dictate a specific format?) Presence and quality of documentation (reference, examples and especially tutorials), API software, and community support? Contribution to investment, data life-cycle economics? What is the level of standardization? Adaptability of format to widely used conventions like CF 1.x, or other accepted convention(s)?
Internal Layout / Design (once format is chosen & adopted…) Define &refine High level organization /structure /DATA /METADATA Distinguish ‘data’ from ‘metadata’, core structure vs. ‘attributes’ – Dimensions, Coordinate Variables, projection attributes – Missing_data, _Fillvalue vs. internal fill value – Units, Gain, offset, min, max, range, etc. Prototype it! – Leverage script environments (Python H5Py, PyTables, etc) – Panoply, HDFView also quick, useful for prototyping, feedback
Using “Groups” HDF5 (and NetCDF v4-Enhanced) support full use of groups e.g. /DATA vs. /METADATA, etc. Groups useful in partitioning out functionally related sets of data or attributes; Hierarchical view mimics file-system Facilitates appropriate information-hiding, highlights needed info, shield other (principle of least privilege…) Well supported by modern tools (Panoply, HDFViews, PyTables, H5Py) and low-lev APIs.
Example(s) of File Formats In Action HDF5 – NASA Measures – NASA Measures Freeze/Thaw (soon available at NSIDC) – http://measures.ntsg.umt.edu/sample_2007_day180.zip http://measures.ntsg.umt.edu/sample_2007_day180.zip AQUA AIRS Level 2 (from earlier talk) : – http://airspar1u.ecs.nasa.gov/opendap/Aqua_AIRS_Level2/AIRX2RET.005/201 0/285/AIRS.2010.10.12.090.L2.RetStd.v220.127.116.11.G10286064818.hdf http://airspar1u.ecs.nasa.gov/opendap/Aqua_AIRS_Level2/AIRX2RET.005/201 0/285/AIRS.2010.10.12.090.L2.RetStd.v18.104.22.168.G10286064818.hdf Aura TES ( TES-Aura_L3-CH4_r0000002135_F01_05.he5 )
Example: NASA Measures Freeze/Thaw, Daily in HDF5 Metadata Block: Attributes
Example: NASA Measures Daily Freeze/Thaw in HDF5 Data Variable (FT_SSMI) and Attributes
CF Conventions & file formats: --how they contribute to interoperability. CF v1.4.x -- the term “CF” is now broader than just climate-forecasting! Standard Name Table -- a step towards wider adoption of names, controlled vocabularies, units terminology CF v1.4.x provides tool-makers with helpful “lingua- franca” guidance. Within a file-format, adopting conventions like CF promotes common layout, names, semantics, for dataset-to-dataset compatibility -- a key to wider data level interoperability.
Attributes vs. Metadata? one man’s ceiling is another man’s floor… Collection level vs. Data Set vs. Granule level Structural vs. science-content Swath vs. grid vs. point Commonly used attributes: – CONVENTIONS attrib, communicates which convention was used – Basic globals: title, history, institution, source, references – Coordinate variables, axis, formula_terms – Units, _Fillvalue, missing_data, valid_range – Short_name, long_name, other provenance – (gain,offset /scale_factor,addOffset), etc.
Challenges? (just a few remain…) Evolution, bifurcation, asymmetric support can result in occasional user confusion: – HDF v1.8.x vs. v1.6.x families? – NetCDF v4 Enhanced vs. NetCDF v4 Classic vs. v3? – HDFEOS5 vs. HDFEOS2? Both GUI tool and API support tends to vary by platform (Linux, Mac, Win7) and sub-flavor… Multi-library dependency stacks beg for fully bundled, version-matched end-to-end install pkg! Conventions community (CF v1.4.x) and metadata standards communities also in motion (but that’s good too…)
Resources: File format related Tools Panoply: http://www.giss.nasa.gov/tools/panoply/ HDFView: http://www.hdfgroup.org/hdf-java-html/hdfview/ http://www.hdfgroup.org/hdf-java-html/hdfview/ OpenDAP : http://opendap.org IDV : http://www.unidata.ucar.edu/software/idv/ McIDAS : http://www.unidata.ucar.edu/software/mcidas/ http://www.unidata.ucar.edu/software/mcidas/ Python : – h5py : http://code.google.com/p/h5py/, http://h5py.alfven.org/,http://h5py.alfven.org/ – PyTables: http://www.pytables.org/moin Perl : PDL-IO-HDF5, and Biohdf? Many others: HEG, MTD, HDFEOS plug-in for HDFview, HDFLook, (ncdump, h5dump, and cousins), GRADS, Matlab, binary APIs
A provisional DOI, UUID Strategy What we used for NASA Measures Freeze/Thaw, daily (v2) just delivered: – DOI: assigned to our reference paper, by IEEE Transactions in Geoscience and Remote Sensing – UUID recipe, seedString = www.our.url/GranuleName/Datetime8601Stamp Import uuid uuid= uuid.uuid5(seedString)