John Caron Unidata October 2012

Slides:



Advertisements
Similar presentations
Complex Scientific Analytics in Earth Science at Extreme Scale John Caron University Corporation for Atmospheric Research Boulder, CO Oct 6, 2010.
Advertisements

GRIB in TDS 4.3. NetCDF 3D Data dimensions: lat = 360; lon = 720; time = 12; variables: float temp(time, lat, lon); temp:coordinates = “time lat lon”;
Reading HDF family of formats via NetCDF-Java / CDM
Recent Work in Progress
The Model Output Interoperability Experiment in the Gulf of Maine: A Success Story Made Possible By CF, NcML, NetCDF-Java and THREDDS Rich Signell (USGS,
A Unified Data Model and Programming Interface for Working with Scientific Data Doug Lindholm Laboratory for Atmospheric and Space Physics University of.
THREDDS Status John Caron Unidata 5/7/2013. Outline Release schedule Aggregations -> featureCollections / NCSS GRIB refactor Discrete Sampling Geometry.
The NCAR Command Language (NCL) and the NetCDF Data Format Research Tools Presentation Matthew Janiga 10/30/2012.
A Common Data Model In the Middle Tier Enabling Data Access in Workflows … HDF/HDF-EOS Workshop XIV September 29, 2010 Doug Lindholm Laboratory for Atmospheric.
Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.
THREDDS, CDM, OPeNDAP, netCDF and Related Conventions John Caron Unidata/UCAR Sep 2007.
7 +/- 2 Maybe Good Ideas John Caron June (1) NetCDF-Java (aka CDM) has lots of functionality, but only available in Java – NcML Aggregation – Access.
The Future of NetCDF Russ Rew UCAR Unidata Program Center Acknowledgments: John Caron, Ed Hartnett, NASA’s Earth Science Technology Office, National Science.
Making earth science data more accessible: experience with chunking and compression Russ Rew January rd Annual AMS Meeting Austin, Texas.
Активное распределенное хранилище для многомерных массивов Дмитрий Медведев ИКИ РАН.
Unidata TDS Workshop THREDDS Data Server Overview October 2014.
Status of netCDF-3, netCDF-4, and CF Conventions Russ Rew Community Standards for Unstructured Grids Workshop, Boulder
Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Implementation of Model Data Interoperability for IOOS: Successes and Lessons Learned Rich Signell USGS Woods Hole, MA / NOAA Silver Spring USA Model Data.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
NetCDF for Developers and Data Providers Russ Rew, UCAR Unidata ICTP Advanced School on High Performance and Grid Computing 14 April 2011.
Unidata’s TDS Workshop TDS Overview – Part II October 2012.
HDF5 A new file format & software for high performance scientific data management.
DM_PPT_NP_v01 SESIP_0715_AJ HDF Product Designer Aleksandar Jelenak, H. Joe Lee, Ted Habermann Gerd Heber, John Readey, Joel Plutchak The HDF Group HDF.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Unidata TDS Workshop TDS Overview – Part I XX-XX October 2014.
Unidata’s Common Data Model John Caron Unidata/UCAR Nov 2006.
THREDDS Data Server Ethan Davis GEOSS Climate Workshop 23 September 2011.
Coverages and the DAP2 Data Model James Gallagher.
NetCDF-Java Overview John Caron Oct 29, Contents Data Models / Shared Dimensions Coordinate Systems Feature Types NetCDF Markup Language (NcML)
NcML Aggregation vs Feature Collections. NcML functionality 1.Modify the objects found in CDM files – Especially Attributes – Don’t have to rewrite the.
Unidata’s TDS Workshop TDS Overview – Part II Unidata July 2011.
Mid-Course Review: NetCDF in the Current Proposal Period Russ Rew
Accomplishments and Remaining Challenges: THREDDS Data Server and Common Data Model Ethan Davis Unidata Policy Committee Meeting May 2011.
The netCDF-4 data model and format Russ Rew, UCAR Unidata NetCDF Workshop 25 October 2012.
THREDDS Data Server Unidata’s Common Data Model Background / Summary John Caron Unidata/UCAR Mar 2007.
Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.
Integrating netCDF and OPeNDAP (The DrNO Project) Dr. Dennis Heimbigner Unidata Go-ESSP Workshop Seattle, WA, Sept
DAP4 James Gallagher & Ethan Davis OPeNDAP and Unidata.
Unidata TDS Workshop THREDDS Data Server Overview
Recent developments with the THREDDS Data Server (TDS) and related Tools: covering TDS, NCML, WCS, forecast aggregation and not including stuff covered.
NetCDF Data Model Issues Russ Rew, UCAR Unidata NetCDF 2010 Workshop
Unidata’s Common Data Model and the THREDDS Data Server John Caron Unidata/UCAR, Boulder CO Jan 6, 2006 ESIP Winter 2006.
IOOS Data Services with the THREDDS Data Server Rich Signell USGS, Woods Hole IOOS DMAC Workshop Silver Spring Sep 10, 2013 Rich Signell USGS, Woods Hole.
Unidata’s TDS Workshop TDS Overview – Part I July 2011.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
The HDF Group Data Interoperability The HDF Group Staff Sep , 2010HDF/HDF-EOS Workshop XIV1.
Unidata’s Common Data Model and NetCDF Java Library API Overview John Caron Unidata/UCAR Nov 2008.
The HDF Group Introduction to netCDF-4 Elena Pourmal The HDF Group 110/17/2015.
Unidata's Involvement in Developing and Supporting Climate Science Infrastructure Russ Rew UCAR Unidata April 2010.
NetCDF-4: Software Implementing an Enhanced Data Model for the Geosciences Russ Rew, Ed Hartnett, and John Caron UCAR Unidata Program, Boulder
NetCDF and Scientific Data Durability Russ Rew, UCAR Unidata ESIP Federation Summer Meeting
Data File Formats: netCDF by Tom Whittaker University of Wisconsin-Madison SSEC/CIMSS 2009 MUG Meeting June, 2009.
Advances in the NetCDF Data Model, Format, and Software Russ Rew Coauthors: John Caron, Ed Hartnett, Dennis Heimbigner UCAR Unidata December 2010.
GIS for Atmospheric Sciences and Hydrology By David R. Maidment University of Texas at Austin National Center for Atmospheric Research, 6 July 2005.
Weathertop Consulting, LLC Server-side OPeNDAP Analysis – Concrete steps toward a generalized framework via a reference implementation using F-TDS Roland.
Common Data Model Scientific Feature Types John Caron UCAR/Unidata July 8, 2008.
Unidata Technologies Relevant to GO-ESSP: An Update Russ Rew
NetCDF: Data Model, Programming Interfaces, Conventions and Format Adapted from Presentations by Russ Rew Unidata Program Center University Corporation.
Update on Unidata Technologies for Data Access Russ Rew
Unidata Infrastructure for Data Services Russ Rew GO-ESSP Workshop, LLNL
NetCDF Data Model Details Russ Rew, UCAR Unidata NetCDF 2009 Workshop
® Sponsored by Improving Access to Point Cloud Data 98th OGC Technical Committee Washington DC, USA 8 March 2016 Keith Ryden Esri Software Development.
Other Projects Relevant (and Not So Relevant) to the SODA Ideal: NetCDF, HDF, OLE/COM/DCOM, OpenDoc, Zope Sheila Denn INLS April 16, 2001.
NetCDF-Java version 2.2 Common Data Model John Caron Unidata/UCAR Dec 10, 2004.
Spark Presentation.
Plans for an Enhanced NetCDF-4 Interface to HDF5 Data
What is NetCDF ? And what are its plans for world domination?
Recent Work in Progress
NCL variable based on a netCDF variable model
Presentation transcript:

John Caron Unidata October 2012 And what are its plans for world domination?

NetCDF is a… API File format Software library Store data model objects Persistence layer NetCDF-3, netCDF-4 Software library Implements the API C, Java, others API An API is the interface to the Data Model for a specific programming language An Abstract Data Model describes data objects and what methods you can use on them

NetCDF is a… File format Stores scientific data Persistence layer NetCDF-3, netCDF-4 Portable Format: Machine, OS, application independent Random Access: fast subsetting Simple: self-describing, user accessible“flat file” Documented: NASA ESE standard, 1 page BNF grammer (netcdf-3) 3

NetCDF-3 file format Header Variable 1 float var1(z, y, x) Non-record Variable Variable 1 float var1(z, y, x) Row-major order Variable 2 Variable 3 … Record Variables float rvar2(0, z, y, x) float rvar3(0, z, y, x) float rvar1(0, z, y, x) Record 0 Record 1 float rvar2(1, z, y, x) float rvar3(1, z, y, x) float rvar1(1, z, y, x) unlimited…

NetCDF-4 file format Built on HDF-5 Much more complicated than netCDF-3 Storage efficiency Compression Chunking (can optimize for common I/O pattern) Multiple unlimited dimensions Variable length data

Row vs Column storage Traditional RDBMS is a row store All fields for one row in a table are stored together Netcdf-3 is a column store All data for one variable is stored together Netcdf-4 allows both row and column store Row: compound type Column: regular variable Recent commercial RDBMS with column oriented storage, better pergormance in some cases NetCDF-3 record variables are like a compound type

NetCDF is a… Software library Reference library in C Fortran, C++, Perl, Python, Matlab, … Independent implementation in Java others? Software library Open Source Active community Supported No reference library, no user group for GRIB, BUFR fragmented, not interoperable, difficult to use 7

NetCDF is a… The Application Programming Interface (API) is the interface to the Data Model for a specific programming language. API Clean separation of concerns Information hiding – user never does a seek() Stable, backwards compatible Easily understood – no surprises Interface has small “surface area” 8

NetCDF is a… An Abstract Data Model describes data objects and what methods you can use on them. 9

NetCDF-3 data model Multidimensional arrays of primitive values byte, char, short, int, float, double Key/value attributes Shared dimensions Fortran77

NetCDF-4 Data Model

NetCDF, HDF5, OPeNDAP Data Models Shared dimensions NetCDF (classic) NetCDF (extended) NetCDF (classic) OPeNDAP HDF5

NetCDF-Java Library (aka) Common Data Model Status Update

C Library Architecture Application API Dispatch NetCDF-3 NetCDF-4 HDF5 HDF4 OPeNDAP …

Scientific Feature Types CDM Architecture Scientific Feature Types Application Datatype Adapter NetCDF-Java/ CDM architecture NetcdfDataset CoordSystem Builder OPeNDAP NetcdfFile THREDDS Catalog.xml I/O service provider NetCDF-3 NIDS NcML NetCDF-4 GRIB cdmremote HDF4 GINI Nexrad Remote Datasets DMSP … Local Files

CDM file formats

Coordinate System UML

Conventions CF Conventions (preferred) dataVariable:coordinates = “lat lon alt time”; COARDS, NCAR-CSM, ATD-Radar, Zebra, GEIF, IRIDL, NUWG, AWIPS, WRF, M3IO, IFPS, ADAS/ARPS, MADIS, Epic, RAF-Nimbus, NSSL National Reflectivity Mosaic, FslWindProfiler, Modis Satellite, Avhrr Satellite, Cosmic, …. Write your own CoordSysBuilder Java class

Projections albers_conical_equal_area (sphere and ellipse) azimuthal_equidistant lambert_azimuthal_equal_area lambert_conformal_conic (sphere and ellipse) lambert_cylindrical_equal_area (sphere and ellipse) mcidas_area mercator METEOSAT 8 (ellipse) orthographic rotated_pole rotated_latlon_grib stereographic (including polar) (sphere and ellipse) transverse_mercator (sphere and ellipse) UTM (ellipse) vertical_perspective

Vertical Transforms (CF) atmosphere_sigma_coordinate atmosphere_hybrid_sigma_pressure_coordinate atmosphere_hybrid_height_coordinate atmosphere_ln_pressure_coordinate ocean_s_coordinate ocean_sigma_coordinate ocean_s_coordinate_g1, ocean_s_coordinate_g2 existing3DField

NetCDF “Index Space” Data Access: OPeNDAP URL: http://motherlode.ucar.edu:8080/thredds/dodsC/ NAM_CONUS_80km_20081028_1200.grib1.ascii? Precipitable_water[5][5:1:30][0:1:77] “Coordinate Space” Data Access: NCSS URL: http://motherlode.ucar.edu:8080/thredds/ncss/grid/ NAM_CONUS_80km_20081028_1200.grib1? var=Precipitable_water& time=2008-10-28T12:00:00Z& north=40&south=22&west=-110&east=-80

Scientific Feature Types Classification of earth science data into broad categories. Take advantage of the regularities that are found in the data for performance Scale to large, multifile collections Support subsetting in Space and Time

What’s in a file? Feature Types NetCDF File OS File profile swath Feature Types NetCDF File OS File radar Multidimensional Arrays Bag of Bytes

Gridded Data Grid: multidimensional grid, separable coordinates Radial: a connected set of radials using polar coordinates collected into sweeps Swath: a two dimensional grid, track and cross-track coordinates Unstructured Grids: finite element models, coastal modeling (under development)

Point Data point: a single data point (having no implied coordinate relationship to other points) timeSeries: a series of data points at the same spatial location with monotonically increasing times  trajectory: a series of data points along a path through space with monotonically increasing times profile: an ordered set of data points along a vertical line at a fixed horizontal position and fixed time timeSeriesProfile: a series of profile features at the same horizontal position with monotonically increasing times trajectoryProfile: a series of profile features located at points ordered along a trajectory

Discrete Sampling Convention CF 1.6 Encoding standard for netCDF classic files Challenge: represent ragged arrays efficiently Classifies data according to connectedness of time/space coordinates Defines netCDF data structures that represent features Make it easy / efficient to Store collections of features in one file Read a Feature from a file Subset the collection by space and time

Rectangular Array Ragged Array

NetCDF Markup Language (NcML) XML representation of netCDF metadata (like ncdump -h) Create new netCDF files (like ncgen) Modify (“fix”) existing datasets without rewriting them Create virtual datasets as aggregations of multiple existing files. Integrated with the TDS Another component of the netCDF java library is support for the netCDF Markup Language. NcML is an XML representation of netCDF metadata. It can be used to create new netCDF files as well as modify an existing dataset and even aggregate a set of existing datasets.

THREDDS Data Server catalog.xml configCatalog.xml THREDDS Server Servlet Container catalog.xml Remote Access Client THREDDS Server NCSS OPeNDAP HTTPServer cdmremote NetCDF-Java library configCatalog.xml Datasets IDD Data motherlode.ucar.edu

Remote Access OPeNDAP 2.0 cdmremote Netcdf Subset Service index space access Cant transport full netCDF extended data model Will replace with DAP 4 next year cdmremote Full data model, index space access Netcdf Subset Service coordinate space access to gridded data Delivers netCDF files (also csv, xml, maybe JSON) Now writes netcdf-4 (alpha test), with C library / JNI cdmrFeature Service coordinate space access to point data Feature type API

ncstream serialization … message message … message message message … message message … message message message index

Scientific Feature Types CDM Architecture Scientific Feature Types Application Datatype Adapter NetCDF-Java/ CDM architecture NetcdfDataset CoordSystem Builder OPeNDAP NetcdfFile THREDDS Catalog.xml I/O service provider NetCDF-3 NIDS NcML NetCDF-4 GRIB cdmremote HDF4 GINI Nexrad Remote Datasets DMSP … Local Files

CFSR timeseries data at NCDC Climate Forecast System Reanalysis 1979 - 2009 (31 years, 372 months) Total 5.6 Tbytes, 56K files Grib2 data

GRIB collection indexing collectionName.ncx 1000x smaller CDM metadata 1000x smaller Index file name.gbx9 GRIB file TDS Create Index file name.gbx9 GRIB file … Index file name.gbx9 GRIB file

GRIB time partitioning gbx9 GRIB file … ncx TDS Jan 1983 gbx9 GRIB file … ncx Partition index Collection.ncx Feb 1983 … Mar 1983 …

What have we got ? Fast indexing allows you to find the subsets that you want in under a second Time partitioning should scale up as long as your data is time partitioned No pixie dust: still have to read the data! GRIB2 stores compressed horizontal slices decompress entire slice to get one value Experimenting with storing in netcdf-4 Chunk to get timeseries data at a single point

Big Data

Bigger Data CMIP5 at IPSL Climate Modelling Centre 300K netCDF files, 300 Tb. Sequential read of 300 Tb @ 100Mb/sec 3 x 10^6 sec = 833 hours = 35 days How to get that down to 8 hours ? Divide into 100 jobs, run in parallel

Required: Parallel I/O Systems Shared nothing, commodity disks Fault tolerant, replicated data (3x) Google File System using map/reduce Hadoop is open source implementation Wide industry use Cost $3000 per node TCO per year $300K per year for 100 node cluster Cost will continue to fall Not sure if you should rent or buy

Parallel File I/O Google File System Hadoop

Required: Send User programs to server Need a query / computation language easily parallelized scientists can/will use Powerful enough to accomplish hard tasks What language? Not going to retrofit existing Fortran code Remember, this is post-processing, not model runs Not Fortran, C, Java (too low level) Some subset of Python ?

Send User programs to server Probably a Domain Specific Language (DSL) Make it up for this specific purpose But make it familiar! So it could look like some subset of existing language

Existing Candidates SciDB just proposed ArrayQL: “ArrayQL currently comprises two parts: an array algebra, meant to provide a precise semantics of operations on arrays; and a user-level language, for defining and querying arrays. The user-level language is modeled on SQL, but with extensions to support array dimensions.” Google Earth Engine is developing a DSL

Required: Parallelizable High Level Language Scientific Data Management in the Coming Decade, Jim Gray (2005) Now: File-at-a-time processing in Fortran Need: Set-at-a-time processing in HLQL Declarative language like SQL (vs. procedural): Define dataset subset to work against Define computation Let the system figure out how to do it

NetCDF “Index Space” Data Access: OPeNDAP URL: http://motherlode.ucar.edu:8080/thredds/dodsC/ NAM_CONUS_80km_20081028_1200.grib1.ascii? Precipitable_water[5][5:1:30][0:1:77] “Coordinate Space” Data Access: NCSS URL: http://motherlode.ucar.edu:8080/thredds/ncss/grid/ NAM_CONUS_80km_20081028_1200.grib1? var=Precipitable_water& time=2008-10-28T12:00:00Z& north=40&south=22&west=-110&east=-80

“Coordinate Space” Data Access: http://motherlode.ucar.edu:8080/thredds/ncss/grid/ NAM_CONUS_80km_20081028_1200.grib1? var=Precipitable_water& time=2008-10-28T12:00:00Z& north=40&south=22&west=-110&east=-80 Fake SQL: SELECT Precipitable_water FROM NAM_CONUS_80km_20081028_1200.grib1 WHERE time=2008-10-28T12:00:00Z AND space=[north=40,south=22,west=-110,east=-80]

More Elaborate DATASET cfsr FROM CFSR-HPR-TS9 WHERE month=April AND year >= 1990 AND space=[north=40,south=22,west=-110,east=-80] AS Grid SELECT precip=Precipitable_water, rh=Reletive_Humidity, T=Temperature FROM cfsr CALC DailyAvg(Correlation( precip, rh) / Avg(T)) RETURN AS Grid

APL example DATASET cfsr FROM CFSR-HPR-TS9 WHERE month=April AND year >= 1990 AND space=[north=40,south=22,west=-110,east=-80] AS Grid CALCDEF myCalc (X,Y,DATA) { X ← 3 3⍴÷⍳9 ⋄ Y ← DATA[⍋DATA] } SELECT precip=Precipitable_water, rh=Reletive_Humidity, T=Temperature FROM cfsr CALC myCalc(precip, rh, T) RETURN AS Grid

Summary: Big Data Post-Processing Need parallel I/O System Shared nothing, commodity disks, replicated data Need parallel processing system Hadoop based on GFS (Map/reduce) Need to send computation to the server Need a parallelizable query / computation language Possibly declarative Must be expressive and powerful Probably a new “domain specific” language Need to capture common queries