Presentation on theme: "Complex Scientific Analytics in Earth Science at Extreme Scale John Caron University Corporation for Atmospheric Research Boulder, CO Oct 6, 2010."— Presentation transcript:
Complex Scientific Analytics in Earth Science at Extreme Scale John Caron University Corporation for Atmospheric Research Boulder, CO Oct 6, 2010
Who Are You? Unidata is NSF-funded engineering group – NSF Division of Atmospheric and Geospace Sciences (AGS) – Core constituency is synoptic meteorology – Build tools to obtain and analyze earth science data for research and education – Atmosphere, oceans, hydrology, climate, biosphere Unidata is the developer of netCDF format I am a Software Engineer, not a Scientist – Java-NetCDF library – THREDDS Data Server – Common Data Model
Previous results for Geoscience Geoscience has no very large projects. Rather diverse, heterogeneous, highly distributed, with a diversity of data formats and access methods (hindering wide data use). Data is stored non-hierarchically, fully-distributed, with a large number of independent sites. Virtually all scientists want to own and control their data. Append-only – written once and never updated. Geoscience data can be represented (at a low level) as n-dimensional arrays. These are not stored in databases but in scientific file formats such as HDF, netCDF. Data formats are usually chosen by data producers, causing data to be archived optimally for data writing / storage, not for retrieval and analysis. Data is often stored with insufficient metadata for non-expert users. Geoscientific data can come from a large variety of sources: ground observatories, mobile stations, sensor networks, aerial observers, simulation models, etc. The set of data for a given location may have different resolutions, different sample rates, different perspectives, or different coordinate systems and therefore must be transformed, regridded, aligned, and otherwise unified before they can be analyzed.
National Center for Atmospheric Research (NCAR) Data Archives NCAR Mass Storage: 8 PB – Growing at 2.5 PB/year – By 2012: 5PB/year – Tape silo with 100 TB Disk Cache NCAR Research Data Archive: 55 TB – High quality observations and model output – 55 TB / 600 Datasets = 100 GB / Dataset
NASA ESDIS Earth Science Data and Information System EOSDIS Metrics (Oct. 1, 2008 to Sept. 30, 2009) Unique Data Sets > 4,000 Distinct Users of EOSDIS Data and Services > 910K Web Site Visits > 1M Average Archive Growth 1.8TB/day Total Archive Volume 4.2 PB End User Distribution Products > 254M End User Average Distribution Volume 6.7 TB/day Satellite Observational Data 4.2 PB/ 4000 datasets = 1 TB / dataset 4.2 PB/ 175 million files = 24 MB / file
CLASS (NOAA) data volumes CLASS currently holds about 30 PB of data Projected to grow to 100 PB by 2015 and 160 PB by 2020
Dataset Size / Heterogeneity Climate Model Intercomparision (PCMDI) project for IPCC AR4 (2006/7): 35 TB, ~20 climate models – 35 TB / 78,000 files = 450 MB / file – Stored in netCDF with CF Conventions NASA's Global Change Master Directory (GCMD) holds more than 25,000 Earth science data set and service descriptions.
Earth Science Data Archive Current Practices Raw data is processed into archive or exchange format – General Purpose : NetCDF / HDF – Special Purpose : eg in meteorology: WMOs GRIB, BUFR Interest is in this archive (not raw) data Dataset is a collection of homogeneous files – Common metadata, single schema (approx) – Granule = single file partitioned by time – Effectively append only Near real-time archive allows file appending Rolling archive keeps, eg, most recent 30 days Very diverse – Big Mandated Archives : NOAA, NASA – Many others : DOE, USGS, EPA, NCAR, Universities, etc.
Earth Science Data Archive Current Practices Search metadata may be put into RDBMS, data is not (some exceptions) Data may be online or near online in tertiary storage Most data is transferred as files in a batch service – Place order, get data later – May have subsetting / aggregation service – May have file format translation service (hard) – May have regridding service (very hard) Stating to develop online Web Services – Open Geospatial Consortium (OGC) protocols / ISO-191xx data models – Community standard protocols, eg OPeNDAP in ocean and atmos. – Synchronous, assumes online Processing – Some standard operators: statistics, regridding – Algebra / calculus to create derived fields
General Purpose Scientific Data File Formats in Earth Science NetCDF (Unidata) / HDF (NCSA) – Persistent Fortran 77 / 90 arrays – Arbitrary key/value attributes Multidimensional arrays – Regular / rectangular (think Fortran) – Ragged (bit of a poor cousin) – Tiled/compressed (performance) Language API bindings – Efficient strided array subsetting – Procedural, file-at-a-time – Some higher level tools for processing sets of files Machine / OS / Language independent – Solved the Syntactical Problem of data access
Data Semantics are hard Semantics are typically stored in key/value attributes stored in the files Datasets define attribute Conventions – Human readable documents (eg CF Conventions) – Sometimes with software API (eg HDF-EOS) – Sometimes you just have to know what it means Sometimes there are no semantics in the file
What to do about heterogeneity? 1.Rewrite into common form / database – Caveat: must save original data 2.Leave the data in the original file formats – Develop decoders to common data model
Unidatas Approach Virtual Dataset – Collection of many files – Hide the details of the file partitioning – Provide remote access Efficient subsetting in space/time – Let user programs work in coordinate space – Handle mapping to array indices Define small set of scientific feature types – Dataset is a collection of objects, not arrays – Necessary to abstract details of array storage
Scientific Feature Types Grid Point Radial Trajectory Swath StationProfile Coordinate Systems Unidatas Common Data Model Data Access netCDF-3, HDF5, OPeNDAP BUFR, GRIB1, GRIB2, NEXRAD, NIDS, McIDAS, GEMPAK, GINI, DMSP, HDF4, HDF-EOS, DORADE, GTOPO, ASCII Storage format, multidimensional arrays Georeferencing, topology Objects, How the User sees the data
Geoscience Data Summary Many important geoscience datasets – 10,000 (?) – Unique metadata / semantics Stored in append-only file collections – Time partitioned – Optional metadata indexing Three levels – Storage format : multidimensional arrays – Coordinate Systems : space / time georeferencing : topology – Objects : Forecast Model Run, Radar Sweep, Satellite Swath Image, Vertical Profile of Atmosphere, Time series of surface observations, collection of lightning strikes, autonomous underwater vehicle (AUV) trajectories, etc.