Complex Scientific Analytics in Earth Science at Extreme Scale John Caron University Corporation for Atmospheric Research Boulder, CO Oct 6, 2010.

Slides:



Advertisements
Similar presentations
1 NASA CEOP Status & Demo CEOS WGISS-25 Sanya, China February 27, 2008 Yonsook Enloe.
Advertisements

Sponsored by the National Science Foundation GENI I&M Workshop NetCDF and Local Data Manager (LDM) Mike Zink November 4, 2010
BEDI -Big Earth Data Initiative
Data Formats: Using self-describing data formats Curt Tilmes NASA Version 1.0 Review Date.
Reading HDF family of formats via NetCDF-Java / CDM
The Model Output Interoperability Experiment in the Gulf of Maine: A Success Story Made Possible By CF, NcML, NetCDF-Java and THREDDS Rich Signell (USGS,
The NCAR Command Language (NCL) and the NetCDF Data Format Research Tools Presentation Matthew Janiga 10/30/2012.
Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.
® OGC Web Services Initiative, Phase 9 (OWS-9): Innovations Thread - OPeNDAP James Gallagher and Nathan Potter, OPeNDAP © 2012 Open Geospatial Consortium.
THREDDS, CDM, OPeNDAP, netCDF and Related Conventions John Caron Unidata/UCAR Sep 2007.
ICOADS Archive Practices at NCAR JCOMM ETMC-III 9-12 February 2010 Steven Worley.
The Future of NetCDF Russ Rew UCAR Unidata Program Center Acknowledgments: John Caron, Ed Hartnett, NASA’s Earth Science Technology Office, National Science.
The International Surface Pressure Databank (ISPD) and Twentieth Century Reanalysis at NCAR Thomas Cram - NCAR, Boulder, CO Gilbert Compo & Chesley McColl.
Активное распределенное хранилище для многомерных массивов Дмитрий Медведев ИКИ РАН.
UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.
TPAC Digital Library Talk Overview Presenter:Glenn Hyland Tasmanian Partnership for Advanced Computing & Australian Antarctic Division Outline: TPAC Overview.
Unidata TDS Workshop THREDDS Data Server Overview October 2014.
Introduction Downloading and sifting through large volumes of data stored in differing formats can be a time-consuming and sometimes frustrating process.
1 CF Unleashed: Introduction to Cf/Radial Joe VanAndel National Center for Atmospheric Research 2013/1/8 The National Center for Atmospheric.
OPeNDAP and the Data Access Protocol (DAP) Original version by Dave Fulker.
Quick Unidata Overview NetCDF Workshop 25 October 2012 Russ Rew.
Inter-American Workshop on Environmental Data Access Panel discussion on scientific and technical issues Merilyn Gentry, LBA-ECO Data Coordinator NASA.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
DISTRIBUTED DATA FLOW WEB-SERVICES FOR ACCESSING AND PROCESSING OF BIG DATA SETS IN EARTH SCIENCES A.A. Poyda 1, M.N. Zhizhin 1, D.P. Medvedev 2, D.Y.
Unidata’s TDS Workshop TDS Overview – Part II October 2012.
HDF5 A new file format & software for high performance scientific data management.
Unidata TDS Workshop TDS Overview – Part I XX-XX October 2014.
Unidata’s Common Data Model John Caron Unidata/UCAR Nov 2006.
THREDDS Data Server Ethan Davis GEOSS Climate Workshop 23 September 2011.
NOCS, PML, STFC, BODC, BADC The NERC DataGrid = Bryan Lawrence Director of the STFC Centre for Environmental Data Archival (BADC, NEODC, IPCC-DDC.
EARTH SCIENCE MARKUP LANGUAGE Why do you need it? How can it help you? INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Accomplishments and Remaining Challenges: THREDDS Data Server and Common Data Model Ethan Davis Unidata Policy Committee Meeting May 2011.
The netCDF-4 data model and format Russ Rew, UCAR Unidata NetCDF Workshop 25 October 2012.
Integrated Model Data Management S.Hankin ESMF July ‘04 Integrated data management in the ESMF (ESME) Steve Hankin (NOAA/PMEL & IOOS/DMAC) ESMF Team meeting.
THREDDS Data Server Unidata’s Common Data Model Background / Summary John Caron Unidata/UCAR Mar 2007.
Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.
Unidata TDS Workshop THREDDS Data Server Overview
ESIP Federation 2004 : L.B.Pham S. Berrick, L. Pham, G. Leptoukh, Z. Liu, H. Rui, S. Shen, W. Teng, T. Zhu NASA Goddard Earth Sciences (GES) Data & Information.
Recent developments with the THREDDS Data Server (TDS) and related Tools: covering TDS, NCML, WCS, forecast aggregation and not including stuff covered.
Unidata’s Common Data Model and the THREDDS Data Server John Caron Unidata/UCAR, Boulder CO Jan 6, 2006 ESIP Winter 2006.
IOOS Data Services with the THREDDS Data Server Rich Signell USGS, Woods Hole IOOS DMAC Workshop Silver Spring Sep 10, 2013 Rich Signell USGS, Woods Hole.
Unidata’s TDS Workshop TDS Overview – Part I July 2011.
The Unified Access Framework for Gridded Data … the 1 st year focus of NOAA’s Global Earth Observation Integrated Data Environment (GEO-IDE) Steve Hankin,
The HDF Group Introduction to netCDF-4 Elena Pourmal The HDF Group 110/17/2015.
INTRODUCTION TO GIS  Used to describe computer facilities which are used to handle data referenced to the spatial domain.  Has the ability to inter-
NetCDF-4: Software Implementing an Enhanced Data Model for the Geosciences Russ Rew, Ed Hartnett, and John Caron UCAR Unidata Program, Boulder
Data File Formats: netCDF by Tom Whittaker University of Wisconsin-Madison SSEC/CIMSS 2009 MUG Meeting June, 2009.
SCD Research Data Archives; Availability Through the CDP About 500 distinct datasets, 12 TB Diverse in type, size, and format Serving 900 different investigators.
The Research Data Archive at NCAR: A System Designed to Handle Diverse Datasets Bob Dattore and Steven Worley National Center for Atmospheric Research.
Unidata Technologies Relevant to GO-ESSP: An Update Russ Rew
CF 2.0 Coming Soon? (Climate and Forecast Conventions for netCDF) Ethan Davis ESO Developing Standards - ESIP Summer Mtg 14 July 2015.
OGC Web Services with complex data Stephen Pascoe How OGC Web Services relate to GML Application Schema.
1 2.5 DISTRIBUTED DATA INTEGRATION WTF-CEOP (WGISS Test Facility for CEOP) May 2007 Yonsook Enloe (NASA/SGT) Chris Lynnes (NASA)
NetCDF: Data Model, Programming Interfaces, Conventions and Format Adapted from Presentations by Russ Rew Unidata Program Center University Corporation.
5-7 May 2003 SCD Exec_Retr 1 Research Data, May Archive Content New Archive Developments Archive Access and Provision.
Update on Unidata Technologies for Data Access Russ Rew
LP DAAC Overview – Land Processes Distributed Active Archive Center Chris Doescher LP DAAC Project Manager (605) Chris Torbert.
Unidata Infrastructure for Data Services Russ Rew GO-ESSP Workshop, LLNL
The Arctic Observing Network and its Data Management Challenges Florence Fetterer (NSIDC/CIRES/CU), James A. Moore (NCAR/EOL), and the CADIS team Photo.
Data Browsing/Mining/Metadata
Data Are from Mars, Tools Are from Venus
INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM
MERRA Data Access and Services
Common Framework for Earth Observation Data
Efficiently serving HDF5 via OPeNDAP
CEE 6440 GIS in Water Resources Fall 2004 Term Paper Presentation
WGISS Connected Data Assets Oct 24, 2018 Yonsook Enloe
Long-Lived Data Collections
Data Management Components for a Research Data Archive
Robert Dattore and Steven Worley
Presentation transcript:

Complex Scientific Analytics in Earth Science at Extreme Scale John Caron University Corporation for Atmospheric Research Boulder, CO Oct 6, 2010

Who Are You? Unidata is NSF-funded engineering group – NSF Division of Atmospheric and Geospace Sciences (AGS) – Core constituency is synoptic meteorology – Build tools to obtain and analyze earth science data for research and education – Atmosphere, oceans, hydrology, climate, biosphere Unidata is the developer of netCDF format I am a Software Engineer, not a Scientist – Java-NetCDF library – THREDDS Data Server – Common Data Model

Previous results for Geoscience Geoscience has no very large projects. Rather diverse, heterogeneous, highly distributed, with a diversity of data formats and access methods (hindering wide data use). Data is stored non-hierarchically, fully-distributed, with a large number of independent sites. Virtually all scientists want to own and control their data. Append-only – written once and never updated. Geoscience data can be represented (at a low level) as n-dimensional arrays. These are not stored in databases but in scientific file formats such as HDF, netCDF. Data formats are usually chosen by data producers, causing data to be archived optimally for data writing / storage, not for retrieval and analysis. Data is often stored with insufficient metadata for non-expert users. Geoscientific data can come from a large variety of sources: ground observatories, mobile stations, sensor networks, aerial observers, simulation models, etc. The set of data for a given location may have different resolutions, different sample rates, different perspectives, or different coordinate systems and therefore must be transformed, regridded, aligned, and otherwise unified before they can be analyzed.

National Center for Atmospheric Research (NCAR) Data Archives NCAR Mass Storage: 8 PB – Growing at 2.5 PB/year – By 2012: 5PB/year – Tape silo with 100 TB Disk Cache NCAR Research Data Archive: 55 TB – High quality observations and model output – 55 TB / 600 Datasets = 100 GB / Dataset

NASA ESDIS Earth Science Data and Information System EOSDIS Metrics (Oct. 1, 2008 to Sept. 30, 2009) Unique Data Sets > 4,000 Distinct Users of EOSDIS Data and Services > 910K Web Site Visits > 1M Average Archive Growth 1.8TB/day Total Archive Volume 4.2 PB End User Distribution Products > 254M End User Average Distribution Volume 6.7 TB/day Satellite Observational Data 4.2 PB/ 4000 datasets = 1 TB / dataset 4.2 PB/ 175 million files = 24 MB / file

CLASS (NOAA) data volumes CLASS currently holds about 30 PB of data Projected to grow to 100 PB by 2015 and 160 PB by 2020

Dataset Size / Heterogeneity Climate Model Intercomparision (PCMDI) project for IPCC AR4 (2006/7): 35 TB, ~20 climate models – 35 TB / 78,000 files = 450 MB / file – Stored in netCDF with CF Conventions NASA's Global Change Master Directory (GCMD) holds more than 25,000 Earth science data set and service descriptions.

Earth Science Data Archive Current Practices Raw data is processed into archive or exchange format – General Purpose : NetCDF / HDF – Special Purpose : eg in meteorology: WMOs GRIB, BUFR Interest is in this archive (not raw) data Dataset is a collection of homogeneous files – Common metadata, single schema (approx) – Granule = single file partitioned by time – Effectively append only Near real-time archive allows file appending Rolling archive keeps, eg, most recent 30 days Very diverse – Big Mandated Archives : NOAA, NASA – Many others : DOE, USGS, EPA, NCAR, Universities, etc.

Earth Science Data Archive Current Practices Search metadata may be put into RDBMS, data is not (some exceptions) Data may be online or near online in tertiary storage Most data is transferred as files in a batch service – Place order, get data later – May have subsetting / aggregation service – May have file format translation service (hard) – May have regridding service (very hard) Stating to develop online Web Services – Open Geospatial Consortium (OGC) protocols / ISO-191xx data models – Community standard protocols, eg OPeNDAP in ocean and atmos. – Synchronous, assumes online Processing – Some standard operators: statistics, regridding – Algebra / calculus to create derived fields

General Purpose Scientific Data File Formats in Earth Science NetCDF (Unidata) / HDF (NCSA) – Persistent Fortran 77 / 90 arrays – Arbitrary key/value attributes Multidimensional arrays – Regular / rectangular (think Fortran) – Ragged (bit of a poor cousin) – Tiled/compressed (performance) Language API bindings – Efficient strided array subsetting – Procedural, file-at-a-time – Some higher level tools for processing sets of files Machine / OS / Language independent – Solved the Syntactical Problem of data access

Data Semantics are hard Semantics are typically stored in key/value attributes stored in the files Datasets define attribute Conventions – Human readable documents (eg CF Conventions) – Sometimes with software API (eg HDF-EOS) – Sometimes you just have to know what it means Sometimes there are no semantics in the file

What to do about heterogeneity? 1.Rewrite into common form / database – Caveat: must save original data 2.Leave the data in the original file formats – Develop decoders to common data model

Unidatas Approach Virtual Dataset – Collection of many files – Hide the details of the file partitioning – Provide remote access Efficient subsetting in space/time – Let user programs work in coordinate space – Handle mapping to array indices Define small set of scientific feature types – Dataset is a collection of objects, not arrays – Necessary to abstract details of array storage

Scientific Feature Types Grid Point Radial Trajectory Swath StationProfile Coordinate Systems Unidatas Common Data Model Data Access netCDF-3, HDF5, OPeNDAP BUFR, GRIB1, GRIB2, NEXRAD, NIDS, McIDAS, GEMPAK, GINI, DMSP, HDF4, HDF-EOS, DORADE, GTOPO, ASCII Storage format, multidimensional arrays Georeferencing, topology Objects, How the User sees the data

Geoscience Data Summary Many important geoscience datasets – 10,000 (?) – Unique metadata / semantics Stored in append-only file collections – Time partitioned – Optional metadata indexing Three levels – Storage format : multidimensional arrays – Coordinate Systems : space / time georeferencing : topology – Objects : Forecast Model Run, Radar Sweep, Satellite Swath Image, Vertical Profile of Atmosphere, Time series of surface observations, collection of lightning strikes, autonomous underwater vehicle (AUV) trajectories, etc.