National Center for Atmospheric Research

Slides:



Advertisements
Similar presentations
University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
Advertisements

Climate Analytics on Global Data Archives Aparna Radhakrishnan 1, Venkatramani Balaji 2 1 DRC/NOAA-GFDL, 2 Princeton University/NOAA-GFDL 2. Use-case 3.
The NCAR Command Language (NCL) and the NetCDF Data Format Research Tools Presentation Matthew Janiga 10/30/2012.
Preparing CMOR for CMIP6 and other WCRP Projects
ESCI/CMIP5 Tools - Jeudi 2 octobre CMIP5 Tools Earth System Grid-NetCDF4- CMOR2.0-Gridspec-Hyrax …
John Dennis Dave Brown Kevin Paul Sheri Mickelson
CLIMATE SCIENTISTS’ BIG CHALLENGE: REPRODUCIBILITY USING BIG DATA Kyo Lee, Chris Mattmann, and RCMES team Jet Propulsion Laboratory (JPL), Caltech.
Basic Unix Dr Tim Cutts Team Leader Systems Support Group Infrastructure Management Team.
Jason Morrill NCOAUG Training Day February, 2008
Z EGU Integration of external metadata into the Earth System Grid Federation (ESGF) K. Berger 1, G. Levavasseur 2, M. Stockhause 1, and M. Lautenschlager.
CIS Computer Programming Logic
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 2 Input, Processing, and Output.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Input, Output, and Processing
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Property of Jack Wilson, Cerritos College1 CIS Computer Programming Logic Programming Concepts Overview prepared by Jack Wilson Cerritos College.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
Migration to Rose and High Resolution Modelling Jean-Christophe Rioual, CRUM, Met Office 09/04/2015.
Advanced Utilities Extending ncgen to support the netCDF-4 Data Model Dr. Dennis Heimbigner Unidata netCDF Workshop August 3-4, 2009.
_______________________________________________________________CMAQ Libraries and Utilities ___________________________________________________Community.
Chris Lynnes, NASA Goddard Space Flight Center NASA/Goddard EARTH SCIENCES DATA and INFORMATION SERVICES CENTER (GES DISC) Analyzing.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
1 Earth System Modeling Framework Documenting and comparing models using Earth System Curator Sylvia Murphy: Julien Chastang:
Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Ptrack Particle tracking server. 2 Ptrack – Project Overview Project –Server (Paul) Provide API Outputs to Postgres and other formats –kml, netcdf, csv,.pth.
Data formats and requirements in CMIP6: the climate-prediction case Pierre-Antoine Bretonnière EC-Earth meeting, Reading, May 2015.
SIMO Python/XML Simulator Current situation 28/10/2005 SIMO Seminar Antti Mäkinen Dept. of Forest Resource Management / University of Helsinki.
PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.
Post Processing Tools Sylvia Murphy National Center for Atmospheric Research.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
An introduction to CDO, NCL and PRECIS utilities
Support to scientific research on seasonal-to-decadal climate and air quality modelling Pierre-Antoine Bretonnière Francesco Benincasa IC3-BSC - Spain.
A41I-0105 Supporting Decadal and Regional Climate Prediction through NCAR’s EaSM Data Portal Doug Schuster and Steve Worley National Center for Atmospheric.
WINTER 2016 – TERM PRESENTATION MICHAEL O’KEEFE. PAST RESEARCH - SUMMER 2015 Continued Jason Woodring’s research on UWCA Main issue with UWCA is the slow.
Creating Database Objects
Introduction to Computing Science and Programming I
Tools Of Structured Analysis
Simulation Production System
Performance Eval Slides originally from Williamson at Calgary
A User’s Perspective on Acquisition and Management of CMIP5 Data
Learning to Program D is for Digital.
Generalisation process and generalisation tools in Maanmittauslaitos
Intro to CMIP, the WHOI CMIP5 community server, and planning for CMIP6
for the Offline and Computing groups
Outline RTQC goals to achieve Description of current proposal Tests
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
What is a function?.
PyStormTracker: A Parallel Object-Oriented Cyclone Tracker in Python
Advanced Topics in Data Management
Task 5 : Supporting CCI Contributions to Obs4MIPs
Funded by the European Union
Lesson 1-1 Linear Relations and Things related to linear functions
Dtk-tools Benoit Raybaud, Research Software Manager.
Metadata Development in the Earth System Curator
Overview of Workflows: Why Use Them?
Complexity Measures for Parallel Computation
Introduction to Functions
Extending ArcGIS using programming
Common Core Math 8 India Walton
Creating Database Objects
Digital Object Management for ENES: Challenges and Opportunities
CyberShake Study 2.2: Computational Review Scott Callaghan 1.
Robert Dattore and Steven Worley
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Data compilation and pre-validation
ECMWF usage, governance and perspectives
Integrated Statistical Production System WITH GSBPM
Presentation transcript:

National Center for Atmospheric Research A New Parallel Python Tool for the Standardization of Earth System Model Data Kevin Paul Sheri Mickelson John M. Dennis National Center for Atmospheric Research Application Scalability And Performance (ASAP) I/O & Workflow Applications (IOWA)

Big Data in Earth System Modeling NCAR’s Community Earth System Model: Massively parallel (MPI-based) Higher resolution simulations … “Big Data”! Coupled Model Intercomparison Project: CMIP5 (2010-2013): 20 different institutions from around the world! CESM: 2.5 PB generated  170 TB published “First to produce, last to publish!” CMIP6 (2017-2020): EXPECT: Up to 25 PB generated from CESM alone! A New Parallel Python Tool for Standardization of ESM Data

CESM Publication Workflow Time Slices Time Series CESM Run Time-Series Generation Standardization Publication Time-Slice Format: (synoptic or history-file) All variables from model in every file, spanning a short time Time-Series Format: (single-field) Each variable from model in single file, spanning a long time Better for analysis! Standardization: Produce the data requested by the MIP Standard variable names, standard dimension names, etc. Publication: Obtaining DOI, “tagging” data, publication data location, etc. A New Parallel Python Tool for Standardization of ESM Data

CESM Publication Workflow Time Slices Time Series CESM Run Time-Series Generation Standardization Publication NCO NCO Pre-CMIP6: All performed using the NetCDF Operators (NCO) App performed in serial! A New Parallel Python Tool for Standardization of ESM Data

CESM Publication Workflow Time Slices Time Series CESM Run Time-Series Generation Standardization Publication PyReshaper PyConform Pre-CMIP6: All performed using the NetCDF Operators (NCO) App performed in serial! CMIP6: All performed using parallel (MPI) Python tools Time-Series Generation  PyReshaper (12x Speedup) Standardization  PyConform A New Parallel Python Tool for Standardization of ESM Data

MIP Data Standards Every MIP requests specific data: File formats (e.g., NetCDF4) Names of files and directory structure File attributes (e.g., institution, MIP name, …) Names of dimensions (e.g., lat, lon, …) Names of variables (e.g., psl, ta, tas, …) Dimensions of variables Variable data types (e.g., float, double, …) Attributes of variables (e.g., units, …) Ranges of time (e.g., 2006 to 2100) Lots of information! Very complicated! Could be very different from model output! A New Parallel Python Tool for Standardization of ESM Data

The Limitation of Standards MIP Standards say “what” you need MIP Standards don’t say “how” to get it Suppose “v” is requested by the MIP What is its data? v = ? Easy case… v = v or v = V Also, pretty easy… v = X Harder… v = X1 + somefunction(X2 - X3) Also need a set of rules, or “definitions”, which require special knowledge about the model itself! A New Parallel Python Tool for Standardization of ESM Data

MIP Data Standardization Standardization Process: Get the MIP-specific “standards” Get the model-specific “definitions” Combine! “standards” + “definitions” = “standardization” Perform the operations needed by the “standardization” Specification Computation PyConform: 2 utilities Specification: creates a “standardization file” Computation: parallel graph-based computation A New Parallel Python Tool for Standardization of ESM Data

PyConform: Specification Get the MIP-specific “standards”: CMOR (from PCMDI/LLNL) No clear separation of “specification” from “computation” DreqPy (from UK Met Office) Database-like collection of MIP standards Get the model-specific “definitions”: Requires input from the scientists! Slow in producing, but no other way? Combine! Serial Python command-line tool Creates a single file used as input in… DreqPy x = X1 + X2 y = X1 - X2 z = Z u = function1(U,V) v = function2(U,V) definitions.txt A New Parallel Python Tool for Standardization of ESM Data

PyConform: Computation “x = X1 + X2” Read: X1[i] Validate: > minimum < maximum dimensions = [j] et cetera Evaluate: (X1+X2)[i] Map: ij Read: X2[i] Write: x[j] y[j] File Read: X1[i] Validate: > minimum < maximum dimensions = [j] et cetera Evaluate: (X1-X2)[i] Map: ij Read: X2[i] “y = X1 - X2” A New Parallel Python Tool for Standardization of ESM Data

Evaluate Nodes: (Almost) anything can be done! All data passed along the “Data Flow” has… Units  allows for implicit conversion! Dimensions  allows for implicit transposition! Basic element-wise arithmetic e.g., +, -, *, /, ** Element-wise math functions e.g., sqrt(x) User-defined functions: Subclass from PyConform’s Function class Input: numbers, strings, variable-names, … Many already written for testing: e.g., vinth2p(T,…), invert_dim(T,”dim”) A New Parallel Python Tool for Standardization of ESM Data

PyConform: Testing Super-linear speedups! CESM Case Name CMIP5 Table Input Dataset Size Output Dataset Size Original Serial Runtime PyConform Parallel Runtime (16 Procs) SPEEDUP b40.rcp4_5.1deg.006 Amon 84 GB 62 GB 4313 sec 113 sec 38x b40.20th.track1.1deg.012 135 GB 102 GB 7157 sec 456 sec 16x 3hr 540 GB 506 GB 22267 sec 645 sec 34x Super-linear speedups! NOTE: …Serial performance much better Much more testing to be done! Achieving bit-for-bit agreement is time-consuming Coming up with the “definitions” can be hard! A New Parallel Python Tool for Standardization of ESM Data

Thank you! Questions? A New Parallel Python Tool for Standardization of ESM Data