Presentation is loading. Please wait.

Presentation is loading. Please wait.

National Center for Atmospheric Research

Similar presentations


Presentation on theme: "National Center for Atmospheric Research"— Presentation transcript:

1 National Center for Atmospheric Research
A New Parallel Python Tool for the Standardization of Earth System Model Data Kevin Paul Sheri Mickelson John M. Dennis National Center for Atmospheric Research Application Scalability And Performance (ASAP) I/O & Workflow Applications (IOWA)

2 Big Data in Earth System Modeling
NCAR’s Community Earth System Model: Massively parallel (MPI-based) Higher resolution simulations … “Big Data”! Coupled Model Intercomparison Project: CMIP5 ( ): 20 different institutions from around the world! CESM: 2.5 PB generated  170 TB published “First to produce, last to publish!” CMIP6 ( ): EXPECT: Up to 25 PB generated from CESM alone! A New Parallel Python Tool for Standardization of ESM Data

3 CESM Publication Workflow
Time Slices Time Series CESM Run Time-Series Generation Standardization Publication Time-Slice Format: (synoptic or history-file) All variables from model in every file, spanning a short time Time-Series Format: (single-field) Each variable from model in single file, spanning a long time Better for analysis! Standardization: Produce the data requested by the MIP Standard variable names, standard dimension names, etc. Publication: Obtaining DOI, “tagging” data, publication data location, etc. A New Parallel Python Tool for Standardization of ESM Data

4 CESM Publication Workflow
Time Slices Time Series CESM Run Time-Series Generation Standardization Publication NCO NCO Pre-CMIP6: All performed using the NetCDF Operators (NCO) App performed in serial! A New Parallel Python Tool for Standardization of ESM Data

5 CESM Publication Workflow
Time Slices Time Series CESM Run Time-Series Generation Standardization Publication PyReshaper PyConform Pre-CMIP6: All performed using the NetCDF Operators (NCO) App performed in serial! CMIP6: All performed using parallel (MPI) Python tools Time-Series Generation  PyReshaper (12x Speedup) Standardization  PyConform A New Parallel Python Tool for Standardization of ESM Data

6 MIP Data Standards Every MIP requests specific data:
File formats (e.g., NetCDF4) Names of files and directory structure File attributes (e.g., institution, MIP name, …) Names of dimensions (e.g., lat, lon, …) Names of variables (e.g., psl, ta, tas, …) Dimensions of variables Variable data types (e.g., float, double, …) Attributes of variables (e.g., units, …) Ranges of time (e.g., 2006 to 2100) Lots of information! Very complicated! Could be very different from model output! A New Parallel Python Tool for Standardization of ESM Data

7 The Limitation of Standards
MIP Standards say “what” you need MIP Standards don’t say “how” to get it Suppose “v” is requested by the MIP What is its data? v = ? Easy case… v = v or v = V Also, pretty easy… v = X Harder… v = X1 + somefunction(X2 - X3) Also need a set of rules, or “definitions”, which require special knowledge about the model itself! A New Parallel Python Tool for Standardization of ESM Data

8 MIP Data Standardization
Standardization Process: Get the MIP-specific “standards” Get the model-specific “definitions” Combine! “standards” + “definitions” = “standardization” Perform the operations needed by the “standardization” Specification Computation PyConform: 2 utilities Specification: creates a “standardization file” Computation: parallel graph-based computation A New Parallel Python Tool for Standardization of ESM Data

9 PyConform: Specification
Get the MIP-specific “standards”: CMOR (from PCMDI/LLNL) No clear separation of “specification” from “computation” DreqPy (from UK Met Office) Database-like collection of MIP standards Get the model-specific “definitions”: Requires input from the scientists! Slow in producing, but no other way? Combine! Serial Python command-line tool Creates a single file used as input in… DreqPy x = X1 + X2 y = X1 - X2 z = Z u = function1(U,V) v = function2(U,V) definitions.txt A New Parallel Python Tool for Standardization of ESM Data

10 PyConform: Computation
“x = X1 + X2” Read: X1[i] Validate: > minimum < maximum dimensions = [j] et cetera Evaluate: (X1+X2)[i] Map: ij Read: X2[i] Write: x[j] y[j] File Read: X1[i] Validate: > minimum < maximum dimensions = [j] et cetera Evaluate: (X1-X2)[i] Map: ij Read: X2[i] “y = X1 - X2” A New Parallel Python Tool for Standardization of ESM Data

11 Evaluate Nodes: (Almost) anything can be done!
All data passed along the “Data Flow” has… Units  allows for implicit conversion! Dimensions  allows for implicit transposition! Basic element-wise arithmetic e.g., +, -, *, /, ** Element-wise math functions e.g., sqrt(x) User-defined functions: Subclass from PyConform’s Function class Input: numbers, strings, variable-names, … Many already written for testing: e.g., vinth2p(T,…), invert_dim(T,”dim”) A New Parallel Python Tool for Standardization of ESM Data

12 PyConform: Testing Super-linear speedups!
CESM Case Name CMIP5 Table Input Dataset Size Output Dataset Size Original Serial Runtime PyConform Parallel Runtime (16 Procs) SPEEDUP b40.rcp4_5.1deg.006 Amon 84 GB 62 GB 4313 sec 113 sec 38x b40.20th.track1.1deg.012 135 GB 102 GB 7157 sec 456 sec 16x 3hr 540 GB 506 GB 22267 sec 645 sec 34x Super-linear speedups! NOTE: …Serial performance much better Much more testing to be done! Achieving bit-for-bit agreement is time-consuming Coming up with the “definitions” can be hard! A New Parallel Python Tool for Standardization of ESM Data

13 Thank you! Questions? A New Parallel Python Tool for Standardization of ESM Data


Download ppt "National Center for Atmospheric Research"

Similar presentations


Ads by Google