Version 1.3, Feb. 2013Parallel implementation 1/28WW Winter School 2013 Parallel implementation Hendrik Tolman The WAVEWATCH III Team + friends Marine.

Slides:



Advertisements
Similar presentations
© Crown copyright Met Office Met Office Unified Model I/O Server Paul Selwood.
Advertisements

Part IV: Memory Management
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
CSCI 4717/5717 Computer Architecture
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Running the model, single grid version
Operational Forecasting Wave Models. WaveWatch III (Tolman 1997, 1999a) Model description: – Third generation wave model developed at NOAA/NCEP. – Solves.
Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.
Version 1.1, Jan. 2013Obstructions 1/26WW Winter School 2013 Developing Obstruction grids Arun Chawla The WAVEWATCH III Team + friends Marine Modeling.
1 DISTRIBUTION STATEMENT XXX– Unclassified, Unlimited Distribution Laser Propagation Modeling Making Large Scale Ultrashort Pulse Laser Simulations Possible.
Version 1.2, Feb. 2013Quasi-stationary WW3 1/15WW Winter School 2013 Quasi-stationary WAVEWATCH III André van der Westhuysen The WAVEWATCH III Team + friends.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Reference: Message Passing Fundamentals.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Introduction to Scientific Computing on Linux Clusters Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Version 1.2, Jan. 2013Grid generation 1/32WW Winter School 2013 Automated Grid Generation for WAVEWATCH III Arun Chawla The WAVEWATCH III Team + friends.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Version 1.3, Feb. 2013Time stepping 1/20WW Winter School 2013 Time stepping Hendrik Tolman The WAVEWATCH III Team + friends Marine Modeling and Analysis.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
1 Parallel Computing Basics of Parallel Computers Shared Memory SMP / NUMA Architectures Message Passing Clusters.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Heterogeneous Parallelization for RNA Structure Comparison Eric Snow, Eric Aubanel, and Patricia Evans University of New Brunswick Faculty of Computer.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
Performance Evaluation of Parallel Processing. Why Performance?
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Porting from the Cray T3E to the IBM SP Jonathan Carter NERSC User Services.
Version 1.2, Feb. 2013Modular structure 1/22WW Winter School 2013 Modular structure of WAVEWATCH III and general features Arun Chawla The WAVEWATCH III.
Version 1.4, Jan. 2013Triangles 1/27WW Winter School 2013 Running WWATCH on triangular meshes Fabrice Ardhuin The friends of WAVEWATCH III Team Marine.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Algorithms & Flowchart
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
NCEP ESMF GFS Global Spectral Forecast Model Weiyu Yang, Mike Young and Joe Sela ESMF Community Meeting MIT, Cambridge, MA July 21, 2005.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.
CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Slide 1 NEMOVAR-LEFE Workshop 22/ Slide 1 Current status of NEMOVAR Kristian Mogensen.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Performance of a Semi-Implicit, Semi-Lagrangian Dynamical Core for High Resolution NWP over Complex Terrain L.Bonaventura D.Cesari.
Background Computer System Architectures Computer System Software.
Understanding General Software Development Lesson 3.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Advanced QlikView Performance Tuning Techniques
CRESCO Project: Salvatore Raia
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Presentation transcript:

Version 1.3, Feb. 2013Parallel implementation 1/28WW Winter School 2013 Parallel implementation Hendrik Tolman The WAVEWATCH III Team + friends Marine Modeling and Analysis Branch NOAA / NWS / NCEP / EMC

Version 1.3, Feb. 2013Parallel implementation 2/28WW Winter School 2013 Outline  Covered in this lecture: Compiling the code. Running the code. Optimizing parallel model implementations.  Parallel implementation of individual grids (ww3_shel).  Additional options in ww3_multi.  Hybrid parallelization.  Profiling.  Memory use (+IO).  Considerations and pitfalls.  Future ….

Version 1.3, Feb. 2013Parallel implementation 3/28WW Winter School 2013 compiling  Parallel implementation of WAVEWATCH III Using MPI only, but code (switches) allow for other type of parallel architecture.  MPI was the standard before, now is even more so.  Thinking about shmem (Cray) style parallel implementation but never had the need to act on this. Not all codes use / can benefit from parallel implementation:  Actual wave model codes ww3_shel, ww3_multi, …. will run much more efficient.  Initialization code ww3_strt may need in some cases for memory use (rarely if ever used at NCEP).

Version 1.3, Feb. 2013Parallel implementation 4/28WW Winter School 2013 compiling  Make sure MPI is implemented correctly: Large majority of issues with MPI have been in MPI implementation:  Even on our production IBM power series machines.  Hope you have good IT support.

Version 1.3, Feb. 2013Parallel implementation 5/28WW Winter School 2013 compiling  Compile in several steps: Set switches for serial code (SHRD switch) Compile serial codes from scratch:  Call w3_new to force complete compile of all routines (not strictly necessary).  Call w3_make with or without program names to get base set of serial codes. Set switches for parallel code (DIST and MPI switch). Compile parallel codes:  Compile selected codes only  w3_make ww3_shel ww3_multi ….  This will automatically recompile all used subroutines.  All this is done automatically in make_MPI.

Version 1.3, Feb. 2013Parallel implementation 6/28WW Winter School 2013 Running parallel  Running code in parallel depends largely on hardware and software on your computer. Generally there is a parallel operations environment:  “poe” on IBM systems  “mpirun” or other on Linux systems.  Sometimes, environment needs to be started separately.  mpdboot on UMD cluster.  Check out with your system, IT support if you have it. There are some examples in the test scripts, particularly the ww3_multi and real-world test cases mww3_test_NN and mww3_case_NN. Pitfall: many duplicate output lines: you are running a serial code in a parallel environment …..

Version 1.3, Feb. 2013Parallel implementation 7/28WW Winter School 2013 Optimization  Three things to consider while  optimizing the implementation. General code optimization (no further discussion):  Compiler options.  Switches (2 nd order versus 3 rd order propagation, etc.)  Spectral resolution.  Time stepping. MPI optimization (no further discussion).  Often overlooked, but can be very important on Linux systems. Application optimization (see below):  Cannot do too much with ww3_shel, but will show techniques used here.  Many additional options in ww3_multi. Tolman, H. L., 2002: Parallel Computing, 28, Tolman, H. L., 2003: MMAB Tech. Note 228, 27 pp.

Version 1.3, Feb. 2013Parallel implementation 8/28WW Winter School 2013 ww3_shel Physical space Spectral space directions frequencies The prognostic variable is the spectral wave energy density as a function of spatial and spectral coordinates and of time.

Version 1.3, Feb. 2013Parallel implementation 9/28WW Winter School 2013 ww3_shel  Propagation : By definition linear, nonlinear corrections possible. Covers all dimensions.  Physics : Wave growth and decay due to external factors :  wind-wave interactions,  wave-wave interactions,  dissipation. Local in physical space.

Version 1.3, Feb. 2013Parallel implementation 10/28WW Winter School 2013 ww3_shel  Time splitting / Fractional steps.  Separate treatment of : physics (local), local propagation effects (change of direction or frequency), spatial propagation.  Each step consecutively operates on small subsets of data.  Entire model in core, memory requirements less than twice that of storing single state.

Version 1.3, Feb. 2013Parallel implementation 11/28WW Winter School 2013 ww3_shel Physics involved suggest that grid points are distributed over processors rather than spectral components, particularly for the time splitting and source term integration techniques used in WAVEWATCH III. "blocking" "Scattering"

Version 1.3, Feb. 2013Parallel implementation 12/28WW Winter School 2013 WAVEWATCH ww3_shel  Blocking : Only data at block bound. needed. Total amount of data comm. is a function of # of processes. Algorithm depends on actual prop. scheme.  Scattering : Full data transpose needed. Total amount of data comm. nearly constant. Algorithm inde- pendent of prop. scheme. Load balancing easier.

Version 1.3, Feb. 2013Parallel implementation 13/28WW Winter School 2013 ww3_shel  For WAVEWATCH III (ww3_shel)  the scattering method is used because : Compatibility with previous versions. Maximum flexibility and transparency of code (future physics and numerics developments). Feasibility based on estimates of amount of communication needed. MPI used for portability.

Version 1.3, Feb. 2013Parallel implementation 14/28WW Winter School 2013 ww3_shel  Standard optimization techniques : Non-blocking communication. Persistent communication, mainly for transparency of code.  Buffering of the data during transpose: Avoids idle CPU cycles during communication operations. Communication “hidden” behind calculations. processors with native data scatter gather buffer active at target processor convert calculate propagation 1-D array with given spectral component for all sea points Corresponding 2-D spatial wave field

Version 1.3, Feb. 2013Parallel implementation 15/28WW Winter School 2013 ww3_shel  Performance test case : Global model version with 288 x 157 spatial grid points (30,030 sea points) and 600 spectral components. 12 hour hindcast, 3 day forecast.  Resources : IBM RS6000 SP with Winterhawk nodes with 2 processors and 500 Mb memory and a NFS or GPFS file system. AIX version 4.3. xlf version 6.  Not always applicable to linux ….  MPI setup impact ….

Version 1.3, Feb. 2013Parallel implementation 16/28WW Winter School 2013 NFS red : calculation only. blue : model without output. green : full model. dashed and dotted lines : efficiency required to match Cray C90 with 1 or 12 CPUs

Version 1.3, Feb. 2013Parallel implementation 17/28WW Winter School 2013 GPFS red : calculation only. blue : model without output. dashed and dotted lines : efficiency required to match Cray C90 with 1 or 12 CPUs

Version 1.3, Feb. 2013Parallel implementation 18/28WW Winter School 2013 ww3_multi  For mosaic approach, there are  other optimization options: Splitting grid in overlapping domains:  Better local CFL time steps.  Hybrid domain decomposition / data transpose.  Presently being researched at NCEP. Running grids with same rank side-by-side on parts of communicator:  Localizing communications.  Amdahl's law generally favors running grids side-by-side on smaller number of processors over running in sequence over larger number of processors.

Version 1.3, Feb. 2013Parallel implementation 19/28WW Winter School 2013 ww3_multi  Splitting the communicator: Example from mww3_test_03, running three overlapping low resolution grids, with overlay of three overlapping high resolution grids.  Example runs low1-3 serially on entire communicator.  Example runs hgh1-3 side-by-side on fractions of communicator.  Output can also go to dedicated processors. from mww3_test_03 ww3_multi input file..... $ ’low1' 'no' 'no' ’no' 'no' 'no' 'no' 'no' F ’low2' 'no' 'no' ’no' 'no' 'no' 'no' 'no' F ’low3' 'no' 'no' ’no' 'no' 'no' 'no' 'no' F $ ’hgh1' 'no' 'no' ’no' 'no' 'no' 'no' 'no' F ’hgh2' 'no' 'no' ’no' 'no' 'no' 'no' 'no' F ’hgh3' 'no' 'no' ’no' 'no' 'no' 'no' 'no' F $ Note: identical fractions = non-overlapping communicators

Version 1.3, Feb. 2013Parallel implementation 20/28WW Winter School 2013 ww3_multi WAVEWATCH III log file version 4.08 ===================================================================== multi-grid model driver date : 2013/01/04 time : 11:01: Group information : nr grids (part of comm.) ( ) 2 2 ( ) 3 ( ) 4 ( ) Resource assignement (processes) : grid comp. grd pnt trk rst bpt prt low hgh hgh hgh Running with or without output processor WAVEWATCH III log file version 4.08 ===================================================================== multi-grid model driver date : 2013/01/04 time : 11:05: Group information : nr grids (part of comm.) ( ) 2 2 ( ) 3 ( ) 4 ( ) Resource assignement (processes) : grid comp. grd pnt trk rst bpt prt low hgh hgh hgh

Version 1.3, Feb. 2013Parallel implementation 21/28WW Winter School 2013 ww3_multi  Grid-level profiling: Compile under MPI with MPRF switch on. Run short piece of model, generating profiling data sets. Run GrADS script profile.gs to visualize: Example of NCEP’s original multi-grid wave model on next slide.  8 grids.  360 processors.  Dedicated I/O processors.

Version 1.3, Feb. 2013Parallel implementation 22/28WW Winter School 2013 ww3_multi regionalcoastalcoupling outputglobal NCEP “multi_1” global model on IBM ca. 2008

Version 1.3, Feb. 2013Parallel implementation 23/28WW Winter School 2013 Memory use + IO  When running a grid on NAPROC processors,  each processor stores: NSEA/NAPROC spectra.(scaling) Output fields:  Version 4.08 and earlier full fields.(not sc.)  Version 4.09 and later:  Sparse output fields (NSEA/NAPROC).(scaling)  Full output fields (NSEA) in 1 processor only, only fields for selected output.(not sc.) Other outputs:  Gathered in one processor, with some buffering to limit local memory use. (“scaling”) Work arrays, interpolation tables, …(not sc.)

Version 1.3, Feb. 2013Parallel implementation 24/28WW Winter School 2013 Memory use + IO  IO considerations: IO server type IOSTYP in ww3_shel and ww3_multi. 0: No IO server process, use direct access write in restart file. Requires full parallel file system 1: No IO server process, each output in single process. 2: Single dedicated IO process for all output. 3:Dedicated IO process for each separate IO type. In ww3_multi, all point output can go to dedicated processor IO proc full fields 1 sparse spectra sparse fields …… sparse spectra sparse fields naproc sparse spectra sparse fields 1 sparse spectra sparse fields 2 sparse spectra sparse fields naproc sparse spectra sparse fields …… sparse spectra sparse fields full fields

Version 1.3, Feb. 2013Parallel implementation 25/28WW Winter School 2013 Memory use + IO  IO considerations: Use IO server to manage memory use as well as faster IO.  Combine with smart placement on nodes (e.g., less processes on node that does IO) leaves much flexibility for efficient loading of large grids. Use overlapping grids:  Each grid has much smaller full field arrays.  Stitch together later with ww3_gint.

Version 1.3, Feb. 2013Parallel implementation 26/28WW Winter School 2013 ww3_multi  Considerations and pitfalls: Intra-node and across-node communications are very different.  Keeping grid on node may be important. Scaling on different systems is very different:  IBM-SP versus Linux.  Impact of file system.  Optimization of MPI.  Data transpose is many small messages, MPI needs to be tuned for this … For operational models, dimension for worst case:  Profile without ice.  Consider smallest grids with largest storms in consideration of load balancing.

Version 1.3, Feb. 2013Parallel implementation 27/28WW Winter School 2013 Optimization future  Working on the following: Hybrid domain decomposition / data transpose parallel using ww3_multi and automatic grid spitting code (under development). Provide some assessment of optimization for both climate (low-res, high-speed), and deterministic (high-res, high- speed) implementations.  Stay tuned !

Version 1.3, Feb. 2013Parallel implementation 28/28WW Winter School 2013 The end End of lecture