CMAQ PARALLEL PERFORMANCE WITH MPI AND OpenMP George Delic, Ph

Slides:



Advertisements
Similar presentations
Parallel Processing with OpenMP
Advertisements

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Parallel Computation of the 2D Laminar Axisymmetric Coflow Nonpremixed Flames Qingan Andy Zhang PhD Candidate Department of Mechanical and Industrial Engineering.
Performance Analysis of Multiprocessor Architectures
Copyright HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill,
Copyright HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill,
Introduction to Scientific Computing on Linux Clusters Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002.
NETL 2014 Workshop on Multiphase Flow Science August 5-6, 2014, Morgantown, WV Accelerating MFIX-DEM code on the Intel Xeon Phi Dr. Handan Liu Dr. Danesh.
The hybird approach to programming clusters of multi-core architetures.
Parallelization: Conway’s Game of Life. Cellular automata: Important for science Biology – Mapping brain tumor growth Ecology – Interactions of species.
Jawwad A Shamsi Nouman Durrani Nadeem Kafi Systems Research Laboratories, FAST National University of Computer and Emerging Sciences, Karachi Novelties.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.
Parallel Processing LAB NO 1.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
1 CCOS Seasonal Modeling: The Computing Environment S.Tonse, N.J.Brown & R. Harley Lawrence Berkeley National Laboratory University Of California at Berkeley.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Parallelization: Area Under a Curve. AUC: An important task in science Neuroscience – Endocrine levels in the body over time Economics – Discounting:
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Experience with COSMO MPI/OpenMP hybrid parallelization Matthew Cordery, William Sawyer Swiss National Supercomputing Centre Ulrich Schättler Deutscher.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Copyright © 2002, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
 Copyright, HiCLAS1 George Delic, Ph.D. HiPERiSM Consulting, LLC And Arney Srackangast, AS1MET Services
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Three State Air Quality Study (3SAQS) Benchmarking for 3SAQS base 2008 CAMx simulations.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.
Single Node Optimization Computational Astrophysics.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Background Computer System Architectures Computer System Software.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Evolution at CERN E. Da Riva1 CFD team supports CERN development 19 May 2011.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Chen Jin (HT016952H) Zhao Xu Ying (HT016907B)
Introduction to Parallel Processing
Amit Amritkar & Danesh Tafti Eric de Sturler & Kasia Swirydowicz
Super Computing By RIsaj t r S3 ece, roll 50.
Constructing a system with multiple computers or processors
Introduction to Parallelism.
University of Technology
Accelerating MapReduce on a Coupled CPU-GPU Architecture
COMP60611 Fundamentals of Parallel and Distributed Systems
MPI-Message Passing Interface
Summary Background Introduction in algorithms and applications
SELECTED RESULTS OF MODELING WITH THE CMAQ PLUME-IN-GRID APPROACH
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
A Domain Decomposition Parallel Implementation of an Elasto-viscoplasticCoupled elasto-plastic Fast Fourier Transform Micromechanical Solver with Spectral.
Hybrid Programming with OpenMP and MPI
MPJ: A Java-based Parallel Computing System
Department of Computer Science, University of Tennessee, Knoxville
Parallel k-means++ for Multiple Shared-Memory Architectures
Assoc. Prof. Marc FRÎNCU, PhD. Habil.
Types of Parallel Computers
Parallel I/O for Distributed Applications (MPI-Conn-IO)
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

CMAQ 5. 2. 1 PARALLEL PERFORMANCE WITH MPI AND OpenMP George Delic, Ph CMAQ 5.2.1 PARALLEL PERFORMANCE WITH MPI AND OpenMP George Delic, Ph.D., HiPERiSM Consulting, LLC Introduction Results Key Observations This presentation reports on implementation of the parallel sparse matrix solver, FSparse, in the Chemistry Transport Model (CTM) in CMAQ [1]. This release is v6.3 and is a major redesign. It is applicable in the CMAQ version that uses either the Rosenbrock (ROS3) or SMV Gear (GEAR) [2] algorithms in the CTM. Figure 1: Fraction of wall clock time (percent) by science process for the FSparse GEAR version of CMAQ for NP=1 and OpenMP thread counts of 8, 12, and 16. Figure 2: This shows OpenMP speedup with 8 threads for the GEAR solver in the thread parallel version over the standard U.S. EPA model using the Intel compiler for various combinations of nodes. Figure 3: For the 8 thread FSparse version of CMAQ this compares the fraction of total time (percent) spent in MPI message passing, OpenMP computation region, and scalar computation region with the Intel compiler for various combinations of nodes. MPI communication MPI communication time tends to compete with, or exceed, computation time. MPI scaling increases with process count Corresponding to this MPI parallel efficiency decreases Computation CHEM is by far the dominant science process in CMAQ computation time (Fig. 1) Scalar computation time dominates, or competes with, OpenMP computation time (Fig. 3) The scalar computation fraction (of total time) dominates whenever the MPI communication fraction is greater than 12% The OpenMP computation fraction dominates when the MPI communication fraction is less than 12% Hybrid parallel algorithm Thread speedup reaches 1.42 with 8 threads (Fig. 2) Hybrid MPI+OpenMP algorithms that offer more on-node compute intensity as the number of available threads rises (in Fig. 4 note the 1.5 times performance boost with 16 vs 8 threads) Fig. 4: Parallel thread speedup over the standard U.S. EPA model using the Intel compiler in 288 calls to CHEM with GEAR solver for 8, 12 and 16 threads, for NP=1 MPI process in the 24 hour episode. Test Bed Environment The hardware systems chosen were the platforms at HiPERiSM Consulting, LLC. This heterogeneous cluster consists of 128 cores across two nodes with dual Intel E5v3 CPUs (16 cores each), and eight nodes on an HP blade server populated with dual E5640 CPUs (4 cores each). This report implemented the Intel Parallel Studio® suite (release 17.6) and Portland CDK (release 18.1). Fig. 1 Profile of CMAQ processes For a profile of where time is consumed Fig.1 compares the fraction of total wall clock time expended in the dominant science processes in CMAQ for one MPI process. The CHEM process is the Gear (GEAR) version of the CTM. The EPA version is compared with the FSparse threaded version for 8, 12, and 16 OpenMP threads, as identified in the legend. As the fraction of time in CHEM decreases the fraction of time in the other science processes increases. FSparse CMAQ thread speedup Fig. 2 shows the speedup of the OpenMP version over the EPA release which peaks at 1.42. The horizontal axis shows the number of MPI processes, and the multiple values are for differing combinations of participating nodes of the cluster. Variability for any fixed value of NP is determined by how many processes are resident on the two fastest nodes, versus the number on the eight blade nodes. Grid cells are partitioned into blocks of size 50 and these blocks are distributed to threads in a thread team in the OpenMP version (which is limited to 8 per MPI process). Fig. 2 Computation vs communication Intel trace tools allow instrumentation that measures the fraction of run time expended in MPI message passing versus computation. In the latter case trace instrumentation also separates scalar and thread parallel computation fractions. Fig. 3 shows this summary for the FSparse version of CMAQ. More than 20% of the time is spent in MPI message passing at larger NP values. The exceptions are for cases when MPI processes reside on the fastest nodes where MPI communication is predominantly via on-node memory, and not the interconnect fabric. Conclusions This report has described an analysis of CMAQ 5.2.1 behavior in the standard U.S. EPA release and a new thread parallel version of CMAQ suitable for the Rosenbrock and Gear solvers. In this version (v6.3) subroutines common to both algorithms have been successfully developed and tested in the Gear case. The new FSparse version of CMAQ offers layers of parallelism not available in the standard U.S. EPA release and is portable across multi- and many-core hardware and compilers that support thread parallelism. Fig. 3 Numerical issues Concentration values for 11 selected species (O3, NOx, etc) were compared for the entire domain and all 24 time steps. Examples of such a comparison is the difference in predicted concentrations. To within the tolerances required in GEAR (ATOL=1.0e-09, RTOL=1.0E-03), agreement was observed. References [1] Delic, G., 2016: see presentation at the Annual CMAS meeting ( http://www.cmasecenter.org ). [2] Jacobson, M. and Turco, R.P., (1994), Atmos. Environ. 28, 273-284.