Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimizing a 2D Discontinuous Galerkin dynamical core for both CPU and GPU execution Pranay Reddy Kommera * Dr. Ram Nair ** Dr. Richard Loft ** Raghu Raj.

Similar presentations


Presentation on theme: "Optimizing a 2D Discontinuous Galerkin dynamical core for both CPU and GPU execution Pranay Reddy Kommera * Dr. Ram Nair ** Dr. Richard Loft ** Raghu Raj."— Presentation transcript:

1 Optimizing a 2D Discontinuous Galerkin dynamical core for both CPU and GPU execution Pranay Reddy Kommera * Dr. Ram Nair ** Dr. Richard Loft ** Raghu Raj Prasanna Kumar ** * Department of Electrical and Computer Engineering, University of Wyoming ** Computational and Information Systems Lab, National Center for Atmospheric Research

2 Outline 1.Introduction 2.Overview of the Code 3.Methodology 4.Results 5.Conclusion & Future Work 1 1

3 Introduction o The Discontinuous Galerkin (DG) methods are becoming increasingly popular in developing global atmospheric models in both hydrostatic and nonhydrostatic (NH) modeling. o Why are DG methods popular ?  Geometric Flexibility.  High-order Accuracy.  Lower interprocessor communication.  Higher parallel efficiency.  A 2D (x-z) prototype of nonhydrostatic Discontinuous Galerkin (NH-DG) model used with high grid resolution has been modified to support parallelism. o Why 2D NH-DG model ?  Has application in high order weather and climate modeling.  Promises 3D parallelism in full 3D model. o What model is used ? 2 2 Nair, R. D., Bao, L., & Hall, D. (n.d.). A Time-Split Discontinuous Galerkin Non-Hydrostatic Model in HOMMEDynamical Core. Speech presented at ICOSAHOM in Utah, Salt Lake City.

4 Introduction Cont. 3 3 Nair, R. D., Bao, L., & Hall, D. (n.d.). A Time-Split Discontinuous Galerkin Non-Hydrostatic Model in HOMME Dynamical Core. Speech presented at ICOSAHOM in Utah, Salt Lake City.  Element Structure  Physical Problem: Inertia-Gravity Wave Test Bao, L., Klöfkorn, R., & Nair, R. D. (2015). Horizontally Explicit and Vertically Implicit (HEVI) Time Discretization Scheme for a Discontinuous Galerkin Nonhydrostatic Model. Mon. Wea. Rev. Monthly Weather Review, 143(3), 972-990. doi:10.1175/mwr-d-14-00083.1

5 Overview of the code o We are solving fully compressible Euler equations using DG Method. ---(1) ---(2) o Routines of interest  nh_flux_maker  dg_nhsys_rhs  Recover_state_vars o Implements RK3 method to update the variables. 4 4 Recover_state_vars dg_nhsys_rhs nh_flux_maker where, U is State vector, F is Flux vector and S is Source term

6 Yellowstone Supercomputer o Caldera  30 Nodes – IBM x360 M4, dual- socket nodes.  Two 8-core 2.6-GHz Intel Xeon E5- 2670 (Sandy Bridge) processors per node.  16 GPU Nodes – 2 NVIDIA K20x GPUs per node. o Compilers  Intel 16.0.2  PGI 16.5 o Points to know  Fortran Code  Double Precision 5 5 http://laramielive.com/6-uw-projects-chosen-for-2nd-cycle-of-supercomputer-use/

7 Approach Serial Implementation 6 6 Optimized Serial Code OpenACC GPU & CPU Implementation OpenMP Implementation Multi Node Multi-GPU Implementation

8 Optimizing Serial Code o Two important factors in design of a scientific code  Readability  Performance o Coding style tradeoffs  Derived data types (DDT) All the variables are packed into a group. Easy to understand. Pros Cons Vectorization is affected by using derived data types.  Introduction of local variables Makes derived data types readable. Pros Cons Increases memory usage and memory requirement. 7 7

9 Optimizing Serial Code Cont. o Optimization 1 - Arrays  Convert derived data types into arrays. type, public :: state_t sequence real(kind=double) :: var1(nx,nx) … real(king=double) :: var10(nx,nx) end type type (dummy_t) :: state(nex,nez) Real(kind=double) :: var1(nx,nx,nex,nez) … Real(kind=double) :: var10(nx,nx,nex,nez) 8 8  Intel and PGI compilers vectorize arrays when compared to derived data types.

10 Optimizing Serial Code Cont. o Optimization 2 - Temporary variables  Clever usage of temporary variables. 9 9 Eliminated unnecessary temporary variables. a = b * c d = a * e d = (b * c) * e Added few temporary variables to remove repeated compute intensive operations. volume = state%volume rho = state%rho mass = volume * rho mass = state%volume * state%rho a = b/c d = e/c f = g/c t = 1/c a = b * t d = e * t f = g * t

11 Outline Optimizing Serial Code Cont. 509.84 566.25 383.32 341.01 367.37 320.48 343.85 265.38 10

12 GPU Architecture o Two important concepts of GPU programming.  Thread Hierarchy Grid Thread Blocks (gangs) Threads (workers and vectors) Warp – 32 Threads  Memory Hierarchy Global Memory Shared Memory Register Memory Constant Memory L1/L2 Cache Global Memory Constant Memory Shared Memory L2 Cache Register Memory Memory Size Access Time 11 https://www.olcf.ornl.gov/support/system-user-guides/accelerated-computing-guide/

13 OpenACC GPU Implementation o Open Accelerators (OpenACC) is a directive based programming model targeting accelerators like NVIDIA GPUs and AMD Radeon. o Parallel and data directives are used to implement parallelism across threads and data transfers across various memories. o GPU Implementation 1 – var(nx,nx,nex,nez) Thread 1 Thread 2 Thread 3 o GPU Implementation 2 – var(nex,nez,nx,nx) Thread 1 Thread 2 Thread 3 Memory o GPU Implementation 2 has resulted in a 33% reduction in time w.r.t. implementation 1. 12 Element 1 Element 2 Element 1Element 2 P1

14 Other Implementations o Multicore implementations on CPUs  Both OpenACC and OpenMP are used to develop parallel models on CPUs.  With minimal code modifications the OpenACC GPU code is converted into OpenACC CPU.  Added OpenMP directives to have a parallel OpenMP implementation. o Multi-GPU Implementation  MPI and OpenACC programming models are used to execute 2D NH-DG model on multiple nodes.  The exchange of data between nodes is performed by CPU using MPI. 13  The data required for exchange are transferred to CPU from GPU for each iteration. GPU 1 CPU GPU 2

15 Results 13.32 9.45 10.23  A speedup of ~1.4x can be achieved for original code by just using Intel compilers. 14

16 Results 15 794.12 383.86 261.28 235.19 225.77 167.91 158.56

17 Conclusion & Future Work o Conclusion  A single Kepler K20x outperforms a dual socketed Sandy Bridge Xeon node.  Demonstrated performance portability with OpenACC.  OpenACC and OpenMP have comparable performance on the CPU for 1 thread per core cases.  However, OpenACC does not support hyperthreading.  The serial performance of the PGI compiler is significantly slower than Intel’s. o Future Work  Use GPU direct for GPU-to-GPU communication to improve scaling.  Optimize the load and store transactions further on GPUs.  Benchmark contemporary systems such as Knights Landing, Pascal and Broadwell. 16

18 References o Bao, L., Klöfkorn, R., & Nair, R. D. (2015). Horizontally Explicit and Vertically Implicit (HEVI) Time Discretization Scheme for a Discontinuous Galerkin Nonhydrostatic Model. Mon. Wea. Rev. Monthly Weather Review, 143(3), 972-990. doi:10.1175/mwr-d-14-00083.1 o Nair, R. D., Bao, L., & Hall, D. (n.d.). A Time-Split Discontinuous Galerkin Non- Hydrostatic Model in HOMMEDynamical Core. Speech presented at ICOSAHOM in Utah, Salt Lake City. 17

19 THANK YOU

20 Extra !$acc parallel num_gangs(x) num_workers(y) vector_length(z) !$acc loop gang worker vector collapse private(var list)  Time for each routine per iteration  Example for parallel directives !$acc data copyin(var list) copyout(var list) !$acc end data  Example for data directives

21 Code  Local variables in dg_nhsys_rhs

22 Timings  Timing statistics for 6000 and 24000 elements

23 Results 14 13.32 18.11

24 Results 14 10.11 3.27 31.26 2.99


Download ppt "Optimizing a 2D Discontinuous Galerkin dynamical core for both CPU and GPU execution Pranay Reddy Kommera * Dr. Ram Nair ** Dr. Richard Loft ** Raghu Raj."

Similar presentations


Ads by Google