Download presentation
Presentation is loading. Please wait.
Published bySibyl Hood Modified over 7 years ago
1
Optimizing a 2D Discontinuous Galerkin dynamical core for both CPU and GPU execution Pranay Reddy Kommera * Dr. Ram Nair ** Dr. Richard Loft ** Raghu Raj Prasanna Kumar ** * Department of Electrical and Computer Engineering, University of Wyoming ** Computational and Information Systems Lab, National Center for Atmospheric Research
2
Outline 1.Introduction 2.Overview of the Code 3.Methodology 4.Results 5.Conclusion & Future Work 1 1
3
Introduction o The Discontinuous Galerkin (DG) methods are becoming increasingly popular in developing global atmospheric models in both hydrostatic and nonhydrostatic (NH) modeling. o Why are DG methods popular ? Geometric Flexibility. High-order Accuracy. Lower interprocessor communication. Higher parallel efficiency. A 2D (x-z) prototype of nonhydrostatic Discontinuous Galerkin (NH-DG) model used with high grid resolution has been modified to support parallelism. o Why 2D NH-DG model ? Has application in high order weather and climate modeling. Promises 3D parallelism in full 3D model. o What model is used ? 2 2 Nair, R. D., Bao, L., & Hall, D. (n.d.). A Time-Split Discontinuous Galerkin Non-Hydrostatic Model in HOMMEDynamical Core. Speech presented at ICOSAHOM in Utah, Salt Lake City.
4
Introduction Cont. 3 3 Nair, R. D., Bao, L., & Hall, D. (n.d.). A Time-Split Discontinuous Galerkin Non-Hydrostatic Model in HOMME Dynamical Core. Speech presented at ICOSAHOM in Utah, Salt Lake City. Element Structure Physical Problem: Inertia-Gravity Wave Test Bao, L., Klöfkorn, R., & Nair, R. D. (2015). Horizontally Explicit and Vertically Implicit (HEVI) Time Discretization Scheme for a Discontinuous Galerkin Nonhydrostatic Model. Mon. Wea. Rev. Monthly Weather Review, 143(3), 972-990. doi:10.1175/mwr-d-14-00083.1
5
Overview of the code o We are solving fully compressible Euler equations using DG Method. ---(1) ---(2) o Routines of interest nh_flux_maker dg_nhsys_rhs Recover_state_vars o Implements RK3 method to update the variables. 4 4 Recover_state_vars dg_nhsys_rhs nh_flux_maker where, U is State vector, F is Flux vector and S is Source term
6
Yellowstone Supercomputer o Caldera 30 Nodes – IBM x360 M4, dual- socket nodes. Two 8-core 2.6-GHz Intel Xeon E5- 2670 (Sandy Bridge) processors per node. 16 GPU Nodes – 2 NVIDIA K20x GPUs per node. o Compilers Intel 16.0.2 PGI 16.5 o Points to know Fortran Code Double Precision 5 5 http://laramielive.com/6-uw-projects-chosen-for-2nd-cycle-of-supercomputer-use/
7
Approach Serial Implementation 6 6 Optimized Serial Code OpenACC GPU & CPU Implementation OpenMP Implementation Multi Node Multi-GPU Implementation
8
Optimizing Serial Code o Two important factors in design of a scientific code Readability Performance o Coding style tradeoffs Derived data types (DDT) All the variables are packed into a group. Easy to understand. Pros Cons Vectorization is affected by using derived data types. Introduction of local variables Makes derived data types readable. Pros Cons Increases memory usage and memory requirement. 7 7
9
Optimizing Serial Code Cont. o Optimization 1 - Arrays Convert derived data types into arrays. type, public :: state_t sequence real(kind=double) :: var1(nx,nx) … real(king=double) :: var10(nx,nx) end type type (dummy_t) :: state(nex,nez) Real(kind=double) :: var1(nx,nx,nex,nez) … Real(kind=double) :: var10(nx,nx,nex,nez) 8 8 Intel and PGI compilers vectorize arrays when compared to derived data types.
10
Optimizing Serial Code Cont. o Optimization 2 - Temporary variables Clever usage of temporary variables. 9 9 Eliminated unnecessary temporary variables. a = b * c d = a * e d = (b * c) * e Added few temporary variables to remove repeated compute intensive operations. volume = state%volume rho = state%rho mass = volume * rho mass = state%volume * state%rho a = b/c d = e/c f = g/c t = 1/c a = b * t d = e * t f = g * t
11
Outline Optimizing Serial Code Cont. 509.84 566.25 383.32 341.01 367.37 320.48 343.85 265.38 10
12
GPU Architecture o Two important concepts of GPU programming. Thread Hierarchy Grid Thread Blocks (gangs) Threads (workers and vectors) Warp – 32 Threads Memory Hierarchy Global Memory Shared Memory Register Memory Constant Memory L1/L2 Cache Global Memory Constant Memory Shared Memory L2 Cache Register Memory Memory Size Access Time 11 https://www.olcf.ornl.gov/support/system-user-guides/accelerated-computing-guide/
13
OpenACC GPU Implementation o Open Accelerators (OpenACC) is a directive based programming model targeting accelerators like NVIDIA GPUs and AMD Radeon. o Parallel and data directives are used to implement parallelism across threads and data transfers across various memories. o GPU Implementation 1 – var(nx,nx,nex,nez) Thread 1 Thread 2 Thread 3 o GPU Implementation 2 – var(nex,nez,nx,nx) Thread 1 Thread 2 Thread 3 Memory o GPU Implementation 2 has resulted in a 33% reduction in time w.r.t. implementation 1. 12 Element 1 Element 2 Element 1Element 2 P1
14
Other Implementations o Multicore implementations on CPUs Both OpenACC and OpenMP are used to develop parallel models on CPUs. With minimal code modifications the OpenACC GPU code is converted into OpenACC CPU. Added OpenMP directives to have a parallel OpenMP implementation. o Multi-GPU Implementation MPI and OpenACC programming models are used to execute 2D NH-DG model on multiple nodes. The exchange of data between nodes is performed by CPU using MPI. 13 The data required for exchange are transferred to CPU from GPU for each iteration. GPU 1 CPU GPU 2
15
Results 13.32 9.45 10.23 A speedup of ~1.4x can be achieved for original code by just using Intel compilers. 14
16
Results 15 794.12 383.86 261.28 235.19 225.77 167.91 158.56
17
Conclusion & Future Work o Conclusion A single Kepler K20x outperforms a dual socketed Sandy Bridge Xeon node. Demonstrated performance portability with OpenACC. OpenACC and OpenMP have comparable performance on the CPU for 1 thread per core cases. However, OpenACC does not support hyperthreading. The serial performance of the PGI compiler is significantly slower than Intel’s. o Future Work Use GPU direct for GPU-to-GPU communication to improve scaling. Optimize the load and store transactions further on GPUs. Benchmark contemporary systems such as Knights Landing, Pascal and Broadwell. 16
18
References o Bao, L., Klöfkorn, R., & Nair, R. D. (2015). Horizontally Explicit and Vertically Implicit (HEVI) Time Discretization Scheme for a Discontinuous Galerkin Nonhydrostatic Model. Mon. Wea. Rev. Monthly Weather Review, 143(3), 972-990. doi:10.1175/mwr-d-14-00083.1 o Nair, R. D., Bao, L., & Hall, D. (n.d.). A Time-Split Discontinuous Galerkin Non- Hydrostatic Model in HOMMEDynamical Core. Speech presented at ICOSAHOM in Utah, Salt Lake City. 17
19
THANK YOU
20
Extra !$acc parallel num_gangs(x) num_workers(y) vector_length(z) !$acc loop gang worker vector collapse private(var list) Time for each routine per iteration Example for parallel directives !$acc data copyin(var list) copyout(var list) !$acc end data Example for data directives
21
Code Local variables in dg_nhsys_rhs
22
Timings Timing statistics for 6000 and 24000 elements
23
Results 14 13.32 18.11
24
Results 14 10.11 3.27 31.26 2.99
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.