Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploiting NVIDIA GPUs with OpenMP

Similar presentations


Presentation on theme: "Exploiting NVIDIA GPUs with OpenMP"— Presentation transcript:

1 Exploiting NVIDIA GPUs with OpenMP
Productive, Portable, High-Level GPU Acceleration

2 OpenMP - Parallelization:
A productive, high-level directive based API to parallelize an application Portable across multiple architectures and platforms template <class T> void OMP45Stream<T>::triad() { unsigned int array_size = this->array_size; T *a = this->a; T *b = this->b; T *c = this->c; #pragma omp parallel for for (int i = 0; i < array_size; i++) a[i] = b[i] + startScalar * c[i]; }

3 OpenMP: GPU Offloading
A productive, high-level, directive based API to parallelize an application Portable across multiple architectures and platforms template <class T> void OMP45Stream<T>::triad() { unsigned int len = this->array_size; T *a = this->a; T *b = this->b; T *c = this->c; #pragma omp target teams distribute parallel for\ map(tofrom: a[:len]) map(to: b[:len], c[:len]) for (int i = 0; i < array_size; i++) a[i] = b[i] + startScalar * c[i]; } GPU - GPU offloading with single line!

4 Performance Comparison: Stream Benchmark
Test Specs 2 Power8 4GHz (8 cores, with 8 threads each) with 1 NVIDIA Pascal P100 GPU. Compiler Options: -O3 -qhot qsmp=omp -qoffload* * Where applicable Kernel is parallelized with a simple directive With a small extension to the directive, the kernel is offloaded to the GPU and accelerated further

5 Time to Accelerate an Application! LULESH
Numerical Algorithms Data Motion Programming Styles A Representative Scientific Computing Application [

6 Generic Acceleration Effort
Find a hotspot for (int i = 0 ; i < n ; ++i) { int indx = idxlist [i] ; double dtf = 1.0e+20; double dtdvov = 1.0e+20; if (vdov[indx] != double(0.)) { dtf = ss[indx] * ss[indx] ; dtdvov = dvovmax / (FABS(vdov[indx])+double(1.e-20)); if (vdov[indx] < double(0.) ) { dtf = dtf + qqc2 * arealg[indx] * arealg[indx] * vdov[indx] * vdov[indx] ; } dtf = SQRT(dtf) ; dtf = arealg[indx] / dtf ; dtcourant = dtf < dtcourant ? dtf : dtcourant; dthydro = dtdvov < dthydro ? dtdvov : dthydro; [1] Sample reduction code found in LULESH

7 Generic Acceleration Effort
Find a hotspot Quick CPU acceleration first Use something like OpenMP #pragma omp parallel for reduction(min: dtcourant, dthydro) for (int i = 0 ; i < n ; ++i) { int indx = idxlist [i] ; double dtf = 1.0e+20; double dtdvov = 1.0e+20; if (vdov[indx] != double(0.)) { dtf = ss[indx] * ss[indx] ; dtdvov = dvovmax / (FABS(vdov[indx])+double(1.e-20)); if (vdov[indx] < double(0.) ) { dtf = dtf + qqc2 * arealg[indx] * arealg[indx] * vdov[indx] * vdov[indx] ; } dtf = SQRT(dtf) ; dtf = arealg[indx] / dtf ; dtcourant = dtf < dtcourant ? dtf : dtcourant; dthydro = dtdvov < dthydro ? dtdvov : dthydro; [1] Sample reduction code found in LULESH

8 Generic Acceleration Effort
Find a hotspot Quick CPU acceleration first Use something like OpenMP If available, use accelerators like GPU Different approaches CUDA OpenMP Since we already have OpenMP code, why not reuse it? #pragma omp target teams distribute parallel for \ map(to:arealg[:n], vdov[:n], ss[:n], idxlist[:n])\ map(tofrom:dtcourant, dthydro)\ reduction(min: dtcourant, dthydro) for (int i = 0 ; i < n ; ++i) { int indx = idxlist [i] ; double dtf = 1.0e+20; double dtdvov = 1.0e+20; if (vdov[indx] != double(0.)) { dtf = ss[indx] * ss[indx] ; dtdvov = dvovmax / (FABS(vdov[indx])+double(1.e-20)); if (vdov[indx] < double(0.) ) { dtf = dtf + qqc2 * arealg[indx] * arealg[indx] * vdov[indx] * vdov[indx] ; } dtf = SQRT(dtf) ; dtf = arealg[indx] / dtf ; dtcourant = dtf < dtcourant ? dtf : dtcourant; dthydro = dtdvov < dthydro ? dtdvov : dthydro; [1] Sample reduction code found in LULESH

9 Performance Results – End to End
Test Specs 2 Power8 4GHz (8 cores, with 8 threads each) with 1 NVIDIA Pascal P100 GPU. Compiler Options: -O3 -qhot qsmp=omp -qoffload* * Where applicable x x x x x x x x

10 References [1] I. Karlin, J. Keasler, R. Neely. LULESH 2.0 Updates and Changes. August 2013, pages 1-9

11 Exploiting NVIDIA GPUs with OpenMP
Productive, Portable, High-Level GPU Acceleration

12 OpenMP Offloading (With XL Compilers) vs. CUDA
2 separate goals for application acceleration: Productivity and portability vs absolute performance Each model is not mutually exclusive With XL Compilers, OpenMP and CUDA can be used together Take advantage of existing CUDA libraries and CUDA code in your OpenMP applications Quickly extend existing CUDA applications CUDA tools are also compatible with OpenMP programs cuda-gdb and other debuggers nvprof/nvvp cuda-memcheck etc

13 Programming Languages and Compilers
Key Features Direct access to the GPU instruction set When leveraging NVIDIA GPUs, generally achieves best performance Compilers: XL Fortran, nvcc, PGI CUDA Fortran Host compilers: GCC, XL High level directives for heterogeneous GPU + NVIDIA GPU systems Platform/accelerator portable Fallback execution for safety Compilers: IBM XL, LLVM/Clang compiler, gcc Interoperable with CUDA (when using XL compilers) High level directives for heterogeneous CPU + NVIDIA GPU systems Directive based parallelization for accelerator device Compilers: PGI, Cray, GCC CUDA Title: Programming Languages and Compilers for OpenPOWER (or something like that)

14 End-to-End Performance
Comparable or better performance in 5/8* comparable kernels Worse performance in 3/8 comparable kernels Some gaps identified in slower kernels 8* not directly comparable Performance of OpenMP on XL improving over time Test Specs 2 Power8 4GHz (8 cores, with 8 threads each) with 1 NVIDIA Pascal P100 GPU. Compiler Options: -O3 -qhot qsmp=omp -qoffload* * Where applicable *When considering comparable kernels in OpenMP version. CUDA version has a different number of kernels

15 Performance of Comparable Kernels (45x45x45)
OpenMP(us) CUDA (us) Speedup (x) .*CalcTimeConstraints.* 24 42 1.75 .*CalcLagrange.* 28 30 1.07 .*CalcMonotonicQRegionForElem.* 90 115 1.28 .*CalcPositionAndVelocityForNodes.* 67 62 0.93 .*CalcAccelerationForNodes.* 34 31 0.91 .*CalcKinematicsForElems.* 190 130 0.68 .*ApplyAccelerationBoundaryConditions.* 4.6 2.4 0.52 .*CalcMonotonicQGradient.* 197 100 0.51 Keep in mind design trade-offs between OpenMP and CUDA

16 Comparing Kernels - CalcTimeConstraintsForElem
Metric OpenMP CUDA Lines of Code (Calling Overhead) 1: directive (very small) 4: kernel launch + definition (small) Lines of Code (Kernel) 24 100+ Implementation Time Low Medium/High Performance (over serial) Excellent Portability Portable with any C/C++/Fortran compiler, any platform NVIDIA hardware only; CUDA and CUDA Fortran compilers

17 Kernel Performance (Detailed) 45x45x45
The following kernels were not directly comparable because they either did not have 1:1 mappings or because making them equivalent would require significant refactoring OpenMP – GPU CUDA .*IntegrateStressForElem.* .*CalcFBHourglassForceForElem.* .*CalcHourglassControlForElem.*_[12] .*EvalEOSForElems.* .*InitStressTermsForElems.*_[12] .*CalcForceForNodes.* .*CalcVolumeForceForElems.* .*ApplyMaterialPropertiesAndUpdateVol.* .*AddNodeForcesFromElems.*

18 Further Demonstration of OpenMP Usage
Tutorial for Common OpenMP Usage Patterns

19 More Effective OpenMP Memory transfers will often be the greatest performance hindrance, initially Share device/GPU data between target regions with directives/functions target data target (enter|exit) omp_target_alloc Share device memory with CUDA use_device_ptr is_device_ptr Update values to/from device without destroying maps/ref counts target update Overlapping execution can also improve performance Utilizing multiple GPUs with the device clause Asynchronous offloading with the nowait clause

20 More Effective OpenMP: target data
target data specifies a scope to apply mapping rules in target (enter|exit) data is without scope CPU GPU void DoSomething(int *x, int *y, int *z, int increment, int m, int n) { #pragma omp target data map(to: x[:n], y[:n]) map(from: z[:n]) #pragma omp target map(to: x[:n]) firstprivate(increment) for (int i = 0 ; i < m ; ++i) { x[i] = x[i] + increment } #pragma omp target map(to: x[:n], y[:n]) map(from: z[:n]) for (int i = 0; i < n; ++i) { z[i] = x[i] + y[i]; copy alloc y x z z y x copy free Example: Sharing device data between GPU target regions with target data

21 More Effective OpenMP: target update
target update can update values to/from host without modifying mapping status CPU GPU void DoSomething(int *x, int *y, int *z, int increment, int m, int n) { int err = 0; #pragma omp target data \ map(to: x[:n], y[:n]) map(from: z[:n]) map(tofrom:err) #pragma omp target map(to: x[:n]) firstprivate(increment) for (int i = 0 ; i < m ; ++i) { x[i] = x[i] + increment if(x[i] < 0) { err = 1; } } #pragma omp target update from(err) if (err) { HandleError(err); … // Remaining target regions copy alloc err y x z copy err err z y x Example: Using target update to check GPU data without changing mapping for data copy free

22 More Effective OpenMP: is_device_ptr
Use is_device_ptr to pass device allocated memory (e.g. from cudaMalloc) to an OpenMP target region Also useful for passing unified memory pointers around (avoids mapping rules) // cuda_alloc.cu int * AllocAndInitialize(int init, int length) { int *d_data; cudaMalloc(&d_data, length * sizeof(*d_data)); InitKernel<<<nBlk, nThd>>>(data, init, length); //Set all to init return d_data; } // omp_kernel.cc void DoSomething() { const int length = 1024; int *devMemFromCuda = AllocAndInitialize(5, length); #pragma omp target is_device_ptr(devMemFromCuda) for (int i = 0; i < length; ++ i) { devMemFromCuda[i] = devMemFromCuda[i] * 2; CPU GPU alloc d_data *d_data d_data *d_data Example: Sharing device between OpenMP and CUDA with is_device_ptr

23 More Effective OpenMP: use_device_ptr
Use use_device_ptr to pass mapped device memory from OpenMP to another programming model (e.g. CUDA) // cuda_launch_kernel.cu void LaunchCUDAIncrement(int *data, int length) { IncrementKernel<<<nBlk, nThd>>>(data, length); } // omp_map_and_call_cuda.cc void DoSomething() { const int len = 1024; int data[len] = {0,}; #pragma omp target data map(data[:len]) use_device_ptr(data[:len]) { LaunchCudaIncrement(&data, len); #pragma omp map(data[:len]) for (int i = 0; i < len; ++i) { data[i] = data[i] * 2; CPU GPU copy data Uses device copy of data data Example: Sharing device between OpenMP and CUDA with is_device_ptr copy

24 More Effective OpenMP: Using Multiple GPUs
The device clause specifies which device to execute and can be evaluated at runtime Use in combination with omp_get_num_devices() to query the machine’s device count void DoSomething() { const int len = 1024; int data[len] = {0,}; int numDevices = omp_get_num_devices(); #pragma omp target map(data[:len]) device(0 % numDevices) for (int i = 0; i < len/2; ++i) { data[i] = data[i] * 2; } #pragma omp target map(data[:len]) device(1 % numDevices) for (int i = len/2; i < len; ++i) { CPU GPU 0 GPU 1 Example: Launching work to multiple devices (GPUs)

25 More Effective OpenMP: Using Multiple GPUs (Continued)
Can be used with parallel sections to launch work in parallel to multiple GPUs void DoSomething() { const int len = 1024; int data[len] = {0,}; int numDevices = omp_get_num_devices(); #pragma omp parallel sections { #pragma omp section #pragma omp target map(data[:len]) device(0) for (int i = 0; i < len/2; ++i) { data[i] = data[i] * 2; } #pragma omp target map(data[:len]) device(1) for (int i = len/2; i < len; ++i) { CPU GPU 0 GPU 1 t1 t2 Example: Launching work to multiple devices (GPUs)

26 More Effective OpenMP: Asynchronous Offloading
By default, an implicit barrier is at the end of target regions (cudaDeviceSynchronize) The nowait clause can be used to remove this implicit barrier void DoSomething(int *x, int *y, int *z, int len) { #pragma omp target map(data[:len]) nowait for (int i = 0; i < len/2; ++i) { data[i] = data[i] * 2; } for (int i = len/2; i < len; ++i) { CPU GPU Target Region 1 Dispatch Target Region 2 Dispatch Target Region 1 Completion Target Region 2 Completion Example: Sharing device between OpenMP and CUDA with is_device_ptr

27 Using the CUDA Toolkit with OpenMP Programs Compiled by XL Compilers

28 Profiling With NVPROF and NVVP (Continued)
Same invocation as normal: nvvp ./lulesh2.0 Nvvp timeline pic


Download ppt "Exploiting NVIDIA GPUs with OpenMP"

Similar presentations


Ads by Google