Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.

Slides:



Advertisements
Similar presentations
GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.
Advertisements

Scheduling and Performance Issues for Programming using OpenMP
Introductions to Parallel Programming Using OpenMP
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Presented by Rengan Xu LCPC /16/2014
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
OpenMPI Majdi Baddourah
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Parallel Programming in Java with Shared Memory Directives.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
CS6235 L16: Libraries, OpenCL and OpenAcc. L16: Libraries, OpenACC, OpenCL CS6235 Administrative Remaining Lectures -Monday, April 15: CUDA 5 Features.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
GPU Programming with CUDA – Optimisation Mike Griffiths
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
GPU Architecture and Programming
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Threaded Programming Lecture 4: Work sharing directives.
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.
Introduction to OpenMP
OpenCL Programming James Perry EPCC The University of Edinburgh.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Threaded Programming Lecture 2: Introduction to OpenMP.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Lecture 14 Introduction to OpenACC Kyu Ho Park May 12, 2016 Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA.
Martin Kruliš by Martin Kruliš (v1.1)1.
Introduction to OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
Shared Memory Parallelism - OpenMP
CS427 Multicore Architecture and Parallel Computing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Department of Computer Science and Engineering
Programming with Shared Memory Introduction to OpenMP
Using OpenMP offloading in Charm++
Introduction to OpenMP
6- General Purpose GPU Programming
CUDA Fortran Programming with the IBM XL Fortran Compiler
Shared-Memory Paradigm & OpenMP
Presentation transcript:

Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh

Acknowledgement Alistair Hart (Cray) and Mark Bull (EPCC) contributed material to this lecture. 2

Overview Introduction to GPU Directives OpenACC New OpenMP 4.0 accelerator support 3

Accelerator Directives Language extensions, e.g. Cuda or OpenCL, allow programmers to interface with the GPU –This gives control to the programmer, but is often tricky and time consuming, and results in complex/non-portable code An alternative approach is to allow the compiler to automatically accelerate code sections on the GPU (including decomposition, data transfer, etc). There must be a mechanism to provide the compiler with hints regarding which sections to be accelerated, parallelism, data usage, etc Directives provide this mechanism –Special syntax which is understood by accelerator compilers and ignored (treated as code comments) by non-accelerator compilers. –Same source code can be compiled for CPU/GPU combo or CPU only –c.f. OpenMP 4

Accelerator Directives Several variants emerged, including –PGI Accelerator Compiler for C/Fortran –Using directives, translates to CUDA (future releases to PTX) –NVIDIA GPUs only –CAPS HMPP compiler –Using directives, translates to CUDA or OpenCL –NVIDIA or AMD GPUs –Pathscale compiler supports NVIDIA GPUs through HMPP programming model –Cray compiler –Cray systems only All of above are commercial products Directives different but similar across products 5

OpenACC standard announced in November 2011 –By CAPS, CRAY, NVIDIA and PGI Only applied to NVIDIA GPUs only (so far) Examples later 6

OpenMP Accelerator Directives All OpenACC partners plus Intel, AMD (and several other organisations including EPCC) formed a subcommittee of the OpenMP committee, looking at extending the OpenMP directive standard to support accelerators. More comprehensive and with wider remit than OpenACC Accelerator support now included in OpenMP 4.0 7

OpenACC We will now illustrate accelerator directives using OpenACC For definitive guide, full list of available directives, clauses, options etc see 8

OpenACC Directives With directives inserted, the compiler will attempt to compile the key kernels for execution on the GPU, and will manage the necessary data transfer automatically. Directive format: –C: #pragma acc …. –Fortran: !$acc …. These are ignored by non-accelerator compilers 9

Accelerator Parallel Construct The programmer specifies which regions of code should be offloaded to the accelerator with the parallel construct 10 C: #pragma acc parallel { …code region… } Fortran: !$acc parallel …code region… !$acc end parallel Note: this directive is usually not sufficient on it’s own – it needs (at least) to be combined with the loop directive (next slide)

Accelerator Loop Construct The loop construct is applied immediately before a loop, specifying that it should be parallelised on the accelerator. 11 C: #pragma acc loop for(…){ …loop body… } Fortran: !$acc loop do … …loop body… end do

Accelerator Loop Construct 12 The loop construct must be used inside a parallel construct. C: #pragma acc parallel { … #pragma acc loop for(…){ …loop body… } … } Fortran: !$acc parallel … !$acc loop … do … …loop body… end do !$acc end loop … !$acc end parallel Multiple loop constructs may be used within a single parallel construct.

Accelerator Loop Construct 13 The parallel loop construct is shorthand for the combination of a parallel and (single) loop construct C: #pragma acc parallel loop for(…){ …loop body… } Fortran: !$acc parallel loop do … …loop body… end do !$acc end parallel loop

!$acc parallel loop do i=1, N output(i)=2.*input(i) end do !$acc end parallel loop !$acc parallel loop do i=1, N output(i)=2.*input(i) end do !$acc end parallel loop 14 Compiler automatically offloads loop to GPU (using default values for the parallel decomposition), and performs necessary data transfers. Use of parallel loop may be sufficient, on its own, to get code running on the GPU, but further directives and clauses exist to give more control to programmer to improve performance and enable more complex cases Parallel Loop Example

Tuning Clauses for loop construct Clauses can be added to loop (or parallel loop ) directives gang, worker, vector –Targets specific loop at specific level of hardware –gang↔ CUDA threadblock (scheduled on a single SM) –worker ↔ CUDA warp of threads (32 on Fermi) –Vector ↔ CUDA threads in warp executing in SIMT lockstep – You can specify more than one !$acc loop gang worker vector schedules loop over all hardware 15

Tuning Clauses for parallel construct To be added to parallel (or parallel loop ) directives num_gangs, num_workers, vector_length –Tunes the amount of parallelism used –equivalent to setting the number of threads per block, number of blocks in CUDA Easiest way to set the number of threads per block is to specify vector_length(NTHREADS) clause –NTHREADS must be one of: 1, 64, 128 (default), 256, 512, 1024 –NTHREADS > 32 automatically decomposed into warps of length 32 –Don't need to specify number of threadblocks E.g. to specify 128 threads per block #pragma acc parallel vector_length(128) 16

Other clauses Further clauses to loop (or parallel loop ) directives seq : loop executed sequentially independent : compiler hint if(logical) Executes on GPU if.TRUE. at runtime, otherwise on CPU reduction : as in OpenMP cache : specified data held in software-managed data cache e.g. explicit blocking to shared memory on NVIDIA GPUs 17

Data Management !$acc parallel loop do i=1, N output(i)=2.*input(i) end do !$acc end parallel loop write(*,*) "finished 1st region" !$acc parallel loop do i=1, N output(i)=i*output(i) end do !$acc end parallel loop !$acc parallel loop do i=1, N output(i)=2.*input(i) end do !$acc end parallel loop write(*,*) "finished 1st region" !$acc parallel loop do i=1, N output(i)=i*output(i) end do !$acc end parallel loop 18 Consider case with 2 loops. The output array will be unnecessarily copied from/to device between regions Host/device data copies are very expensive

Accelerator Data Construct Allows more efficient memory management 19 C: #pragma acc data { …code region… } Fortran: !$acc data …code region… !$acc end data

Accelerator Data Construct !$acc data copyin(input) copyout(output) !$acc parallel loop do i=1, N output(i)=2.*input(i) end do !$acc end parallel loop write(*,*) "finished first region" !$acc parallel loop do i=1, N output(i)=i*output(i) end do !$acc end parallel loop !$acc end data !$acc data copyin(input) copyout(output) !$acc parallel loop do i=1, N output(i)=2.*input(i) end do !$acc end parallel loop write(*,*) "finished first region" !$acc parallel loop do i=1, N output(i)=i*output(i) end do !$acc end parallel loop !$acc end data 20 the output array is no longer unnecessarily transferred between host and device between kernel calls

Data clauses Applied to data, parallel [loop] regions copy, copyin, copyout –copy data "in" to GPU at start of region and/or "out" to CPU at end –copy means copyin and copyout –Supply list of arrays or array sections (using ":” notation) –N.B.Fortran uses start:end; C/C++uses start:length –e.g. first N elements of array: Fortran 1:N ; C/C++ 0:N create –Do not copy at all – useful for temporary arrays –Host copy still exists private, firstprivate –as per OpenMP –Scalars private by default present –Specify that data is already present on the GPU, so copying should be avoided (example later) 21

Sharing GPU data between subroutines PROGRAM main INTEGER :: a(N) … !$acc data copy(a) !$acc parallel loop DO i = 1,N a(i) = i ENDDO !$acc end parallel loop CALL double_array(a) !$acc end data … END PROGRAM main PROGRAM main INTEGER :: a(N) … !$acc data copy(a) !$acc parallel loop DO i = 1,N a(i) = i ENDDO !$acc end parallel loop CALL double_array(a) !$acc end data … END PROGRAM main 22 The present data clause allows the programmer to specify that the data is already on the device, so should not be copied again SUBROUTINE double_array(b) INTEGER :: b(N) !$acc parallel loop present(b) DO i = 1,N b(i) = 2*b(i) ENDDO !$acc end parallel loop END SUBROUTINE double_array SUBROUTINE double_array(b) INTEGER :: b(N) !$acc parallel loop present(b) DO i = 1,N b(i) = 2*b(i) ENDDO !$acc end parallel loop END SUBROUTINE double_array

Update directive !$acc update [host|device] –Copy specified arrays (slices) within data region –Useful if you only need to send a small subset of data to/from GPU –e.g. halo exchange for domain-decomposed parallel code –or sending a few array elements to the CPU for printing/debugging 23

Kernels directive The kernels directive exists an an alternative to the parallel directive. kernels has emerged from the PGI model, while parallel has emerged from the Cray model –Both in standard as a compromise parallel and kernels regions look very similar –both define a region to be accelerated –different levels of obligation for the compiler 24

Kernels directive parallel –Prescriptive (like OpenMP programming model) –Uses a single accelerator kernel to accelerate region –compiler will accelerate region even if this leads to incorrect results kernels –Descriptive (like PGI Accelerator programming model) – uses one or more accelerator kernels to accelerate region –compiler may accelerate region –may not if not sure that loop iterations are independent Within a kernels region, the compiler will automatically try to offload all it can to the GPU (even if there are no loop directives). 25

See the documentation for full list of directives and clauses. Runtime Library Routines are available to, e.g. –Retrieve information about the GPU hardware environment –Specify which device to use –Explicitly initialize the accelerator (helps with benchmarking) Environment Variables –e.g. can be set to specify which device to use There are still a number of limitations with the model and current implementations –Meaning that the feasibility of use depends on the complexity of code 26

OpenMP 4 Accelerator support Similar to, but not the same as, OpenACC directives. Support for more than just loops Less reliance on compiler to parallelise and map code to threads Not GPU specific –suitable for Xeon Phi or DSPs, for example Fully integrated into the rest of OpenMP 27

Host ‐ centric model with one host device and multiple target devices of the same type. device: a logical execution engine with local storage. device data environment: a data environment associated with a target data or target region. target constructs control how data and code is offloaded to a device. Data is mapped from a host data environment to a device data environment. 28

target regions Code inside target region is executed on the device. Executes sequentially by default. Can include other OpenMP directives to make it run in parallel. Clauses to control data movement. Can specify which device to use. #pragma omp target map(to:B,C), map(tofrom:sum) #pragma omp parallel for reduction(+:sum) for (int i=0; i<N; i++){ sum += B[i] + C[i]; } 29

target data construct just moves data and does not execute code (c.f. #pragma acc data in OpenACC). –can have multiple target regions inside a target data region –allows data to persist on device between target regions target update construct updates data during a target data region. declare target compiles a version of function/subroutine that can be called on the device. Target regions are blocking: the encountering thread waits for them to complete. –Asynchronous behaviour can be achieved by using target regions inside tasks (with dependencies if required). 30

What about GPUs? Executing a target region on a GPU can only use one multiprocessor –synchronisation required for OpenMP not possible between multiprocessors –not much use! teams construct creates multiple master threads which can execute in parallel, and spawn parallel regions, but cannot synchronise or communicate with each other. distribute construct spreads the iterations of a parallel loop across teams. –only schedule option is static (with optional chunksize). 31

Example #pragma omp target teams distribute parallel for\ map(to:B,C), map(tofrom:sum) reduction(+:sum) for (int i=0; i<N; i++){ sum += B[i] + C[i]; } Executes the loop on the device and distributes iterations across multiprocessors and across threads within each multiprocessor. 32

Summary A directives based approach to programming GPUs is potentially much more simplistic and productive than direct use of language extensions –More portable –Higher level: compiler automates much of the work –Less flexible and possibly poorer performance –Limitations may offer challenges for complex codes The OpenACC standard emerged in 2011 –We went through the main concepts with examples OpenMP 4 now incorporates accelerators –More comprehensive and with wider participation than OpenACC 33