Debugging PGI Compilers for Heterogeneous Supercomputing.

Slides:

Advertisements

Similar presentations

Accelerated Linear Algebra Libraries James Wynne III NCCS User Assistance.

Advertisements

List Ranking and Parallel Prefix

GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.

Compilers & Tools for HPC January

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.

GPU PROGRAMMING David Gilbert California State University, Los Angeles.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

CUDA and the Memory Model (Part II). Code executed on GPU.

CUDA Grids, Blocks, and Threads

Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming EPCC The University of Edinburgh.

An Introduction to Programming with CUDA Paul Richmond

GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer

Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

GPU Programming with CUDA – Optimisation Mike Griffiths

Debugging and Profiling GMAO Models with Allinea’s DDT/MAP Georgios Britzolakis April 30, 2015.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

CIS 565 Fall 2011 Qing Sun

CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

GPU Architecture and Programming

1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.

Martin Kruliš by Martin Kruliš (v1.0)1.

Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

CS/EE 217 – GPU Architecture and Parallel Programming

Computer Engg, IIT(BHU)

Prof. Zhang Gang School of Computer Sci. & Tech.

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Recitation 2: Synchronization, Shared memory, Matrix Transpose

ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.

CUDA Parallelism Model

CS/EE 217 – GPU Architecture and Parallel Programming

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

CUDA Grids, Blocks, and Threads

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Chapter 4:Parallel Programming in CUDA C

Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.

6- General Purpose GPU Programming

Parallel Computing 18: CUDA - I

CUDA Fortran Programming with the IBM XL Fortran Compiler

Presentation transcript:

Debugging PGI Compilers for Heterogeneous Supercomputing

Common OpenACC Errors acc parallel or loop independent errors (not a parallel loop) data bounds errors (not enough data moved to the device) stale data on device or host (missing update) present error (missing data clause somewhere) roundoff error (differences in float arithmetic host vs device) roundoff error for summation (parallel accumulation) async errors (missing wait) compiler error (ask for help) other runtime error (need debugger or other help)

DEBUGGING PGI CUDA FORTRAN AND OPENACC ON GPUS WITH ALLINEA DDT Sebastien Deldon (PGI)- Beau Paisley (Allinea)

TALK HIGHLIGHTS  Brief CUDA Fortran overview  Brief OpenACC overview  CUDA Fortran/OpenACC debug info generation  Allinea DDT overview and features  CUDA Fortran & OpenACC live debugging demos

3 WAYS TO PROGRAM ACCELERATORS

attributes(global) subroutine mm_kernel ( A, B, C, N, M, L ) real :: A(N,M), B(M,L), C(N,L), Cij integer, value :: N, M, L integer :: i, j, kb, k, tx, ty real, shared :: Asub(16,16),Bsub(16,16) tx = threadidx%x ty = threadidx%y i = (blockidx%x-1) * 16 + tx j = (blockidx%y-1) * 16 + ty Cij = 0.0 do kb = 1, M, 16 Asub(tx,ty) = A(i,kb+tx-1) Bsub(tx,ty) = B(kb+ty-1,j) call syncthreads() do k = 1,16 Cij = Cij + Asub(tx,k) * Bsub(k,ty) enddo call syncthreads() enddo C(i,j) = Cij end subroutine mmul_kernel real, device, allocatable, dimension(:,:) :: Adev,Bdev,Cdev... allocate (Adev(N,M), Bdev(M,L), Cdev(N,L)) Adev = A(1:N,1:M) Bdev = B(1:M,1:L) call mm_kernel >> ( Adev, Bdev, Cdev, N, M, L) C(1:N,1:L) = Cdev deallocate ( Adev, Bdev, Cdev )... Host CodeDevice Code CUDA FORTRAN

!$CUF KERNEL DIRECTIVES module madd_device_module use cudafor contains subroutine madd_dev(a,b,c,sum,n1,n2) real,dimension(:,:),device :: a,b,c real :: sum integer :: n1,n2 type(dim3) :: grid, block !$cuf kernel do (2) >> do j = 1,n2 do i = 1,n1 a(i,j) = b(i,j) + c(i,j) sum = sum + a(i,j) enddo enddo end subroutine end module module madd_device_module use cudafor implicit none contains attributes(global) subroutine madd_kernel(a,b,c,blocksum,n1,n2) real, dimension(:,:) :: a,b,c real, dimension(:) :: blocksum integer, value :: n1,n2 integer :: i,j,tindex,tneighbor,bindex real :: mysum real, shared :: bsum(256) ! Do this thread's work mysum = 0.0 do j = threadidx%y + (blockidx%y-1)*blockdim%y, n2, blockdim%y*griddim%y do i = threadidx%x + (blockidx%x-1)*blockdim%x, n1, blockdim%x*griddim%x a(i,j) = b(i,j) + c(i,j) mysum = mysum + a(i,j) ! accumulates partial sum per thread enddo enddo ! Now add up all partial sums for the whole thread block ! Compute this thread's linear index in the thread block ! We assume 256 threads in the thread block tindex = threadidx%x + (threadidx%y-1)*blockdim%x ! Store this thread's partial sum in the shared memory block bsum(tindex) = mysum call syncthreads() ! Accumulate all the partial sums for this thread block to a single value tneighbor = 128 do while( tneighbor >= 1 ) if( tindex = 1 ) if( tindex >>(a,b,c,blocksum,n1,n2) call madd_sum_kernel >>(blocksum,dsum,nb) r = cudaThreadSynchronize() ! don't deallocate too early deallocate(blocksum) end subroutine end module Equivalent hand-written CUDA kernels

OPENACC MEMBERS

... #pragma acc data copy(b[0:n][0:m]) \ create(a[0:n][0:m]) { for (iter = 1; iter <= p; ++iter){ #pragma acc kernels { for (i = 1; i < n-1; ++i){ for (j = 1; j < m-1; ++j){ a[i][j]=w0*b[i][j]+ w1*(b[i-1][j]+b[i+1][j]+ b[i][j-1]+b[i][j+1])+ w2*(b[i-1][j-1]+b[i-1][j+1]+ b[i+1][j-1]+b[i+1][j+1]); } } for( i = 1; i < n-1; ++i ) for( j = 1; j < m-1; ++j ) b[i][j] = a[i][j]; }... S 2 (B) S 1 (B) S 2 (B) OPENACC Host Memory Accelerator Memory AA BB S 1 (B) S p (B)

HOW DOES THE PGI ACCELERATOR COMPILER WORK? Unified CPU/GPU binary C/C++/Fortran OpenACC Cuda Fortran Code compile PGI Accelerator compiler Device X86 ASM CUDA C NVIDIA SDK nvcc GPU ASM PGI Accelerator linker link Host

NATIVE LLVM CODE GENERATION TO ENABLE DEBUGGING Unified CPU/GPU binary C/C++/Fortran OpenACC Cuda Fortran Code compile PGI Accelerator compiler Device X86 ASM NVVM IR NVIDIA SDK libnvvm GPU ASM PGI Accelerator linker link Host

ENABLING DEVICE-SIDE DEBUGGING  PGI Accelerator native NVVM IR/libnvvm code generator  Generate debug info using NVVM IR debug metadata —Source line correlation —Global/local variables  Debug info for CUDA predefined variables (threadIdx, …)  Debug info for Fortran-specific features

CUDA FORTRAN DEBUGGING STATUS  CUDA Fortran debugging features in PGI 14.1 and later  One-to-one mapping for source line correlation  Set and run to breakpoints in CUDA Fortran kernels  Step through kernel code  Examine kernel local variables, global variables in device/shared memories, predefined variables

CUDA FORTRAN DEBUGGING LIMITATIONS  Lower optimization level when invoking libnvvm – code generation may chang  Array bounds debug information only for constant  !$CUF directive support available with PGI 14.4

OpenACC debug challenges void MatrixMultiplication( float * restrict a, float * restrict b, float * restrict c, int m, int n, int p) { int i, j, k ; #pragma acc data copy(a[0:(m*n)]), copyin(b[0:(m*p)],c[0:(p*n)]) { #pragma acc kernels loop independent, gang, vector(8) for (i=0; i<m; i++){ #pragma acc loop gang, vector (8) for (j=0; j<n; j++) { #pragma acc loop seq for (k=0; k<p; k++) a[i*n+j] += b[i*p+k]*c[k*n+j] ; } % pgcc –g –O0 –ta=nvidia –acc mmul.c –o mmul % ddt … extern "C" __global__ __launch_bounds__(64) void MatrixMultiplication_20_gpu( float* const __restrict _c ; float* const __restrict _b, float* _a, int _n, int _p) { int _i, _j, _k ; _i = threadIdx.y + blockIdx.y*8 ; _j = threadIdx.x + blockIdx.x*8 ; for (_k=0; _k<_p; _k++) _a [_i*_n+_j] += _b[_i*_p+_k]*_c [_k*_n+_j] ; } Original Source Code Simplified Pseudo Kernel Code

OPENACC DEBUG CHALLENGE  Source line correlation  Variable correlation  Variable not referenced anymore  Do we expose compiler-created variables ?  How to deal with significantly restructured loops ? extern "C" __global__ __launch_bounds__(64) void MatrixMultiplication_20_gpu( float* const __restrict _c ; float* const __restrict _b, float* _a, int _n, int _p) { int _i, _j, _k ; _i = threadIdx.y + blockIdx.y*8 ; _j = threadIdx.x + blockIdx.x*8 ; for (_k=0; _k<_p; _k++) _a [_i*_n+_j] += _b[_i*_p+_k]*_c [_k*_n+_j] ; } Simplified Pseudo Kernel Code

OPENACC DEBUG STATUS  Available in PGI 14.4  Source line correlation  Debug support for variables turned into kernel parameters  !$CUF directives debug support

OPENACC DEBUG LIMITATIONS  Same as for CUDA Fortran debugging  No support for source variables that are not referenced by generated kernel  No support for generated for common block variable passed as parameters  Limited support for acc routines

ABOUT ALLINEA DDT  Graphical debugger designed for: —C/C++, Fortran, UPC, CUDA, CUDA Fortran —Multithreaded code  Single address space —Multiprocess code  Interdependent or independent processes —Accelerated codes  GPUs, Intel Xeon Phi  Any mix of the above  Slash your time to debug : —Reproduces and triggers your bugs instantly —Helps you easily understand where issues come from quickly —Helps you to fix them as swiftly as possible

LET’S SEE DDT IN ACTION

idata Global Memory CUDA FORTRAN DEBUG DEMO attributes(global) subroutine transposeNoBankConflicts(odata, idata) implicit none real, intent(out) :: odata(ny,nx) real, intent(in) :: idata(nx,ny) real, shared :: tile(TILE_DIM+1, TILE_DIM) integer :: x, y, j x = (blockIdx%x-1) * TILE_DIM + threadIdx%x y = (blockIdx%y-1) * TILE_DIM + threadIdx%y do j = 0, TILE_DIM-1, BLOCK_ROWS tile(threadIdx%x, threadIdx%y+j) = idata(x,y+j) end do call syncthreads() x = (blockIdx%y-1) * TILE_DIM + threadIdx%x y = (blockIdx%x-1) * TILE_DIM + threadIdx%y do j = 0, TILE_DIM-1, BLOCK_ROWS odata(x,y+j) = tile(threadIdx%y+j, threadIdx%x) end do end subroutine transposeNoBankConflicts devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-fortran / % pgfortran –g –O0 –Mcuda mtrans.cuf –o mtrans % ddt … tile Shared Memory odata Global Memory

OpenACC debug demo void MatrixMultiplication( float * restrict a, float * restrict b, float * restrict c, int m, int n, int p) { int i, j, k ; #pragma acc data copy(a[0:(m*n)]), copyin(b[0:(m*p)],c[0:(p*n)]) { #pragma acc kernels loop independent, gang, vector(8) for (i=0; i<m; i++){ #pragma acc loop gang, vector (8) for (j=0; j<n; j++) { #pragma acc loop seq for (k=0; k<p; k++) a[i*n+j] += b[i*p+k]*c[k*n+j] ; } % pgcc –g –O0 –ta=nvidia –acc mmul.c –o mm % ddt … extern "C" __global__ __launch_bounds__(64) void MatrixMultiplication_20_gpu( float* const __restrict _c ; float* const __restrict _b, float* _a, int _n, int _p) { int _i, _j, _k ; _i = threadIdx.y + blockIdx.y*8 ; _j = threadIdx.x + blockIdx.x*8 ; for (_k=0; _k<_p; _k++) _a [_i*_n+_j] += _b[_i*_p+_k]*_c [_k*_n+_j] ; } Original Source Code Simplified Pseudo Kernel Code

COPYRIGHT NOTICE © Contents copyright 2014, NVIDIA Corporation. This material may not be reproduced in any manner without the expressed written permission of NVIDIA. PGFORTRAN, PGF95, PGI Accelerator and PGI Unified Binary are trademarks, and PGI, PGCC, PGC++, PGI Visual Fortran, PVF, PGI CDK, Cluster Development Kit, PGPROF, PGDBG, and The Portland Group are registered trademarks of NVIDIA Corporation. Other brands and names are the property of their respective owners.

BACKUP SLIDES

25 Debugging CUDA Fortran with Allinea DDT Set and run to breakpoints in CUDA Fortran kernels View CUDA Fortran kernels source code Drill into CUDA thread-blocks to examine local variables Evaluate data in device shared/global memories Track execution stacks for CUDA threads/blocks

26 View arrays in device shred/global memory

27 Inspect values in CUDA Fortran multidimensional arrays

28 Vizualize values in CUDA Fortran multidimensional arrays

29 View C source code Track CUDA thread/blocks execution stack in OpenACC kernel Set and run to breakpoints in OpenACC parallel region Examine OpenACC kernel local variables in device memory

30 Inspect values in OpenACC multidimensional arrays

TRAP ERROR WHERE IT OCCURRED