Presentation is loading. Please wait.

Presentation is loading. Please wait.

Killdevil Running CUDA programs on cluster. Requesting permission https://onyen.unc.edu/cgi- bin/unc_id/services https://onyen.unc.edu/cgi- bin/unc_id/services.

Similar presentations


Presentation on theme: "Killdevil Running CUDA programs on cluster. Requesting permission https://onyen.unc.edu/cgi- bin/unc_id/services https://onyen.unc.edu/cgi- bin/unc_id/services."— Presentation transcript:

1 Killdevil Running CUDA programs on cluster

2 Requesting permission https://onyen.unc.edu/cgi- bin/unc_id/services https://onyen.unc.edu/cgi- bin/unc_id/services

3 Compiling CUDA programs module load cuda Run script : compile.sh – nvcc -o MatrixMul -I/usr/local/cuda/include/ - L/usr/local/lib64 -L/usr/local/cuda/lib64 MatrixMul.cu

4 Running CUDA programs ssh killdevil.unc.edu module load cuda Run script : submitjob.sh – bsub –q gpu –a gpuexcl_t –n 1 –o MYGPUJOB.o%J

5 CUDA SDK https://developer.nvidia.com/cuda-downloads – Download the SDK depending on your OS Windows : Requires Visual Studio to compile sample Linux :Requires gcc

6 CUDA : Threads

7 Recap Kernel program is executed by a grid of threads

8 Thread Organization Organized in two-level hierarchy – Grid composed of Blocks gridDim : Number of blocks the grid has – Blocks composed of Threads blockDim : Number of threads the block has Each block gets a unique Id – blockIdx Each thread gets a unique Id – threadIdx

9 Thread Organization Each block has equal number of threads – blockDim.x, blockDim.y, blockDim.z threadIdx is always local to the block

10 1D Example Grid = 128 blocks Block = 32 threads – blockDim.x in kernel returns 32 Total threads = 128 x 32 = 4096 – Each thread has a unique Id blockIdx.x * blockDim.x + threadId.x

11 Multi-Dimension Example

12 Things to Note Blocks are organized into 3D arrays of threads – 1D, 2D, 3D depending on your problem – Vector sum : 1D; Matrix multiplication : 2D All blocks in a grid have the same dimensions – i.e all blocks have equal number of threads in each dimension The total size of a block is limited to 512 threads – blockDim can be (512, 1, 1), (8, 16, 2), (16, 16, 2) – But not (32, 32, 1) Total threads : 32 x 32 x 1 = 1024 which exceeds 512

13 USING blockIdx AND threadIdx 0, 01, 02, 0 width-1, 0 0, 1 width–1, 1 0, 2 0, width-1 width – 1, width - 1

14 Matrix-Multiplication with larger size

15 Simple example

16 Updated kernel code

17 Block scheduling on device

18 Thread Assignment

19

20 QUESTIONS?


Download ppt "Killdevil Running CUDA programs on cluster. Requesting permission https://onyen.unc.edu/cgi- bin/unc_id/services https://onyen.unc.edu/cgi- bin/unc_id/services."

Similar presentations


Ads by Google