Killdevil Running CUDA programs on cluster. Requesting permission https://onyen.unc.edu/cgi- bin/unc_id/services https://onyen.unc.edu/cgi- bin/unc_id/services.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Intermediate GPGPU Programming in CUDA
CUDA More on Blocks/Threads. 2 Debugging Using the Device Emulation Mode An executable compiled in device emulation mode ( nvcc -deviceemu ) runs completely.
Speed, Accurate and Efficient way to identify the DNA.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
The Missouri S&T CS GPU Cluster Cyriac Kandoth. Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Tutorial on Distributed High Performance Computing 14:30 – 19:00 (2:30 pm – 7:00 pm) Wednesday November 17, 2010 Jornadas Chilenas de Computación 2010.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 5, 2011, 3-DBlocks.ppt Addressing 2-D grids with 3-D blocks Class Discussion Notes.
CUDA Grids, Blocks, and Threads
Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013, 3-DBlocks.ppt Addressing 2-D grids with 3-D blocks Class Discussion Notes.
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.
Computer Engg, IIT(BHU)
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
CUDA Programming Model
Basic CUDA Programming
Lecture 2: Intro to the simd lifestyle and GPU internals
CS/EE 217 – GPU Architecture and Parallel Programming
CUDA Grids, Blocks, and Threads
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
CUDA Grids, Blocks, and Threads
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Chapter 4:Parallel Programming in CUDA C
Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

Killdevil Running CUDA programs on cluster

Requesting permission bin/unc_id/services bin/unc_id/services

Compiling CUDA programs module load cuda Run script : compile.sh – nvcc -o MatrixMul -I/usr/local/cuda/include/ - L/usr/local/lib64 -L/usr/local/cuda/lib64 MatrixMul.cu

Running CUDA programs ssh killdevil.unc.edu module load cuda Run script : submitjob.sh – bsub –q gpu –a gpuexcl_t –n 1 –o MYGPUJOB.o%J

CUDA SDK – Download the SDK depending on your OS Windows : Requires Visual Studio to compile sample Linux :Requires gcc

CUDA : Threads

Recap Kernel program is executed by a grid of threads

Thread Organization Organized in two-level hierarchy – Grid composed of Blocks gridDim : Number of blocks the grid has – Blocks composed of Threads blockDim : Number of threads the block has Each block gets a unique Id – blockIdx Each thread gets a unique Id – threadIdx

Thread Organization Each block has equal number of threads – blockDim.x, blockDim.y, blockDim.z threadIdx is always local to the block

1D Example Grid = 128 blocks Block = 32 threads – blockDim.x in kernel returns 32 Total threads = 128 x 32 = 4096 – Each thread has a unique Id blockIdx.x * blockDim.x + threadId.x

Multi-Dimension Example

Things to Note Blocks are organized into 3D arrays of threads – 1D, 2D, 3D depending on your problem – Vector sum : 1D; Matrix multiplication : 2D All blocks in a grid have the same dimensions – i.e all blocks have equal number of threads in each dimension The total size of a block is limited to 512 threads – blockDim can be (512, 1, 1), (8, 16, 2), (16, 16, 2) – But not (32, 32, 1) Total threads : 32 x 32 x 1 = 1024 which exceeds 512

USING blockIdx AND threadIdx 0, 01, 02, 0 width-1, 0 0, 1 width–1, 1 0, 2 0, width-1 width – 1, width - 1

Matrix-Multiplication with larger size

Simple example

Updated kernel code

Block scheduling on device

Thread Assignment

QUESTIONS?