Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
GPU PROGRAMMING David Gilbert California State University, Los Angeles.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
GPU Programming with CUDA – Optimisation Mike Griffiths
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
GPU Architecture and Programming
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
OpenCL Programming James Perry EPCC The University of Edinburgh.
© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Computer Engg, IIT(BHU)
Prof. Zhang Gang School of Computer Sci. & Tech.
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Parallel programming with GPGPU coprocessors
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Introduction to CUDA.
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
General Purpose Graphics Processing Units (GPGPUs)
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale

Resources CUDA Programming Guide Programming Massively Parallel Processors: A Hands-on Approach - David Kirk

Motivation Process independent tasks in parallel for a given application. How can we modify line drawing routine? a)Divide line into parts and assign each part to each processor. b)What if we assign a processor per scanline? (Each processor knows its own y coordinate)

Squaring Array Elements Consider Array of 8 elements Array sitting in the host(CPU) memory Void host_square(float* h_A) { For(I =0; I < 8; I ++ ) h_A[I] = h_A[I] * h_A[I] }

Squaring Array Elements Array Sitting in the Device(GPU) memory (Also called as Global Memory) Spawn the threads Each thread automatically gets a number Programmer controls how much work each thread will do. e.g(in our current example) 4 threads - each thread squares 2 elements 8 threads – each thread squares 1 element.

Squaring Array Elements Consider 8 threads are spawned by programmer, where each thread squares 1 element. Each thread automatically gets a number (0,0,0) (1,0,0) (2,0,0) … (7,0,0) in registers corresponding to each thread. These built in registers are called ThreadIdx.x ,ThreadIdx.y, ThreadIdx.z

Squaring Array Elements Write a program for only 1 thread Following program will be executed for each thread in parallel _global_ void device_square(float* d_A) { myid = ThreadIdx.x; d_A[myid] = d_A[myid] * d_A[myid]; }

Squaring Array Elements Block Level Parallelism? Suppose programmer decides to visualize array of 8 elements as 2 blocks BlockId’s are stored in built in BlockIdx.x, BlockIdx.y, BlockIdx.z BlockID (0,0,0) (1,0,0) ThreadID (0,0,0) .. (3,0,0) (0,0,0)..(3,0,0)

Squaring Array Elements Here each thread knows its own blockID and ThreadId Int BLOCK_SIZE = 4; _global_ void device_square(float* d_A) { myid = BlockIdx.x*BLOCK_SIZE +ThreadIdx.x; d_A[myid] = d_A[myid] * d_A[myid]; }

General flow of .cu file Allocate host memory(malloc) for Array h_A Initialize that Array Allocate device memory(cudaMalloc) for Array d_A Transfer data from host to device memory(cudaMemCpy) Specify kernel execution Configuaration (This is very important. Depending upon it blocks and threads automatically get assigned numbers) Call Kernel Transfer result from device to host memory(cudaMemCpy) Deallocate host(free) and device(cudaFree) memories

GPGPU What is GPGPU? Why GPGPU? General purpose computing on GPUs Massively parallel computing power Hundreds of cores, thousands of concurrent threads Inexpensive

CPU v/s GPU © NVIDIA Corporation 2009

CPU v/s GPU © NVIDIA Corporation 2009

GPGPU How? CUDA OpenCL DirectCompute

CUDA ‘Compute Unified Device Architecture’ Heterogeneous serial-parallel computing Scalable programming model C for CUDA – extension to C

C for CUDA Both serial (CPU) and parallel (GPU) code Kernel – function that executes on GPU __global__ void matrix_mul(…){} matrix_mul<<<dimGrid, dimBlock>>>(…) Array Squaring – (our example of 8 elements) device_square<<<2, 4>>>(float* d_A) File has extension ‘.cu’

C for CUDA __global__ void matrix_mul(…){ … [GPU (Parallel) code] } void main(){ … [CPU (serial) code] … matrix_mul<<<dimGrid, dimBlock>>>(…)

Compiling Use nvcc to compile .cu files nvcc –o runme kernel.cu Use –c option to generate .obj files nvcc –c kernel.cu g++ –c main.cpp g++ –o runme *.o

Programming Model SIMT (Single Instruction Multiple Threads) Threads run in groups of 32 called warps Every thread in a warp executes the same instruction at a time

Programming Model A single kernel executed by several threads Threads are grouped into ‘blocks’ Kernel launches a ‘grid’ of thread blocks

Programming Model © NVIDIA Corporation

Programming Model All threads within a block can Share data through ‘Shared Memory’ Synchronize using ‘_syncthreads()’ Threads and Blocks have unique IDs Available through special variables

Programming Model © NVIDIA Corporation

Transparent Scalability Hardware is free to schedule thread blocks on any processor © NVIDIA Corporation 2009

Control Flow Divergence Courtesy Fung et al. MICRO ‘07

Memory Model Host (CPU) and device (GPU) have separate memory spaces Host manages memory on device Use functions to allocate/set/copy/free memory on device Similar to C functions

Memory Model Types of device memory Registers – read/write per-thread Local Memory – read/write per-thread Shared Memory – read/write per-block Global Memory – read/write across grids Constant Memory – read across grids Texture Memory – read across grids

Memory Model © NVIDIA Corporation

Memory Model Host Block (0,0) Shared Memory Block (1,0) Registers Global Memory Constant Texture Block (0,0) Shared Memory Registers Local Thread (0,0) Thread (1,0) Block (1,0) Host © NVIDIA Corporation

Thank you http://developer.download.nvidia.com/presentations/2009/SIGGRAPH/Alternative_Rendering_Pipelines.mp4