Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*)
Joseph Kider University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3

Agenda GPU architecture review CUDA
First of two or three dedicated classes

Acknowledgements Many slides are from
Kayvon Fatahalian's From Shader Code to a Teraflop: How GPU Shader Cores Work: David Kirk and Wen-mei Hwu’s UIUC course:

GPU Architecture Review
GPUs are: Parallel Multithreaded Many-core GPUs have: Tremendous computational horsepower High memory bandwidth

GPUs are specialized for Compute-intensive, highly parallel computation Graphics! Transistors are devoted to: Processing Not: Data caching Flow control

Transistor Usage Image from:

Slide from: http://bps10. idav. ucdavis

Threading Hardware in G80

Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA

Fixed-function pipeline
3D API Commands 3D API: OpenGL or Direct3D 3D Application Or Game CPU-GPU Boundary (AGP/PCIe) Data Stream Command & GPU Vertex Index Stream Pixel Location Stream Assembled Primitives Pixel Updates GPU Front End Primitive Assembly Rasterization and Interpolation Raster Operations Frame Buffer Pre-transformed Vertices Transformed Vertices Pre-transformed Fragments Transformed Fragments Programmable Vertex Processor Programmable Fragment Processor

Programmable pipeline
3D API Commands 3D API: OpenGL or Direct3D 3D Application Or Game CPU-GPU Boundary (AGP/PCIe) Data Stream Command & GPU Vertex Index Stream Pixel Location Stream Assembled Primitives Pixel Updates GPU Front End Primitive Assembly Rasterization and Interpolation Raster Operations Frame Buffer Pre-transformed Vertices Transformed Vertices Pre-transformed Fragments Transformed Fragments Programmable Vertex Processor Programmable Fragment Processor

Unified Programmable pipeline
3D API Commands 3D API: OpenGL or Direct3D 3D Application Or Game CPU-GPU Boundary (AGP/PCIe) Data Stream Command & GPU Vertex Index Stream Pixel Location Stream Assembled Primitives Pixel Updates GPU Front End Primitive Assembly Rasterization and Interpolation Raster Operations Frame Buffer Pre-transformed Vertices Transformed Vertices Pre-transformed Fragments Transformed Fragments Unified Vertex, Fragment, Geometry Processor

General Diagram (6800/NV40) TMU = texture memory unit

TurboCache Uses PCI-Express bandwidth to render directly to system memory Card needs less memory Performance boost while lowering cost TurboCache Manager dynamically allocates from main memory Local memory used to cache data and to deliver peak performance when needed

NV40 Vertex Processor TMU = texture memory unit An NV40 vertex processor is able to execute one vector operation (up to four FP32 components), one scalar FP32 operation, and make one access to the texture per clock cycle

NV40 Fragment Processors
Early termination from mini z buffer and z buffer checks; resulting sets of 4 pixels (quads) passed on to fragment units Each quad processor has two ALUs, each capable of executing 2 operations: Maximum: four operations per clock without texture selection OR Maximum: one texture selection and two operations per clock 4 temporary registers (FP32) (or 8 FP16) per pixel

Why NV40 series was better
Massive parallelism Scalability Lower end products have fewer pixel pipes and fewer vertex shader units Computation Power 222 million transistors First to comply with Microsoft’s DirectX 9 spec Dynamic Branching in pixel shaders Why NV40 series was better: Massive parallelism 16 pixel pipes with one texture unit/pipe, and in special cases, it can produce 32 pixels per clock 6 vertex shader units 222 million transistors fabricated by IBM on a .13 micron process Scalability 16 parallel pixel pipes lower end products have fewer pixel pipes and fewer vertex shader units Computation Power More capable general purpose computing engine than any graphics chips that came before it Can process data internally with 32 bits of floating point precision per channel color (red, green, blue, alpha) with no performance penalty Microsoft’s DirectX 9 spec

Dynamic Branching Helps detect if pixel needs shading
Instruction flow handled in groups of pixels Specify branch granularity (the number of consecutive pixels that take the same branch) Better distribution of blocks of pixels between the different quad engines

General Diagram (7800/G70) G70 (7800) upped to 8 vertex processing units, G70 upped to 6 quad processors

General Diagram (6800/NV40) General Diagram (7800/G70)
G70 (7800) upped to 8 vertex processing units, G70 upped to 6 quad processors

GeForce Go 7800 – Power Issues
Power consumption and package are the same as the 6800 Ultra chip, meaning notebook designers do not have to change very much about their thermal designs Dynamic clock scaling can run as slow as 16 MHz This is true for the engine, memory, and pixel clocks Heavier use of clock gating than the desktop version Runs at voltages lower than any other mobile performance part Regardless, you won’t get much battery-based runtime for a 3D game

GeForce 7800 GTX Parallelism
8 Vertex Engines Z-Cull Triangle Setup/Raster 24 Pixel Shaders Shader Instruction Dispatch Fragment Crossbar 16 Raster Operation Pipelines Memory Partition Memory Partition Memory Partition Memory Partition

G80 – Graphics Mode The future of GPUs is programmable processing
So – build the architecture around the processor L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread Issue Pixel Thread Issue Input Assembler Host

G80 CUDA mode – A Device Example
Processors execute computing threads New operating mode/HW interface for computing Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache

Why Use the GPU for Computing ?
The GPU has evolved into a very flexible and powerful processor: It’s programmable using high-level languages It supports 32-bit floating point precision It offers lots of GFLOPS: GPU in every PC and workstation GFLOPS G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800

What is Behind such an Evolution?
The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data processing rather than data caching and flow control The fast-growing video game industry exerts strong economic pressure that forces constant innovation Cache ALU Control DRAM DRAM CPU GPU

What is (Historical) GPGPU ?
General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates critical path of application Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation Applications – see //GPGPU.org Game effects (FX) physics, image processing Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting Mark Harris, of Nvidia, runs the gpgpu.org website

Previous GPGPU Constraints
Dealing with graphics API Working with the corner cases of the graphics API Addressing modes Limited texture size/dimension Shader capabilities Limited outputs Instruction sets Lack of Integer & bit ops Communication limited Between pixels Scatter a[i] = p per thread per Shader per Context Input Registers Fragment Program Texture Constants Temp Registers Output Registers FB Memory

An Example of Physical Reality Behind CUDA
CPU (host) GPU w/ local DRAM (device)

Arrays of Parallel Threads
A CUDA kernel is executed by an array of threads All threads run the same code (SPMD)‏ Each thread has an ID that it uses to compute memory addresses and make control decisions … float x = input[threadID]; float y = func(x); output[threadID] = y; threadID 7 6 5 4 3 2 1

Thread Blocks: Scalable Cooperation
Divide monolithic thread array into multiple blocks Threads within a block cooperate via shared memory, atomic operations and barrier synchronization Threads in different blocks cannot cooperate Thread Block 0 Thread Block 0 Thread Block N - 1 … float x = input[threadID]; float y = func(x); output[threadID] = y; threadID 7 6 5 4 3 2 1 7 6 5 4 3 2 1 7 6 5 4 3 2 1 … float x = input[threadID]; float y = func(x); output[threadID] = y; … float x = input[threadID]; float y = func(x); output[threadID] = y; …

Thread Batching: Grids and Blocks
A kernel is executed as a grid of thread blocks All threads share data memory space A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution For hazard-free shared memory accesses Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0) Courtesy: NDVIA

Block and Thread IDs Threads and blocks have IDs
So each thread can decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes … Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0) Courtesy: NDVIA

CUDA Device Memory Space Overview
Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory (Device) Grid Constant Memory Texture Global Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host Global, constant, and texture memory spaces are persistent across kernels called by the same application. The host can R/W global, constant, and texture memories

Global, Constant, and Texture Memories (Long Latency Accesses)
Global memory Main means of communicating R/W Data between host and device Contents visible to all threads Texture and Constant Memories Constants initialized by host (Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory Courtesy: NDVIA

Block IDs and Thread IDs
Each thread uses IDs to decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes …

CUDA Memory Model Overview
Global memory Main means of communicating R/W Data between host and device Contents visible to all threads Long latency access We will focus on global memory for now Constant and texture memory will come later Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory

Parallel Computing on a GPU
8-series GPUs deliver 25 to 200+ GFLOPS on compiled parallel C applications Available in laptops, desktops, and clusters GPU parallelism is doubling every year Programming model scales transparently Programmable in C with CUDA tools Multithreaded SPMD model uses application data parallelism and thread parallelism GeForce 8800 Tesla D870 Tesla S870

Single-Program Multiple-Data (SPMD)
CUDA integrated CPU + GPU application C program Serial C code executes on CPU Parallel Kernel C code executes on GPU thread blocks CPU Serial Code Grid 0 . . . GPU Parallel Kernel KernelA<<< nBlk, nTid >>>(args); CPU Serial Code Grid 1 . . . GPU Parallel Kernel KernelB<<< nBlk, nTid >>>(args);

Grids and Blocks A kernel is executed as a grid of thread blocks
All threads share global memory space A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution using barrier Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate

CUDA Thread Block Programmer declares (Thread) Block:
Block size 1 to 512 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads All threads in a Block execute the same thread program Threads share data and synchronize while doing their share of the work Threads have thread id numbers within Block Thread program uses thread id to select work and address shared data CUDA Thread Block Thread Id #: … m Thread program Courtesy: John Nickolls, NVIDIA

GeForce-8 Series HW Overview
Streaming Processor Array … TPC TPC TPC TPC TPC TPC Texture Processor Cluster Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch SM Shared Memory TEX SP SP SM SP SP SFU SFU SP SP SP SP

CUDA Processor Terminology
SPA Streaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800) TPC Texture Processor Cluster (2 SM + TEX) SM Streaming Multiprocessor (8 SP) Multi-threaded processor core Fundamental processing unit for CUDA thread block SP Streaming Processor Scalar ALU for a single CUDA thread

Streaming Multiprocessor (SM)
8 Streaming Processors (SP) 2 Super Function Units (SFU) Multi-threaded instruction dispatch 1 to 512 threads active Shared instruction fetch per 32 threads Cover latency of texture/memory loads 20+ GFLOPS 16 KB shared memory texture and global memory access Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP

G80 Thread Computing Pipeline
Processors execute computing threads Alternative operating mode specifically for computing The future of GPUs is programmable processing So – build the architecture around the processor Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread Issue Pixel Thread Issue Input Assembler Host Generates Thread grids based on kernel calls

Thread Life Cycle in HW Grid is launched on the SPA
Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0) Grid is launched on the SPA Thread Blocks are serially distributed to all the SM’s Potentially >1 Thread Block per SM Each SM launches Warps of Threads 2 levels of parallelism SM schedules and executes Warps that are ready to run As Warps and Thread Blocks complete, resources are freed SPA can distribute more Thread Blocks

SM Executes Blocks Threads are assigned to SMs in Block granularity
t0 t1 t2 … tm t0 t1 t2 … tm SP Shared Memory MT IU SP Shared Memory MT IU Blocks Threads are assigned to SMs in Block granularity Up to 8 Blocks to each SM as resource allows SM in G80 can take up to 768 threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. Threads run concurrently SM assigns/maintains thread id #s SM manages/schedules thread execution Blocks Texture L1 TF L2 Memory

Thread Scheduling/Execution
Each Thread Blocks is divided in 32-thread Warps This is an implementation decision, not part of the CUDA programming model Warps are scheduling units in SM If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM? Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps At any point in time, only one of the 24 Warps will be selected for instruction fetch and execution. … Block 1 Warps … Block 2 Warps … … t0 t1 t2 … t31 t0 t1 t2 … t31 Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP

SM Warp Scheduling SM hardware implements zero-overhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a Warp execute the same instruction when selected 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 If one global memory access is needed for every 4 instructions A minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency SM multithreaded Warp scheduler time warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 . . . warp 8 instruction 12 warp 3 instruction 96

SM Instruction Buffer – Warp Scheduling
$ Fetch one warp instruction/cycle from instruction L1 cache into any instruction buffer slot Issue one “ready-to-go” warp instruction/cycle from any warp - instruction buffer slot operand scoreboarding used to prevent hazards Issue selection based on round-robin/age of warp SM broadcasts the same instruction to 32 Threads of a Warp L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU

Scoreboarding All register operands of all instructions in the Instruction Buffer are scoreboarded Instruction becomes ready after the needed values are deposited prevents hazards cleared instructions are eligible for issue Decoupled Memory/Processor pipelines any thread can continue to issue instructions until scoreboarding prevents issue allows Memory/Processor ops to proceed in shadow of other waiting Memory/Processor ops

Granularity Considerations
For Matrix Multiplication, should I use 4X4, 8X8, 16X16 or 32X32 tiles? For 4X4, we have 16 threads per block, Since each SM can take up to 768 threads, the thread capacity allows 48 blocks. However, each SM can only take up to 8 blocks, thus there will be only 128 threads in each SM! There are 8 warps but each warp is only half full. For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, it could take up to 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! There are 16 warps available for scheduling in each SM Each warp spans four slices in the y dimension For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule. There are 24 warps available for scheduling in each SM Each warp spans two slices in the y dimension For 32X32, we have 1024 threads per Block. Not even one can fit into an SM!

Memory Hardware in G80

CUDA Device Memory Space: Review
Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory (Device) Grid Constant Memory Texture Global Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host Global, constant, and texture memory spaces are persistent across kernels called by the same application. The host can R/W global, constant, and texture memories

Parallel Memory Sharing
Local Memory: per-thread Private per thread Auto variables, register spill Shared Memory: per-Block Shared by threads of the same block Inter-thread communication Global Memory: per-application Shared by all threads Inter-Grid communication Thread Local Memory Block Shared Memory Grid 0 . . . Global Memory Sequential Grids in Time Grid 1 . . .

SM Memory Architecture
t0 t1 t2 … tm t0 t1 t2 … tm SP Shared Memory MT IU SP Shared Memory MT IU Blocks Blocks Threads in a block share data & results In Memory and Shared Memory Synchronize at barrier instruction Per-Block Shared Memory Allocation Keeps data close to processor Minimize trips to global Memory Shared Memory is dynamically allocated to blocks, one of the limiting resources Texture L1 TF Courtesy: John Nicols, NVIDIA L2 Memory

SM Register File Register File (RF) TEX pipe can also read/write RF
$ Register File (RF) 32 KB (8K entries) for each SM in G80 TEX pipe can also read/write RF 2 SMs share 1 TEX Load/Store pipe can also read/write RF L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU

Programmer View of Register File
4 blocks 3 blocks There are 8192 registers in each SM in G80 This is an implementation decision, not part of CUDA Registers are dynamically partitioned across all blocks assigned to the SM Once assigned to a block, the register is NOT accessible by threads in other blocks Each thread in the same block only access registers assigned to itself

Matrix Multiplication Example
If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM? Each block requires 10*256 = 2560 registers 8192 = 3 * change So, three blocks can run on an SM as far as registers are concerned How about if each thread increases the use of registers by 1? Each Block now requires 11*256 = 2816 registers 8192 < 2816 *3 Only two Blocks can run on an SM, 1/3 reduction of parallelism!!!

More on Dynamic Partitioning
Dynamic partitioning gives more flexibility to compilers/programmers One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional CPU threading models. The compiler can tradeoff between instruction-level parallelism and thread level parallelism

Let’s program this thing!

GPU Computing History 2001/2002 – researchers see GPU as data-parallel coprocessor The GPGPU field is born 2007 – NVIDIA releases CUDA CUDA – Compute Uniform Device Architecture GPGPU shifts to GPU Computing 2008 – Khronos releases OpenCL specification

CUDA Abstractions A hierarchy of thread groups Shared memories
Barrier synchronization

CUDA Terminology Host – typically the CPU
Code written in ANSI C Device – typically the GPU (data-parallel) Code written in extended ANSI C Host and device have separate memories CUDA Program Contains both host and device code

CUDA Terminology Kernel – data-parallel function
Invoking a kernel creates lightweight threads on the device Threads are generated and scheduled with hardware Does a kernel remind you of a shader in OpenGL?

CUDA Kernels Executed N times in parallel by N different CUDA threads
Declaration Specifier Thread ID threadIdx is a 3-component vector Execution Configuration

CUDA Program Execution
Image from:

Thread Hierarchies Grid – one or more thread blocks
1D or 2D Block – array of threads 1D, 2D, or 3D Each block in a grid has the same number of threads Each thread in a block can Synchronize Access shared memory

Thread Hierarchies Image from:

Thread Hierarchies Block – 1D, 2D, or 3D
Example: Index into vector, matrix, volume

Thread Hierarchies Thread ID: Scalar thread identifier
Thread Index: threadIdx 1D: Thread ID == Thread Index 2D with size (Dx, Dy) Thread ID of index (x, y) == x + y Dy 3D with size (Dx, Dy, Dz) Thread ID of index (x, y, z) == x + y Dy + z Dx Dy

Thread Hierarchies 2D Index 1 Thread Block 2D Block

Thread Hierarchies Thread Block Group of threads
G80 and GT200: Up to 512 threads Fermi: Up to 1024 threads Reside on same processor core Share memory of that core

Thread Hierarchies Thread Block Group of threads
G80 and GT200: Up to 512 threads Fermi: Up to 1024 threads Reside on same processor core Share memory of that core Image from:

Thread Hierarchies Block Index: blockIdx Dimension: blockDim 1D or 2D

Thread Hierarchies 16x16 Threads per block 2D Thread Block

Thread Hierarchies Example: N = 32 i = [0, 1] * 16 + [0, 15]
16x16 threads per block (independent of N) threadIdx ([0, 15], [0, 15]) 2x2 thread blocks in grid blockIdx ([0, 1], [0, 1]) blockDim = 16 i = [0, 1] * 16 + [0, 15]

Thread Hierarchies Thread blocks execute independently
In any order: parallel or series Scheduled in any order by any number of cores Allows code to scale with core count

Thread Hierarchies Scheduled by the CUDA runtime
Image from:

Thread Hierarchies Threads in a block
Share (limited) low-latency memory Synchronize execution To coordinate memory accesses __syncThreads() Barrier – threads in block wait until all threads reach this Lightweight Image from:

CUDA Memory Transfers Image from:

CUDA Memory Transfers Host can transfer to/from device Global memory
Constant memory Image from:

CUDA Memory Transfers cudaMalloc() cudaFree()
Allocate global memory on device cudaFree() Frees memory Image from:

CUDA Memory Transfers Code from:

CUDA Memory Transfers Pointer to device memory
Code from:

CUDA Memory Transfers Size in bytes
Code from:

CUDA Memory Transfers cudaMemcpy()
Host to host Host to device Device to host Device to device Host Device Global Memory Does this remind you of VBOs in OpenGL?

CUDA Memory Transfers cudaMemcpy() Memory transfer Host to host
Host to device Device to host Device to device Host Device Global Memory

CUDA Memory Transfers cudaMemcpy() All transfers are asynchronous
Host to host Host to device Device to host Device to device Host Device Global Memory All transfers are asynchronous

CUDA Memory Transfers Host to device Host Device Global Memory
Code from:

CUDA Memory Transfers Destination (device) Source (host) Host Device
Global Memory Code from:

CUDA Memory Transfers Host Device Global Memory
Code from:

Matrix Multiply P = M * N Assume M and N are square for simplicity
Is this data-parallel? Image from:

Matrix Multiply 1,000 x 1,000 matrix 1,000,000 dot products
Each 1,000 multiples and 1,000 adds

Matrix Multiply: CPU Implementation
void MatrixMulOnHost(float* M, float* N, float* P, int width)‏ { for (int i = 0; i < width; ++i)‏ for (int j = 0; j < width; ++j) float sum = 0; for (int k = 0; k < width; ++k) float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * width + j] = sum; Code from:

Matrix Multiply: CUDA Skeleton
Code from:

Matrix Multiply Step 1 Add CUDA memory transfers to the skeleton

Matrix Multiply: Data Transfer
Allocate input Code from:

Allocate output Code from:

Code from:

Read back from device Code from:

Does this remind you of GPGPU with GLSL? Code from:

Matrix Multiply Step 2 Implement the kernel in CUDA C

Matrix Multiply: CUDA Kernel
Accessing a matrix, so using a 2D block Code from:

Each kernel computes one output Code from:

Where did the two outer for loops in the CPU implementation go? Code from:

No locks or synchronization, why? Code from:

Matrix Multiply Step 3 Invoke the kernel in CUDA C

Matrix Multiply: Invoke Kernel
One block with width by width threads Code from:

Matrix Multiply One Block of threads compute matrix Pd Each thread
Nd Grid 1 One Block of threads compute matrix Pd Each thread computes one element of Pd Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high)‏ Size of matrix limited by the number of threads allowed in a thread block Block 1 Thread (2, 2)‏ 48 WIDTH Md Pd © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign Slide from:

Matrix Multiply What is the major performance problem with our implementation? What is the major limitation?

Introduction to CUDA (1 of n*)

Similar presentations

Presentation on theme: "Introduction to CUDA (1 of n*)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to CUDA (1 of n*)

Similar presentations

Presentation on theme: "Introduction to CUDA (1 of n*)"— Presentation transcript:

Similar presentations

About project

Feedback