Intermediate GPGPU Programming in CUDA

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:

Speed, Accurate and Efficient way to identify the DNA.

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Optimization on Kepler Zehuan Wang

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.

CIS 565 Fall 2011 Qing Sun

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

L17: CUDA, cont. Execution Model and Memory Hierarchy November 6, 2012.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

Martin Kruliš by Martin Kruliš (v1.0)1.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Synchronization These notes introduce:

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

CUDA programming Performance considerations (CUDA best practices)

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

Computer Engg, IIT(BHU)

Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.

Sathish Vadhiyar Parallel Programming

EECE571R -- Harnessing Massively Parallel Processors ece

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Basic CUDA Programming

Slides from “PMPP” book

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

CUDA Execution Model – III Streams and Events

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

General Purpose Graphics Processing Units (GPGPUs)

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)

6- General Purpose GPU Programming

Parallel Computing 18: CUDA - I

Presentation transcript:

Intermediate GPGPU Programming in CUDA Supada Laosooksathit

NVIDIA Hardware Architecture Host memory Terminologies Global memory Shared memory SMs

Recall 5 steps for CUDA Programming Initialize device Allocate device memory Copy data to device memory Execute kernel Copy data back from device memory

Initialize Device Calls To select the device associated to the host thread cudaSetDevice(device) This function must be called before any __global__ function, otherwise device 0 is automatically selected. To get number of devices cudaGetDeviceCount(&devicecount) To retrieve device’s property cudaGetDeviceProperties(&deviceProp, device)

Hello World Example Allocate host and device memory

Hello World Example Host code

Hello World Example Kernel code

To Try CUDA Programming SSH to 138.47.102.111 Set environment vals in .bashrc in your home directory export PATH=$PATH:/usr/local/cuda/bin export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH Copy the SDK from /home/students/NVIDIA_GPU_Computing_SDK Compile the following directories NVIDIA_GPU_Computing_SDK/shared/ NVIDIA_GPU_Computing_SDK/C/common/ The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/

Demo Hello World Vector Add Print out block and thread IDs C = A + B Show some real demos.. Above one and additional ones in the dirs

CUDA Language Concept CUDA programming model CUDA memory model

Some Terminologies Device = GPU = set of stream multiprocessors Stream Multiprocessor (SM) = set of processors & shared memory Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory

CUDA Programming Model Parallel code (kernel) is launched and executed on a device by many threads Threads are grouped into thread blocks Parallel code is written for a thread // Kernel definition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; }

Thread Hierarchy Threads launched for a parallel section are partition into thread blocks Thread block is a group of threads that can: Synchronize their execution Communicate via a low latency shared memory Grid = all thread blocks for a given launch

IDs and Dimensions Threads Blocks Dimensions are set at launch time 3D IDs Unique within a block – two threads from two different blocks cannot cooperate Blocks 2D and 3D IDs (depend on the hardware) Unique within a grid Dimensions are set at launch time Can be unique for each section Built-in variables: threadIdx, blockIdx blockDim, gridDim

Kernel 1 Kernel 2 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0)

CUDA Memory Model Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory (Device) Grid Constant Memory Texture Global Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host The host can R/W global, constant, and texture memories

Host memory

Device DRAM Global memory Texture and Constant Memories Main means of communicating R/W data between host and device Contents visible to all threads Texture and Constant Memories Constants initialized by host

CUDA Global Memory Allocation cudaMalloc(pointer, memsize) Allocates object in the device Global Memory pointer = address of a pointer to the allocated object memsize = Size of allocated object cudaFree(pointer) Frees object from device Global Memory

CUDA Host-Device Data Transfer cudaMemcpy() Memory data transfer Requires four parameters Pointer to source Pointer to destination Number of bytes copied Type of transfer: Host to Host, Host to Device, Device to Host, Device to Device

CUDA Function Declaration Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() host __host__ float HostFunc() __global__ defines a kernel function Must return void

CUDA Function Calls Restrictions __device__ functions cannot have their address taken For functions executed on the device: No recursion No static variable declarations inside the function No variable number of arguments

Calling a Kernel Function – Thread Creation A kernel function must be called with an execution configuration: KernelFunc<<< DimGrid, DimBlock, SharedMemBytes, Streams >>>(...); DimGrid = dimension and size of the grid DimBlock = dimension and size of each block SharedMemBytes specifies the number of bytes in shared memory (option) Streams specifies the associated stream (option)

NVIDIA Hardware Architecture Host memory Terminologies Global memory Shared memory SMs

NVIDIA Hardware Architecture Terminologies -Compute capability -Threads in a block will be grouped into a warp of 32 threads -Execute in the cores SM

Specifications of a Device Compute Capability 1.3 Compute Capability 2.0 Warp size 32 Max threads/block 512 1024 Max Blocks/grid 65535 Shared mem 16 KB/SM 48 KB/SM For more details deviceQuery in CUDA SDK Appendix F in Programming Guide 4.0

Demo deviceQuery Show hardware specifications in details

Memory Optimizations Reduce the time of memory transfer between host and device Use asynchronous memory transfer (CUDA streams) Use zero copy Reduce the number of transactions between on-chip and off-chip memory Memory coalescing Avoid bank conflicts in shared memory

Reduce Time of Host-Device Memory Transfer Regular memory transfer (synchronously)

Reduce Time of Host-Device Memory Transfer CUDA streams Allow overlapping between kernel and memory copy

CUDA Streams Example

CUDA Streams Example

GPU Timers CUDA Events CUDA timer calls An API Use the clock shade in kernel Accurate for timing kernel executions CUDA timer calls Libraries implemented in CUDA SDK

CUDA Events Example

Demo simpleStreams

Reduce Time of Host-Device Memory Transfer Zero copy Allow device pointers to access page-locked host memory directly Page-locked host memory is allocated by cudaHostAlloc()

Demo Zero copy

Reduce number of On-chip and Off-chip Memory Transactions Threads in a warp access global memory Memory coalescing Copy a bunch of words at the same time

Memory Coalescing Threads in a warp access global memory in a straight forward way (4-byte word per thread)

Memory Coalescing Memory addresses are aligned in the same segment but the accesses are not sequential

Memory Coalescing Memory addresses are not aligned in the same segment

Shared Memory 16 banks for compute capability 1.x, 32 banks for compute capability 2.x Help utilizing memory coalescing Bank conflicts may occur Two or more threads in access the same bank In compute capability 1.x, no broadcast In compute capability 2.x, broadcast the same data to many threads that request

Bank Conflicts No bank conflict 2-way bank conflict Threads: Banks: 1 Threads: Banks: 1 2 3 Threads: Banks: 1 2 3

Matrix Multiplication Example

Matrix Multiplication Example Reduce accesses to global memory (A.height/BLOCK_SIZE) times reading A (B.width/BLOCK_SIZE) times reading B

Demo Matrix Multiplication With and without shared memory Different block sizes

Control Flow if, switch, do, for, while Branch divergence in a warp Threads in a warp issue different instruction sets Different execution paths will be serialized Increase number of instructions in that warp

Branch Divergence

Summary 5 steps for CUDA Programming NVIDIA Hardware Architecture Memory hierarchy: global memory, shared memory, register file Specifications of a device: block, warp, thread, SM

Summary Memory optimization Control flow Reduce overhead due to host-device memory transfer with CUDA streams, Zero copy Reduce the number of transactions between on-chip and off-chip memory by utilizing memory coalescing (shared memory) Try to avoid bank conflicts in shared memory Control flow Try to avoid branch divergence in a warp

References http://docs.nvidia.com/cuda/cuda-c-programming-guide/ http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ http://www.developer.nvidia.com/cuda-toolkit