Download presentation
Presentation is loading. Please wait.
Published byReynard Doyle Modified over 9 years ago
1
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development & Technology
2
Optimization
3
3 3 Steps to Port your C/C++ code to CUDA Step 1: Single Thread port your C/C++ code to single thread CUDA kernel, and make sure output result correct. Focus on data movement between device & host memory Step 2: Single Block Port single thread kernel into single block kernel, and make sure output result correct. Focus on parallelizing with thread index Step 3: Multi Blocks & Threads Port single block kernel into multi blocks kernel, and make sure output result correct. Focus on fixing 2 layers index system, determine the best index utilization
4
4 3 Steps to optimize your CUDA kernels Step 1: setup timers to measure the kernel time use CUDA Event to measure kernel executing time use clock() in kernel to measure executing time weight per part in detail Step 2: kernel part & bottleneck division analyze your kernel, divide it into multi parts determine the bottleneck in each part use profiler to help determine bottlenecks Step 3: parts optimization optimize each part one by one, from the most time consuming part make sure the output correct after optimizing make sure the kernel executing time become shorter after optimizing
5
5 Bottlenecks Division I PCIE bound Suffer too much cudaMemcpy between host and device memory Memory bound Suffer global memory (device memory) bandwidth, or non-coalesced memory access pattern Computing bound Suffer computing power limit (Flops) Branch bound Suffer too many branch conditions Thread bound Suffer too few threads Register bound Suffer too few available registers (conjugated with thread bound)
6
6 Bottlenecks Division II PCIE bound Try keeping all data in device memory as long as possible Try using CUDA Stream to asynchronous data movement Memory bound Try using Texture, shared memory, constant memory, or cache (after Fermi) to reduce directly global memory I/O Computing bound Try reduce the operations in your algorithm Try use intrinsic functions Try trigger on –fast_math compiler options After trying all possible ways, you face to the hardware limit, which means this part is almost optimized already, please change to faster card
7
7 Bottlenecks Division III Branch bound Reduce the number of branches, especially the diverged branches. Thread bound Use compiler option –Xptxas –v to watch the used register amount per thread, and the used smem (shared memory) amount per block If the total register amount per block is over the spec, please try --maxrregcount to set the maximum register usage per thread, but this will generate local memory (DRAM) usage, which will be performance drawback. Register bound Note: the number of variables declared in kernel is not equal to the register using amount, the compiler will optimize it to smaller register amount, and drop some not used variables Try reduce the variables in your algorithm Try change the computing order, this will make the lifetime of some variables shorter
8
8 Warp & Branches I Warp = 32 threads SP : MUL & MAD SFU : sin(x), cos(x)… 1 block divides into multi-warps SM could execute 1 warp in 4 clocks
9
9 Warp & Branches II Branch will make warp diverged, each part will be executed in time order More diverged branches will be slower Non-diverged 2-fold diverged 3-fold diverged
10
10 Warp & Branches III Fat & slim diverged warp If there are some common instructions in diverged warp, move it out of branch will save some executing time Generally speaking, make the branch as slim as possible will save time Such as data load/save, common instructions,…etc will make the diverged warp fatter 1: common instruction 2 3 1: com 2 3 Slim diverged warp Fat diverged warp
11
11 Estimate the computing throughput Computing weight Isolate the computing parts & measure their percentage in kernel through GPU clock() Kernel executing time Measure the kernel executing time, and calculate the computing time in these kernel parts Used computing throughput in Flops Count total arithmetic operations, and divide by this executing time
12
12 Example of intrinsic functions __mul24(x, y) faster than 32 bits product computes the product of the 24 least significant bits of the integer parameters __sinf(x), __cosf(x), __expf(x) very fast, single precision less precision, but still ok __sincosf(x,sptr,cptr)
13
13 Memory system Thread Scope Register: on die, fastest, default Local memory: DRAM, non cached, slow (400~600 clocks) Block Scope Shared memory: on die, fast (4~6 clocks), Qualifier __shared__ Grid Scope Global memory: DRAM, non cached, slow (400~600 clocks) Constant memory: on die, small (total 64KB), Qualifier __constant__ Texture memory: read only, DRAM+cache, fast if cache hit Cache (Fermi only): R/W cache
14
14 Count memory bandwidth (exercise) Memory access weight Isolate the memory access parts & measure their percentage in kernel thru GPU clock() Kernel executing time Measure the kernel executing time, and calculate the memory access time in kernel Used memory bandwidth Count total memory access bytes, and divide by the access time
15
15 Coalesced global memory I/O Threads in ½ warp shares the same memory controller If the memory access pattern in ½ warp is dense localized in memory, this will lead to good performance, cause it will form a single transaction. We call this coalesced I/O if they diverge to different memory segments, the performance will drop due to multi transactions. We call this non-coalesced I/O How many threads in warp shared the same memory controller may differ from hardware spec.
16
16 Wrap kernel as standard C/C++ functions This can compile to kernels into standard object files or Library Link to other languages: Java, Fortran, MATLAB, … Not necessary to rewrite all non-C code into CUDA, we can call kernels from any other languages
17
17 Example: Wrap Kernel to C __global__ void ker_xxx(int* a, int* b){//some CUDA kernel … } extern “C”{//export as standard C format void xxx(int* a, int* b); }; void xxx(int* a, int* b){//wrap the kernel into C function … ker_xxx >>(a,b); … }
18
18 Multi-GPU operations Before CUDA 4.0 or Non-tesla Cards One CUDA context can control only one GPU hardware, to send/receive data, and launch kernels (which means one CPU thread can control one GPU, since one CPU thread own one CUDA context) We can use MPI, openMP, pthreads to create multi CPU threads, then use cudaSetDevice() to assign each CPU thread to each GPU Data communications: copy data in global memory back to system memory, and transfer data through MPI, openMP, pthreads protocols. CUDA 4.0 UVA: universal virtual addressing (all GPU & CPU can see data from each other)
19
19 Hardware (SPA, Streaming Processors Array) TPC
20
20 Hardware (TPC, Texture/Processors Cluster)
21
21 Hardware (SM, Streaming Multiprocessor) Warp = 32 threads SP : MUL & MAD SFU : sin(x), cos(x)… 1 block divides into multi-warps SM could execute 1 warp in 4 clocks
22
22 SPA, Streaming Processors Array Double Precision Special Function Unit (SFU) TP Array Shared Memory
23
23 How to use so many cores? 240 SP thread processors 30 DP thread processors Full scalar processor IEEE 754 double precision floating point Double Precision Special Function Unit (SFU) TP Array Shared Memory Thread Processor (TP) FP/Int Multi-banked Register File SpcOps ALUs Thread Processor Array (TPA)
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.