1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Published byModified over 4 years ago
Presentation on theme: "1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow."— Presentation transcript:
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow statements in kernel code.
2 GPU execution resources organized into “Streaming Processors” (SM) Each streaming processor has compute resources such as register file, instruction scheduler, … A number of blocks are assigned to each streaming processor for execution Limits of number of threads that can be simultaneously tracked and scheduled– limits the number of size of blocks that can be assigned to each SM Streaming Processors
3 C 2050 Fermi 14 streaming processors, Each SM has 32 cores So 448 cores Apparently Fermi was originally intended to have 512 cores (16 SP) but too hot. GeForce GTX 480 (March 2010) 15 SMs (480 cores) Streaming Processors continued
4 Thread Scheduling Once a block assigned to a SM, divided into 32-threads units called warps. Size of a warp could change between implementations One warp is actually executed in hardware at a time (Some docs talk about a half-warp (16 thread units) actually simultaneously) Execution in SM – starts with the first warp in the first block
5 For a program without control instructions (no if statements etc.), the same instruction is executed for each thread in the warp simultaneously
6 When there is a divergent path, first the instructions on one path are executed and then the instructions in the other path, within each warp. So this causes the two paths to be serialized. But different warps are considered separately. It would be possible for one warp to execute one path and another warp to execute the other path at the same time. Control-flow instructions
7 Maximum performance Ideally have not control-flow statements If control-flow statements necessary: Programmer might be able to arrange each warp to execute just one path Example if (threadID < 16) /*do this */ ; if (threadID < 32 /* do this*/ ; if (threadID < 48) /* do this*/ ; Need to test/check
8 Compiler loop unrolling Sometimes compiler unrolls loops. Then no divergent paths Example for (i = 0; i < 4; i++) a[i] = 0; becomes a = 0; a = 0; a = 0; a = 0;
9 Branch predication instructions Compiler can also use branch predication instructions to eliminate divergent paths Branch predication instruction – a machine instruction that combines an Boolean condition (predicate) with an operation such as addition Example ADD R1, R2, R3 where CC == zero, etc.
11 Notes on 2-D and 3-D addressing 2-D Address, (x, y) and block sizes D x and D y Unique global thread ID = x + y D x 3-D Address, (x, y, z) and block sizes D x, D y, and D z. Unique global thread ID = x + y D x + zD x D y