Download presentation

Presentation is loading. Please wait.

Published byMilo Dedman Modified over 2 years ago

1
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations

2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 2 Objective Highlight additional factors which affect CUDA performance –Data Prefetching –Instruction mix –Thread Granularilty

3
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 3 Tiled Matrix Multiply Example Loop { Load current tile to shared memory syncthreads() Compute current tile syncthreads() } Load Md tile to shared memory has two parts: 1)Load Md tile from global memory to registers 2) Store registers contents in shared memory Similarly the loading of Nd tile to shared memory has two parts NOTE: There are no independent instructions between the two parts (i.e. global memory latency can not be hidden) Wraps which are loading tiles will have to wait significantly before starting to compute the product matrix tile

4
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 4 Prefetching: Tiled Matrix Multiply Example Load next tile from global memory Loop { Deposit current tile to shared memory syncthreads() Load next tile from global memory Compute current tile } Before entering the while loop load the first tile into the registers. Once in the loop, move loaded data into shared memory Once all threads pass barrier, load next tile from global memory Threads compute elements of current Pd tile from shared memory by the dot-product loop Provides many independent instructions between the load and deposit

5
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 5 Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH Prefetch Deposit blue tile from register into shared memory Syncthreads Load orange tile into register Compute on Blue tile Deposit orange tile into shared memory ….

6
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 6 Instruction Mix Considerations Processor core has limited instruction processing bandwidth All instructions consume similar instruction processing bandwidth –Floating point calculation instructions –Load instructions –Branch instructions –address arithmetic instructions –Etc….

7
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 7 Instruction Mix Considerations I A loop incurs instructions to –Update loop counter –Perform conditional branch at the end of each iteration –Address arithmetic instructions when loop counter is used to index matrices These instructions compete against floating point calculation instructions for the limited instruction processing bandwidth.

8
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 8 Instruction Mix Considerations II Above kernel loop executes –2 floating point arithmetic instructions –1 loop branch instruction –2 address arithmetic instructions –1 loop counter increment instruction Only 1/3 of the instructions executed are floating-point –limits the achievable performance to no more than 1/3 of the peak bandwidth. for (int k = 0; k < BLOCK_SIZE; ++k) Pvalue += Ms[ty][k] * Ns[k][tx];

9
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 9 Instruction Mix Considerations III Loop unrolling can help. Pvalue += Ms[ty][k] * Ns[k][tx] + … Ms[ty][k+15] * Ns[k+15][tx]; for (int k = 0; k < BLOCK_SIZE; ++k) Pvalue += Ms[ty][k] * Ns[k][tx];

10
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 10 Granularity Considerations For Matrix Multiplication, should I use 4X4, 8X8, 16X16 or 32X32 tiles? For 4X4, we have 16 threads per block, Since each SM can take up to 768 threads, the thread capacity allows 48 blocks. However, each SM can only take up to 8 blocks, thus there will be only 128 threads in each SM! –There are 8 warps but each warp is only half full. For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, it could take up to 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! –There are 16 warps available for scheduling in each SM –Each warp spans four slices in the y dimension

11
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 11 Granularity Considerations (continued) For Matrix Multiplication, should I use 4X4, 8X8, 16X16 or 32X32 tiles? For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule. –There are 24 warps available for scheduling in each SM –Each warp spans two slices in the y dimension For 32X32, we have 1024 threads per Block. Not even one can fit into an SM!

12
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 12 Thread Granularity : Matrix Multiply Example The tiled algorithm uses one thread to compute one element of the Pd. –Dot product between on row of Md and one column of Nd. –Multiple threads redundantly load each Md row. –Two Pd elements in adjacent tiles uses the same Md row. Same Md row is redundantly loaded by the two thread blocks assigned to generate two Pd tiles. Merge the two thread blocks into one –Each thread calculates two Pd elements –Both dot products use the same Mds row but different Nds columns. –Reduces the global memory access by ¼.

13
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 13

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google