Download presentation
Presentation is loading. Please wait.
Published byLionel Gallagher Modified over 9 years ago
1
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from ECE 408 at the University of Illinois)
2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 2 M 2,0 M 1,1 M 1,0 M 0,0 M 0,1 M 3,0 M 2,1 M 3,1 Memory Layout of a Matrix in C M 2,0 M 1,0 M 0,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 1,2 M 0,2 M 2,2 M 3,2 M 1,2 M 0,2 M 2,2 M 3,2 M 1,3 M 0,3 M 2,3 M 3,3 M 1,3 M 0,3 M 2,3 M 3,3 M
3
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 3 Memory Coalescing When accessing global memory, peak performance utilization occurs when all threads in a half warp access continuous memory locations. Md Nd W I D T H WIDTH Thread 1 Thread 2 Not coalescedcoalesced
4
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 4 M 2,0 M 1,1 M 1,0 M 0,0 M 0,1 M 3,0 M 2,1 M 3,1 Memory Layout of a Matrix in C M 2,0 M 1,0 M 0,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 1,2 M 0,2 M 2,2 M 3,2 M 1,2 M 0,2 M 2,2 M 3,2 M 1,3 M 0,3 M 2,3 M 3,3 M 1,3 M 0,3 M 2,3 M 3,3 M T1T1 T2T2 T3T3 T4T4 Time Period 1 T1T1 T2T2 T3T3 T4T4 Time Period 2 Access direction in Kernel code …
5
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 5 M 2,0 M 1,1 M 1,0 M 0,0 M 0,1 M 3,0 M 2,1 M 3,1 Memory Layout of a Matrix in C M 2,0 M 1,0 M 0,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 1,2 M 0,2 M 2,2 M 3,2 M 1,2 M 0,2 M 2,2 M 3,2 M 1,3 M 0,3 M 2,3 M 3,3 M 1,3 M 0,3 M 2,3 M 3,3 M T1T1 T2T2 T3T3 T4T4 Time Period 1 T1T1 T2T2 T3T3 T4T4 Time Period 2 Access direction in Kernel code …
6
Review, see the access patterns in both MM kernels from earlier © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 6
7
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 7 Review: Matrix Multiplication Kernel using Multiple Blocks __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; }
8
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 8 Tiled Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { 1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH]; 2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockIdx.x; int by = blockIdx.y; 4. int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Coolaborative loading of Md and Nd tiles into shared memory 9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; 10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width]; 11. __syncthreads(); 11. for (int k = 0; k < TILE_WIDTH; ++k) 12. Pvalue += Mds[ty][k] * Nds[k][tx]; 13. Synchthreads(); 14.} 13. Pd[Row*Width+Col] = Pvalue; }
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.