Presentation is loading. Please wait.

Presentation is loading. Please wait.

Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Similar presentations


Presentation on theme: "Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009."— Presentation transcript:

1 Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009

2 Outline Matrix multiplication Implementation Experiments Work plan

3 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Matrix Multiplication A: M*N B: N*P C=A*B:M*P A B C N M N P

4 Matrix Multiplication // Matrix multiplication on the (CPU) host void MatrixMulOnHost (float* A, float* B, float* C, int hA, int wA, int wB) { for (int i = 0; i < hA; ++i) { for (int j = 0; j < wB; ++j) { double sum = 0; for (int k = 0; k < wA; ++k) { double a = A[i * wA + k]; double b = B[k * wB + j]; sum += a * b; } P[i * wB + j] = sum; }

5 Implementation_1 One thread calculates one element of C –dim3 grid(1, 1); – dim3 thread(WC, HC); –__global__ void matrixMul_low( float* C, float* A, float* B, int wA, int wB) –{ –int tx = threadIdx.x; –int ty = threadIdx.y; –float Csub = 0; – for(int k=0; k<wA; ++k) – { – Csub += A[ty*wA+k] * B[k*wB+tx]; – } –C[ty*wB+tx] = Csub; –}

6 Experiments_1 10000 times

7 Experiments_1

8 Brief analysis Less efficient than CPU; Data transfer occupies most of the time, each thread –Loads a row of matrix A –Loads a column of matrix B –Perform one multiply and addition for each pair of A and B elements –Compute to off-chip memory access ratio close to 1:1 (not very high) Size of matrix limited by the number of threads allowed in a thread block –1*2*2 is not ok? Try to increase the Compute to off-chip memory access ratio !

9 Ad Bd Cd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH-1 2 012 by ty 2 1 0 TILE_WIDTH-1 2 1 0 TILE_WIDTH TILE_WIDTHE WIDTH Implementation_2 Tiled Multiply –Each block computes one square sub-matrix Pd sub of size TILE_BLOCK_SIZE –Each thread computes one element of C sub –Assume that the dimensions of A and B are multiples of TILE_BLOCK_SIZE

10 Implementation_2 dim3 thread(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(WC/thread.x, HC/thread.y); In kernel function –__shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; –__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; –//Load the matrices from device memory to shared memory –AS(ty, tx) = A[a + ty*wA + tx]; –BS(ty, tx) = B[b + ty*wB + tx]; – //Synchronize to make sure the matrices are loaded –__syncthreads(); –for(int k=0; k<BLOCK_SIZE; ++k) –{ –Csub += AS(ty,k)*BS(k,tx); –} –__syncthreads(); – int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; –C[c + wB *ty +tx] = Csub;

11 Experiments_2 Improvement by tile

12 Experiments_2 10000 times

13 Experiments_2 Thanks for your listening

14 Experiments_2

15 Improvement by GPU compared with CPU

16 Experiments_2

17

18

19

20

21 WA, HA, WB GPU CPU Comput time (ms)total time (ms) 16,16,16 GPU CPU 45 15 24678 78 32,32,32 GPU CPU 60 62 27250 203 48,80,128 GPU CPU 225 861 26625 1203 128,256,512 GPU CPU 4249 45829 35531 49328 512,512,512 GPU CPU 27441 364232 70359 382062

22 Brief analysis Using shared memory to increase Compute to off-chip memory access ratio –256 access, (16+16)*16*16 computations. Data transfer still occupies much time –Coalesced accesses

23 Implementation_3 Transpose matrix B –Then read B is the same as read A; –C[i, j] = ∑ A[i, k]*B[j, k];

24 Experiments_3 Coalesced accessesImplementation_2

25 Brief analysis No big change –Review the code –Try a new method~

26 Work plan Further experiments on Matrix Multiplication Learn Reduction


Download ppt "Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009."

Similar presentations


Ads by Google