Presentation is loading. Please wait.

Presentation is loading. Please wait.

School of Electrical Engineering and Computer Science University of Central Florida Fast matrix multiplication with CUDA.

Similar presentations


Presentation on theme: "School of Electrical Engineering and Computer Science University of Central Florida Fast matrix multiplication with CUDA."— Presentation transcript:

1 School of Electrical Engineering and Computer Science University of Central Florida Fast matrix multiplication with CUDA

2 CDA6938University of Central Florida 2 Overview Platform –GEFORCE 8800GT, 512MB –Core: G92, Shader frequency: 1.5 GHz, Mem frequency: 900 MHz Performance –Tuned for 4k x 4k matrix, 192 GFlops Revisiting the Tiled version Using large tiles –Base algorithm –Optimized algorithm Tools and tips

3 CDA6938University of Central Florida 3 The Tiled version Tile Size: 16 x threads / block 14 regs, 2076 smem / block Occupancy: 2/3 T0 T1 … T14 T15 T16 T17 … T30 T31 … T239 … T255

4 CDA6938University of Central Florida 4 The Tiled version – Memory access Every half warp is accessing continuous memory locations. Memory accesses are fully coalesced. T0 T1 … T14 T15 T16 T17 … T30 T31 … T239 … T255

5 CDA6938University of Central Florida 5 The Tiled version – Bank conflicts No bank conflicts. T0 T1 … T14 T15 T16 T17 … T30 T31 … T239 … T255 broadcast 16 banks

6 CDA6938University of Central Florida 6 The Tiled version - Bottlenecks If fully use memory bandwidth and ALUs: –14.4G float/s, 168G MAD/s –11.67 MAD/float With 16 x 16 tiles: –Total (W 3 /8) loads, 8 MAD/float –Too many loads! Solution: large tile. Psub += As[ty][k] * Bs[k][tx] –Extra instructions. 77 GFlops (4k x 4k) TWidth THeight Reused TWidth times. Reused THeight times. mov.b32 $r12, s[$ofs4+0x0000] mov.b32 $r7, s[$ofs4+0x0040] mad.rn.f32 $r11, s[$ofs1+0x000c], $r11, $r13 add.b32 $ofs4, $ofs3, 0x c mad.rn.f32 $r13, s[$ofs1+0x0010], $r12, $r11 mov.b32 $r12, s[$ofs4+0x0000] mov.b32 $r11, s[$ofs4+0x0040] mad.rn.f32 $r7, s[$ofs1+0x0014], $r7, $r13 add.b32 $ofs4, $ofs3, 0x c mad.rn.f32 $r13, s[$ofs1+0x0018], $r12, $r7

7 CDA6938University of Central Florida 7 Using Large Tiles Each thread: –17 loads / iteration –W/16 iterations –Total (W 3 /15) loads, 15 MAD/load Stored in shared memory. 16 Stored in registers threads 16 Psubs/thread

8 CDA6938University of Central Florida 8 Using Large Tiles - Algorithm For each sub tile in A & B –Read the sub tile from A to shared memory. 1 number / thread. –For each of the 16 numbers in B: Read one number from B into a register. Perform one MAD for each Psub. To remove extra instructions for offset calculation, we want the sub tile A to be stored in column-major format in the shared memory. –But … A B C T0 T1 T2 … T255

9 CDA6938University of Central Florida 9 Using Large Tiles - Algorithm Solution1: –Transpose A to column-major format first. A C T0 T1 T2 … T255 T0 T15 Solution2: –Read A in row-major format, write to the shared memory in column-major format. –Bank conflicts when write to the shared memory! Shared A B0 B15 A C T0 T1 T2 … T255 T0T15 Shared A B0 B15

10 CDA6938University of Central Florida 10 Using Large Tiles - Algorithm Solution3: –Padding Shared A with one empty row. –No bank conflicts. Do not need to transpose A. –164 GFlops (4k x 4k). A C T0 T1 T2 … T255 T0T15 Shared A B0 B15 B0 B1B15 B1 B2B0

11 CDA6938University of Central Florida 11 Using Large Tiles - code for (int i = 0; i < MATRIX_WIDTH/16; i++) { ashare[tx][ty] = A[0]; __syncthreads(); #pragma unroll // 150 GFlops (4k x 4k) without unroll for (int k = 0; k < 16; k++) { b = B[k * MATRIX_WIDTH]; comp16(b, &ashare[k][0], c); } A += 16; B += 16 * MATRIX_WIDTH; __syncthreads(); };

12 CDA6938University of Central Florida 12 Using Large Tiles - optimized do { ashare[tx][ty] = a; __syncthreads(); a = A[0]; bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; b[0] = B[4 * MATRIX_WIDTH]; b[1] = B[5 * MATRIX_WIDTH]; b[2] = B[6 * MATRIX_WIDTH]; b[3] = B[7 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp16(bb[i], &ashare[i][0], c); … bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; b[0] = B[12 * MATRIX_WIDTH]; b[1] = B[13 * MATRIX_WIDTH]; b[2] = B[14 * MATRIX_WIDTH]; b[3] = B[15 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp16(bb[i], &ashare[i + 8][0], c); bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; A += 16; B += 16 * MATRIX_WIDTH; b[0] = B[0 * MATRIX_WIDTH]; b[1] = B[1 * MATRIX_WIDTH]; b[2] = B[2 * MATRIX_WIDTH]; b[3] = B[3 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp16(bb[i], &ashare[i + 12][0], c); __syncthreads(); } while( A < Alast );... // last iteration

13 CDA6938University of Central Florida 13 Using Large Tiles - Performance KernelMatrix regsmemoccupancyGflops* 16x16tile 1k x 1k /381 2k x 2k /381 4k x 4k /377 16x256tile Base 1k x 1k /3172 2k x 2k /3161 4k x 4k / x256tile Optimized 1k x 1k (-maxrregcount 32) /3176 2k x 2k (-maxrregcount 32) /3185 4k x 4k (no – maxrregcount 32! Otherwise, will use lmem) /3192 cublas 1k x 1k2/3111 2k x 2k2/3114 4k x 4k2/3112 Execution time is measured as the computation time on GPU.

14 CDA6938University of Central Florida 14 Using Large Tiles – Performance 2 Gflops (comp): excluding CPU GPU data transfer time. Gflops (total): including CPU GPU data transfer time. 8800GT (G92)8800GTX (G80) KernelMatrix Gflops (comp)Gflops (total)Gflops (comp)Gflops (total) 16x16tile 1k x 1k k x 2k k x 4k x256tile Base 1k x 1k k x 2k k x 4k x256tile Optimized 1k x 1k k x 2k k x 4k cublas 1k x 1k k x 2k k x 4k

15 CDA6938University of Central Florida 15 Tools - CUDA GPU Occupancy Calculator

16 CDA6938University of Central Florida 16 Tools - decuda Developed by Wladimir J. van der Laan –a PhD candidate at the Institute of Mathematics and Computing Science of the University of Groningen.

17 CDA6938University of Central Florida 17 Tools – CUDA Visual Profiler –GPU Time CPU Time Occupancy –Profiler counters: gld_incoherent : Number of non-coalesced global memory loads gld_coherent : Number of coalesced global memory loads gst_incoherent : Number of non-coalesced global memory stores gst_coherent : Number of coalesced global memory stores local_load : Number of local memory loads local_store : Number of local memory stores branch : Number of branch events (instruction and/or sync stack) divergent_branch : Number of divergent branches within a warp instructions : Number of dynamic instructions (in fetch) warp_serialize : Number of threads in a warp serialize based on address (GRF or constant) cta_launched : Number of CTAs launched on the PM TPC

18 CDA6938University of Central Florida 18 Tips Get usage of reg, smem, cmem, and lmem: –nvcc -m32 -o data/matrix_kernel.cubin -cubin matrix_kernel.cu --compiler-options -fno-strict-aliasing -I. - I/usr/local/cuda/include -I../../common/inc -DUNIX -O3 --ptxas-options=-v Compile with –maxrregcount

19 CDA6938University of Central Florida 19 References NVIDIA CUDA Samples: –http://www.nvidia.com/object/cuda_sample_linear_algeb ra.htmlhttp://www.nvidia.com/object/cuda_sample_linear_algeb ra.html –Simple CUBLAS –Matrix Multiplication –Matrix Transpose NVIDIA Forum: –http://forums.nvidia.com/index.php?showtopic=47689&st =0http://forums.nvidia.com/index.php?showtopic=47689&st =0


Download ppt "School of Electrical Engineering and Computer Science University of Central Florida Fast matrix multiplication with CUDA."

Similar presentations


Ads by Google