Download presentation

Presentation is loading. Please wait.

Published byNico Larcher Modified over 2 years ago

1
School of Electrical Engineering and Computer Science University of Central Florida Fast matrix multiplication with CUDA

2
CDA6938University of Central Florida 2 Overview Platform –GEFORCE 8800GT, 512MB –Core: G92, Shader frequency: 1.5 GHz, Mem frequency: 900 MHz Performance –Tuned for 4k x 4k matrix, 192 GFlops Revisiting the Tiled version Using large tiles –Base algorithm –Optimized algorithm Tools and tips

3
CDA6938University of Central Florida 3 The Tiled version Tile Size: 16 x 16 256 threads / block 14 regs, 2076 smem / block Occupancy: 2/3 T0 T1 … T14 T15 T16 T17 … T30 T31 … T239 … T255

4
CDA6938University of Central Florida 4 The Tiled version – Memory access Every half warp is accessing continuous memory locations. Memory accesses are fully coalesced. T0 T1 … T14 T15 T16 T17 … T30 T31 … T239 … T255

5
CDA6938University of Central Florida 5 The Tiled version – Bank conflicts No bank conflicts. T0 T1 … T14 T15 T16 T17 … T30 T31 … T239 … T255 broadcast 16 banks

6
CDA6938University of Central Florida 6 The Tiled version - Bottlenecks If fully use memory bandwidth and ALUs: –14.4G float/s, 168G MAD/s –11.67 MAD/float With 16 x 16 tiles: –Total (W 3 /8) loads, 8 MAD/float –Too many loads! Solution: large tile. Psub += As[ty][k] * Bs[k][tx] –Extra instructions. 77 GFlops (4k x 4k) TWidth THeight Reused TWidth times. Reused THeight times. mov.b32 $r12, s[$ofs4+0x0000] mov.b32 $r7, s[$ofs4+0x0040] mad.rn.f32 $r11, s[$ofs1+0x000c], $r11, $r13 add.b32 $ofs4, $ofs3, 0x0000019c mad.rn.f32 $r13, s[$ofs1+0x0010], $r12, $r11 mov.b32 $r12, s[$ofs4+0x0000] mov.b32 $r11, s[$ofs4+0x0040] mad.rn.f32 $r7, s[$ofs1+0x0014], $r7, $r13 add.b32 $ofs4, $ofs3, 0x0000021c mad.rn.f32 $r13, s[$ofs1+0x0018], $r12, $r7

7
CDA6938University of Central Florida 7 Using Large Tiles Each thread: –17 loads / iteration –W/16 iterations –Total (W 3 /15) loads, 15 MAD/load 256 16 Stored in shared memory. 16 Stored in registers. 256 256 threads 16 Psubs/thread

8
CDA6938University of Central Florida 8 Using Large Tiles - Algorithm For each sub tile in A & B –Read the sub tile from A to shared memory. 1 number / thread. –For each of the 16 numbers in B: Read one number from B into a register. Perform one MAD for each Psub. To remove extra instructions for offset calculation, we want the sub tile A to be stored in column-major format in the shared memory. –But … A B C T0 T1 T2 … T255

9
CDA6938University of Central Florida 9 Using Large Tiles - Algorithm Solution1: –Transpose A to column-major format first. A C T0 T1 T2 … T255 T0 T15 Solution2: –Read A in row-major format, write to the shared memory in column-major format. –Bank conflicts when write to the shared memory! Shared A B0 B15 A C T0 T1 T2 … T255 T0T15 Shared A B0 B15

10
CDA6938University of Central Florida 10 Using Large Tiles - Algorithm Solution3: –Padding Shared A with one empty row. –No bank conflicts. Do not need to transpose A. –164 GFlops (4k x 4k). A C T0 T1 T2 … T255 T0T15 Shared A B0 B15 B0 B1B15 B1 B2B0

11
CDA6938University of Central Florida 11 Using Large Tiles - code for (int i = 0; i < MATRIX_WIDTH/16; i++) { ashare[tx][ty] = A[0]; __syncthreads(); #pragma unroll // 150 GFlops (4k x 4k) without unroll for (int k = 0; k < 16; k++) { b = B[k * MATRIX_WIDTH]; comp16(b, &ashare[k][0], c); } A += 16; B += 16 * MATRIX_WIDTH; __syncthreads(); };

12
CDA6938University of Central Florida 12 Using Large Tiles - optimized do { ashare[tx][ty] = a; __syncthreads(); a = A[0]; bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; b[0] = B[4 * MATRIX_WIDTH]; b[1] = B[5 * MATRIX_WIDTH]; b[2] = B[6 * MATRIX_WIDTH]; b[3] = B[7 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp16(bb[i], &ashare[i][0], c); … bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; b[0] = B[12 * MATRIX_WIDTH]; b[1] = B[13 * MATRIX_WIDTH]; b[2] = B[14 * MATRIX_WIDTH]; b[3] = B[15 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp16(bb[i], &ashare[i + 8][0], c); bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; A += 16; B += 16 * MATRIX_WIDTH; b[0] = B[0 * MATRIX_WIDTH]; b[1] = B[1 * MATRIX_WIDTH]; b[2] = B[2 * MATRIX_WIDTH]; b[3] = B[3 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp16(bb[i], &ashare[i + 12][0], c); __syncthreads(); } while( A < Alast );... // last iteration

13
CDA6938University of Central Florida 13 Using Large Tiles - Performance KernelMatrix regsmemoccupancyGflops* 16x16tile 1k x 1k1420762/381 2k x 2k1420762/381 4k x 4k1420762/377 16x256tile Base 1k x 1k2611161/3172 2k x 2k2711161/3161 4k x 4k2911161/3164 16x256tile Optimized 1k x 1k (-maxrregcount 32) 2911161/3176 2k x 2k (-maxrregcount 32) 3011161/3185 4k x 4k (no – maxrregcount 32! Otherwise, will use lmem) 3211161/3192 cublas 1k x 1k2/3111 2k x 2k2/3114 4k x 4k2/3112 Execution time is measured as the computation time on GPU.

14
CDA6938University of Central Florida 14 Using Large Tiles – Performance 2 Gflops (comp): excluding CPU GPU data transfer time. Gflops (total): including CPU GPU data transfer time. 8800GT (G92)8800GTX (G80) KernelMatrix Gflops (comp)Gflops (total)Gflops (comp)Gflops (total) 16x16tile 1k x 1k 81648564 2k x 2k 81718474 4k x 4k 77736561 16x256tile Base 1k x 1k 172112185113 2k x 2k 161130176139 4k x 4k 164146178157 16x256tile Optimized 1k x 1k 176114192116 2k x 2k 185145193149 4k x 4k 192168192168 cublas 1k x 1k 1118311582 2k x 2k 11497118100 4k x 4k 112104117108

15
CDA6938University of Central Florida 15 Tools - CUDA GPU Occupancy Calculator

16
CDA6938University of Central Florida 16 Tools - decuda Developed by Wladimir J. van der Laan –a PhD candidate at the Institute of Mathematics and Computing Science of the University of Groningen. http://www.cs.rug.nl/~wladimir/decuda/

17
CDA6938University of Central Florida 17 Tools – CUDA Visual Profiler http://forums.nvidia.com/index.php?showtopic=57443 –GPU Time CPU Time Occupancy –Profiler counters: gld_incoherent : Number of non-coalesced global memory loads gld_coherent : Number of coalesced global memory loads gst_incoherent : Number of non-coalesced global memory stores gst_coherent : Number of coalesced global memory stores local_load : Number of local memory loads local_store : Number of local memory stores branch : Number of branch events (instruction and/or sync stack) divergent_branch : Number of divergent branches within a warp instructions : Number of dynamic instructions (in fetch) warp_serialize : Number of threads in a warp serialize based on address (GRF or constant) cta_launched : Number of CTAs launched on the PM TPC

18
CDA6938University of Central Florida 18 Tips Get usage of reg, smem, cmem, and lmem: –nvcc -m32 -o data/matrix_kernel.cubin -cubin matrix_kernel.cu --compiler-options -fno-strict-aliasing -I. - I/usr/local/cuda/include -I../../common/inc -DUNIX -O3 --ptxas-options=-v Compile with –maxrregcount

19
CDA6938University of Central Florida 19 References NVIDIA CUDA Samples: –http://www.nvidia.com/object/cuda_sample_linear_algeb ra.htmlhttp://www.nvidia.com/object/cuda_sample_linear_algeb ra.html –Simple CUBLAS –Matrix Multiplication –Matrix Transpose NVIDIA Forum: –http://forums.nvidia.com/index.php?showtopic=47689&st =0http://forums.nvidia.com/index.php?showtopic=47689&st =0

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google