Presentation is loading. Please wait.

Presentation is loading. Please wait.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

Similar presentations


Presentation on theme: "Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST"— Presentation transcript:

1 Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST ahmad.ahmad@kaust.edu.sa

2 Agenda Motivation GPU Technology GPU Optimization issues MAGMA SYMV kernel The new SYMV Kernel Performance Results What Helped us? Future Work

3 Motivation GPUs are invading HPC community. ◦ Many cores (~512) on a single GPU card. ◦ Best suited for massively (embarrassingly) parallel problem. ◦ Unlike CPUs, dedicate more silicon to floating point operation. ◦ Unlike CPUs, consume much less power. Three of the top 5 supercomputers are heterogeneous (CPUs + GPUs) ◦ The world’s biggest supercomputer to be built will have 18,000 GPUs To get high performance, it is quite a challenge

4 GPU Technology (Fermi) SM L2-Cache DRAM

5 GPU Technology (Fermi) For each SM ◦ 32 cores. ◦ 64K L1/SHMEM ◦ 16 LS/ST units ◦ 4 SFUs ◦ 32768 registers (32-bits)

6 GPU Technology (Fermi) Fermi GPUs are the first GPU in the world with complete memory hierarchy ◦ (registers, L1 cache/SHMEM, L2 cache, DRAM) Fermi is the first GPU with ECC support. Fermi theoretical peak performance: ◦ 1 Tflop/s (single precision) ◦ ~ 500 Gflop/s (double precision)

7 GPU Technology Why is it tough? Let’s take a look at the programming model… A user program is designed as a grid of computation blocks Each block occupies one SM and has dedicated local memory Blocks share L2 cache and Global Memory

8 GPU Technology Why is it tough? Let’s take a look at the programming model… A single computation block is divided in threads in 1D, 2D, or 3D arrays Commonly known as Thread Block Threads are executed in warps (groups of 32)

9 GPU Optimization Issues General ◦ Load balancing between computational blocks. ◦ Data caching for reused data. ◦ Data prefetching (to mask memory latency). ◦ Avoid going to SLOW global memory as much as possible ◦ Memory coalesced access (per warp) GPU Specific ◦ Avoid shared memory bank conflict. ◦ Avoid divergent branches (within same warp). ◦ Avoid using many registers per thread (63 in Fermi). ◦ Wisely use SM resources to increase occupancy (since one SM can host more than one computation block simultaneously)

10 The SYMV Kernel A level-2 BLAS kernel ◦ Compute: Y = α × A × X + β × Y  A is a symmetric matrix (S-D-C-Z)  X and Y are vectors  Α and β are scalars ◦ Only lower/upper side of A should be referenced. ◦ The operation of mat-vec multiplication involve data reuse in the vector X only. ◦ No data reuse can be exploited regarding the elements of matrix A (except for symmetry).

11 MAGMA SYMV Kernel (SC’11 paper) Main ideas ◦ Matrix is divided into 64×64 sub-matrices. ◦ Each computation block is responsible for one horizontal row of submatrices. ◦ A computation block starts by the diagonal sub-matrix of the assigned row. ◦ Non diagonal sub-matrices are regarded twice  One for non-transposed sub-matrix.  Second for transposed sub-matrix to exploit symmetry. ◦ Recursive Blocking  Used to save shared memory.  Each sub-matrix is processed in 32×32 chunks ◦ Pointer Re-directing  Used to handle non 64X matrix dimension

12 MAGMA SYMV Kernel ++++ Reduction through SHMEM/ REG + + Reduction through GLMEM – computed by other blocks Spelled to GLMEM for other blocks

13 Main Ideas of our Design Same 64×64 block size as MAGMA Diagonal Blocks are isolated from non-diagonal ones. Each computation block is responsible for one vertical column of submatrices, offering better use of locality for column major format. No Recursive Blocking ◦ Fermi has enough shared memory (up to 48K). ◦ Allows more efficient data prefetching (in diagonal submatrices) Shared memory usage is restricted to reduction operation only ◦ In Fermi, SHMEM latency is high (compared to previous GPUs) ◦ In MAGMA, SHMEM is used in reduction as well as storing partial results ◦ In the new design, partial results are accumulated in registers first, and spelled once to shared memory for reduction.

14 The new SYMV kernel + + + + Reduction through SHMEM/ REG + + Reduction through GLMEM- computed by other blocks Spelled to GLMEM for other blocks

15 Experiments The new kernel ◦ was written in CUDA C ver 4.0 ◦ was integrated into MAGMA/BLAS for testing. ◦ is, so far, designed for 64X matrix dimension. We plan to use either pointer redirecting (same as MAGMA) or padding (easier for implementation and fast release). ◦ is tested on Fermi (Tesal C2070) GPU with 6 GB of memory

16 Performance Results “cont.”

17 Performance Results

18 What helped us? PAPI CUDA Component ◦ Extract performance counters during kernel execution. ◦ Really easy to use (even for a first time user)! ◦ Mainly used to identify where possible improvements are possible.  Shared memory bank conflict  Global memory misses (load/store)  Divergent branches  Local memory usage.

19 What helped us? “cont.” NVIDIA compute profiler ◦ Extract information unavailable/hard to get through PAPI CUDA component.  Registers per thread.  GPU time  Occupancy analysis  Kernel memory bandwidth

20 Future Work Distribution of work among computation blocks is not balanced. Balancing load may lead to further improvement, but locality will not be exploited. 1D Block cyclic assignment is intended 0 01 102 2143 32024 431345

21 Credits Rajib Nath (University of California, San Diego) ◦ Fruitful discussion about the design of the MAGMA SYMV kernel. ◦ Guidelines for possible improvements. Heike Jagode (UTK) ◦ Guidelines installation/usage of PAPI

22 Thank You Question?


Download ppt "Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST"

Similar presentations


Ads by Google