Download presentation

Presentation is loading. Please wait.

1
Linear Algebra on GPUs Vasily Volkov

2
GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use vector thread program model instead – Similar to SSE/Cell SPE/vector units, not superscalar units – Native vector length is 32; can simulate larger (thread blocks) Multithreading using register windows – Context switch never flushes registers to memory – If more threads than can fit, some don’t start until others finish Huge register files, small high-latency caches – Fundamental difference with CPUs, similar to vector processors – Cache is to save memory bandwidth, not reduce latency – Use register blocking, not cache blocking Large startup costs (≈5 s) – Not good for fine-grain computations – Use global barriers instead? (≈1.5 s)

3
Relation to Other Multicores ProcessorCore2 SSE, 2.66GHzCell SPEsG80/G84 Gflop/s/core21.3 Gflop/s25.6 Gflop/s19-23 Gflop/s Vector length4432 # of cores2-484-16 All offer both multithreading and SIMD capabilities Use CUDA to program all of them?

4
Pointer Chase Benchmark, 8800GTX 8-word cache line L1: 8 x 5kB, each is 20-way associative L2: 6 x 32kB, each is 4-way associative 512kB memory pages TLB: 16 entries, fully associative run k=A[k] in a loop in a scalar thread latency bound larger latency at cache miss Reveals cache sizes 8800GTX/8800GTS/8600GTS/FX5600: Different number of similar caches

5
GPU Memory System (8800GTX)

6
Matrix-Matrix multiply, C = C + AB 64x16 blocks in C, rank-4 updates – Ensures that our code is compute-bound Each thread computes one block in C – ijk/jik form; other choices produce race condition in C Keep A’s and C’s block in vector registers – Similarly done on IBM 3090 VF and Cray X1 Keep B’s block in local memory – Others keep it in scalar registers (n/a on GPU); or caches (slow) Use compiler options to enforce tight register budget – To hide memory latencies better by multithreading Use prefetching to hide latencies even better – Now, performance is not bound by latency and bandwidth in reading blocks in A and B! – Bound by instruction issue and local memory ops (230 Gflop/s)

7
Performance of SGEMM CUBLAS 1.1: keeps square blocks in A and B in local memory uses long vectors (poor instruction mix) exposing too much of data parallelism may cost you Our SGEMM is now in CUBLAS 2.0 beta

8
SGEMM, 8800GTX, k = 1024 Constant work per vector thread (function of k) Optimized version does better load balancing by computing partial sums

9
Panel Factorization CPU: runtime on Core2 Duo 2.66GHz, Intel MKL 10.0 (includes CPU-GPU transfer!) GPU: estimated for 8800GTX as

10
Design of Matrix Factorizations Right-looking scheme = most parallelism = best on 16-core GPUs Crout scheme = least bandwidth = best on 4-core GPU and if using CUBLAS 1.1 Left-looking = half of work in triangular solve = limited parallelism = inefficient 2-level blocking – Both levels are right-looking + premature exit from finer level to keep – Up to 6% speedup only, at large matrices (n≈10,000) Block size on GPU is 64 (same as in matrix multiply) – Autotuning in QR (up to 7% speedup) Row-major layout on GPU in LU decomposition – Since gathers with large stride are inefficient – Requires transposition at every CPU-GPU transfer – >2x speedup! Panel factorization on CPU overlapped with BLAS3 on GPU (use lookahead) Multiply by inverse (GEMM) instead of triangular solve (TRSM) – TRSM vs. GEMM is 13 Gflop/s vs. 160 Gflop/s if matrix is 64x64 – Parallelism in TRSM is not embarrassing enough

11
Test Platform GPU – Core2 Duo 2.67GHz + GeForce 8800 GTX – Core2 Duo 2.67GHz + two GeForce 8800 GTX CPU – Core2 Duo 2.67GHz – Core2 Quad 2.4GHz

12
Summary of Performance

13
Speedups vs. CPU

14
Summary of Speedups 8800GTX Gflop/s Core2 DuoCore2 Quad Gflop/sspeedupGflop/sspeedup Cholesky18323.0 7.4 34.9 5.5 LU17922.7 7.7 59.2 3.0 QR19222.5 8.3 44.8 4.3 SGEMM20625.9 8.0 69.8 3.0

15
Speedup using 2 GPUs Using column-cyclic layout

16
Breakdown of runtime (LU)

17
What if omit one optimization in LU?

18
Other Work Done Tridiagonal eigenvalue solver (bisection) – Most work: factorize A– i I = LDL T, count signs in D (compute bound) – Done for many i in parallel — traditionally vectorized – If need more parallelism — do multisection instead of bisection But it increases total flop count – Rest is difficult to parallelize, does not worth it – Our solution: Run vectorized loops on the GPU, rest (least work) on the CPU Autotune to decide optimal redundancy and when involve CPU Use features of IEEE arithmetic to save another 15-30% of runtime Up to 156x faster than LAPACK on Pentium 4 Tridiagonal eigenvector solver (inverse iteration) – Most work: Solve (A– i I)x k+1 =x k for fixed i (bandwidth bound) – Factorize A– i I = LDL T once, keep D only. Reconstruct L on need Reconstruction is overlapped with memory access; still bandwidth bound – Don’t pivot — recompute using safe code if fails (do it on CPU) – Up to 120x faster than LAPACK on Core2 Duo so far – More complicated when eigenvalues are clustered Stencil computation (7-point on 3D grid) – Blocks in registers and local memory – Bandwidth-bound, runs at up to 66% of pin-bandwidth

19
Future Work Analysis of architecture – Find best parallels in past architectures to reuse methods – Catching up with newer GPUs – More micro-benchmarking to get better performance models More scientific kernels – CUFFT is ≈50Gflop/s, can do better (e.g. by not doing sin/cos in the inner loop) More LAPACK – Two-sided factorizations used in eigensolvers and SVD LAPACK does 50% of work is in BLAS1/BLAS2 Mostly BLAS3 algorithm is known, but has requires more flops if eigenvectors are needed May use divide-and-conquer instead – MRRR (improved inverse iteration algorithm, also rich in parallelism) – Non-symmetric eigensolvers such as QR iterations currently fine-grained, can do better? – Iterative refinement for eigenvalue problem? ScaLAPACK (distributed memory LAPACK) – One-sided factorizations on a GPU cluster

Similar presentations

© 2020 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google