Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Published byModified over 4 years ago
Presentation on theme: "Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use."— Presentation transcript:
GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use vector thread program model instead – Similar to SSE/Cell SPE/vector units, not superscalar units – Native vector length is 32; can simulate larger (thread blocks) Multithreading using register windows – Context switch never flushes registers to memory – If more threads than can fit, some don’t start until others finish Huge register files, small high-latency caches – Fundamental difference with CPUs, similar to vector processors – Cache is to save memory bandwidth, not reduce latency – Use register blocking, not cache blocking Large startup costs (≈5 s) – Not good for fine-grain computations – Use global barriers instead? (≈1.5 s)
Relation to Other Multicores ProcessorCore2 SSE, 2.66GHzCell SPEsG80/G84 Gflop/s/core21.3 Gflop/s25.6 Gflop/s19-23 Gflop/s Vector length4432 # of cores2-484-16 All offer both multithreading and SIMD capabilities Use CUDA to program all of them?
Pointer Chase Benchmark, 8800GTX 8-word cache line L1: 8 x 5kB, each is 20-way associative L2: 6 x 32kB, each is 4-way associative 512kB memory pages TLB: 16 entries, fully associative run k=A[k] in a loop in a scalar thread latency bound larger latency at cache miss Reveals cache sizes 8800GTX/8800GTS/8600GTS/FX5600: Different number of similar caches
Matrix-Matrix multiply, C = C + AB 64x16 blocks in C, rank-4 updates – Ensures that our code is compute-bound Each thread computes one block in C – ijk/jik form; other choices produce race condition in C Keep A’s and C’s block in vector registers – Similarly done on IBM 3090 VF and Cray X1 Keep B’s block in local memory – Others keep it in scalar registers (n/a on GPU); or caches (slow) Use compiler options to enforce tight register budget – To hide memory latencies better by multithreading Use prefetching to hide latencies even better – Now, performance is not bound by latency and bandwidth in reading blocks in A and B! – Bound by instruction issue and local memory ops (230 Gflop/s)
Performance of SGEMM CUBLAS 1.1: keeps square blocks in A and B in local memory uses long vectors (poor instruction mix) exposing too much of data parallelism may cost you Our SGEMM is now in CUBLAS 2.0 beta
SGEMM, 8800GTX, k = 1024 Constant work per vector thread (function of k) Optimized version does better load balancing by computing partial sums
Panel Factorization CPU: runtime on Core2 Duo 2.66GHz, Intel MKL 10.0 (includes CPU-GPU transfer!) GPU: estimated for 8800GTX as
Design of Matrix Factorizations Right-looking scheme = most parallelism = best on 16-core GPUs Crout scheme = least bandwidth = best on 4-core GPU and if using CUBLAS 1.1 Left-looking = half of work in triangular solve = limited parallelism = inefficient 2-level blocking – Both levels are right-looking + premature exit from finer level to keep – Up to 6% speedup only, at large matrices (n≈10,000) Block size on GPU is 64 (same as in matrix multiply) – Autotuning in QR (up to 7% speedup) Row-major layout on GPU in LU decomposition – Since gathers with large stride are inefficient – Requires transposition at every CPU-GPU transfer – >2x speedup! Panel factorization on CPU overlapped with BLAS3 on GPU (use lookahead) Multiply by inverse (GEMM) instead of triangular solve (TRSM) – TRSM vs. GEMM is 13 Gflop/s vs. 160 Gflop/s if matrix is 64x64 – Parallelism in TRSM is not embarrassing enough
Test Platform GPU – Core2 Duo 2.67GHz + GeForce 8800 GTX – Core2 Duo 2.67GHz + two GeForce 8800 GTX CPU – Core2 Duo 2.67GHz – Core2 Quad 2.4GHz
Other Work Done Tridiagonal eigenvalue solver (bisection) – Most work: factorize A– i I = LDL T, count signs in D (compute bound) – Done for many i in parallel — traditionally vectorized – If need more parallelism — do multisection instead of bisection But it increases total flop count – Rest is difficult to parallelize, does not worth it – Our solution: Run vectorized loops on the GPU, rest (least work) on the CPU Autotune to decide optimal redundancy and when involve CPU Use features of IEEE arithmetic to save another 15-30% of runtime Up to 156x faster than LAPACK on Pentium 4 Tridiagonal eigenvector solver (inverse iteration) – Most work: Solve (A– i I)x k+1 =x k for fixed i (bandwidth bound) – Factorize A– i I = LDL T once, keep D only. Reconstruct L on need Reconstruction is overlapped with memory access; still bandwidth bound – Don’t pivot — recompute using safe code if fails (do it on CPU) – Up to 120x faster than LAPACK on Core2 Duo so far – More complicated when eigenvalues are clustered Stencil computation (7-point on 3D grid) – Blocks in registers and local memory – Bandwidth-bound, runs at up to 66% of pin-bandwidth
Future Work Analysis of architecture – Find best parallels in past architectures to reuse methods – Catching up with newer GPUs – More micro-benchmarking to get better performance models More scientific kernels – CUFFT is ≈50Gflop/s, can do better (e.g. by not doing sin/cos in the inner loop) More LAPACK – Two-sided factorizations used in eigensolvers and SVD LAPACK does 50% of work is in BLAS1/BLAS2 Mostly BLAS3 algorithm is known, but has requires more flops if eigenvectors are needed May use divide-and-conquer instead – MRRR (improved inverse iteration algorithm, also rich in parallelism) – Non-symmetric eigensolvers such as QR iterations currently fine-grained, can do better? – Iterative refinement for eigenvalue problem? ScaLAPACK (distributed memory LAPACK) – One-sided factorizations on a GPU cluster