Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication = * Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August 30, 2004

Motivation: Harness GPU Performance
6 5 4 Relative Performance Peak FLOPS 3 Memory BW 2 1 P4 3.4Ghz 6800 Ultra X800 XT PE

Streaming Computation on GPUs
GPUs accelerate streaming numerical algorithms Data parallelism High ratio of arithmetic to data access Little data reuse Kernel function (shader) Input Elements Output Elements

Streaming Computation on GPUs
Level 1 BLAS operations Buck et al. [2004] Fluid solvers Kruger & Westermann [2003] Boltz et al. [2003] Image processing Apple Corp. [2004] McCormick et al. [2004] Segmentation Sherbondy et al. [2003] Database operations Govindaraju et al. [2004] Data Clustering Hall et al. [2004]

Dense Matrix Multiplication
= * C A B Abundant data parallelism Regular data access (no branching) High ratio of computation to data access

Dense Matrix Multiplication
Widely used computational kernel Building block for LAPACK library

Matrix Multiplication on GPUs
Larsen & McAllister [2001] Moravansky [2003] Hall et al. [2003] Limited analysis of performance

Overview GPU Implementations Results Analysis: Why GPUs are slow
Ways to Make GPUs Better

CPU-Based Approaches = * C A B
High performance matrix multiplication algorithms are cache aware = * C A B Hierarchies designed to improve perf when data is reused frequently. Any quality impl. For cpu will be aware of cache sizes to ensure most data accesses are services out of cache. Point: Blocking strategies minimize number of times data fetched from memory Since required math operations cubic in dimension of the block, while data set is only quadratic, processor units kept more busy as block size increases Partition computation into submatrix multiplications Load input submatrices into cache Multiply submatrices Store output submatrix to memory

Method 1: Column Packed (CP)
x y z w = * C A B 4 elements stored per texel Naïve way of doing things. Column packed arbitrary. Could go row packed. 4x4 matrix by 4-vector multiplications Larsen & McAllister [SC2001] Moravansky [2003]

Method 2: Submatrix Packed (SP)
y z w = * C A B 2x2 submatrix stored per texel Picked unroll size of 128 on NV Better reuse of data. All data used twice. As opposed to 5 fetches and 4 mads pattern of column packed THINK ABOUT HOW PIXELS IN EACH ROW AND COLUMN GET REUSED HERE 2x2 by 2x2 submatrix multiplications Hall et al. [2003]

Alternative Approaches Ineffective
Varied mapping into texture memory Altered rasterization order with geometry Single quad most effective Utilized multiple outputs Varied amount of loop unrolling Column packed: unroll maximally Submatrix packed: unroll 128 times

Performance Results Pentium 4 3Ghz CPU, 512KB L2 cache
12 GFLOPS peak compute 44.1GB/sec cache BW Using sgemm routine from ATLAS package NVIDIA GeForce 5900 Ultra GeForce 6800 Ultra ATI Radeon 9800 XT Radeon X800 XT PE (prerelease 500Mhz mem / 500Mhz core clock)

Multiplication of 1024x1024 Matrices
Previous Generation GPUs Multiplication of 1024x1024 Matrices 12 30 10 25 8 20 GB/sec GFLOPS 6 15 4 10 GFLOPS 2 5 Bandwidth P4 3Ghz 5900 Ultra 9800 XT

Multiplication of 1024x1024 Matrices
Current Generation GPUs Multiplication of 1024x1024 Matrices 12 30 10 25 8 20 GB/sec GFLOPS 6 15 4 10 GFLOPS 2 5 Bandwidth P4 3Ghz 6800 Ultra X800 XT PE

Fragment Processor Data Paths
Texture Unit Fragment Processor L1 Texture Cache From L2 To Frame Buffer

GPU Microbenchmarks Peak Arithmetic Rate GFLOPS 70 60 50 40 30 20 10
Verbally note that results do NOT include transfer to and from the cards or time to compiler shader programs, etc. Note computation precision Note table shows 1024x1024 performance Need to describe how bandwidth is computed (total fetches) Also lead in to fact that although GPUs win, we are about to talk about efficiency 20 10 5900 Ultra 6800 Ultra 9800 XT X800 XT PE

GPU Microbenchmarks Observed Bandwidth GB/sec Cache BW Seq BW 30 25 20
15 Seq BW 10 Verbally note that results do NOT include transfer to and from the cards or time to compiler shader programs, etc. Note computation precision Note table shows 1024x1024 performance Need to describe how bandwidth is computed (total fetches) Also lead in to fact that although GPUs win, we are about to talk about efficiency 5 5900 Ultra 6800 Ultra 9800 XT X800 XT PE

Fragment Processor Data Paths
Low bandwidth (1 float/clock) From L2 Fragment Processor L1 Texture Cache To Frame Buffer Texture Unit 1 4-wide MAD/clock High bandwidth (texture filtering) Make comparison to P4 ratios verbally although no text dedicated to it. Fragment processor consumes data at 8X rate texture provides it!

Datapaths Designed for Shading
Texture Unit Fragment Processor 8 to 1 reduction in amount of data 4 components per clock L1 Texture Cache From L2 To Frame Buffer 8 bit components 2 to 1 ratio of compute to bandwidth Texture units filter (reduce) data Shaders use interpolated values & constants

Compute and Bandwidth Efficiency
100 80 60 Percentage of Peak 40 Compute Bandwidth 20 Emphasize that this is NOT off chip bandwidth. This is effective cache bandwidth. So no algorithm would be able to read data from texture very much faster. 5900 Ultra 6800 Ultra 9800 XT X800 XT PE P4 3Ghz GPU algorithms are severely bandwidth limited!

Minimize Texture Fetches
Block in shader register file Would need 8x8 submatrices to run at peak rates Limited to 4x4 submatrices by available outputs 8x8 assumes output is async with processing. 12x12 is correct answer assuming this is NOT the case

Improvement 1: Widen Datapath
Fragment processor receives cached data more quickly Expect performance to improve linearly with increase in bandwidth Need ~4X improvement to achieve peak perf But L2 may no longer be able to fill L1

Improvement 2: Larger Scratch Space
Requires large number of registers Needs large number of output values Reduces texture bandwidth requirements Performance increases linearly with dimension of submatrices Increases amount of per-pixel state Storage increases as square of dimension of submatrices Requires 16X space of SP method for peak perf

Summary GPU algorithms for matrix-matrix multiplication run inefficiently Best algorithms achieve below 20% of peak performance Saturate data path between texture and FP units Cache-aware software blocking strategies do not improve performance Cannot exploit data reuse Hardware limits algorithm efficiency Compare to P4 again

Summary Hardware changes required to improve efficiency
Widen path between texture and register file Output large number of values from shaders Improved efficiency would make GPUs powerful platform for broader class of numerical algorithms Compare to P4 again

Acknowledgements Thanks to Ian Buck, Mike Houston, Sean Treichler, Nick Triantos, Steve Morein Support from ATI, NVIDIA, DARPA, IBM, SONY Rambus Stanford Graduate Fellowship Stanford School of Engineering Fellowship Compare to P4 again

Questions? Compare to P4 again

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan

Similar presentations

Presentation on theme: "Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan

Similar presentations

Presentation on theme: "Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan"— Presentation transcript:

Similar presentations

About project

Feedback