Presentation is loading. Please wait.

Presentation is loading. Please wait.

Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August.

Similar presentations


Presentation on theme: "Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August."— Presentation transcript:

1 Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August 30, 2004 = *

2 Motivation: Harness GPU Performance 0 1 2 3 4 5 6 P4 3.4Ghz6800 UltraX800 XT PE Peak FLOPS Memory BW Relative Performance

3 Streaming Computation on GPUs Kernel function (shader)  GPUs accelerate streaming numerical algorithms  Data parallelism  High ratio of arithmetic to data access  Little data reuse Input Elements Output Elements

4 Streaming Computation on GPUs  Level 1 BLAS operations Buck et al. [2004]  Fluid solvers Kruger & Westermann [2003] Boltz et al. [2003]  Image processing Apple Corp. [2004] McCormick et al. [2004]  Segmentation Sherbondy et al. [2003]  Database operations Govindaraju et al. [2004]  Data Clustering Hall et al. [2004]

5 Dense Matrix Multiplication  Abundant data parallelism * = B A C  Regular data access (no branching)  High ratio of computation to data access

6 Dense Matrix Multiplication  Widely used computational kernel  Building block for LAPACK library

7 Matrix Multiplication on GPUs  Larsen & McAllister [2001]  Moravansky [2003]  Hall et al. [2003]  Limited analysis of performance

8 Overview  GPU Implementations  Results  Analysis: Why GPUs are slow  Ways to Make GPUs Better

9 CPU-Based Approaches  High performance matrix multiplication algorithms are cache aware  Partition computation into submatrix multiplications  Load input submatrices into cache  Multiply submatrices  Store output submatrix to memory * = B A C

10 Method 1: Column Packed (CP) = Larsen & McAllister [SC2001] Moravansky [2003] * C A B 4 elements stored per texel 4x4 matrix by 4-vector multiplications x y z w

11 Method 2: Submatrix Packed (SP) = Hall et al. [2003] * C A B 2x2 submatrix stored per texel x y z w 2x2 by 2x2 submatrix multiplications

12 Alternative Approaches Ineffective  Varied mapping into texture memory  Altered rasterization order with geometry  Single quad most effective  Utilized multiple outputs  Varied amount of loop unrolling  Column packed: unroll maximally  Submatrix packed: unroll 128 times

13 Performance Results  Pentium 4 3Ghz CPU, 512KB L2 cache  12 GFLOPS peak compute  44.1GB/sec cache BW  Using sgemm routine from ATLAS package  NVIDIA  GeForce 5900 Ultra  GeForce 6800 Ultra  ATI  Radeon 9800 XT  Radeon X800 XT PE (prerelease 500Mhz mem / 500Mhz core clock)

14 Previous Generation GPUs 0 2 4 6 8 10 12 P4 3Ghz 5900 Ultra9800 XT 0 5 10 15 20 25 30 GFLOPS Bandwidth Multiplication of 1024x1024 Matrices GFLOPS GB/sec

15 Current Generation GPUs 0 2 4 6 8 10 12 P4 3Ghz 6800 UltraX800 XT PE 0 5 10 15 20 25 30 GFLOPS Bandwidth Multiplication of 1024x1024 Matrices GFLOPS GB/sec

16 Fragment Processor Data Paths From L2 Fragment Processor L1 Texture Cache To Frame Buffer Texture Unit

17 GPU Microbenchmarks 0 10 20 30 40 50 60 70 5900 Ultra6800 Ultra9800 XTX800 XT PE GFLOPS Peak Arithmetic Rate

18 GPU Microbenchmarks Observed Bandwidth 0 5 10 15 20 25 30 5900 Ultra6800 Ultra9800 XTX800 XT PE GB/sec Cache BW Seq BW

19 Fragment Processor Data Paths From L2 Fragment Processor L1 Texture Cache To Frame Buffer Texture Unit High bandwidth (texture filtering) Low bandwidth (1 float/clock) Fragment processor consumes data at 8X rate texture provides it! 1 4-wide MAD/clock

20 Datapaths Designed for Shading From L2 Fragment Processor L1 Texture Cache To Frame Buffer Texture Unit 8 to 1 reduction in amount of data 4 components per clock  8 bit components  2 to 1 ratio of compute to bandwidth  Texture units filter (reduce) data  Shaders use interpolated values & constants

21 Compute and Bandwidth Efficiency 0 20 40 60 80 100 5900 Ultra 6800 Ultra 9800 XT X800 XT PE P4 3Ghz Percentage of Peak Compute Bandwidth GPU algorithms are severely bandwidth limited!

22 Minimize Texture Fetches  Block in shader register file  Would need 8x8 submatrices to run at peak rates  Limited to 4x4 submatrices by available outputs

23 Improvement 1: Widen Datapath  Fragment processor receives cached data more quickly  Expect performance to improve linearly with increase in bandwidth  Need ~4X improvement to achieve peak perf  But L2 may no longer be able to fill L1

24 Improvement 2: Larger Scratch Space  Requires large number of registers  Needs large number of output values  Reduces texture bandwidth requirements  Performance increases linearly with dimension of submatrices  Increases amount of per-pixel state  Storage increases as square of dimension of submatrices  Requires 16X space of SP method for peak perf

25 Summary  GPU algorithms for matrix-matrix multiplication run inefficiently  Best algorithms achieve below 20% of peak performance  Saturate data path between texture and FP units  Cache-aware software blocking strategies do not improve performance  Cannot exploit data reuse  Hardware limits algorithm efficiency

26 Summary  Hardware changes required to improve efficiency  Widen path between texture and register file  Output large number of values from shaders  Improved efficiency would make GPUs powerful platform for broader class of numerical algorithms

27 Acknowledgements  Thanks to Ian Buck, Mike Houston, Sean Treichler, Nick Triantos, Steve Morein  Support from ATI, NVIDIA, DARPA, IBM, SONY  Rambus Stanford Graduate Fellowship  Stanford School of Engineering Fellowship

28 Questions?


Download ppt "Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August."

Similar presentations


Ads by Google