Presentation is loading. Please wait.

Presentation is loading. Please wait.

Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August.

Similar presentations


Presentation on theme: "Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August."— Presentation transcript:

1 Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August 30, 2004 = *

2 Motivation: Harness GPU Performance P4 3.4Ghz6800 UltraX800 XT PE Peak FLOPS Memory BW Relative Performance

3 Streaming Computation on GPUs Kernel function (shader)  GPUs accelerate streaming numerical algorithms  Data parallelism  High ratio of arithmetic to data access  Little data reuse Input Elements Output Elements

4 Streaming Computation on GPUs  Level 1 BLAS operations Buck et al. [2004]  Fluid solvers Kruger & Westermann [2003] Boltz et al. [2003]  Image processing Apple Corp. [2004] McCormick et al. [2004]  Segmentation Sherbondy et al. [2003]  Database operations Govindaraju et al. [2004]  Data Clustering Hall et al. [2004]

5 Dense Matrix Multiplication  Abundant data parallelism * = B A C  Regular data access (no branching)  High ratio of computation to data access

6 Dense Matrix Multiplication  Widely used computational kernel  Building block for LAPACK library

7 Matrix Multiplication on GPUs  Larsen & McAllister [2001]  Moravansky [2003]  Hall et al. [2003]  Limited analysis of performance

8 Overview  GPU Implementations  Results  Analysis: Why GPUs are slow  Ways to Make GPUs Better

9 CPU-Based Approaches  High performance matrix multiplication algorithms are cache aware  Partition computation into submatrix multiplications  Load input submatrices into cache  Multiply submatrices  Store output submatrix to memory * = B A C

10 Method 1: Column Packed (CP) = Larsen & McAllister [SC2001] Moravansky [2003] * C A B 4 elements stored per texel 4x4 matrix by 4-vector multiplications x y z w

11 Method 2: Submatrix Packed (SP) = Hall et al. [2003] * C A B 2x2 submatrix stored per texel x y z w 2x2 by 2x2 submatrix multiplications

12 Alternative Approaches Ineffective  Varied mapping into texture memory  Altered rasterization order with geometry  Single quad most effective  Utilized multiple outputs  Varied amount of loop unrolling  Column packed: unroll maximally  Submatrix packed: unroll 128 times

13 Performance Results  Pentium 4 3Ghz CPU, 512KB L2 cache  12 GFLOPS peak compute  44.1GB/sec cache BW  Using sgemm routine from ATLAS package  NVIDIA  GeForce 5900 Ultra  GeForce 6800 Ultra  ATI  Radeon 9800 XT  Radeon X800 XT PE (prerelease 500Mhz mem / 500Mhz core clock)

14 Previous Generation GPUs P4 3Ghz 5900 Ultra9800 XT GFLOPS Bandwidth Multiplication of 1024x1024 Matrices GFLOPS GB/sec

15 Current Generation GPUs P4 3Ghz 6800 UltraX800 XT PE GFLOPS Bandwidth Multiplication of 1024x1024 Matrices GFLOPS GB/sec

16 Fragment Processor Data Paths From L2 Fragment Processor L1 Texture Cache To Frame Buffer Texture Unit

17 GPU Microbenchmarks Ultra6800 Ultra9800 XTX800 XT PE GFLOPS Peak Arithmetic Rate

18 GPU Microbenchmarks Observed Bandwidth Ultra6800 Ultra9800 XTX800 XT PE GB/sec Cache BW Seq BW

19 Fragment Processor Data Paths From L2 Fragment Processor L1 Texture Cache To Frame Buffer Texture Unit High bandwidth (texture filtering) Low bandwidth (1 float/clock) Fragment processor consumes data at 8X rate texture provides it! 1 4-wide MAD/clock

20 Datapaths Designed for Shading From L2 Fragment Processor L1 Texture Cache To Frame Buffer Texture Unit 8 to 1 reduction in amount of data 4 components per clock  8 bit components  2 to 1 ratio of compute to bandwidth  Texture units filter (reduce) data  Shaders use interpolated values & constants

21 Compute and Bandwidth Efficiency Ultra 6800 Ultra 9800 XT X800 XT PE P4 3Ghz Percentage of Peak Compute Bandwidth GPU algorithms are severely bandwidth limited!

22 Minimize Texture Fetches  Block in shader register file  Would need 8x8 submatrices to run at peak rates  Limited to 4x4 submatrices by available outputs

23 Improvement 1: Widen Datapath  Fragment processor receives cached data more quickly  Expect performance to improve linearly with increase in bandwidth  Need ~4X improvement to achieve peak perf  But L2 may no longer be able to fill L1

24 Improvement 2: Larger Scratch Space  Requires large number of registers  Needs large number of output values  Reduces texture bandwidth requirements  Performance increases linearly with dimension of submatrices  Increases amount of per-pixel state  Storage increases as square of dimension of submatrices  Requires 16X space of SP method for peak perf

25 Summary  GPU algorithms for matrix-matrix multiplication run inefficiently  Best algorithms achieve below 20% of peak performance  Saturate data path between texture and FP units  Cache-aware software blocking strategies do not improve performance  Cannot exploit data reuse  Hardware limits algorithm efficiency

26 Summary  Hardware changes required to improve efficiency  Widen path between texture and register file  Output large number of values from shaders  Improved efficiency would make GPUs powerful platform for broader class of numerical algorithms

27 Acknowledgements  Thanks to Ian Buck, Mike Houston, Sean Treichler, Nick Triantos, Steve Morein  Support from ATI, NVIDIA, DARPA, IBM, SONY  Rambus Stanford Graduate Fellowship  Stanford School of Engineering Fellowship

28 Questions?


Download ppt "Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August."

Similar presentations


Ads by Google