Download presentation

Presentation is loading. Please wait.

Published byJace Harkless Modified over 2 years ago

1
Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August 30, 2004 = *

2
Motivation: Harness GPU Performance P4 3.4Ghz6800 UltraX800 XT PE Peak FLOPS Memory BW Relative Performance

3
Streaming Computation on GPUs Kernel function (shader) GPUs accelerate streaming numerical algorithms Data parallelism High ratio of arithmetic to data access Little data reuse Input Elements Output Elements

4
Streaming Computation on GPUs Level 1 BLAS operations Buck et al. [2004] Fluid solvers Kruger & Westermann [2003] Boltz et al. [2003] Image processing Apple Corp. [2004] McCormick et al. [2004] Segmentation Sherbondy et al. [2003] Database operations Govindaraju et al. [2004] Data Clustering Hall et al. [2004]

5
Dense Matrix Multiplication Abundant data parallelism * = B A C Regular data access (no branching) High ratio of computation to data access

6
Dense Matrix Multiplication Widely used computational kernel Building block for LAPACK library

7
Matrix Multiplication on GPUs Larsen & McAllister [2001] Moravansky [2003] Hall et al. [2003] Limited analysis of performance

8
Overview GPU Implementations Results Analysis: Why GPUs are slow Ways to Make GPUs Better

9
CPU-Based Approaches High performance matrix multiplication algorithms are cache aware Partition computation into submatrix multiplications Load input submatrices into cache Multiply submatrices Store output submatrix to memory * = B A C

10
Method 1: Column Packed (CP) = Larsen & McAllister [SC2001] Moravansky [2003] * C A B 4 elements stored per texel 4x4 matrix by 4-vector multiplications x y z w

11
Method 2: Submatrix Packed (SP) = Hall et al. [2003] * C A B 2x2 submatrix stored per texel x y z w 2x2 by 2x2 submatrix multiplications

12
Alternative Approaches Ineffective Varied mapping into texture memory Altered rasterization order with geometry Single quad most effective Utilized multiple outputs Varied amount of loop unrolling Column packed: unroll maximally Submatrix packed: unroll 128 times

13
Performance Results Pentium 4 3Ghz CPU, 512KB L2 cache 12 GFLOPS peak compute 44.1GB/sec cache BW Using sgemm routine from ATLAS package NVIDIA GeForce 5900 Ultra GeForce 6800 Ultra ATI Radeon 9800 XT Radeon X800 XT PE (prerelease 500Mhz mem / 500Mhz core clock)

14
Previous Generation GPUs P4 3Ghz 5900 Ultra9800 XT GFLOPS Bandwidth Multiplication of 1024x1024 Matrices GFLOPS GB/sec

15
Current Generation GPUs P4 3Ghz 6800 UltraX800 XT PE GFLOPS Bandwidth Multiplication of 1024x1024 Matrices GFLOPS GB/sec

16
Fragment Processor Data Paths From L2 Fragment Processor L1 Texture Cache To Frame Buffer Texture Unit

17
GPU Microbenchmarks Ultra6800 Ultra9800 XTX800 XT PE GFLOPS Peak Arithmetic Rate

18
GPU Microbenchmarks Observed Bandwidth Ultra6800 Ultra9800 XTX800 XT PE GB/sec Cache BW Seq BW

19
Fragment Processor Data Paths From L2 Fragment Processor L1 Texture Cache To Frame Buffer Texture Unit High bandwidth (texture filtering) Low bandwidth (1 float/clock) Fragment processor consumes data at 8X rate texture provides it! 1 4-wide MAD/clock

20
Datapaths Designed for Shading From L2 Fragment Processor L1 Texture Cache To Frame Buffer Texture Unit 8 to 1 reduction in amount of data 4 components per clock 8 bit components 2 to 1 ratio of compute to bandwidth Texture units filter (reduce) data Shaders use interpolated values & constants

21
Compute and Bandwidth Efficiency Ultra 6800 Ultra 9800 XT X800 XT PE P4 3Ghz Percentage of Peak Compute Bandwidth GPU algorithms are severely bandwidth limited!

22
Minimize Texture Fetches Block in shader register file Would need 8x8 submatrices to run at peak rates Limited to 4x4 submatrices by available outputs

23
Improvement 1: Widen Datapath Fragment processor receives cached data more quickly Expect performance to improve linearly with increase in bandwidth Need ~4X improvement to achieve peak perf But L2 may no longer be able to fill L1

24
Improvement 2: Larger Scratch Space Requires large number of registers Needs large number of output values Reduces texture bandwidth requirements Performance increases linearly with dimension of submatrices Increases amount of per-pixel state Storage increases as square of dimension of submatrices Requires 16X space of SP method for peak perf

25
Summary GPU algorithms for matrix-matrix multiplication run inefficiently Best algorithms achieve below 20% of peak performance Saturate data path between texture and FP units Cache-aware software blocking strategies do not improve performance Cannot exploit data reuse Hardware limits algorithm efficiency

26
Summary Hardware changes required to improve efficiency Widen path between texture and register file Output large number of values from shaders Improved efficiency would make GPUs powerful platform for broader class of numerical algorithms

27
Acknowledgements Thanks to Ian Buck, Mike Houston, Sean Treichler, Nick Triantos, Steve Morein Support from ATI, NVIDIA, DARPA, IBM, SONY Rambus Stanford Graduate Fellowship Stanford School of Engineering Fellowship

28
Questions?

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google