Download presentation

Presentation is loading. Please wait.

Published byJace Harkless Modified over 2 years ago

1
**Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan**

Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication = * Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August 30, 2004

2
**Motivation: Harness GPU Performance**

6 5 4 Relative Performance Peak FLOPS 3 Memory BW 2 1 P4 3.4Ghz 6800 Ultra X800 XT PE

3
**Streaming Computation on GPUs**

GPUs accelerate streaming numerical algorithms Data parallelism High ratio of arithmetic to data access Little data reuse Kernel function (shader) Input Elements Output Elements

4
**Streaming Computation on GPUs**

Level 1 BLAS operations Buck et al. [2004] Fluid solvers Kruger & Westermann [2003] Boltz et al. [2003] Image processing Apple Corp. [2004] McCormick et al. [2004] Segmentation Sherbondy et al. [2003] Database operations Govindaraju et al. [2004] Data Clustering Hall et al. [2004]

5
**Dense Matrix Multiplication**

= * C A B Abundant data parallelism Regular data access (no branching) High ratio of computation to data access

6
**Dense Matrix Multiplication**

Widely used computational kernel Building block for LAPACK library

7
**Matrix Multiplication on GPUs**

Larsen & McAllister [2001] Moravansky [2003] Hall et al. [2003] Limited analysis of performance

8
**Overview GPU Implementations Results Analysis: Why GPUs are slow**

Ways to Make GPUs Better

9
**CPU-Based Approaches = * C A B**

High performance matrix multiplication algorithms are cache aware = * C A B Hierarchies designed to improve perf when data is reused frequently. Any quality impl. For cpu will be aware of cache sizes to ensure most data accesses are services out of cache. Point: Blocking strategies minimize number of times data fetched from memory Since required math operations cubic in dimension of the block, while data set is only quadratic, processor units kept more busy as block size increases Partition computation into submatrix multiplications Load input submatrices into cache Multiply submatrices Store output submatrix to memory

10
**Method 1: Column Packed (CP)**

x y z w = * C A B 4 elements stored per texel Naïve way of doing things. Column packed arbitrary. Could go row packed. 4x4 matrix by 4-vector multiplications Larsen & McAllister [SC2001] Moravansky [2003]

11
**Method 2: Submatrix Packed (SP)**

y z w = * C A B 2x2 submatrix stored per texel Picked unroll size of 128 on NV Better reuse of data. All data used twice. As opposed to 5 fetches and 4 mads pattern of column packed THINK ABOUT HOW PIXELS IN EACH ROW AND COLUMN GET REUSED HERE 2x2 by 2x2 submatrix multiplications Hall et al. [2003]

12
**Alternative Approaches Ineffective**

Varied mapping into texture memory Altered rasterization order with geometry Single quad most effective Utilized multiple outputs Varied amount of loop unrolling Column packed: unroll maximally Submatrix packed: unroll 128 times

13
**Performance Results Pentium 4 3Ghz CPU, 512KB L2 cache**

12 GFLOPS peak compute 44.1GB/sec cache BW Using sgemm routine from ATLAS package NVIDIA GeForce 5900 Ultra GeForce 6800 Ultra ATI Radeon 9800 XT Radeon X800 XT PE (prerelease 500Mhz mem / 500Mhz core clock)

14
**Multiplication of 1024x1024 Matrices**

Previous Generation GPUs Multiplication of 1024x1024 Matrices 12 30 10 25 8 20 GB/sec GFLOPS 6 15 4 10 GFLOPS 2 5 Bandwidth P4 3Ghz 5900 Ultra 9800 XT

15
**Multiplication of 1024x1024 Matrices**

Current Generation GPUs Multiplication of 1024x1024 Matrices 12 30 10 25 8 20 GB/sec GFLOPS 6 15 4 10 GFLOPS 2 5 Bandwidth P4 3Ghz 6800 Ultra X800 XT PE

16
**Fragment Processor Data Paths**

Texture Unit Fragment Processor L1 Texture Cache From L2 To Frame Buffer

17
**GPU Microbenchmarks Peak Arithmetic Rate GFLOPS 70 60 50 40 30 20 10**

Verbally note that results do NOT include transfer to and from the cards or time to compiler shader programs, etc. Note computation precision Note table shows 1024x1024 performance Need to describe how bandwidth is computed (total fetches) Also lead in to fact that although GPUs win, we are about to talk about efficiency 20 10 5900 Ultra 6800 Ultra 9800 XT X800 XT PE

18
**GPU Microbenchmarks Observed Bandwidth GB/sec Cache BW Seq BW 30 25 20**

15 Seq BW 10 Verbally note that results do NOT include transfer to and from the cards or time to compiler shader programs, etc. Note computation precision Note table shows 1024x1024 performance Need to describe how bandwidth is computed (total fetches) Also lead in to fact that although GPUs win, we are about to talk about efficiency 5 5900 Ultra 6800 Ultra 9800 XT X800 XT PE

19
**Fragment Processor Data Paths**

Low bandwidth (1 float/clock) From L2 Fragment Processor L1 Texture Cache To Frame Buffer Texture Unit 1 4-wide MAD/clock High bandwidth (texture filtering) Make comparison to P4 ratios verbally although no text dedicated to it. Fragment processor consumes data at 8X rate texture provides it!

20
**Datapaths Designed for Shading**

Texture Unit Fragment Processor 8 to 1 reduction in amount of data 4 components per clock L1 Texture Cache From L2 To Frame Buffer 8 bit components 2 to 1 ratio of compute to bandwidth Texture units filter (reduce) data Shaders use interpolated values & constants

21
**Compute and Bandwidth Efficiency**

100 80 60 Percentage of Peak 40 Compute Bandwidth 20 Emphasize that this is NOT off chip bandwidth. This is effective cache bandwidth. So no algorithm would be able to read data from texture very much faster. 5900 Ultra 6800 Ultra 9800 XT X800 XT PE P4 3Ghz GPU algorithms are severely bandwidth limited!

22
**Minimize Texture Fetches**

Block in shader register file Would need 8x8 submatrices to run at peak rates Limited to 4x4 submatrices by available outputs 8x8 assumes output is async with processing. 12x12 is correct answer assuming this is NOT the case

23
**Improvement 1: Widen Datapath**

Fragment processor receives cached data more quickly Expect performance to improve linearly with increase in bandwidth Need ~4X improvement to achieve peak perf But L2 may no longer be able to fill L1

24
**Improvement 2: Larger Scratch Space**

Requires large number of registers Needs large number of output values Reduces texture bandwidth requirements Performance increases linearly with dimension of submatrices Increases amount of per-pixel state Storage increases as square of dimension of submatrices Requires 16X space of SP method for peak perf

25
Summary GPU algorithms for matrix-matrix multiplication run inefficiently Best algorithms achieve below 20% of peak performance Saturate data path between texture and FP units Cache-aware software blocking strategies do not improve performance Cannot exploit data reuse Hardware limits algorithm efficiency Compare to P4 again

26
**Summary Hardware changes required to improve efficiency**

Widen path between texture and register file Output large number of values from shaders Improved efficiency would make GPUs powerful platform for broader class of numerical algorithms Compare to P4 again

27
Acknowledgements Thanks to Ian Buck, Mike Houston, Sean Treichler, Nick Triantos, Steve Morein Support from ATI, NVIDIA, DARPA, IBM, SONY Rambus Stanford Graduate Fellowship Stanford School of Engineering Fellowship Compare to P4 again

28
Questions? Compare to P4 again

Similar presentations

Presentation is loading. Please wait....

OK

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on central administrative tribunal new delhi Ppt on nitrogen cycle and nitrogen fixation Ppt on feedback amplifier Ppt on l&t finance company View my ppt online maker Ppt on needle stick injuries statistics Ppt on event driven programming disadvantages Ppt on public health nutrition Ppt on online mobile shopping Ppt on non ferrous minerals technologies