Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan

Slides:



Advertisements
Similar presentations
Computer graphics & visualization GP-GPU. computer graphics & visualization Image Synthesis – WS 07/08 Dr. Jens Krüger – Computer Graphics and Visualization.
Advertisements

Implementation of Voxel Volume Projection Operators Using CUDA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.
Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign.
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Matrix Operations on the GPU CIS 665: GPU Programming and Architecture TA: Joseph Kider.
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10th, 2003.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
GI 2006, Québec, June 9th 2006 Implementing the Render Cache and the Edge-and-Point Image on Graphics Hardware Edgar Velázquez-Armendáriz Eugene Lee Bruce.
Why GPUs? Robert Strzodka. 2Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Matrix Operations on the GPU CIS 665: GPU Programming and Architecture TA: Joseph Kider.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.
GH05 KD-Tree Acceleration Structures for a GPU Raytracer Tim Foley, Jeremy Sugerman Stanford University.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
Slide 1 / 16 On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006.
Enhancing GPU for Scientific Computing Some thoughts.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
GPU Computation Strategies & Tricks Ian Buck Stanford University.
Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.
Cg Programming Mapping Computational Concepts to GPUs.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.
Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Appendix C Graphics and Computing GPUs
GPU Architecture and Its Application
TI Information – Selective Disclosure
Graphics Processing Unit
Christian Lauterbach GPGPU presentation 3/5/2007
STUDY AND IMPLEMENTATION
Mattan Erez The University of Texas at Austin
Ray Tracing on Programmable Graphics Hardware
University of Virginia
Mattan Erez The University of Texas at Austin
Graphics Processing Unit
Presentation transcript:

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication = * Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August 30, 2004

Motivation: Harness GPU Performance 6 5 4 Relative Performance Peak FLOPS 3 Memory BW 2 1 P4 3.4Ghz 6800 Ultra X800 XT PE

Streaming Computation on GPUs GPUs accelerate streaming numerical algorithms Data parallelism High ratio of arithmetic to data access Little data reuse Kernel function (shader) Input Elements Output Elements

Streaming Computation on GPUs Level 1 BLAS operations Buck et al. [2004] Fluid solvers Kruger & Westermann [2003] Boltz et al. [2003] Image processing Apple Corp. [2004] McCormick et al. [2004] Segmentation Sherbondy et al. [2003] Database operations Govindaraju et al. [2004] Data Clustering Hall et al. [2004]

Dense Matrix Multiplication = * C A B Abundant data parallelism Regular data access (no branching) High ratio of computation to data access

Dense Matrix Multiplication Widely used computational kernel Building block for LAPACK library

Matrix Multiplication on GPUs Larsen & McAllister [2001] Moravansky [2003] Hall et al. [2003] Limited analysis of performance

Overview GPU Implementations Results Analysis: Why GPUs are slow Ways to Make GPUs Better

CPU-Based Approaches = * C A B High performance matrix multiplication algorithms are cache aware = * C A B Hierarchies designed to improve perf when data is reused frequently. Any quality impl. For cpu will be aware of cache sizes to ensure most data accesses are services out of cache. Point: Blocking strategies minimize number of times data fetched from memory Since required math operations cubic in dimension of the block, while data set is only quadratic, processor units kept more busy as block size increases Partition computation into submatrix multiplications Load input submatrices into cache Multiply submatrices Store output submatrix to memory

Method 1: Column Packed (CP) x y z w = * C A B 4 elements stored per texel Naïve way of doing things. Column packed arbitrary. Could go row packed. 4x4 matrix by 4-vector multiplications Larsen & McAllister [SC2001] Moravansky [2003]

Method 2: Submatrix Packed (SP) y z w = * C A B 2x2 submatrix stored per texel Picked unroll size of 128 on NV Better reuse of data. All data used twice. As opposed to 5 fetches and 4 mads pattern of column packed THINK ABOUT HOW PIXELS IN EACH ROW AND COLUMN GET REUSED HERE 2x2 by 2x2 submatrix multiplications Hall et al. [2003]

Alternative Approaches Ineffective Varied mapping into texture memory Altered rasterization order with geometry Single quad most effective Utilized multiple outputs Varied amount of loop unrolling Column packed: unroll maximally Submatrix packed: unroll 128 times

Performance Results Pentium 4 3Ghz CPU, 512KB L2 cache 12 GFLOPS peak compute 44.1GB/sec cache BW Using sgemm routine from ATLAS package NVIDIA GeForce 5900 Ultra GeForce 6800 Ultra ATI Radeon 9800 XT Radeon X800 XT PE (prerelease 500Mhz mem / 500Mhz core clock)

Multiplication of 1024x1024 Matrices Previous Generation GPUs Multiplication of 1024x1024 Matrices 12 30 10 25 8 20 GB/sec GFLOPS 6 15 4 10 GFLOPS 2 5 Bandwidth P4 3Ghz 5900 Ultra 9800 XT

Multiplication of 1024x1024 Matrices Current Generation GPUs Multiplication of 1024x1024 Matrices 12 30 10 25 8 20 GB/sec GFLOPS 6 15 4 10 GFLOPS 2 5 Bandwidth P4 3Ghz 6800 Ultra X800 XT PE

Fragment Processor Data Paths Texture Unit Fragment Processor L1 Texture Cache From L2 To Frame Buffer

GPU Microbenchmarks Peak Arithmetic Rate GFLOPS 70 60 50 40 30 20 10 Verbally note that results do NOT include transfer to and from the cards or time to compiler shader programs, etc. Note computation precision Note table shows 1024x1024 performance Need to describe how bandwidth is computed (total fetches) Also lead in to fact that although GPUs win, we are about to talk about efficiency 20 10 5900 Ultra 6800 Ultra 9800 XT X800 XT PE

GPU Microbenchmarks Observed Bandwidth GB/sec Cache BW Seq BW 30 25 20 15 Seq BW 10 Verbally note that results do NOT include transfer to and from the cards or time to compiler shader programs, etc. Note computation precision Note table shows 1024x1024 performance Need to describe how bandwidth is computed (total fetches) Also lead in to fact that although GPUs win, we are about to talk about efficiency 5 5900 Ultra 6800 Ultra 9800 XT X800 XT PE

Fragment Processor Data Paths Low bandwidth (1 float/clock) From L2 Fragment Processor L1 Texture Cache To Frame Buffer Texture Unit 1 4-wide MAD/clock High bandwidth (texture filtering) Make comparison to P4 ratios verbally although no text dedicated to it. Fragment processor consumes data at 8X rate texture provides it!

Datapaths Designed for Shading Texture Unit Fragment Processor 8 to 1 reduction in amount of data 4 components per clock L1 Texture Cache From L2 To Frame Buffer 8 bit components 2 to 1 ratio of compute to bandwidth Texture units filter (reduce) data Shaders use interpolated values & constants

Compute and Bandwidth Efficiency 100 80 60 Percentage of Peak 40 Compute Bandwidth 20 Emphasize that this is NOT off chip bandwidth. This is effective cache bandwidth. So no algorithm would be able to read data from texture very much faster. 5900 Ultra 6800 Ultra 9800 XT X800 XT PE P4 3Ghz GPU algorithms are severely bandwidth limited!

Minimize Texture Fetches Block in shader register file Would need 8x8 submatrices to run at peak rates Limited to 4x4 submatrices by available outputs 8x8 assumes output is async with processing. 12x12 is correct answer assuming this is NOT the case

Improvement 1: Widen Datapath Fragment processor receives cached data more quickly Expect performance to improve linearly with increase in bandwidth Need ~4X improvement to achieve peak perf But L2 may no longer be able to fill L1

Improvement 2: Larger Scratch Space Requires large number of registers Needs large number of output values Reduces texture bandwidth requirements Performance increases linearly with dimension of submatrices Increases amount of per-pixel state Storage increases as square of dimension of submatrices Requires 16X space of SP method for peak perf

Summary GPU algorithms for matrix-matrix multiplication run inefficiently Best algorithms achieve below 20% of peak performance Saturate data path between texture and FP units Cache-aware software blocking strategies do not improve performance Cannot exploit data reuse Hardware limits algorithm efficiency Compare to P4 again

Summary Hardware changes required to improve efficiency Widen path between texture and register file Output large number of values from shaders Improved efficiency would make GPUs powerful platform for broader class of numerical algorithms Compare to P4 again

Acknowledgements Thanks to Ian Buck, Mike Houston, Sean Treichler, Nick Triantos, Steve Morein Support from ATI, NVIDIA, DARPA, IBM, SONY Rambus Stanford Graduate Fellowship Stanford School of Engineering Fellowship Compare to P4 again

Questions? Compare to P4 again