Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

Slides:

Advertisements

Similar presentations

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

Advertisements

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.

The University of Adelaide, School of Computer Science

ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.

Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”

Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.

Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.

OpenCL Introduction A TECHNICAL REVIEW LU OCT

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

Computer Maintenance Unit Subtitle: Basic Input/Output System (BIOS) Excerpted from 1 Copyright © Texas Education Agency, All.

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.

© GCSE Computing Computing Hardware Starter. Creating a spreadsheet to demonstrate the size of memory. 1 byte = 1 character or about 1 pixel of information.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY

Computer Maintenance I

Sunpyo Hong, Hyesoon Kim

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Serial Communications

µC-States: Fine-grained GPU Datapath Power Management

TI Information – Selective Disclosure

Computer Maintenance Unit Subtitle: Basic Input/Output System (BIOS)

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration

Measuring and Modeling On-Chip Interconnect Power on Real Hardware

BLIS optimized for EPYCTM Processors

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

Cache Memory Presentation I

The Small batch (and Other) solutions in Mantle API

Heterogeneous System coherence for Integrated CPU-GPU Systems

Many-core Software Development Platforms

Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,

SOC Runtime Gregory Stoner.

for more information ... Performance Tuning

libflame optimizations with BLIS

Linchuan Chen, Peng Jiang and Gagan Agrawal

NVIDIA Fermi Architecture

Objective of This Course

DRAM Bandwidth Slide credit: Slides adapted from

Interference from GPU System Service Requests

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Interference from GPU System Service Requests

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

RegMutex: Inter-Warp GPU Register Time-Sharing

Memory Organization.

Parallelization of Sparse Coding & Dictionary Learning

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

2.C Memory GCSE Computing Langley Park School for Boys.

About this Template Dear Colleague, This template is provided by Valooto to help you communicate the facts about your need for a CPQ (Configure Price Quote)

Advanced Micro Devices, Inc.

Memory System Performance Chapter 3

Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018

Presentation transcript:

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014 NOTE: These notes are a general script for what I planned to say during the talk. This is not a transcript, and it tapers off near the end when I felt that a script was no longer needed. Good morning everyone. As Kirk said, my name is Joseph Greathouse from AMD Research, and I’m here to describe a new method of performing sparse matrix-vector multiplication on GPUs without needing to change the matrix storage format from the traditional CSR. Efficient Sparse Matrix-Vector Multiplication on GPUs using the CSR Storage Format Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

This Talk in One Slide Demonstrate how to save space and time by using the CSR format for GPU-based SpMV Optimized GPU-based SpMV algorithm that increases memory coalescing by using LDS Before I dive into the details, I’d like to give a one-slide overview of what this talk will be about. First, I’m going to demonstrate how perform GPU-based sparse matrix-vector multiplication without needing to change your matrix storage structure from the traditional CSR format. This means that you don’t need to waste time or space transforming the matrix whenever you want to perform SpMV on a GPU. To do this, I’m going to describe a new SpMV algorithm that uses the CSR storage format and that removes the bottlenecks that were believed to prevent optimal performance on GPUs. In particular, by significantly increasing memory coalescing, and using local scratchpads for scatter-gather accesses, our algorithm offers an almost 15x performance improvement over previous algorithms that use the CSR format. In addition, it’s over 2 times faster than existing state-of-the-art algorithms that use more complex storage formats. 14.7x faster than other CSR-based algorithms 2.3x faster than using other matrix formats

Background I’m going to start the talk by walking through a good deal of background information on this problem. A lot of you in the audience may be familiar with many or all of these concepts. Nonetheless, I hope you’ll follow along, because the algorithm I’m going to present later is not extraordinarily complex. In fact, I think that it falls out very simply from the background information I’m going to present. So keep paying attention, because this background information is going to directly inform our new optimized SpMV algorithm.

Sparse Matrix-Vector Multiplication 1.0 - 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 1 2 3 4 5 22 28 18 39 × = To start, let’s talk about “sparse matrix vector multiplication” On the left, you’ll see a sparse matrix. Most of the values are zero (denoted by dashes). In the middle, we have a dense vector, which we want to multiply by the matrix. Each entry in the matrix is multiplied by the entry in the vector that corresponds to its column *click click click*, and the results from the same row are added together to produce one of the dense vector outputs. This process is continued for every non-zero value in the matrix, until we have filled the output vector. You may have noticed that entries in the input vector were used multiple times, while the values in the vector were only used once. This streaming access of the vector, and the small amount of mathematical operations used to produce the output *click* mean that this is considered to be a memory bandwidth-bound kernel. Its performance is primarily constrained by how fast you can load the vector from memory. Traditionally Bandwidth Limited

Compressed Sparse Row (CSR) 1.0 - 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 3 5 6 8 9 row delimiters: 2 4 1 3 columns: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 values:

Uncoalesced accesses are much less efficient! AMD GPU Memory System I$ I$ K$ K$ 64 Bytes per clock L1 bandwidth per CU L1 L1 L1 L1 L1 L1 L1 L1 64 Bytes per clock L2 bandwidth per partition L2 L2 L2 Uncoalesced accesses are much less efficient! 64-bit Dual Channel Memory Controller 64-bit Dual Channel Memory Controller 64-bit Dual Channel Memory Controller

Must do parallel loads to maximize bandwidth AMD GPU Compute UNit Local Data Share (64KB) Vector Registers (4x 64KB) Vector Units (4x SIMD-16) L1 Cache (16 KB) Must do parallel loads to maximize bandwidth

Local Data Share (LDS) Memory Scratchpad (software-addressed) on-chip cache Statically divided between all workgroups in a CU Highly ported to allow scatter/gather (uncoalesced) accesses Local Data Share (64KB)

Uncoalesced accesses CSR-Scalar One thread/work-item per matrix row Workgroup #1 Workgroup #2 Workgroup #3 Uncoalesced accesses

Good: coalesced accesses CSR-Vector One workgroup (multiple threads) per matrix row Workgroup #1 Workgroup #2 . Workgroup #12 Bad: poor parallelism Good: coalesced accesses

Making GPU SpMV Fast Neither CSR-Scalar nor CSR-Vector offer ideal performance. CSR works well on CPUs and is widely used Traditional solution: new matrix storage format + new algorithms ELLPACK/ITPACK, BCSR, DIA, BDIA, SELL, SBELL, ELL+COO Hybrid, many others Auto-tuning frameworks like clSpMV try to find the best format for each matrix Can require more storage space or dynamic transformation overhead Can we find a good GPU algorithm that does not modify the original CSR data?

CSR-Adaptive

Increasing Bandwidth Efficiency CSR-Vector coalesces long rows by streaming into the LDS

CSR-Vector Using the LDS Local Data Share + + + + Fast uncoalesced accesses! Fast accesses everywhere

Increasing Bandwidth Efficiency How can we get good bandwidth for other “short” rows? Load these into the LDS

CSR-Stream: Stream Matrix Into the LDS Local Data Share Fast accesses everywhere Block 1 Block 2 Block 3 Block 4

Finding Blocks for CSR-Stream Done Once on the CPU Assuming 8 LDS entries per workgroup row blocks: CSR Structure Unchanged 3 6 11 12 row delimiters: 3 6 8 10 14 16 17 19 20 22 24 32 columns: values: Block 1 Block 2 Block 3 Block 4

What about Very Long Rows? Local Data Share CSR-Adaptive CSR-Stream CSR-Vector Block 1 Block 2 Block 3 Block 4

CSR-Adaptive Create Row Blocks (Once per matrix) On CPU: On GPU (each workgroup assigned one row block): Read Row Block Entries Single Row? No Yes CSR-Stream CSR-Vector Vector Load Multiple Rows into LDS Vector Load Row into LDS Perform Reduction out of LDS (serially or in parallel) Perform Parallel Reduction out of LDS

Experiments

Experimental Setup CPU: AMD A10-7850K (3.7 GHz) 32 GB dual-channel DDR3-2133 GPU: AMD FirePro™ W9100 44 parallel compute units (CUs) 930 MHz core clock rate (5.2 single-precision TFLOPs) 16 GDDR5 channels at 1250 MHz (320 GB/s) CentOS 6.4 (kernel 2.6.32-358.23.2) AMD FirePro Driver version 14.20 beta AMD APP SDK v2.9

SpMV Setup SHOC Benchmark Suite (Aug. 2013 version) ViennaCL v1.5.2 CSR-Vector (details in paper: CSR-Scalar, ELLPACK) ViennaCL v1.5.2 CSR-Scalar, COO, ELLPACK, ELLPACK+COO HYB 4-value and 8-value padded versions of CSR-Scalar Best of these for each matrix: “ViennaCL Best” clSpMV v0.1 CSR-Scalar, CSR-Vector, BCSR, COO, DIA, BDIA, ELLPACK, SELL, BELL, SBELL Best of these for each matrix: “clSpMV Best” Details in paper: clSpMV “Cocktail” format CSR-Adaptive 20 matrices from Univ. of Florida Sparse Matrix Collection

Data Structure Generation Times Note: Some matrices could not be held in ELLPACK or BCSR

Performance in Single-Precision GFLOPs Roughly Equal Performance (7/20) Others Better (4/20) The arrows on this slide point to the only benchmarks where the previous CSR-Vector algorith matches the performance of CSR-Adaptive.

Performance in Single-Precision GFLOPs CSR-Adaptive Performance Higher (9/20)

Performance Benefit of CSR-Adaptive

AMD Sample Code Available Soon – Ask Us! Conclusion CSR-Adaptive up to 14.7x faster than previous GPU-based SpMV algorithms Requires little work beyond generating the traditional CSR data structure Less than 0.1% extra storage compared to CSR Algorithm is not complex: easy to integrate into libraries and existing SW AMD Sample Code Available Soon – Ask Us! Independently Implemented Version in ViennaCL 1.6.1

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, FirePro and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.

Backup Slides

AMD GPU Microarchitecture Many highly-threaded vector processors Up to 112,640 work-items in flight Highly Parallel Memory System (16 channels, 320 GB/s)