Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

Slides:

Advertisements

Similar presentations

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

Advertisements

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.

ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.

OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.

Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.

EVOLUTION OF MULTIMEDIA & DISPLAY MAZEN SALLOUM 26 FEB 2015.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

OpenFOAM on a GPU-based Heterogeneous Cluster

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.

Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.

HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.

Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.

OpenCL Introduction A TECHNICAL REVIEW LU OCT

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY

Sunpyo Hong, Hyesoon Kim

Irregular Applications –Sparse Matrix Vector Multiplication

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

CLSPARSE: A VENDOR-OPTIMIZED OPEN-SOURCE SPARSE BLAS LIBRARY JOSEPH L. GREATHOUSE, KENT KNOX, JAKUB POŁA*, KIRAN VARAGANTI, MAYANK DAGA *UNIV. OF WROCŁAW.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.

µC-States: Fine-grained GPU Datapath Power Management

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration

Measuring and Modeling On-Chip Interconnect Power on Real Hardware

BLIS optimized for EPYCTM Processors

The Small batch (and Other) solutions in Mantle API

Heterogeneous System coherence for Integrated CPU-GPU Systems

Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,

SOC Runtime Gregory Stoner.

libflame optimizations with BLIS

Interference from GPU System Service Requests

Simulation of exascale nodes through runtime hardware monitoring

Interference from GPU System Service Requests

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

RegMutex: Inter-Warp GPU Register Time-Sharing

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

Advanced Micro Devices, Inc.

Memory System Performance Chapter 3

© 2012 Elsevier, Inc. All rights reserved.

Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018

Presentation transcript:

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014 NOTE: These notes are a general script for what I planned to say during the talk. This is not a transcript, and it tapers off near the end when I felt that a script was no longer needed. Good morning everyone. As Kirk said, my name is Joseph Greathouse from AMD Research, and I’m here to describe a new method of performing sparse matrix-vector multiplication on GPUs without needing to change the matrix storage format from the traditional CSR. Efficient Sparse Matrix-Vector Multiplication on GPUs using the CSR Storage Format Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

This Talk in One Slide Demonstrate how to save space and time by using the CSR format for GPU-based SpMV Optimized GPU-based SpMV algorithm that increases memory coalescing by using LDS Before I dive into the details, I’d like to give a one-slide overview of what this talk will be about. First, I’m going to demonstrate how perform GPU-based sparse matrix-vector multiplication without needing to change your matrix storage structure from the traditional CSR format. This means that you don’t need to waste time or space transforming the matrix whenever you want to perform SpMV on a GPU. To do this, I’m going to describe a new SpMV algorithm that uses the CSR storage format and that removes the bottlenecks that were believed to prevent optimal performance on GPUs. In particular, by significantly increasing memory coalescing, and using local scratchpads for scatter-gather accesses, our algorithm offers an almost 15x performance improvement over previous algorithms that use the CSR format. In addition, it’s over 2 times faster than existing state-of-the-art algorithms that use more complex storage formats. 14.7x faster than other CSR-based algorithms 2.3x faster than using other matrix formats

Background I’m going to start the talk by walking through a good deal of background information on this problem. A lot of you in the audience may be familiar with many or all of these concepts. Nonetheless, I hope you’ll follow along, because the algorithm I’m going to present later is not extraordinarily complex. In fact, I think that it falls out very simply from the background information I’m going to present. So keep paying attention, because this background information is going to directly inform our new optimized SpMV algorithm.

Sparse Matrix-Vector Multiplication 1.0 - 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 1 2 3 4 5 22 28 18 39 × = To start, let’s talk about “sparse matrix vector multiplication” On the left, you’ll see a sparse matrix. Most of the values are zero (denoted by dashes). In the middle, we have a dense vector, which we want to multiply by the matrix. Each entry in the matrix is multiplied by the entry in the vector that corresponds to its column *click click click*, and the results from the same row are added together to produce one of the dense vector outputs. This process is continued for every non-zero value in the matrix, until we have filled the output vector. You may have noticed that entries in the input vector were used multiple times, while the values in the vector were only used once. This streaming access of the vector, and the small amount of mathematical operations used to produce the output *click* mean that this is considered to be a memory bandwidth-bound kernel. Its performance is primarily constrained by how fast you can load the vector from memory. Traditionally Bandwidth Limited

Compressed Sparse Row (CSR) 1.0 - 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 3 5 6 8 9 row delimiters: 2 4 1 3 columns: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 values:

Uncoalesced accesses are much less efficient! AMD GPU Memory System I$ I$ K$ K$ 64 Bytes per clock L1 bandwidth per CU L1 L1 L1 L1 L1 L1 L1 L1 64 Bytes per clock L2 bandwidth per partition L2 L2 L2 Uncoalesced accesses are much less efficient! 64-bit Dual Channel Memory Controller 64-bit Dual Channel Memory Controller 64-bit Dual Channel Memory Controller

Must do parallel loads to maximize bandwidth AMD GPU Compute UNit Local Data Share (64KB) Vector Registers (4x 64KB) Vector Units (4x SIMD-16) L1 Cache (16 KB) Must do parallel loads to maximize bandwidth

Local Data Share (LDS) Memory Scratchpad (software-addressed) on-chip cache Statically divided between all workgroups in a CU Highly ported to allow scatter/gather (uncoalesced) accesses Local Data Share (64KB)

Uncoalesced accesses CSR-Scalar One thread/work-item per matrix row Workgroup #1 Workgroup #2 Workgroup #3 Uncoalesced accesses

Good: coalesced accesses CSR-Vector One workgroup (multiple threads) per matrix row Workgroup #1 Workgroup #2 . Workgroup #12 Bad: poor parallelism Good: coalesced accesses

Making GPU SpMV Fast Neither CSR-Scalar nor CSR-Vector offer ideal performance. CSR works well on CPUs and is widely used Traditional solution: new matrix storage format + new algorithms ELLPACK/ITPACK, BCSR, DIA, BDIA, SELL, SBELL, ELL+COO Hybrid, many others Auto-tuning frameworks like clSpMV try to find the best format for each matrix Can require more storage space or dynamic transformation overhead Can we find a good GPU algorithm that does not modify the original CSR data?

CSR-Adaptive

Increasing Bandwidth Efficiency CSR-Vector coalesces long rows by streaming into the LDS

CSR-Vector Using the LDS Local Data Share + + + + Fast uncoalesced accesses! Fast accesses everywhere

Increasing Bandwidth Efficiency How can we get good bandwidth for other “short” rows? Load these into the LDS

CSR-Stream: Stream Matrix Into the LDS Local Data Share Fast accesses everywhere Block 1 Block 2 Block 3 Block 4

Finding Blocks for CSR-Stream Done Once on the CPU Assuming 8 LDS entries per workgroup row blocks: CSR Structure Unchanged 3 6 11 12 row delimiters: 3 6 8 10 14 16 17 19 20 22 24 32 columns: values: Block 1 Block 2 Block 3 Block 4

What about Very Long Rows? Local Data Share CSR-Adaptive CSR-Stream CSR-Vector Block 1 Block 2 Block 3 Block 4

CSR-Adaptive Create Row Blocks (Once per matrix) On CPU: On GPU (each workgroup assigned one row block): Read Row Block Entries Single Row? No Yes CSR-Stream CSR-Vector Vector Load Multiple Rows into LDS Vector Load Row into LDS Perform Reduction out of LDS (serially or in parallel) Perform Parallel Reduction out of LDS

Experiments

Experimental Setup CPU: AMD A10-7850K (3.7 GHz) 32 GB dual-channel DDR3-2133 GPU: AMD FirePro™ W9100 44 parallel compute units (CUs) 930 MHz core clock rate (5.2 single-precision TFLOPs) 16 GDDR5 channels at 1250 MHz (320 GB/s) CentOS 6.4 (kernel 2.6.32-358.23.2) AMD FirePro Driver version 14.20 beta AMD APP SDK v2.9

SpMV Setup SHOC Benchmark Suite (Aug. 2013 version) ViennaCL v1.5.2 CSR-Vector (details in paper: CSR-Scalar, ELLPACK) ViennaCL v1.5.2 CSR-Scalar, COO, ELLPACK, ELLPACK+COO HYB 4-value and 8-value padded versions of CSR-Scalar Best of these for each matrix: “ViennaCL Best” clSpMV v0.1 CSR-Scalar, CSR-Vector, BCSR, COO, DIA, BDIA, ELLPACK, SELL, BELL, SBELL Best of these for each matrix: “clSpMV Best” Details in paper: clSpMV “Cocktail” format CSR-Adaptive 20 matrices from Univ. of Florida Sparse Matrix Collection

Data Structure Generation Times Note: Some matrices could not be held in ELLPACK or BCSR

Performance in Single-Precision GFLOPs Roughly Equal Performance (7/20) Others Better (4/20) The arrows on this slide point to the only benchmarks where the previous CSR-Vector algorith matches the performance of CSR-Adaptive.

Performance in Single-Precision GFLOPs CSR-Adaptive Performance Higher (9/20)

Performance Benefit of CSR-Adaptive

AMD Sample Code Available Soon – Ask Us! Conclusion CSR-Adaptive up to 14.7x faster than previous GPU-based SpMV algorithms Requires little work beyond generating the traditional CSR data structure Less than 0.1% extra storage compared to CSR Algorithm is not complex: easy to integrate into libraries and existing SW AMD Sample Code Available Soon – Ask Us! Independently Implemented Version in ViennaCL 1.6.1

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, FirePro and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.

Backup Slides

AMD GPU Microarchitecture Many highly-threaded vector processors Up to 112,640 work-items in flight Highly Parallel Memory System (16 channels, 320 GB/s)