STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

Slides:



Advertisements
Similar presentations
ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
Advertisements

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.
ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.
OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.
Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.
EVOLUTION OF MULTIMEDIA & DISPLAY MAZEN SALLOUM 26 FEB 2015.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
OpenFOAM on a GPU-based Heterogeneous Cluster
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.
Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Data Intensive Computing on Heterogeneous Platforms Norm Rubin Fellow GPG graphics products group AMD HPEC 2009.
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.
AMD platform security processor
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Copyright 2011, Atmel December, 2011 Atmel ARM-based Flash Microcontrollers 1 1.
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K.
© Banff and Buchan College 2007 DH2T 34 Computer Architecture 1 LO2 Lesson One Memory.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Empirical Quantification of Opportunities for Content Adaptation in Web Servers Michael Gopshtein and Dror Feitelson School of Engineering and Computer.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY
Sunpyo Hong, Hyesoon Kim
SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.
SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.
CLSPARSE: A VENDOR-OPTIMIZED OPEN-SOURCE SPARSE BLAS LIBRARY JOSEPH L. GREATHOUSE, KENT KNOX, JAKUB POŁA*, KIRAN VARAGANTI, MAYANK DAGA *UNIV. OF WROCŁAW.
PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.
From Source Code to Packages and even whole distributions By Cool Person From openSUSE.
µC-States: Fine-grained GPU Datapath Power Management
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
CS427 Multicore Architecture and Parallel Computing
ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration
Measuring and Modeling On-Chip Interconnect Power on Real Hardware
BLIS optimized for EPYCTM Processors
Richard Dorrance Literature Review: 1/11/13
The Small batch (and Other) solutions in Mantle API
Heterogeneous System coherence for Integrated CPU-GPU Systems
Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,
In-depth on the memory system
SOC Runtime Gregory Stoner.
libflame optimizations with BLIS
Interference from GPU System Service Requests
Simulation of exascale nodes through runtime hardware monitoring
Interference from GPU System Service Requests
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
RegMutex: Inter-Warp GPU Register Time-Sharing
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
Advanced Micro Devices, Inc.
Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018
Presentation transcript:

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |2 THIS TALK IN ONE SLIDE Demonstrate how to save space and time by using the CSR format for GPU-based SpMV Dynamic GPU-based SpMV algorithm that is efficient for regular and irregular matrices 2x faster than CSR-Adaptive 36% faster than CSR5 at 10x less overheads

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA | SPARSE MATRIX-VECTOR MULTIPLICATION (SPMV) Traditionally Bandwidth Limited × = *1+ 2*3 + 3*5

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |4 Performance of SpMV is dominated by how the sparse matrix is stored in memory

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |5 COMPRESSED SPARSE ROW (CSR) row delimiters: columns: values:

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |6 CSR-SCALAR  One thread/work-item per matrix row Workgroup #1 Workgroup #2 Workgroup #3 Uncoalesced accesses

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |7 CSR-VECTOR  One workgroup (multiple threads) per matrix row Workgroup #1 Workgroup #2 Workgroup # Good: coalesced accesses Bad: poor parallelism

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |8 MAKING SPMV FAST ON GPUs  Neither CSR-scalar nor CSR-vector offer ideal performance  Traditional solution: new matrix storage format + new algorithms ‒More than fifty new storage formats have been proposed  Can we find a good GPU algorithm that does not modify the original CSR data? rewrite softwarespace and time overhead

CSR-Adaptive Automatically adapts to different problems in different domains

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |10 INCREASING BANDWIDTH EFFICIENCY CSR-vector coalesces long rows by streaming into the LDS

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |11 INCREASING BANDWIDTH EFFICIENCY Load these into the LDS  How can we get good bandwidth for other “short” rows?

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |12 CSR-STREAM: STREAM MATRIX INTO THE LDS Block 1Block 2 Block 3Block 4 Local Data Share Fast accesses everywhere

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |13  How can we get good bandwidth for “medium-sized” rows? Block 1Block 2 Block 3Block 4 CSR-Stream CSR-Vector ✗ Improved reduction technique Efficient use of memory subsystem

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |14 CSR-ADAPTIVE: SHORT AND MEDIUM ROWS CSR-StreamCSR-Vector Block 1Block 2 Block 3Block 4 Short rowsMedium-sized rows

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |15 WHAT ABOUT VERY LONG ROWS?  CSR-Vector: One workgroup can become a performance bottle-neck Block 1Block 2 Block 4 Block 3 CSR-Vector ✗ CSR-VectorL Block 4Block 5 Multiple workgroups compute the long row

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |16 CSR-VECTORL (LONG ROWS) Block 1Block 2Block 4 Local Data Share Block 3Block 5 Local Data Share Global Memory + atomic

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |17 OPTIMIZED CSR-STREAM: LOGARITHMIC REDUCTION 4 rows assigned to a workgroup with 8 threads – 2 threads / row

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |18 CSR-ADAPTIVE Block 1 Block 2 Block 4Block 3Block 5 CSR-StreamCSR-VectorCSR-VectorL CSR-Adaptive Short rows Medium-sized rows Long rows A complete SpMV solution

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |19 EXPERIMENTAL SETUP  CPU: AMD A K ‒32 GB DDR SDRAM  GPU: AMD FirePro TM W9100 ‒44 parallel 64-wide compute units (CUs) – (2816 ALUs) ‒5.2 single-precision TFLOPs / 2.6 double-precision TFLOPS ‒320 GB/s  32 different input matrices used in previous works  AMD APP SDK v2.9  AMD FirePro driver v14.20 Beta on Ubuntu

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |20 CSR-ADAPTIVE OPTIMIZATIONS AMD FIREPRO W9100 GPU 2x 9x9x

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |21 CSR-ADAPTIVE VS CSR5 AMD FIREPRO W9100 GPU (SINGLE PRECISION) 28% Lose out on 1/32 matrices 2x

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |22 OVERHEAD: CSR-ADAPTIVE VS CSR5

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |23 BANDWIDTH: CSR-ADAPTIVE

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |24 CONCLUSION Optimized GPU-based SpMV algorithm that beats the competition by 2x Operates within 15% of GPU’s achievable b/w Minimal overhead for data structure generation Shipping now as part of AMD’s clSPARSE library

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |25 DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2015 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD FirePro and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.

Backup Slides

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |27 FINDING BLOCKS FOR CSR-ADAPTIVE row delimiters: values: row blocks: CSR Structure Unchanged DONE ONCE ON THE CPU Assuming 8 LDS entries per workgroup columns: Block 1 Block 2 Block 4Block 3Block 5

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |28 A DECENT SPMV FOR THE CPU IS SIMPLE 1. for (i = 0; i < nRows; i++) { 2. temp = 0; 3. for (j = rowDs[i]; j < rowDs[i+1]; j++) { 4. temp += vals[j] * vec[cols[j]]; 5. } 6. output[i] = temp; 7. } #pragma omp parallel for Only seven lines of code provides decent SpMV performance on the CPU

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |29 CSR-VECTOR USING THE LDS Local Data Share Fast uncoalesced accesses! Fast accesses everywhere

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |30 CSR-VECTOR USING THE LDS Local Data Share Block 1Block 2 Block 3Block 4

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |31 CSR-ADAPTIVE VS CSR5 (BEST IN ACADEMIA) AMD FIREPRO W9100 GPU (DOUBLE PRECISION)

| STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES | DECEMBER 17, 2015 | 2015 HIPC | BENGALURU, INDIA |32 LOCAL DATA SHARE (LDS) MEMORY  Scratchpad (software-addressed) on-chip cache  Statically divided between all workgroups in a CU  Highly ported to allow scatter/gather (uncoalesced) accesses Local Data Share (64KB)