Irregular Applications –Sparse Matrix Vector Multiplication

Slides:

Advertisements

Similar presentations

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Advertisements

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

More on threads, shared memory, synchronization

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Systems of Linear Equations

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Sparse Matrix Algorithms CS 524 – High-Performance Computing.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

CUDA Linear Algebra Library and Next Generation

Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

L17: Introduction to “Irregular” Algorithms and MPI, cont. November 8, 2011.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad

L20: Sparse Matrix Algorithms, SIMD review November 15, 2012.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Lecture 6 CUDA Global Memory Kyu Ho Park Mar. 29, 2016.

Improvement to Hessenberg Reduction

Eliminating Intra-warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement Farzad Khorasani Bryan Rowe,

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

CS/EE 217 – GPU Architecture and Parallel Programming

Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries

Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs

Reminder: Array Representation In C++

Linchuan Chen, Peng Jiang and Gagan Agrawal

Reminder: Array Representation In C++

CS/EE 217 – GPU Architecture and Parallel Programming

CSCE569 Parallel Computing

Parallel Computation Patterns (Reduction)

ECE 498AL Lecture 15: Reductions and Their Implementation

ECE 498AL Lecture 10: Control Flow

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

ECE 498AL Spring 2010 Lecture 10: Control Flow

© 2012 Elsevier, Inc. All rights reserved.

Reminder: Array Representation In C++

Parallelized Analytic Placer

Presentation transcript:

Irregular Applications –Sparse Matrix Vector Multiplication CS/EE217 GPU Architecture and Parallel Programming Lecture 20

Sparse Matrix-Vector Multiplication Many slides from presentation by: Kai He1, Sheldon Tan1, Esteban Tlelo-Cuautle2, Hai Wang3 and He Tang3 1University of California, Riverside 2Institute National Astrophysics, Optical and Electrics, Mexico 3UESTC, China

Outline Background Review of existing methods of Sparse Matrix-Vector (SpMV) Multiplication on GPU SegSpMV method on GPU Experimental results Conclusion

Sparse Matrices Matrices where most elements are zeros Come up frequently in many science, engineering, modeling, computer science, etc.. Problems Example, solvers for linear systems of equations Example, graphs (e.g., representation of the web connectivity) What impact/drawback do they have on computations?

Structured Unstructured Sparse Matrix Format (CSR) Compressed Row (COO) Coordinate (DIA) Diagonal (ELL) ELLPACK (HYB) Hybrid Structured Unstructured

Compressed Sparse Row (CSR) 2 1 3 1 4 2 6 3 9 8 7 4 Row Ptr 1 4 6 7 9 11 13 Col Idx 1 2 5 1 2 3 4 6 1 5 4 6 Values 2 1 3 1 4 2 6 3 9 8 7 4

Compressed Sparse Column (CSC) 2 1 3 1 4 2 6 3 9 8 7 4 Col Ptr 1 4 6 7 9 11 13 Row Idx 1 2 5 1 2 3 4 6 1 5 4 6 Values 2 1 9 1 4 2 6 7 3 8 3 4

Characteristics of SpMV Multiplication Memory bound: FLOP : Byte ratio is very low Generally irregular & unstructured Unlike dense matrix operation

Objectives of SpMV Implementation on GPU Expose sufficient parallelism Develop thousands of independent threads Minimize execution path divergence SIMD utilization Minimize memory access divergence Memory coalescing

Computing Sparse Matrix-Vector Multiplication Row laid out in sequence Middle[elemid] = val[elemid] * vec[col[elemid]]

Computing Sparse Matrix-Vector Multiplication Row laid out in sequence Middle[elemid] = val[elemid] * vec[col[elemid]]

Computing Sparse Matrix-Vector Multiplication Row laid out in sequence Middle[elemid] = val[elemid] * vec[col[elemid]]

Computing Sparse Matrix-Vector Multiplication Row laid out in sequence Middle[elemid] = val[elemid] * vec[col[elemid]]

Computing Sparse Matrix-Vector Multiplication Row laid out in sequence Middle[elemid] = val[elemid] * vec[col[elemid]]

Sequential loop to implement SpMV for (int row=0; row < num_rows; row++) { float dot = 0; int row_start = row_ptr[row]; int row_end = row_ptr[row+1]; for(int elem = row_start; elem < row_end; elem++) { dot+= data[elem] * x [col_index[elem]]; } y[row] =dot;

Discussion How do we assign threads to work? How will this parallelize? Think of memory access patterns Think of control divergence

Algorithm: gbmv_s One thread per row Thread 0 Thread 1 Thread 2 … … … …

gbmv_s Issues? Poor coalescing __global__ void SpMV_CSR (int num_rows, float *data, int *col_index, int *row_ptr, float *x, float *y) { int row=blockIdx.x * blockDim.x + threadIdx.x; if(row < num_rows) { float dot = 0; int row_start = row_ptr[row]; int row_end = row_ptr[row+1]; for(int elem = row_start; elem < row_end; elem++) { dot+= data[elem] * x [col_index[elem]]; } y[row] =dot; Issues? Poor coalescing

Algorithm: gbmv_w One warp per row Partial memory coalescing vec[col[elemid]] cannot be coalesced because the col[elemid] can be arbitrary How do you size the warps? Warp 0 Warp 1 Warp 2 Warp 3 Warp 4

Vector Expansion Col Index 1 2 5 1 2 3 5 3 2 4 3 5 3 7 1 3 3 7 Original Vector 7 6 Expanded Vector 6 9 1 1 6 7 9 vec_expanded[elemid] = vec[col[elemid]] 6 The expansion operation will be done on CPU. 1

Algorithm: PS (Product & Summation) One thread per non-zero element Memory coalescing Product Kernel Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Matrix in CSR Format Expanded Vector

Algorithm: PS (Product & Summation) Summation Kernel Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Product (SMEM) Thread 0 Thread 3 Thread 7 Thread 8 Thread 10 Multiplication result

Algorithm: PS (Product & Summation) We need to read row pointer array to get row_begin and row_end for summation The number of addition on each threads are different May not have enough shared memory when there are too many nonzeros in a row Thread 0 Thread 3 Thread 7 Thread 8 Thread 10

Algorithm: segSpMV Divide the matrix in CSR format and expanded vector into small segments with constant length Segments without enough elements will be appended zeros. Segment length = 2 Segment 0 Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6

segSpMV Kernel segSpMV Kernel Thread 0 Thread 1 Thread 2 Thread 3 Matrix in CSR Format Expanded Vector Product (SMEM)

segSpMV Kernel (Cont.) segSpMV Kernel Thread 0 Thread 1 Thread 2 Product (SMEM) Intermediate result

Final Summation Final summation is done on CPU There will be very few additions in final summation if we choose segment length carefully Intermediate result Multiplication result

Read matrix in CSR format and original vector Algorithm: segSpMV Advantages: Can enjoy full coalesced memory access Less synchronization overhead because all segments have same length Can store all intermediate results in shared memory Read matrix in CSR format and original vector CPU part Vector Expansion SpMV Kernel GPU part CPU part Final Summation

Experiment Results Matrices size nnz nzperrow Seg_length mc2depi 525825 2100225 3.9942 4 mac_econ_fwd500 206500 1273389 6.1665 8 scircuit 170998 958936 5.6079 webbase-1M 1000005 6105536 6.1055 cop20k_A 121192 1362087 11.2391 16 cant 62451 2034917 32.5842 32 consph 83334 3046907 36.5626 pwtk 217918 5926171 27.1945 qcd5_4 49152 1916928 39 shipsec1 140874 3977139 28.2319 pdb1HYS 36417 2190591 60.153 64 rma10 46835 2374001 50.6886

Results of Short Segment (ms)

Speedup of SpMV

Results of Long Segment (ms)

Reference [0] This paper: Kai He, Sheldon Tan, Esteban Tlelo-Cuautle, Hai Wang an He Tang, “A New Segmentation-Based GPU Accelerated Sparse Matrix Vector Multiplication”, IEEE MWCAS 2014 [1] Wang BoXiong David and Mu Shuai, “Taming Irregular EDA Applications on GPUs”, ICCAD 2009. [2] Nathan Bell and Michael Garland, “Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors”, NVIDIA Research. [3] M. Naumov, “Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU”, NVIDIA Technical Report, NVR-2011-001, 2011.

Another Alternative; ELL ELL representation; alternative to CSR Start with CSR PAD each row to the size of the longest row Transpose it

__global__ void SpMV_ELL(int num_rows, float. data, int __global__ void SpMV_ELL(int num_rows, float *data, int *col_index, int num_elem, float *x, float *y) { int row=blockIdx.x * blockDim.x + threadIdx.x; if(row < num_rows) { float dot = 0; for(int i = 0; i < num_elem; i++) { dot+= data[row+i*num_rows] * x [col_index[row+i*num_rows]]; } y[row] =dot;

Example/Intuition Lets look at Thread 0 First iteration: 3 * col[0] Second iteration 1 * col [2] Third iteration? Is this the correct result? What do you think of this implementation?

Problems? What if one (or a few) rows are dense? What can we do? A lot of padding Big storage/memory overhead Shared memory overhead if you try to use it Can actually become slower than CSR What can we do? In comes COO (Coordinate Format)

Main idea (see 10.4 in the book) for (int i = 0; i < num_elem; i++) y[row_index[i]]+= data[i]* x[col_index[i]]; Start with ELL Store not only column index, but also row index This now allow us to reorder elements See code above What can we gain from doing that?

Main idea (cont’d) Convert CSR to ELL As you convert, if there is a long row, you remove some elements from it and keep them in COO format Send ELL (missing the COO elements) to GPU for SpMV_ELL kernel to chew on Once the results are back, the CPU adds in the contributions from the missing COO elements Which should be very few

Other optimizations? Sorting rows by length (see 10.5) in the book Keep track of original row order Create different kernels/grids for rows of similar size Reshuffle y at the end to its original position

Questions? Please check chapter 10 In the book