Sparse Matrix Dense Vector Multiplication by Pedro A. Escallon Parallel Processing Class Florida Institute of Technology April 2002.

Slides:

Advertisements

Similar presentations

Nonnegative Matrix Factorization with Sparseness Constraints S. Race MA591R.

Advertisements

Block LU Factorization Lecture 24 MA471 Fall 2003.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,

Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

CISE301_Topic3KFUPM1 SE301: Numerical Methods Topic 3: Solution of Systems of Linear Equations Lectures 12-17: KFUPM Read Chapter 9 of the textbook.

Rayan Alsemmeri Amseena Mansoor. LINEAR SYSTEMS Jacobi method is used to solve linear systems of the form Ax=b, where A is the square and invertible.

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

Maths for Computer Graphics

CSCI 317 Mike Heroux1 Sparse Matrix Computations CSCI 317 Mike Heroux.

Sparse Matrix Algorithms CS 524 – High-Performance Computing.

CSE5304—Project Proposal Parallel Matrix Multiplication Tian Mi.

Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

1 Friday, November 03, 2006 “The greatest obstacle to discovery is not ignorance, but the illusion of knowledge.” -D. Boorstin.

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

1 Tuesday, October 31, 2006 “Data expands to fill the space available for storage.” -Parkinson’s Law.

Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.

Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.

1 © 2012 The MathWorks, Inc. Speeding up MATLAB Applications.

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

Makoto Kudoh*1, Hisayasu Kuroda*1,

Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri +

Parallel Sparse Matrix Algorithms for numerical computing matrix-vector multiplication.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Computation on meshes, sparse matrices, and graphs Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

Targil 6 Notes This week: –Linear time Sort – continue: Radix Sort Some Cormen Questions –Sparse Matrix representation & usage. Bucket sort Counting sort.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

L17: Introduction to “Irregular” Algorithms and MPI, cont. November 8, 2011.

Parallel Characteristics of Sequence Alignments Kyle R. Junik.

JAVA AND MATRIX COMPUTATION

1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.

Parallel Solution of the Poisson Problem Using MPI

PARALLELIZATION OF MULTIPLE BACKSOLVES James Stanley April 25, 2002 Project #2.

Solve a system of linear equations By reducing a matrix Pamela Leutwyler.

CS 484. Iterative Methods n Gaussian elimination is considered to be a direct method to solve a system. n An indirect method produces a sequence of values.

Generic Compressed Matrix Insertion P ETER G OTTSCHLING – S MART S OFT /TUD D AG L INDBO – K UNGLIGA T EKNISKA H ÖGSKOLAN SmartSoft – TU Dresden

Lab 3. Why Compressed Row Storage –A sparse matrix has a lot of elements of value zero. –Using a two dimensional array to store a sparse matrix wastes.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad

Basic Communication Operations Carl Tropper Department of Computer Science.

Irregular Applications –Sparse Matrix Vector Multiplication

PARALLEL COMPUTATION FOR MATRIX MULTIPLICATION Presented By:Dima Ayash Kelwin Payares Tala Najem.

L20: Sparse Matrix Algorithms, SIMD review November 15, 2012.

ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.

1 Aggregated Circulant Matrix Based LDPC Codes Yuming Zhu and Chaitali Chakrabarti Department of Electrical Engineering Arizona State.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Matrix Multiplication in Hadoop

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

1 Numerical Methods Solution of Systems of Linear Equations.

Optimizing the Performance of Sparse Matrix-Vector Multiplication

Parallel Direct Methods for Sparse Linear Systems

Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries

GPU Computing CIS-543 Lecture 10: CUDA Libraries

Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries

CS 6068 Parallel Computing Fall 2015 Week 4 – Sept 21

Richard Dorrance Literature Review: 1/11/13

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

Prof. Zhang Gang School of Computer Sci. & Tech.

Linchuan Chen, Peng Jiang and Gagan Agrawal

[ ] [ ] [ ] [ ] EXAMPLE 3 Scalar multiplication Simplify the product:

CSCE569 Parallel Computing

CSCI N207 Data Analysis Using Spreadsheet

Parallel build blocks.

ENERGY 211 / CME 211 Lecture 11 October 15, 2008.

Presentation transcript:

Sparse Matrix Dense Vector Multiplication by Pedro A. Escallon Parallel Processing Class Florida Institute of Technology April 2002

The Problem Improve the speed of sparse matrix - dense vector multiplication using MPI in a beowolf parallel computer.

What To Improve Current algorithms use excessive indirect addressing Current optimizations depend on the structure of the matrix (distribution of the nonzero elements)

Sparse Matrix Representations Coordinate format Compressed Sparse Row (CSR) Compressed Sparse Column (CSC) Modified Sparse Row (MSR)

Compressed Sparse Row (CSR) 0 A 01 A A 11 0 A 13 A A 01 A 02 A 11 A 13 A 20 rS ndx val

CSR Code void sparseMul(int m, double *val, int *ndx, int *rS, double *x, double *y) { int i,j; for(i=0;i<m;i++) { for(j=rowStart[i];j<rS[i+1];j++) { y[i]+=(*val++)*x[*ndx++]; }

Goals Eliminate indirect addressing Remove the dependency on the distribution of the nonzero elements Further compress the matrix storage Most of all, to speed up the operation

Proposed Solution {0,0}{1,A 01 }{2,A 02 }{-1,0}{1,A 11 }{3,A 13 }{-2,A 20 } 0 A 01 A A 11 0 A 13 A A =

Data Structure typedef struct { int rCol; double val; } dSparS_t; {rCol,val}

Process 013p local_size hdr.size … residual < p local_size – hdr.size / p residual = hdr.size % p A

Scatter 012p local_size … A 012p … local_A

Multiplication Code if( (index=local_A[0].rCol) > 0 ) local_Y[0].val = local_A[0].val * X[index]; else local_Y[0].val = local_A[0].val * X[0]; local_Y[0].rCol = -1; k=1; h=0; while(k<local_size) { while((0<(index=local_A[k].rCol)) && (k<local_size)) local_Y[h].val += local_A[k++].val * X[index]; if(k<local_size) { local_Y[h++].rCol = -index-1; local_Y[h].val = local_A[k++].val * X[0]; } local_Y[h].rCol = local_Y[-1+h++].rCol+1; while(h < stride) local_Y[h++].rCol = -1;

Multiplication local_size local_A stride local_Y doamin Range X =*

Algorithm local_A X Y.val Y.rCol {r 0,v 0 } 0 X[0] =X[0]*v 00 - {c 1,v 1 } 0 X[c 01 ] +=X[c 01 ]*v {r 1,v 0 } 1.. X[0] =X[0]*v 00 - {c 2,v 2 } 0 X[c 02 ] +=X[c 02 ]*v 02 -r 1 -1 {c 1,v 1 } 1 X[c 11 ] +=X[c 11 ]*v 11 -

Gather … 012p … local_Y residual gatherBuffer split element stride range

Consolidation of Split Rows … residual Y nCols … += gatherBuffer

Results (vavasis3) vavasis3.rua - Total non-zero values: 1,683,902 - p = 10 Broadcast TimeScatter TimeGather Time Computation Time P P P P P P P P P P

Results (vavasis3) vavasis3.rua - Total non-zero values: 1,683,902 - p = 8 Broadcast TimeScatter TimeGather Time Computation Time P P P P P P P P vavasis3.rua - Total non-zero values: 1,683,902 - p = 1 Broadcast TimeScatter TimeGather Time Computation Time P

Results (vavasis3) vavasis3.rua - Total non-zero values: 1,683,902 - p = 4 Broadcast TimeScatter TimeGather Time Computation Time P P P P vavasis3.rua - Total non-zero values: 1,683,902 - p = 2 Broadcast TimeScatter TimeGather Time Computation Time P P

Results (vavasis3) PComputationSpeedupE_pGatherC_p vavasis3.rua - Calculated Results

Results (bayer02) bayer02.rua - Total non-zero values: 63,679 - p = 10 Broadcast TimeScatter TimeGather Time Computation Time P P P P P P P P P P

Results (bayer02) bayer02.rua - Total non-zero values: 63,679 - p = 8 Broadcast TimeScatter TimeGather Time Computation Time P P P P P P P P bayer02.rua - Total non-zero values: 63,679 - p = 1 Broadcast TimeScatter TimeGather Time Computation Time P

Results (bayer02) bayer02.rua - Total non-zero values: 63,679 - p = 4 Broadcast TimeScatter TimeGather Time Computation Time P P P P bayer02.rua - Total non-zero values: 63,679 - p = 2 Broadcast TimeScatter TimeGather Time Computation Time P P

Results (bayer02) PComputationSpeedupE_pGatherC_p bayer02.rua - Calculated Results

Conclusions The proposed representation speeds up the matrix calculation Data mismatch solution before gather should be improved There seems to be a communication penalty for using moving structured data

Bibliography “Optimizing the Performance of Sparse Matrix- Vector Multiplication” dissertation by Eun-Jin Im. “Iterative Methods for Sparse Linear Systems” by Yousef Saad “Users’ Guide for the Harwell-Boeing Sparse Matrix Collection” by Iain S. Duff