Sparse Matrix Dense Vector Multiplication by Pedro A. Escallon Parallel Processing Class Florida Institute of Technology April 2002.

Sparse Matrix Dense Vector Multiplication by Pedro A. Escallon Parallel Processing Class Florida Institute of Technology April 2002

The Problem Improve the speed of sparse matrix - dense vector multiplication using MPI in a beowolf parallel computer.

What To Improve Current algorithms use excessive indirect addressing Current optimizations depend on the structure of the matrix (distribution of the nonzero elements)

Sparse Matrix Representations Coordinate format Compressed Sparse Row (CSR) Compressed Sparse Column (CSC) Modified Sparse Row (MSR)

Compressed Sparse Row (CSR) 0 A 01 A 02 0 0 A 11 0 A 13 A 20 000 0245 02130 A 01 A 02 A 11 A 13 A 20 rS ndx val

CSR Code void sparseMul(int m, double *val, int *ndx, int *rS, double *x, double *y) { int i,j; for(i=0;i<m;i++) { for(j=rowStart[i];j<rS[i+1];j++) { y[i]+=(*val++)*x[*ndx++]; }

Goals Eliminate indirect addressing Remove the dependency on the distribution of the nonzero elements Further compress the matrix storage Most of all, to speed up the operation

Proposed Solution {0,0}{1,A 01 }{2,A 02 }{-1,0}{1,A 11 }{3,A 13 }{-2,A 20 } 0 A 01 A 02 0 0 A 11 0 A 13 A 20 000 A =

Data Structure typedef struct { int rCol; double val; } dSparS_t; {rCol,val}

Process 013p local_size hdr.size … residual < p local_size – hdr.size / p residual = hdr.size % p A

Scatter 012p local_size … A 012p … local_A

Multiplication Code if( (index=local_A[0].rCol) > 0 ) local_Y[0].val = local_A[0].val * X[index]; else local_Y[0].val = local_A[0].val * X[0]; local_Y[0].rCol = -1; k=1; h=0; while(k<local_size) { while((0<(index=local_A[k].rCol)) && (k<local_size)) local_Y[h].val += local_A[k++].val * X[index]; if(k<local_size) { local_Y[h++].rCol = -index-1; local_Y[h].val = local_A[k++].val * X[0]; } local_Y[h].rCol = local_Y[-1+h++].rCol+1; while(h < stride) local_Y[h++].rCol = -1;

Multiplication local_size local_A stride local_Y doamin Range X =*

Algorithm local_A X Y.val Y.rCol {r 0,v 0 } 0 X[0] =X[0]*v 00 - {c 1,v 1 } 0 X[c 01 ] +=X[c 01 ]*v 01 -.. {r 1,v 0 } 1.. X[0] =X[0]*v 00 - {c 2,v 2 } 0 X[c 02 ] +=X[c 02 ]*v 02 -r 1 -1 {c 1,v 1 } 1 X[c 11 ] +=X[c 11 ]*v 11 -

Gather … 012p … local_Y residual gatherBuffer split element stride range

Consolidation of Split Rows … residual Y nCols … += gatherBuffer

Results (vavasis3) vavasis3.rua - Total non-zero values: 1,683,902 - p = 10 Broadcast TimeScatter TimeGather Time Computation Time P00.1039302.3802850.0960510.012123 P10.1075880.4571400.0120000.011504 P20.1076670.7060870.0120220.011642 P30.1031550.9518140.0119710.011560 P40.1076441.2063760.0122100.011536 P50.1092431.4525630.0120320.011506 P60.1084771.7025710.0120440.011506 P70.1094461.9484810.0120040.011658 P80.0558222.2089240.0120790.011540 P90.0590232.4599000.0120090.011438

Results (vavasis3) vavasis3.rua - Total non-zero values: 1,683,902 - p = 8 Broadcast TimeScatter TimeGather Time Computation Time P00.0894782.2643160.1217410.014860 P10.0930830.5690911.7117890.014105 P20.0932170.8664601.4293520.014227 P30.0910121.1605911.1469540.014457 P40.0817191.4623350.8655200.014365 P50.0853751.7569410.5823530.014341 P60.0854182.0556510.2998470.014362 P70.0890872.3509980.0178130.014728 vavasis3.rua - Total non-zero values: 1,683,902 - p = 1 Broadcast TimeScatter TimeGather Time Computation Time P00.0000021.4127740.0330150.112132

Results (vavasis3) vavasis3.rua - Total non-zero values: 1,683,902 - p = 4 Broadcast TimeScatter TimeGather Time Computation Time P00.0519803.0268460.2175740.028587 P10.0556051.7252721.0279280.028258 P20.0557032.3193430.4510210.028141 P30.0564223.2125180.0180730.027988 vavasis3.rua - Total non-zero values: 1,683,902 - p = 2 Broadcast TimeScatter TimeGather Time Computation Time P00.2332005.8108140.4260970.056334 P10.2368646.5213280.0321250.055866

Results (vavasis3) PComputationSpeedupE_pGatherC_p 10.112132--- 0.0330151.294430 20.0563341.9904850.9952430.4260978.563763 40.0285873.9224820.9806211.02792836.957883 80.0148607.5458950.9432371.711789116.194415 100.0121239.2495260.9249530.0960518.923039 vavasis3.rua - Calculated Results

Results (bayer02) bayer02.rua - Total non-zero values: 63,679 - p = 10 Broadcast TimeScatter TimeGather Time Computation Time P00.0461360.0931430.0117330.000926 P10.0488240.0182070.0015670.000423 P20.0486270.0271460.0020540.000456 P30.0444160.0343860.0024400.000445 P40.0482140.0463650.0024570.000397 P50.0484810.0535110.0019780.000425 P60.0456660.0632040.0020150.000467 P70.0481730.0701670.0024400.000419 P80.0339470.0885320.0023230.000395 P90.0321100.0978660.0019590.000479

Results (bayer02) bayer02.rua - Total non-zero values: 63,679 - p = 8 Broadcast TimeScatter TimeGather Time Computation Time P00.0401590.1034220.0118100.001020 P10.0427430.0233530.0017280.000549 P20.0427090.0356700.0017770.000607 P30.0393220.0471410.0017380.000599 P40.0415840.0640240.0017240.000702 P50.0392290.0755280.0017250.000568 P60.0372060.0897570.0017330.000565 P70.0399120.1012670.0021110.000541 bayer02.rua - Total non-zero values: 63,679 - p = 1 Broadcast TimeScatter TimeGather Time Computation Time P00.0000030.0638240.0109750.006090

Results (bayer02) bayer02.rua - Total non-zero values: 63,679 - p = 4 Broadcast TimeScatter TimeGather Time Computation Time P00.0496800.0969300.0183080.001888 P10.0523790.0489240.0037650.001555 P20.0519440.0764050.0036090.001561 P30.0464130.1018710.0036360.001528 bayer02.rua - Total non-zero values: 63,679 - p = 2 Broadcast TimeScatter TimeGather Time Computation Time P00.0254940.5206110.0081920.003445 P10.0281570.5040810.0328480.003121

Results (bayer02) PComputationSpeedupE_pGatherC_p 10.006090--- 0.0109752.802135 20.0034451.7677790.8838900.03284810.534978 40.0018883.2256360.8064090.01830810.697034 80.0010205.9705880.7463240.01181012.578431 100.0009266.5766740.6576670.01173313.670626 bayer02.rua - Calculated Results

Conclusions The proposed representation speeds up the matrix calculation Data mismatch solution before gather should be improved There seems to be a communication penalty for using moving structured data

Bibliography “Optimizing the Performance of Sparse Matrix- Vector Multiplication” dissertation by Eun-Jin Im. “Iterative Methods for Sparse Linear Systems” by Yousef Saad “Users’ Guide for the Harwell-Boeing Sparse Matrix Collection” by Iain S. Duff

Sparse Matrix Dense Vector Multiplication by Pedro A. Escallon Parallel Processing Class Florida Institute of Technology April 2002.

Similar presentations

Presentation on theme: "Sparse Matrix Dense Vector Multiplication by Pedro A. Escallon Parallel Processing Class Florida Institute of Technology April 2002."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sparse Matrix Dense Vector Multiplication by Pedro A. Escallon Parallel Processing Class Florida Institute of Technology April 2002.

Similar presentations

Presentation on theme: "Sparse Matrix Dense Vector Multiplication by Pedro A. Escallon Parallel Processing Class Florida Institute of Technology April 2002."— Presentation transcript:

Similar presentations

About project

Feedback