Presentation is loading. Please wait.

Presentation is loading. Please wait.

Makoto Kudoh*1, Hisayasu Kuroda*1,

Similar presentations


Presentation on theme: "Makoto Kudoh*1, Hisayasu Kuroda*1,"— Presentation transcript:

1 Optimal Algorithm Selection of Parallel Sparse Matrix-Vector Multiplication Is Important
Makoto Kudoh*1, Hisayasu Kuroda*1, Takahiro Katagiri*2, Yasumasa Kanada*1 *1 The University of Tokyo *2 PRESTO, Japan Science and Technology Corporation

2 Introduction Sparse Matrix-Vector multiplication(SpMxV)
(A is a sparse matrix, x is a dense vector) Basic computational kernel used in scientific computations ex. Iterative solver for linear systems, eigenvalue problems Large scale SpMxV problems Parallel Sparse Matrix-Vector Multiplication

3 Calculation of Parallel Sparse Matrix-Vector Multiplication
2 4 5 3 1 -1 -2 -4 rowptr colind value A PE0 PE1 PE2 PE3 y x Row block distribution Compressed sparse row format Two phase computations: data communication and local computation PE0 PE1 PE2 PE3 Vector data communication x A x y PE0 PE1 PE2 PE3 Local computation

4 Optimization of Parallel SpMxV
Poor performance compared with dense matrix Increased memory reference to matrix data caused by indirect access Irregular memory access pattern to vector x Many optimization algorithms of SpMxV proposed BUT The effect depends highly on the non-zero structure of the matrix and the machine’s architecture Optimal algorithm selection is important

5 Related Works Library approach Compiler approach
PSPARSLIB, PETSc, ILIB, etc Fixed optimize algorithm Work on parallel systems Compiler approach SPARSITY, sparse compiler, etc Generate optimized code for matrix and machine Not work on parallel systems

6 The purpose of our work Our program compare
Performance of best algorithm for matrix and machine Performance of fixed algorithm for all matrices and machines Our program Include several algorithms for local computation and data communication Measure performance of each algorithm exhaustively Select the best algorithm for the matrix and machine Algorithm selecting time is not in concern

7 Optimization algorithms of our program
Algorithms implemented in our routine Local computation Register Blocking Diagonal Blocking Unrolling Data communication Allgather Communication Range Limited Communication Minimum Data Size Communication

8 Register Blocking (Local Computation 1/3)
+ Original matrix Blocked matrix Remaining matrix Extract small dense blocks and make a blocked matrix Reduce the number of load instruction Increase temporal locality to the source vector Abbreviate size mxn Register Blocking to Rmxn R1x2,R1x3,R1x4,R2x1,R2x2,R2x3,R2x4, R3x1,R3x2,R3x3,R3x4,R4x1,R4x2,R4x3,R4x4

9 Diagonal Blocking (Local Computation 2/3)
+ Original matrix Blocked matrix Remaining matrix For matrices with dense non-zero structure around diagonal part Block diagonal part and treat it as a dense band matrix Reduce the number of load instruction Optimize the access of register and cache Abbreviate size n Diagonal Blocking to Dn D3,D5,D7,D9,D11,D13,D15,D17,D19

10 Unrolling (Local Computation 3/3)
Just unroll the inner loop Reduce the loop overhead Exploit instruction level parallelism Abbreviate unrolling level n to Un U1,U2,U3,U4,U5,U6,U7,U8

11 Allgather Communication (data communication 1/3)
PE0 PE1 PE2 PE3 Each processor sends all vector data to all other processors Easy to implement (with MPI_Allgather) The communication data size is very large

12 Range-limited Communication (data communication 2/3)
PE0 vector PE1 Send vector Send only minimum contiguous required block Not communicate between unnecessary processors Small overhead CPU time, since data rearrangement is unnecessary Communication data size is not minimum on most matrices

13 Minimum Data Size Communication (data communication 1/3)
PE0 PE1 vector unpack pack buffer send Communicate only the required elements Need ‘pack’ and ‘unpack’ operations before and after communication The communication data size is minimum ‘pack’, ‘unpack’ operations require a little overhead CPU time

14 Implementation of Communication
Use MPI library 3 implementations for 1 to 1 communication Send-Recv Isend-Irecv Irecv-Isend 3 implementations for range-limited and minimum data size communication Allgather SendRecv-range, IsendIrecv-range, IrecvIsend-range SendRecv-min, IsendIrecv-min, Irecv-Isend-min

15 Methodology of Selecting Optimal Algorithm
Select at runtime Can not detect the characteristic of the matrix until runtime Measure the time of local computation and data communication independently When combined, total time is not necessarily fastest Measure time of each data communication, select best algorithm Combine local computation and best data communication, measure time and select best

16 Numerical Experiment Default fixed algorithms
Experimental environment, test matrices Results

17 Default Fixed Algorithms
Local computation: U1 and R2x2 Data communication: Allgather and IrecvIsend-min No. Local computation Data communication 1 U1 Allgather 2 R2x2 3 IrecvIsend-min 4

18 Experimental Environment
Language C Communication library MPI (MPICH 1.2.1) Name Processor # of PEs Network Compiler Compiler Version Compiler Option PC-Cluster PentiumIII 800 MHz 8 100 base-T Ethernet GCC 2.95.2 -O3 SUN Enterprise 3500 Ultra Sparc II 336 MHz SMP WorkShop Compilers 5.0 -xO5 COMPAQ AlphaServer GS80 Alpha MHz Compaq C -fast SGI2100 MIPS R MHz DSM MIPSpro C 7.30 -64 -O3 HITACHI HA8000-ex880 Intel Itanium 800MHz Intel Itanium Compiler 5.0.1

19 From Tim Davis’ matrix collection
Test Matrices ct20stif cfd1 gearbox From Tim Davis’ matrix collection No. Name Explanation Dimension Non-zeros 1 3dtube 3-D pressure tube 45,330 3,213,618 2 cfd1 Symmetric pressure matrix 70,656 1,828,364 3 crystk03 FEM crystal vibration 24,696 1,751,178 4 venkat01 Unstructured 2D euler solver 62,424 1,717,792 5 bcsstk35 Automobile seat frame and body attachment 30,237 1,450,163 6 cfd2 123,440 3,087,898 7 ct20stif Stiffness matrix 52,329 2,698,463 8 nasasrb Shuttle rocket booster 54,870 2,677,324 9 raefsky3 Fluid structure interaction turbulence 21,200 1,488,768 10 pwtk Pressurized wind tunnel 217,918 11,634,424 11 gearbox Aircraft flap actuator 153,746 9,080,404

20 Result of Matrix No.2 Comm-time(msec) PentiumIII-Ethernet Alpha-SMP
Local-time(msec) U1 R2x2 R1x3 R2x2 R3x1 U1 R1x3 U2 R2x4 U2 U2 U2 R2x4 U6 U2 U5 IrecvIsend-min IrecvIsend-range Comm-algorithm Local-algorithm MIPS-DSM Itanium-SMP R3x1 R3x1 R1x3 R2x2 R1x3 D9 R3x1 R1x3 R2x1 U2 D7 U1 U4 U3 U4 U3 IrecvIsend-range IsendIrecv-range

21 Result of Matrix No.7 Comm-time(msec) PentiumIII-Ethernet Alpha-SMP
Local-time(msec) U1 U1 R3x1 R3x1 U1 R3x1 U1 U1 R2x3 R3x3 R3x3 R3x3 SendRecv-min IsendIrecv-min Comm-algorithm Local-algorithm MIPS-DSM Itanium-SMP R4x2 R3x3 R3x3 R3x3 R4x1 R3x3 R3x3 R3x2 IrecvIsend-min D9 D15 R3x3 D11 D7 D9 R3x3 R2x3 SendRecv-min

22 Result of Matrix No.11 Comm-time(msec) Local-time(msec)
PentiumIII-Ethernet Alpha-SMP R2x3 R3x3 D15 R3x3 R2x3 R3x3 R3x3 R3x3 D5 R3x3 R3x3 R3x3 R3x3 R3x3 R3x3 R3x3 IsendIrecv-min SendRecv-min Comm-algorithm Local-algorithm MIPS-DSM Itanium-SMP D5 R3x3 R3x3 R3x3 R3x3 D7 D9 R3x3 R3x3 R3x3 R3x3 R4x3 R3x3 R3x3 R3x3 R3x3 SendRecv-min SendRecv-min

23 Summary of Experiment Summary of speed-up
def 1 def 2 def 3 def 4 PC-cluster 8.16 7.90 1.32 1.05 Sun Enterprise 3500 2.82 3.07 1.35 1.58 COMPAQ 3.56 3.10 1.59 1.44 SGI 3.73 3.33 1.61 1.36 Hitachi 2.51 1.81 2.03 1.39 Best algorithm depends highly on characteristics of matrix and machine Obtained at least 1.05 speed-up compared with fixed default algorithms

24 Conclusion and Future Work
Compared performance of best algorithm with that of typical fixed algorithms Obtained meaningful speed-up by selecting best algorithm Selecting optimal algorithm according to characteristics of matrix and machine is important Create light overhead method of selecting algorithm Now, selecting time takes hundreds of SpMxV time


Download ppt "Makoto Kudoh*1, Hisayasu Kuroda*1,"

Similar presentations


Ads by Google