The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16) Exploring Vectorization Possibilities on the Intel Xeon Phi for.

The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)
Exploring Vectorization Possibilities on the Intel Xeon Phi for Solving Tridiagonal Systems I. E. Venetis, A. Nakos, A. Kouris, E. Gallopoulos HPCLab, Computer Engineering & Informatics Dept., University of Patras, Greece

Goal: Solve general tridiagonal systems
Assume irreducibility The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Applicability Important kernel in many algorithms and applications
Several algorithms convert a dense matrix to tridiagonal, then deal with the tridiagonal form Interesting case Solve many systems of the form (A – ω·B)x=y for different values of ω R. B. Sidje and K. Burrage, “QRT: A QR-Based Tridiagonalization Algorithm for Nonsymmetric Matrices,” SIMAX, 2005. The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Parallel tridiagonal solvers
Extensive prior work Recursive doubling (RD) Stone, Egecioglu et al., Davidson et al. Cyclic reduction (CR) and “Parallel CR” Hockney and Golub, Heller, Amodio et al., Arbenz, Hegland, Gander, Goddeke et al., … Divide-and-conquer/Partitioning Sameh, Wang, Johnsson, Wright, Lopez et al., Dongarra, Arbenz et al., Polizzi et al., Hwu et al. The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Parallel tridiagonal solvers
RD and CR mostly appropriate for matrices factorizable by diagonal pivoting e.g. diagonally dominant or symmetric positive definite most parallel tridiagonal solvers are of the RD or CR type and do not deal with pivoting Notable exception: P. Arbenz and M. Hegland. “The stable parallel solution of narrow banded linear systems.” In Proc. 8th SIAM Conf. Parallel Proc., 1997. The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Tridiagonal solvers using partitioning
Basic assumption Tridiagonal subsystems must be nonsingular In general, no guarantee that they will be so! A is non-singular A1,1 and A2,2 are singular Example from: “A direct tridiagonal solver based on Givens rotations for GPU architectures”, I.E. Venetis et al., Parallel Computing, 2015. The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Typically not handled ScaLAPACK: pdgbsv(n,nrhs,dl,d,du,..., info);
IBM Parallel ESSL: pdgtsv(n,nrhs,dl,d,du,..., info); If info > 0 ⇒ results unpredictable: 1 ≤ info ≤ p Singular submatrix (pivot is 0) info > p Singular interaction matrix (pivot is 0 or too small) The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Spike factorization Solves banded systems
Adapted for tridiagonal systems A. Sameh & D. Kuck, “On stable parallel linear system solvers”, J. ACM, 1978. The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

First implementation for coprocessor
Illinois IMPACT group for NVidia GPUs in 2012 Implementation was adopted as routine “gtsv” in the NVidia cuSPARSE library The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

g-Spike Goal: An algorithm that is robust enough to handle singular diagonal blocks while being competitive in speed with other parallel tridiagonal solvers Initially implemented for NVidia GPUs using CUDA “A direct tridiagonal solver based on Givens rotations for GPU architectures”, I.E. Venetis et al., Parallel Computing, 2015. First attempt to port to the Intel Xeon Phi “A general tridiagonal solver for coprocessors: Adapting g-Spike for the Intel Xeon Phi”, I.E. Venetis et al., 2015 Int’l Conference on Parallel Computing (ParCo 2015), Edinburgh, UK, 2015. The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

g-Spike robustness QR factorization Givens rotations
Last element becomes 0 if subsystem is singular: Easy detection Element can easily be restored after solving non-singular subsystem Add purple column to make it non-singular The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

1st level parallelization: OpenMP
Assign partitions to hardware threads in cores No dependencies among partitions The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

2nd level parallelization: Vectorization
Reorganization of elements in g-Spike Originally for coalesced memory accesses in GPU We kept it for cache exploitation Vectorized with scatter/gather instructions The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

2nd level parallelization: Vectorization
Vectorization of actual computations Hasn’t been done in initial port to Xeon Phi Data dependencies exist This obstacle present in other solvers, not only g-Spike Step m-2 Step m-1 The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

New approach to vectorization
The Intel Xeon Phi has 512-bit wide SIMD registers 8 double precision elements fit Subdivide every partition into 8 sub-partitions Handle them in a similar manner to first-level partitions to parallelize code Although through vectorization now The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

First attempt Access corresponding elements in sub-partitions with stride Vectorize with gather/scatter instructions d u l ⋱ Load 8 elements of sub-partitions into SIMD register Load next 8 elements of sub-partitions into SIMD register Sub-partitions in same partition Performance proved very bad 3x slower Abandoned this approach The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Second attempt Reorganize elements within partition
Same logic as with global reorganization Bring together elements on which the same operations are performed Better results than previous attempt But worse than not using reorganization at all Cost of reorganization is too high 1st partition, 1st elements of sub-partitions 1st partition, 2nd elements of sub-partitions 1st partition, mth elements of sub-partitions Continue with 2nd, 3rd, etc partitions … The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Third attempt: Vectorize multiple RHS
An obvious way to exploit vectorization Same operations have to be applied to all RHS d u x0 b0,0 b0,1 b0,2 b0,3 b0,4 b0,5 b0,6 b0,7 l x1 b1,0 b1,1 b1,2 b1,3 b1,4 b1,5 b1,6 b1,7 x2 b2,0 b2,1 b2,2 b2,3 b2,4 b2,5 b2,6 b2,7 x3 b3,0 b3,1 b3,2 b3,3 b3,4 b3,5 b3,6 b3,7 x4 b4,0 b4,1 b4,2 b4,3 b4,4 b4,5 b4,6 b4,7 x5 b5,0 b5,1 b5,2 b5,3 b5,4 b5,5 b5,6 b5,7 × x6 = b6,0 b6,1 b6,2 b6,3 b6,4 b6,5 b6,6 b6,7 x7 b7,0 b7,1 b7,2 b7,3 b7,4 b7,5 b7,6 b7,7 ⋱ ⋮ xN bN,0 bN,1 bN,2 bN,3 bN,4 bN,5 bN,6 bN,7 The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Experimental environment
Host Intel Core i CPU 8 3.40GHz 16GB memory Coprocessor Intel Xeon Phi 3120A GHz 6GB memory Size of system to be solved is 220 Except if otherwise noted The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Test matrix collection
Matrix number* MATLAB instructions to create matrix 1 tridiag(n,L,D,U), with L,D,U sampled from U(−1,1) 3 gallery(‘lesp’, n) 18 tridiag(-1*ones(n-1, 1), 4*ones(n, 1), -1*ones(n-1, 1)) 20 tridiag(-ones(n-1,1),4*ones(n,1),U), U sampled from U(−1,1) * Taken from: “A direct tridiagonal solver based on Givens rotations for GPU architectures”, I.E. Venetis et al., Parallel Computing, 2015. The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Numerical stability comparison
Matrix number κ2(Α) g-spike with vect. Block Size 16 RER2 g-spike without vect. g-spike CUDA MATLAB 1 9,32e+03 1,59e-14 1,54e-14 2,64e-14 1,10e-14 3 2,84e+03 2,05e-16 1,90e-16 2,10e-16 1,52e-16 18 3,00e+00 2,20e-16 2,03e-16 1,88e-16 1,27e-16 20 2,46e+00 1,81e-16 1,77e-16 1,70e-16 1,17e-16 The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

g-Spike with/without vectorization
Always use number of threads that gives highest performance 224 threads 128 threads 64 threads 32 threads The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Speedup (without vectorization, N=220)
The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Speedup (without vectorization, N=226)

g-Spike – ScaLAPACK (Intel MKL)
224 threads Always use number of threads/processes that gives highest performance 32 proc. 64 proc. 128 proc. 8 proc. 16 proc. The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

g-Spike – LAPACK (Intel MKL)
8 threads 4 threads 16 threads Always use number of threads that gives highest performance The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Multiple RHS (N=220, 224 threads)

Multiple RHS (N=222, 224 threads)

Conclusions – Possible extensions
Good numerical stability with low computational cost Currently the fastest solver for general tridiagonal systems on the Intel Xeon Phi High data transfer cost for a single application of the algorithm Should be amortized among multiple invocations Possible extensions Use of square root free Givens rotations Improve on the detection of singularity Cases where g-Spike will not detect singularity of a sub-matrix due to finite precision Extend to banded systems The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

THANK YOU! QUESTIONS? The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16) Exploring Vectorization Possibilities on the Intel Xeon Phi for.

Similar presentations

Presentation on theme: "The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16) Exploring Vectorization Possibilities on the Intel Xeon Phi for."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16) Exploring Vectorization Possibilities on the Intel Xeon Phi for.

Similar presentations

Presentation on theme: "The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16) Exploring Vectorization Possibilities on the Intel Xeon Phi for."— Presentation transcript:

Similar presentations

About project

Feedback