The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16) Exploring Vectorization Possibilities on the Intel Xeon Phi for.

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Optimization on Kepler Zehuan Wang
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
Solving Linear Systems (Numerical Recipes, Chap 2)
OpenFOAM on a GPU-based Heterogeneous Cluster
Overview Introduction Variational Interpolation
By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Ordinary least squares regression (OLS)
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Stencil-based Discrete Gradient Transform Using
A survey of Exascale Linear Algebra Libraries for Data Assimilation
Sathish Vadhiyar Parallel Programming
Ioannis E. Venetis Department of Computer Engineering and Informatics
CS427 Multicore Architecture and Parallel Computing
I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
Bisection and Twisted SVD on GPU
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Lecture 2: Intro to the simd lifestyle and GPU internals
Lecture 5: GPU Compute Architecture
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Linchuan Chen, Peng Jiang and Gagan Agrawal
GPU Implementations for Finite Element Methods
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Parallel Computation Patterns (Scan)
Peng Jiang, Linchuan Chen, and Gagan Agrawal
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE 498AL Lecture 10: Control Flow
Dense Linear Algebra (Data Distributions)
Mattan Erez The University of Texas at Austin
ECE 498AL Spring 2010 Lecture 10: Control Flow
Multicore and GPU Programming
Data Parallel Pattern 6c.1
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16) Exploring Vectorization Possibilities on the Intel Xeon Phi for Solving Tridiagonal Systems I. E. Venetis, A. Nakos, A. Kouris, E. Gallopoulos HPCLab, Computer Engineering & Informatics Dept., University of Patras, Greece

Goal: Solve general tridiagonal systems Assume irreducibility The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Applicability Important kernel in many algorithms and applications Several algorithms convert a dense matrix to tridiagonal, then deal with the tridiagonal form Interesting case Solve many systems of the form (A – ω·B)x=y for different values of ω R. B. Sidje and K. Burrage, “QRT: A QR-Based Tridiagonalization Algorithm for Nonsymmetric Matrices,” SIMAX, 2005. The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Parallel tridiagonal solvers Extensive prior work Recursive doubling (RD) Stone, Egecioglu et al., Davidson et al. Cyclic reduction (CR) and “Parallel CR” Hockney and Golub, Heller, Amodio et al., Arbenz, Hegland, Gander, Goddeke et al., … Divide-and-conquer/Partitioning Sameh, Wang, Johnsson, Wright, Lopez et al., Dongarra, Arbenz et al., Polizzi et al., Hwu et al. The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Parallel tridiagonal solvers RD and CR mostly appropriate for matrices factorizable by diagonal pivoting e.g. diagonally dominant or symmetric positive definite most parallel tridiagonal solvers are of the RD or CR type and do not deal with pivoting Notable exception: P. Arbenz and M. Hegland. “The stable parallel solution of narrow banded linear systems.” In Proc. 8th SIAM Conf. Parallel Proc., 1997. The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Tridiagonal solvers using partitioning Basic assumption Tridiagonal subsystems must be nonsingular In general, no guarantee that they will be so! A is non-singular A1,1 and A2,2 are singular Example from: “A direct tridiagonal solver based on Givens rotations for GPU architectures”, I.E. Venetis et al., Parallel Computing, 2015. The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Typically not handled ScaLAPACK: pdgbsv(n,nrhs,dl,d,du,..., info); IBM Parallel ESSL: pdgtsv(n,nrhs,dl,d,du,..., info); If info > 0 ⇒ results unpredictable: 1 ≤ info ≤ p Singular submatrix (pivot is 0) info > p Singular interaction matrix (pivot is 0 or too small) The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Spike factorization Solves banded systems Adapted for tridiagonal systems A. Sameh & D. Kuck, “On stable parallel linear system solvers”, J. ACM, 1978. The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

First implementation for coprocessor Illinois IMPACT group for NVidia GPUs in 2012 Implementation was adopted as routine “gtsv” in the NVidia cuSPARSE library The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

g-Spike Goal: An algorithm that is robust enough to handle singular diagonal blocks while being competitive in speed with other parallel tridiagonal solvers Initially implemented for NVidia GPUs using CUDA “A direct tridiagonal solver based on Givens rotations for GPU architectures”, I.E. Venetis et al., Parallel Computing, 2015. First attempt to port to the Intel Xeon Phi “A general tridiagonal solver for coprocessors: Adapting g-Spike for the Intel Xeon Phi”, I.E. Venetis et al., 2015 Int’l Conference on Parallel Computing (ParCo 2015), Edinburgh, UK, 2015. The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

g-Spike robustness QR factorization Givens rotations Last element becomes 0 if subsystem is singular: Easy detection Element can easily be restored after solving non-singular subsystem Add purple column to make it non-singular The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

1st level parallelization: OpenMP Assign partitions to hardware threads in cores No dependencies among partitions The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

2nd level parallelization: Vectorization Reorganization of elements in g-Spike Originally for coalesced memory accesses in GPU We kept it for cache exploitation Vectorized with scatter/gather instructions The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

2nd level parallelization: Vectorization Vectorization of actual computations Hasn’t been done in initial port to Xeon Phi Data dependencies exist This obstacle present in other solvers, not only g-Spike Step m-2 Step m-1 The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

New approach to vectorization The Intel Xeon Phi has 512-bit wide SIMD registers 8 double precision elements fit Subdivide every partition into 8 sub-partitions Handle them in a similar manner to first-level partitions to parallelize code Although through vectorization now The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

First attempt Access corresponding elements in sub-partitions with stride Vectorize with gather/scatter instructions d u l ⋱ Load 8 elements of sub-partitions into SIMD register Load next 8 elements of sub-partitions into SIMD register Sub-partitions in same partition Performance proved very bad 3x slower Abandoned this approach The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Second attempt Reorganize elements within partition Same logic as with global reorganization Bring together elements on which the same operations are performed Better results than previous attempt But worse than not using reorganization at all Cost of reorganization is too high 1st partition, 1st elements of sub-partitions 1st partition, 2nd elements of sub-partitions 1st partition, mth elements of sub-partitions Continue with 2nd, 3rd, etc partitions … The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Third attempt: Vectorize multiple RHS An obvious way to exploit vectorization Same operations have to be applied to all RHS d u x0 b0,0 b0,1 b0,2 b0,3 b0,4 b0,5 b0,6 b0,7 l x1 b1,0 b1,1 b1,2 b1,3 b1,4 b1,5 b1,6 b1,7 x2 b2,0 b2,1 b2,2 b2,3 b2,4 b2,5 b2,6 b2,7 x3 b3,0 b3,1 b3,2 b3,3 b3,4 b3,5 b3,6 b3,7 x4 b4,0 b4,1 b4,2 b4,3 b4,4 b4,5 b4,6 b4,7 x5 b5,0 b5,1 b5,2 b5,3 b5,4 b5,5 b5,6 b5,7 × x6 = b6,0 b6,1 b6,2 b6,3 b6,4 b6,5 b6,6 b6,7 x7 b7,0 b7,1 b7,2 b7,3 b7,4 b7,5 b7,6 b7,7 ⋱ ⋮ xN bN,0 bN,1 bN,2 bN,3 bN,4 bN,5 bN,6 bN,7 The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Experimental environment Host Intel Core i7-3770 CPU 8 cores @ 3.40GHz 16GB memory Coprocessor Intel Xeon Phi 3120A 57 cores @ 1.1 GHz 6GB memory Size of system to be solved is 220 Except if otherwise noted The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Test matrix collection Matrix number* MATLAB instructions to create matrix 1 tridiag(n,L,D,U), with L,D,U sampled from U(−1,1) 3 gallery(‘lesp’, n) 18 tridiag(-1*ones(n-1, 1), 4*ones(n, 1), -1*ones(n-1, 1)) 20 tridiag(-ones(n-1,1),4*ones(n,1),U), U sampled from U(−1,1) * Taken from: “A direct tridiagonal solver based on Givens rotations for GPU architectures”, I.E. Venetis et al., Parallel Computing, 2015. The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Numerical stability comparison Matrix number κ2(Α) g-spike with vect. Block Size 16 RER2 g-spike without vect. g-spike CUDA MATLAB 1 9,32e+03 1,59e-14 1,54e-14 2,64e-14 1,10e-14 3 2,84e+03 2,05e-16 1,90e-16 2,10e-16 1,52e-16 18 3,00e+00 2,20e-16 2,03e-16 1,88e-16 1,27e-16 20 2,46e+00 1,81e-16 1,77e-16 1,70e-16 1,17e-16 The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

g-Spike with/without vectorization Always use number of threads that gives highest performance 224 threads 128 threads 64 threads 32 threads The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Speedup (without vectorization, N=220) The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Speedup (without vectorization, N=226) The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

g-Spike – ScaLAPACK (Intel MKL) 224 threads Always use number of threads/processes that gives highest performance 32 proc. 64 proc. 128 proc. 8 proc. 16 proc. The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

g-Spike – LAPACK (Intel MKL) 8 threads 4 threads 16 threads Always use number of threads that gives highest performance The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Multiple RHS (N=220, 224 threads) The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Multiple RHS (N=220, 224 threads) The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Multiple RHS (N=222, 224 threads) The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Multiple RHS (N=222, 224 threads) The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

Conclusions – Possible extensions Good numerical stability with low computational cost Currently the fastest solver for general tridiagonal systems on the Intel Xeon Phi High data transfer cost for a single application of the algorithm Should be amortized among multiple invocations Possible extensions Use of square root free Givens rotations Improve on the detection of singularity Cases where g-Spike will not detect singularity of a sub-matrix due to finite precision Extend to banded systems The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)

THANK YOU! QUESTIONS? The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16)