Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Fill Reduction Algorithm Using Diagonal Markowitz Scheme with Local Symmetrization Patrick Amestoy ENSEEIHT-IRIT, France Xiaoye S. Li Esmond Ng Lawrence.
Lecture 6: Multicore Systems
A Memory-Efficient Algorithm for Large-Scale Symmetric Tridiagonal Eigenvalue Problem on Multi-GPU Systems Hyunsu Cho and Peter A. Yoon Trinity College,
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application.
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
MATH 685/ CSI 700/ OR 682 Lecture Notes
Sparse Matrices in Matlab John R. Gilbert Xerox Palo Alto Research Center with Cleve Moler (MathWorks) and Rob Schreiber (HP Labs)
SOLVING SYSTEMS OF LINEAR EQUATIONS. Overview A matrix consists of a rectangular array of elements represented by a single symbol (example: [A]). An individual.
Symmetric Minimum Priority Ordering for Sparse Unsymmetric Factorization Patrick Amestoy ENSEEIHT-IRIT (Toulouse) Sherry Li LBNL/NERSC (Berkeley) Esmond.
1cs542g-term Sparse matrix data structure  Typically either Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Informally “ia-ja” format.
Sparse Matrix Methods Day 1: Overview Day 2: Direct methods Nonsymmetric systems Graph theoretic tools Sparse LU with partial pivoting Supernodal factorization.
Sparse Matrix Methods Day 1: Overview Matlab and examples Data structures Ax=b Sparse matrices and graphs Fill-reducing matrix permutations Matching and.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
6/22/2005ICS'20051 Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen University of Rochester.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
The Evolution of a Sparse Partial Pivoting Algorithm John R. Gilbert with: Tim Davis, Jim Demmel, Stan Eisenstat, Laura Grigori, Stefan Larimore, Sherry.
L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.
CS 290H Lecture 12 Column intersection graphs, Ordering for sparsity in LU with partial pivoting Read “Computing the block triangular form of a sparse.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,
September 15, Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan.
Lecture 5 Parallel Sparse Factorization, Triangular Solution
CS6963 L15: Design Review and CUBLAS Paper Discussion.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
JAVA AND MATRIX COMPUTATION
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Solution of Sparse Linear Systems
Lecture 4 Sparse Factorization: Data-flow Organization
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
Administrivia: October 5, 2009 Homework 1 due Wednesday Reading in Davis: Skim section 6.1 (the fill bounds will make more sense next week) Read section.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
1 1  Capabilities: Serial (thread-safe), shared-memory (SuperLU_MT, OpenMP or Pthreads), distributed-memory (SuperLU_DIST, hybrid MPI+ OpenM + CUDA).
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
CS 290H Lecture 15 GESP concluded Final presentations for survey projects next Tue and Thu 20-minute talk with at least 5 min for questions and discussion.
Irregular Applications –Sparse Matrix Vector Multiplication
Symmetric-pattern multifrontal factorization T(A) G(A)
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue.
Parallel Direct Methods for Sparse Linear Systems
A computational loop k k Integration Newton Iteration
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
The Landscape of Sparse Ax=b Solvers
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
CSCE569 Parallel Computing
University of Wisconsin-Madison
A Graph Partitioning Algorithm by Node Separators
Dense Linear Algebra (Data Distributions)
Parallelized Analytic Placer
A computational loop k k Integration Newton Iteration
Presentation transcript:

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization on CPUs and GPUs Nano-scale Integrated Circuit and System Lab., EE Department, Tsinghua University Ling Ren 1

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Outline  Background & Related works  Algorithm  GPU implementation  Experiments  Performance analysis 2

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Background  Our motivation: SPICE simulator  Bottleneck : Sparse LU Solver  University of Florida Sparse Matrix Collection 3

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Related works  Dense LU  Very efficient on GPUs, [Volkov2008, Tomov2010]  Sparse LU  [SuperLU1999, Pardiso2002] Supernode (Dense blocks) [Christen2007] dense blocks on GPU  But no supernodes in extreme sparse matrices  [KLU2010], for circuit matrices, without Supernode G/P left-looking algorithm [G/P 1988], sequential version only 4

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Outline  Background & Related works  Algorithm  GPU implementation  Experiments  Performance analysis 5

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Algorithm – left-looking  Each column is updated by all the columns on its left  Update = vector multiply-and-add (MAD) 6 Read + write > arithmetic Update

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Algorithm description – EGraph  Nonzero structure of U determines the dependency 7 (b)EGraph (a) Upper triangular matrix U nonzero

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Algorithm analysis – parallelism 8 Column 1 Column 2 Column 3 Column Overlapped factorization in pipeline mode Thread 1 Thread 2  Divide columns into levels  Columns in the same level are independent  Cluster mode & pipeline mode

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Algorithm analysis – timing order 9  Pipeline parallelism, along with timing order

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Outline  Background & Related works  Algorithm  GPU implementation  Experiments  Performance analysis 10

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Algorithm analysis – timing order 11  Pipeline parallelism, along with timing order

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University GPU implementation - avoid deadlock  Traditionally, some warps  Inactive at the beginning (limited resource etc.).  Activated when other active warps finish  But in sparse LU, all warps must be active from the beginning 12

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University GPU implementation – data locality  Coalesced accesses  Sorting the nonzeros 13

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University GPU implementation – work partitioning  Two levels of parallelism  Between vector MAD & within vector MAD  Virtue group size?  Too large  idle threads, fewer virtue groups  Too small  SIMD threads diverge  Virtue group = one warp 14 One virtue group processes one column # virtue groups == # concurrent columns

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Outline  Background & Related works  Algorithm  GPU implementation  Experiments  Performance analysis 15

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Experiments – setup  CPU  2 Xeon E5405  2 Xeon X5680  GPU  AMD Radeon 5870  NVIDIA GTX580  Testing matrices  University of Florida Sparse Matrix Collection 16

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Experiments – bandwidth vs. flops  GTX580 vs. X

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Experiments – performance  Group A: flops < 200M  Group B: flops > 200M  Group C: denormal numbers, slow on CPU 18

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Outline  Background & Related works  Algorithm  GPU implementation  Experiments  Performance analysis 19

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Performance analysis – concurrent columns  More concurrent columns, higher performance?  No, inexecutable operations. 20

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Performance analysis – on CPUs  Cache size & speed  More cores  higher performance 21

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Performance analysis – on GPUs  Cache too small, not important at present Only 10% performance loss, if no caching at all  Dominant factors  peak bandwidth  enough active warps to bring out the peak 22

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Performance analysis – on GPUs  Radeon 5870 & GTX580 23

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Thank you ! 24 EE Department, Tsinghua University Ling Ren

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Reference  [Volkov2008] V.Volkov and J. Demmel, “Benchmarking GPUs to tune dense linear algebra”, SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, IEEE Press, 2008, pp  [Tomov2010] S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense linear algebra for hybrid GPU accelerated many-core systems,” Parallel Comput., vol. 36, pp. 232–240, June  [Christen2007] M. Christen, O. Schenk, and H. Burkhart, “General-purpose sparse matrix building blocks using the NVIDIA CUDA technology platform,”  [SuperLU1999] J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, and J. W. H. Liu, “A supernodal approach to sparse partial pivoting,” SIAM J. Matrix Analysis and Applications, vol. 20, no. 3, pp. 720–755, 1999  [Pardiso2002] O. Schenk and K. Gartner, “Solving un-symmetric sparse systems of linear equations with PARDISO,” Computational Science - ICCS 2002, vol. 2330, pp. 355–363,

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Reference  [G/P 1988] J. R. Gilbert and T. Peierls, “Sparse partial pivoting in time proportional to arithmetic operations,” SIAM J. Sci. Statist. Comput., vol. 9, pp. 862– 874, 1988  [KLU2010] T. A. Davis and E. Palamadai Natarajan, “Algorithm 907: KLU, a direct sparse solver for circuit simulation problems,” ACM Trans. Math. Softw., vol. 37, pp. 36:1–36:17, September  [Florida] T. A. Davis and Y. Hu, “The university of Florida sparse matrix collection,” to appear in ACM Transactions on Mathematical Software.  [MC64] I. S. Duff and J. Koster, “The design and use of algorithms for permuting large entries to the diagonal of sparse matrices,” SIAM J. Matrix Anal. and Applics, no. 4, pp. 889–901,  [AMD] P. R. Amestoy, Enseeiht-Irit, T. A. Davis, and I. S. Duff, “Algorithm 837: AMD, an approximate minimum degree ordering algorithm,” ACM Trans. Math. Softw., vol. 30, pp. 381–388, September

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Algorithm – Terminology  Elimination Graph Definition  An edge from j to k iff U(j, k) != 0  In the following context, node = column 27 Level Definition – The length of the longest path from any source node to itself. – Source nodes have no incoming edges.

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University 28 GPU implementation - workflow

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University GPU implementation - preprocessing  Preprocessing: only once on CPU  MC64 to ensure numerical stability [MC64];  Approximate Minimum Degree to reduce fill-ins [AMD] ;  pre-factorization (numeric factorization with partial pivoting) to calculate the symbolic structure of L and U.  Sorting the nonzeros of L and U 29

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University GPU implementation - data formats  data formats for intermediate results : dense arrays vs. CSC  CSC (Compressed Sparse Column) Can be put in shared memory Using too much shared memory reduces active warps  Dense arrays > CSC format: 2.5x 30

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University 31