Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Lecture 19: Parallel Algorithms
Fill Reduction Algorithm Using Diagonal Markowitz Scheme with Local Symmetrization Patrick Amestoy ENSEEIHT-IRIT, France Xiaoye S. Li Esmond Ng Lawrence.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Lecture 3 Sparse Direct Method: Combinatorics Xiaoye Sherry Li Lawrence Berkeley National Laboratory, USA crd-legacy.lbl.gov/~xiaoye/G2S3/
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
MATH 685/ CSI 700/ OR 682 Lecture Notes
SOLVING SYSTEMS OF LINEAR EQUATIONS. Overview A matrix consists of a rectangular array of elements represented by a single symbol (example: [A]). An individual.
Numerical Algorithms Matrix multiplication
Solution of linear system of equations
Symmetric Minimum Priority Ordering for Sparse Unsymmetric Factorization Patrick Amestoy ENSEEIHT-IRIT (Toulouse) Sherry Li LBNL/NERSC (Berkeley) Esmond.
Numerical Algorithms • Matrix multiplication
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
1cs542g-term Sparse matrix data structure  Typically either Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Informally “ia-ja” format.
Sparse Direct Methods on High Performance Computers X. Sherry Li CS267/E233: Applications of Parallel Computing.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Sparse Matrix Methods Day 1: Overview Day 2: Direct methods
The Landscape of Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage (if sparse) More Robust.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:
Sparse Matrix Methods Day 1: Overview Day 2: Direct methods Nonsymmetric systems Graph theoretic tools Sparse LU with partial pivoting Supernodal factorization.
1/26 Design of parallel algorithms Linear equations Jari Porras.
6/22/2005ICS'20051 Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen University of Rochester.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
The Evolution of a Sparse Partial Pivoting Algorithm John R. Gilbert with: Tim Davis, Jim Demmel, Stan Eisenstat, Laura Grigori, Stefan Larimore, Sherry.
Sparse Direct Solvers on High Performance Computers X. Sherry Li CS267: Applications of Parallel Computers March.
CS 290H Lecture 12 Column intersection graphs, Ordering for sparsity in LU with partial pivoting Read “Computing the block triangular form of a sparse.
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
Lecture 5 Parallel Sparse Factorization, Triangular Solution
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Amesos Sparse Direct Solver Package Ken Stanley, Rob Hoekstra, Marzio Sala, Tim Davis, Mike Heroux Trilinos Users Group Albuquerque 3 Nov 2004.
SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.
Combinatorial Scientific Computing and Petascale Simulation (CSCAPES) A SciDAC Institute Funded by DOE’s Office of Science Investigators Alex Pothen, Florin.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
JAVA AND MATRIX COMPUTATION
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Solution of Sparse Linear Systems
Lecture 4 Sparse Factorization: Data-flow Organization
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
Direct Methods for Sparse Linear Systems Lecture 4 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
Administrivia: October 5, 2009 Homework 1 due Wednesday Reading in Davis: Skim section 6.1 (the fill bounds will make more sense next week) Read section.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
1 1  Capabilities: Serial (C), shared-memory (OpenMP or Pthreads), distributed-memory (hybrid MPI+ OpenM + CUDA). All have Fortran interface. Sparse LU.
Data Structures and Algorithms in Parallel Computing Lecture 7.
1 1  Capabilities: Serial (thread-safe), shared-memory (SuperLU_MT, OpenMP or Pthreads), distributed-memory (SuperLU_DIST, hybrid MPI+ OpenM + CUDA).
CS 290H Lecture 15 GESP concluded Final presentations for survey projects next Tue and Thu 20-minute talk with at least 5 min for questions and discussion.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
CS 290H Lecture 9 Left-looking LU with partial pivoting Read “A supernodal approach to sparse partial pivoting” (course reader #4), sections 1 through.
Symmetric-pattern multifrontal factorization T(A) G(A)
Solution of Sparse Linear Systems Numerical Simulation CSE245 Thanks to: Kin Sou, Deepak Ramaswamy, Michal Rewienski, Jacob White, Shihhsien Kuo and Karen.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
Lecture 3 Sparse Direct Method: Combinatorics
CS 290N / 219: Sparse Matrix Algorithms
Lecture 22: Parallel Algorithms
Sparse Direct Methods on High Performance Computers
Sparse Matrix Methods on High Performance Computers
Numerical Algorithms • Parallelizing matrix multiplication
Dense Linear Algebra (Data Distributions)
To accompany the text “Introduction to Parallel Computing”,
Presentation transcript:

Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24, 2008

Overview Basic algorithms Partitioning, processor assignment at different phases Scalability issues Current & future work 2

Sparse GE Scalar algorithm: 3 nested loops –Can re-arrange loops to get different variants: left-looking, right- looking, L U for i = 1 to n column_scale ( A(:,i) ) for k = i+1 to n s.t. A(i,k) != 0 for j = i+1 to n s.t. A(j,i) != 0 A(j,k) = A(j,k) - A(j,i) * A(i,k)  Typical fill-ratio: 10x for 2D problems, 30-50x for 3D problems  Finding fill-ins is equivalent to finding transitive closure of G(A)

4 Major stages 1.Order equations & variables to preserve sparsity NP-hard, use heuristics 2.Symbolic factorization Identify supernodes, set up data structures and allocate memory for L & U. 3.Numerical factorization – usually dominates total time How to pivot? 4.Triangular solutions – usually less than 5% total time SuperLU_MT 1.Sparsity ordering 2.Factorization Partial pivoting Symbolic fact. Num. fact. (BLAS 2.5) 3.Solve SuperLU_DIST 1.Static pivoting 2.Sparsity ordering 3.Symbolic fact. 4.Numerical fact. (BLAS 3) 5.Solve

5 SuperLU_DIST steps: Static numerical pivoting: improve diagonal dominance –Currently use MC64 (HSL, serial) –Being parallelized [J. Riedy]: auction algorithm Ordering to preserve sparsity –Can use parallel graph partitioning: ParMetis, Scotch Symbolic factorization: determine pattern of {L\U} –Parallelized [L. Grigori et al.] Numerics: Parallelized –Factorization: usually dominate total time –Triangular solutions –Iterative refinement: triangular solution + SPMV

6 Supernode: dense blocks in {L\U} Good for high performance –Enable use of BLAS 3 –Reduce inefficient indirect addressing (scatter/gather) –Reduce time of the graph algorithms by traversing a coarser graph

Matrix partitioning at different stages Distributed input A (user interface) –1-D block partition (distributed CRS format) Parallel symbolic factorization –Tied with a ND ordering –Distribution using separator tree, 1-D within separators Numeric phases –2-D block cyclic distribution 7

8 Parallel symbolic factorization Tree-based partitioning / assignment Use graph partitioning to reorder/partition matrix –ParMetis on graph of A + A’ `Arrow-head’, two-level partitioning –separator tree: subtree-to- subprocessor mapping –within separators: 1-D block cyclic distribution Disadvantage: works only with ND ordering, and a binary tree P 0,1,2,3 P 0,1 P 2,3 P0P0 P1P1 P2P2 P3P3 P0P0 P1P1 P2P2 P3P3

9 Memory result of parallel symbolic Maximum per-processor memory dds15P = 8P = 256 LU fill (millions) symbolicSequntial Parallel Ratio Seq./Par Entire solverOld New dds15P = 8P = 256 LU fill (millions) symbolicSequntial Parallel Ratio Seq./Par Entire solverOld New Accelerator: dds15 (Omega3P) Fusion: matrix181 (M3D-C1)

10 Runtime of parallel symbolic, IBM Power5 matrix181P = 8P = 256 symbolic Sequntial Parallel Entire solver Old New dds15P = 8P = 256 symbolic Sequntial Parallel Entire solver Old New

Numeric phases: 2-D partition by supernodes Find supernode boundaries from columns of L –Not to exceed MAXSUPER (~50) Apply same partition to rows of U Diagonal blocks are square, full, <= MAXSUPER; off-diagonal blocks are rectangular, not full

2D block cyclic layout One step look-ahead to overlap comm. & comp. Scales to 1000s processors 12 Processor assignment in 2-D Process mesh Matrix ACTIVE  Disadvantage: inflexible Communications restricted: - row communicators - column communicators

Block dependency graph – DAG Based on nonzero structure of L+U –Each diagonal block has edges directed to the blocks below in the same column (L-part), and the blocks on the right in the same row (U-part) –Each pair of blocks L(r,k) and U(k,c) have edges directed to block (r,c) for Schur complement update 13  Elimination proceeds from source to sink  Over the iteration space for k = 1 : N, dags and submatrices become smaller

Higher level of dependency Lower arithmetic intensity (flops per byte of DRAM access or communication) Triangular solution Process mesh 2 3 4

15 Examples NameCodesN|A| / NFill-ratio dds15Acclerator (Omega3P) 834, Sparsity-preserving ordering: MMD applied to structure of A’+A matrix181Fusion (M3D-C1) 589, stomach3D finite diff.213, twotoneNonlinear anal. circuit 120, pre2Circuit in Freq.-domain 659,

Load imbalance LB = avg-flops / max-flops 16

Communication 17

Current and future work LUsim – simulation-based performance model [P. Cicotti et al.] Micro-benchmarks to calibrate memory access time, BLAS speed, and network speed Memory system simulator for each processor Block dependency graph Better partition to improve load balance Better scheduling to reduce processor idle time 18