Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

Slides:

Advertisements

Similar presentations

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Advertisements

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Load Balancing Parallel Applications on Heterogeneous Platforms.

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)

Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded.

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

MATH 685/ CSI 700/ OR 682 Lecture Notes

Solving Linear Systems (Numerical Recipes, Chap 2)

CS 140 : Matrix multiplication Linear algebra problems Matrix multiplication I : cache issues Matrix multiplication II: parallel issues Thanks to Jim Demmel.

Avoiding Communication in Numerical Linear Algebra Jim Demmel EECS & Math Departments UC Berkeley.

Modern iterative methods For basic iterative methods, converge linearly Modern iterative methods, converge faster –Krylov subspace method Steepest descent.

Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

The Landscape of Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage (if sparse) More Robust.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:

1 02/09/05CS267 Lecture 7 CS 267 Tricks with Trees James Demmel

The Past, Present and Future of High Performance Numerical Linear Algebra: Minimizing Communication Jim Demmel EECS & Math Departments UC Berkeley.

6/22/2005ICS'20051 Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen University of Rochester.

Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.

CS240A: Conjugate Gradients and the Model Problem.

Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.

CS294: Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments Oded Schwartz EECS Department.

Communication avoiding algorithms in dense linear algebra

Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.

Antonio M. Vidal Jesús Peinado

COALA: Communication Optimal Algorithms for Linear Algebra Jim Demmel EECS & Math Depts. UC Berkeley Laura Grigori INRIA Saclay Ile de France.

1 Minimizing Communication in Numerical Linear Algebra Introduction, Technological Trends Jim Demmel EECS & Math Departments,

How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10.

1 Jim Demmel EECS & Math Departments, UC Berkeley Minimizing Communication in Numerical Linear Algebra

Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.

A CONDENSATION-BASED LOW COMMUNICATION LINEAR SYSTEMS SOLVER UTILIZING CRAMER'S RULE Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer.

1 Minimizing Communication in Linear Algebra James Demmel 15 June

1 Minimizing Communication in Numerical Linear Algebra Optimizing One-sided Factorizations: LU and QR Jim Demmel EECS & Math.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Communication-Avoiding Algorithms for Linear Algebra and Beyond Jim Demmel EECS & Math Departments UC Berkeley.

Lecture 4 Sparse Factorization: Data-flow Organization

CS240A: Conjugate Gradients and the Model Problem.

Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }

Direct Methods for Sparse Linear Systems Lecture 4 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen.

Minimizing Communication in Numerical Linear Algebra Optimizing Krylov Subspace Methods Jim Demmel EECS & Math Departments,

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Krylov-Subspace Methods - I Lecture 6 Alessandra Nardi Thanks to Prof. Jacob White, Deepak Ramaswamy, Michal Rewienski, and Karen Veroy.

Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Algorithms for Supercomputers Upper bounds: from sequential to parallel Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance.

Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Optimizing the Performance of Sparse Matrix-Vector Multiplication

Communication-Avoiding Iterative Methods

MS76: Communication-Avoiding Algorithms - Part I of II

Model Problem: Solving Poisson’s equation for temperature

Communication-avoiding Krylov subspace methods in finite precision

Real-Time Ray Tracing Stefan Popov.

Minimizing Communication in Linear Algebra

for more information ... Performance Tuning

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Numerical Algorithms Quiz questions

Avoiding Communication in Linear Algebra

Presentation transcript:

Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Motivation Running time of an algorithm is sum of 3 terms: # flops * time_per_flop # words moved / bandwidth # messages * latency communication

Motivation Running time of an algorithm is sum of 3 terms: # flops * time_per_flop # words moved / bandwidth # messages * latency Exponentially growing gaps between Time_per_flop << 1/Network BW << Network Latency Improving 59%/year vs 26%/year vs 15%/year Time_per_flop << 1/Memory BW << Memory Latency Improving 59%/year vs 23%/year vs 5.5%/year communication

Motivation Running time of an algorithm is sum of 3 terms: # flops * time_per_flop # words moved / bandwidth # messages * latency Exponentially growing gaps between Time_per_flop << 1/Network BW << Network Latency Improving 59%/year vs 26%/year vs 15%/year Time_per_flop << 1/Memory BW << Memory Latency Improving 59%/year vs 23%/year vs 5.5%/year Goal : reorganize linear algebra to avoid communication Not just hiding communication (speedup  2x ) Arbitrary speedups possible communication

Outline Motivation Avoiding Communication in Dense Linear Algebra Avoiding Communication in Sparse Linear Algebra

Collaborators (so far) UC Berkeley – Kathy Yelick, Ming Gu – Mark Hoemmen, Marghoob Mohiyuddin, Kaushik Datta, George Petropoulos, Sam Williams, BeBOp group – Lenny Oliker, John Shalf CU Denver – Julien Langou INRIA – Laura Grigori, Hua Xiang Much related work – Complete references in technical reports

Why all our problems are solved for dense linear algebra– in theory Thm (D., Dumitriu, Holtz, Kleinberg) (Numer.Math. 2007) – Given any matmul running in O(n  ) ops for some  >2, it can be made stable and still run in O(n  +  ) ops, for any  >0. Current record :   2.38 Thm (D., Dumitriu, Holtz) (Numer. Math. 2008) – Given any stable matmul running in O(n  +  ) ops, it is possible to do backward stable dense linear algebra in O(n  +  ) ops: GEPP, QR rank revealing QR (randomized) (Generalized) Schur decomposition, SVD (randomized) Also reduces communication to O(n  +  ) But constants? 7

Summary (1) – Avoiding Communication in Dense Linear Algebra QR decomposition of m x n matrix, m >> n – Parallel implementation Conventional: O( n log p ) messages “New”: O( log p ) messages - optimal – Serial implementation with fast memory of size F Conventional: O( mn/F ) moves of data from slow to fast memory – mn/F = how many times larger matrix is than fast memory “New”: O(1) moves of data - optimal – Lots of speed up possible (measured and modeled) Price: some redundant computation Extends to general rectangular (square) case Optimal, a la Hong/Kung, wrt bandwidth and latency Modulo polylog(P) factors, for both parallel and sequential Extends to other architectures (eg multicore) Extends to LU Decomposition

Minimizing Comm. in Parallel QR W = W0W1W2W3W0W1W2W3 R 00 R 10 R 20 R 30 R 01 R 11 R 02 QR decomposition of m x n matrix W, m >> n TSQR = “Tall Skinny QR” P processors, block row layout Usual Parallel Algorithm Compute Householder vector for each column Number of messages  n log P Communication Avoiding Algorithm Reduction operation, with QR as operator Number of messages  log P

TSQR in more detail Q is represented implicitly as a product (tree of factors)

Minimizing Communication in TSQR W = W0W1W2W3W0W1W2W3 R 00 R 10 R 20 R 30 R 01 R 11 R 02 Parallel: W = W0W1W2W3W0W1W2W3 R 01 R 02 R 00 R 03 Sequential: W = W0W1W2W3W0W1W2W3 R 00 R 01 R 11 R 02 R 11 R 03 Dual Core: Choose reduction tree dynamically Multicore / Multisocket / Multirack / Multisite / Out-of-core: ?

Performance of TSQR vs Sca/LAPACK Parallel – Pentium III cluster, Dolphin Interconnect, MPICH Up to 6.7x speedup (16 procs, 100K x 200) – BlueGene/L Up to 4x speedup (32 procs, 1M x 50) – Both use Elmroth-Gustavson locally – enabled by TSQR Sequential – OOC on PowerPC laptop As little as 2x slowdown vs (predicted) infinite DRAM See UC Berkeley EECS Tech Report Being revised to add optimality results!

QR for General Matrices CAQR – Communication Avoiding QR for general A – Use TSQR for panel factorizations – Apply to rest of matrix Cost of parallel CAQR vs ScaLAPACK’s PDGEQRF – n x n matrix on P 1/2 x P 1/2 processor grid, block size b – Flops: (4/3)n 3 /P + (3/4)n 2 b log P/P 1/2 vs (4/3)n 3 /P – Bandwidth: (3/4)n 2 log P/P 1/2 vs same – Latency: 2.5 n log P / b vs 1.5 n log P Close to optimal (modulo log P factors) – Assume: O(n 2 /P) memory/processor, O(n 3 ) algorithm, – Choose b = n / (P 1/2 log 2 P) (near its upper bound) – Bandwidth lower bound:  (n 2 /P 1/2 ) – just log(P) smaller – Latency lower bound:  (P 1/2 ) – just polylog(P) smaller – Extension of Irony/Toledo/Tishkin (2004) Sequential CAQR: up to O(n 1/2 ) times less bandwidth Implementation – Julien Langou’s summer project

Modeled Speedups of CAQR vs ScaLAPACK Petascale up to 22.9x IBM Power 5 up to 9.7x “Grid” up to 11x Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4 GB/s.

TSLU: LU factorization of a tall skinny matrix First try the obvious generalization of TSQR:

Growth factor for TSLU based factorization Unstable for large P and large matrices. When P = # rows, TSLU is equivalent to parallel pivoting. Courtesy of H. Xiang

Making TSLU Stable At each node in tree, TSLU selects b pivot rows from 2b candidates from its 2 child nodes At each node, do LU on 2b original rows selected by child nodes, not U factors from child nodes When TSLU done, permute b selected rows to top of original matrix, redo b steps of LU without pivoting CALU – Communication Avoiding LU for general A – Use TSLU for panel factorizations – Apply to rest of matrix – Cost: redundant panel factorizations Benefit: – Stable in practice, but not same pivot choice as GEPP – b times fewer messages overall - faster

Growth factor for better CALU approach Like threshold pivoting with worst case threshold =.33, so |L| <= 3 Testing shows about same residual as GEPP

Performance vs ScaLAPACK TSLU – IBM Power 5 Up to 4.37x faster (16 procs, 1M x 150) – Cray XT4 Up to 5.52x faster (8 procs, 1M x 150) CALU – IBM Power 5 Up to 2.29x faster (64 procs, 1000 x 1000) – Cray XT4 Up to 1.81x faster (64 procs, 1000 x 1000) Optimality analysis analogous to QR See INRIA Tech Report 6523 (2008)

Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4 GB/s. Speedup prediction for a Petascale machine - up to 81x faster P = 8192

Related Work and Contributions for Dense Linear Algebra Related work (on QR) – Pothen & Raghavan (1989) – Rabani & Toledo (2001) – Dunha, Becker & Patterson (2002) – Gunter & van de Geijn (2005) Our contributions – QR: 2D parallel, efficient 1D parallel QR – Extensions to LU – Communication lower bounds

Summary (2) – Avoiding Communication in Sparse Linear Algebra Take k steps of Krylov subspace method – GMRES, CG, Lanczos, Arnoldi – Assume matrix “well-partitioned,” with modest surface- to-volume ratio – Parallel implementation Conventional: O(k log p) messages “New”: O(log p) messages - optimal – Serial implementation Conventional: O(k) moves of data from slow to fast memory “New”: O(1) moves of data – optimal Can incorporate some preconditioners – Hierarchical, semiseparable matrices … Lots of speed up possible (modeled and measured) – Price: some redundant computation

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Locally Dependent Entries for [x,Ax], A tridiagonal, 2 processors Can be computed without communication Proc 1 Proc 2

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Can be computed without communication Proc 1 Proc 2 Locally Dependent Entries for [x,Ax,A 2 x], A tridiagonal, 2 processors

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Can be computed without communication Proc 1 Proc 2 Locally Dependent Entries for [x,Ax,…,A 3 x], A tridiagonal, 2 processors

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Can be computed without communication Proc 1 Proc 2 Locally Dependent Entries for [x,Ax,…,A 4 x], A tridiagonal, 2 processors

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Locally Dependent Entries for [x,Ax,…,A 8 x], A tridiagonal, 2 processors Can be computed without communication k=8 fold reuse of A Proc 1 Proc 2

Remotely Dependent Entries for [x,Ax,…,A 8 x], A tridiagonal, 2 processors x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x One message to get data needed to compute remotely dependent entries, not k=8 Minimizes number of messages = latency cost Price: redundant work  “surface/volume ratio” Proc 1 Proc 2

Fewer Remotely Dependent Entries for [x,Ax,…,A 8 x], A tridiagonal, 2 processors x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Reduce redundant work by half Proc 1 Proc 2

Remotely Dependent Entries for [x,Ax,A 2 x,A 3 x], A irregular, multiple processors

Sequential [x,Ax,…,A 4 x], with memory hierarchy v One read of matrix from slow memory, not k=4 Minimizes words moved = bandwidth cost No redundant work

Performance Results Measured – Sequential/OOC speedup up to 3x Modeled – Sequential/multicore speedup up to 2.5x – Parallel/Petascale speedup up to 6.9x – Parallel/Grid speedup up to 22x See bebop.cs.berkeley.edu/#pubs

Optimizing Communication Complexity of Sparse Solvers Example: GMRES for Ax=b on “2D Mesh” – x lives on n-by-n mesh – Partitioned on p ½ -by- p ½ grid – A has “5 point stencil” (Laplacian) (Ax)(i,j) = linear_combination(x(i,j), x(i,j±1), x(i±1,j)) – Ex: 18-by-18 mesh on 3-by-3 grid

Minimizing Communication of GMRES What is the cost = (#flops, #words, #mess) of k steps of standard GMRES? GMRES, ver.1: for i=1 to k w = A * v(i-1) MGS(w, v(0),…,v(i-1)) update v(i), H endfor solve LSQ problem with H n/p ½ Cost(A * v) = k * (9n 2 /p, 4n / p ½, 4 ) Cost(MGS) = k 2 /2 * ( 4n 2 /p, log p, log p ) Total cost ~ Cost( A * v ) + Cost (MGS) Can we reduce the latency?

Minimizing Communication of GMRES Cost(GMRES, ver.1) = Cost(A*v) + Cost(MGS) Cost(W) = ( ~ same, ~ same, 8 ) Latency cost independent of k – optimal Cost (MGS) unchanged Can we reduce the latency more? = ( 9kn 2 /p, 4kn / p ½, 4k ) + ( 2k 2 n 2 /p, k 2 log p / 2, k 2 log p / 2 ) How much latency cost from A*v can you avoid? Almost all GMRES, ver. 2: W = [ v, Av, A 2 v, …, A k v ] [Q,R] = MGS(W) Build H from R, solve LSQ problem k = 3

Minimizing Communication of GMRES Cost(GMRES, ver. 2) = Cost(W) + Cost(MGS) = ( 9kn 2 /p, 4kn / p ½, 8 ) + ( 2k 2 n 2 /p, k 2 log p / 2, k 2 log p / 2 ) How much latency cost from MGS can you avoid? Almost all Cost(TSQR) = ( ~ same, ~ same, log p ) Latency cost independent of s - optimal GMRES, ver. 3: W = [ v, Av, A 2 v, …, A k v ] [Q,R] = TSQR(W) … “Tall Skinny QR” Build H from R, solve LSQ problem W = W1W2W3W4W1W2W3W4 R1R2R3R4R1R2R3R4 R 12 R 34 R 1234

Minimizing Communication of GMRES Cost(GMRES, ver. 2) = Cost(W) + Cost(MGS) = ( 9kn 2 /p, 4kn / p ½, 8 ) + ( 2k 2 n 2 /p, k 2 log p / 2, k 2 log p / 2 ) How much latency cost from MGS can you avoid? Almost all Cost(TSQR) = ( ~ same, ~ same, log p ) Oops GMRES, ver. 3: W = [ v, Av, A 2 v, …, A k v ] [Q,R] = TSQR(W) … “Tall Skinny QR” Build H from R, solve LSQ problem W = W1W2W3W4W1W2W3W4 R1R2R3R4R1R2R3R4 R 12 R 34 R 1234

Minimizing Communication of GMRES Cost(GMRES, ver. 2) = Cost(W) + Cost(MGS) = ( 9kn 2 /p, 4kn / p ½, 8 ) + ( 2k 2 n 2 /p, k 2 log p / 2, k 2 log p / 2 ) How much latency cost from MGS can you avoid? Almost all Cost(TSQR) = ( ~ same, ~ same, log p ) Oops – W from power method, precision lost! GMRES, ver. 3: W = [ v, Av, A 2 v, …, A k v ] [Q,R] = TSQR(W) … “Tall Skinny QR” Build H from R, solve LSQ problem W = W1W2W3W4W1W2W3W4 R1R2R3R4R1R2R3R4 R 12 R 34 R 1234

Minimizing Communication of GMRES Cost(GMRES, ver. 3) = Cost(W) + Cost(TSQR) = ( 9kn 2 /p, 4kn / p ½, 8 ) + ( 2k 2 n 2 /p, k 2 log p / 2, log p ) Latency cost independent of k, just log p – optimal Oops – W from power method, so precision lost – What to do? Use a different polynomial basis Not Monomial basis W = [v, Av, A 2 v, …], instead … Newton Basis W N = [v, (A – θ 1 I)v, (A – θ 2 I)(A – θ 1 I)v, …] or Chebyshev Basis W C = [v, T 1 (v), T 2 (v), …]

Related Work and Contributions for Sparse Linear Algebra Related work – s-step GMRES : De Sturler (1991), Bai, Hu & Reichel (1991), Joubert et al (1992), Erhel (1995) – s-step CG : Van Rosendale (1983), Chronopoulos & Gear (1989), Toledo (1995) – Matrix Powers Kernel : Pfeifer (1963), Hong & Kung (1981), Leiserson, Rao & Toledo (1993), Toledo (1995), Douglas, Hu, Kowarschik, Rüde, Weiss (2000), Strout, Carter & Ferrante (2001) Our contributions – Unified approach to serial and parallel, use of TSQR – Optimizing serial via TSP – Unified approach to stability – Incorporating preconditioning

Summary and Conclusions (1/2) Possible to minimize communication complexity of much dense and sparse linear algebra – Practical speedups – Approaching theoretical lower bounds Optimal asymptotic complexity algorithms for dense linear algebra – also lower communication Hardware trends mean the time has come to do this Lots of prior work (see pubs) – and some new

Summary and Conclusions (2/2) Many open problems – Automatic tuning - build and optimize complicated data structures, communication patterns, code automatically: bebop.cs.berkeley.edu – Extend optimality proofs to general architectures – Dense eigenvalue problems – SBR or spectral D&C? – Sparse direct solvers – CALU or SuperLU? – Which preconditioners work? – Why stop at linear algebra?