1 Minimizing Communication in Linear Algebra James Demmel 15 June 2010 www.cs.berkeley.edu/~demmel.

Slides:



Advertisements
Similar presentations
Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Advertisements

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded.
MATH 685/ CSI 700/ OR 682 Lecture Notes
Solving Linear Systems (Numerical Recipes, Chap 2)
Avoiding Communication in Numerical Linear Algebra Jim Demmel EECS & Math Departments UC Berkeley.
OpenFOAM on a GPU-based Heterogeneous Cluster
Modern iterative methods For basic iterative methods, converge linearly Modern iterative methods, converge faster –Krylov subspace method Steepest descent.
Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.
Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.
Sparse Matrix Methods Day 1: Overview Day 2: Direct methods
The Landscape of Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage (if sparse) More Robust.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:
The Past, Present and Future of High Performance Numerical Linear Algebra: Minimizing Communication Jim Demmel EECS & Math Departments UC Berkeley.
Introduction to Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
CS240A: Conjugate Gradients and the Model Problem.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
CS294: Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments Oded Schwartz EECS Department.
MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 6. Eigenvalue problems.
Communication-Avoiding Algorithms for Linear Algebra and Beyond Jim Demmel EECS & Math Departments UC Berkeley.
Introduction to Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Communication avoiding algorithms in dense linear algebra
Communication-Avoiding Algorithms for Linear Algebra and Beyond
Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
COALA: Communication Optimal Algorithms for Linear Algebra Jim Demmel EECS & Math Depts. UC Berkeley Laura Grigori INRIA Saclay Ile de France.
1 Minimizing Communication in Numerical Linear Algebra Introduction, Technological Trends Jim Demmel EECS & Math Departments,
How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
1 Jim Demmel EECS & Math Departments, UC Berkeley Minimizing Communication in Numerical Linear Algebra
The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley CScADS Autotuning Workshop 9 July 2007.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
1 Minimizing Communication in Numerical Linear Algebra Optimizing One-sided Factorizations: LU and QR Jim Demmel EECS & Math.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
02/24/2011CS267 Lecture 121 CS 267 Dense Linear Algebra: Parallel Gaussian Elimination James Demmel
Communication-Avoiding Algorithms for Linear Algebra and Beyond Jim Demmel EECS & Math Departments UC Berkeley.
Communication-Avoiding Algorithms for Linear Algebra and Beyond Jim Demmel Math & EECS Departments UC Berkeley.
CS240A: Conjugate Gradients and the Model Problem.
Minimizing Communication in Numerical Linear Algebra Optimizing Krylov Subspace Methods Jim Demmel EECS & Math Departments,
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Communication-Avoiding Algorithms for Linear Algebra and Beyond Jim Demmel EECS & Math Departments UC Berkeley.
Introduction to Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Krylov-Subspace Methods - I Lecture 6 Alessandra Nardi Thanks to Prof. Jacob White, Deepak Ramaswamy, Michal Rewienski, and Karen Veroy.
How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part II: Geometric embedding Oded Schwartz CS294, Lecture.
Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Algorithms for Supercomputers Upper bounds: from sequential to parallel Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
The Landscape of Sparse Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage More Robust More.
Communication-Avoiding Iterative Methods
MS76: Communication-Avoiding Algorithms - Part I of II
Model Problem: Solving Poisson’s equation for temperature
Communication-avoiding Krylov subspace methods in finite precision
Minimizing Communication in Linear Algebra
Minimizing Communication in Linear Algebra
Shengxin Zhu The University of Oxford
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Avoiding Communication in Linear Algebra
Presentation transcript:

1 Minimizing Communication in Linear Algebra James Demmel 15 June

2 Outline What is “communication” and why is it important to avoid? “Direct” Linear Algebra Lower bounds on how much data must be moved to solve linear algebra problems like Ax=b, Ax = λx, etc Algorithms that attain these lower bounds Not in standard libraries like Sca/LAPACK (yet!) Large speed-ups possible “Iterative” Linear Algebra (Krylov Subspace Methods) Ditto Extensions, open problems

Collaborators Grey Ballard, UCB EECS Jack Dongarra, UTK Ioana Dumitriu, U. Washington Laura Grigori, INRIA Ming Gu, UCB Math Mark Hoemmen, Sandia NL Olga Holtz, UCB Math & TU Berlin Julien Langou, U. Colorado Denver Marghoob Mohiyuddin, UCB EECS Oded Schwartz, TU Berlin Hua Xiang, INRIA Kathy Yelick, UCB EECS & NERSC BeBOP group at Berkeley 3 Thanks to Intel, Microsoft, UC Discovery, NSF, DOE, …

4 Motivation (1/2) Algorithms have two costs: 1.Arithmetic (FLOPS) 2.Communication: moving data between levels of a memory hierarchy (sequential case) processors over a network (parallel case). CPU Cache DRAM CPU DRAM CPU DRAM CPU DRAM CPU DRAM

Motivation (2/2) Running time of an algorithm is sum of 3 terms: # flops * time_per_flop # words moved / bandwidth # messages * latency 5 communication Time_per_flop << 1/ bandwidth << latency Gaps growing exponentially with time Goal : reorganize linear algebra to avoid communication Between all memory hierarchy levels L1 L2 DRAM network, etc Not just hiding communication (speedup  2x ) Arbitrary speedups possible Annual improvements Time_per_flopBandwidthLatency Network26%15% DRAM23%5% 59%

Direct linear algebra: Prior Work on Matmul Assume n 3 algorithm (i.e. not Strassen-like) Sequential case, with fast memory of size M Lower bound on #words moved to/from slow memory =  (n 3 / M 1/2 ) [Hong, Kung, 81] Attained using “blocked” algorithms 6 Parallel case on P processors: Let NNZ be total memory needed; assume load balanced Lower bound on #words communicated =  (n 3 /(P· NNZ ) 1/2 ) [Irony, Tiskin, Toledo, 04] NNZLower bound on #words Attained by 3n 2 (“2D alg”)  (n 2 / P 1/2 ) [Cannon, 69] 3n 2 P 1/3 (“3D alg”)  (n 2 / P 2/3 ) [Johnson,93]

Lower bound for all “direct” linear algebra Holds for BLAS, LU, QR, eig, SVD, tensor contractions, … Some whole programs (sequences of these operations, no matter how individual ops are interleaved, eg computing A k ) Dense and sparse matrices (where #flops << n 3 ) Sequential and parallel algorithms Some graph-theoretic algorithms (eg Floyd-Warshall) See [BDHS09] 7 Let M = “fast” memory size per processor #words_moved by at least one processor =  (#flops / M 1/2 ) #messages_sent by at least one processor =  (#flops / M 3/2 )

Can we attain these lower bounds? Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds? Mostly not If not, are there other algorithms that do? Yes Goals for algorithms: Minimize #words_moved =  (#flops/ M 1/2 ) Minimize #messages =  (#flops/ M 3/2 ) Need new data structures: (recursive) blocked Minimize for multiple memory hierarchy levels Cache-oblivious algorithms would be simplest Fewest flops when matrix fits in fastest memory Cache-oblivious algorithms don’t always attain this Only a few sparse algorithms so far (eg Cholesky) 8

Summary of dense sequential algorithms attaining communication lower bounds 9 Algorithms shown minimizing # Messages use (recursive) block layout Not possible with columnwise or rowwise layouts Many references (see reports), only some shown, plus ours Cache-oblivious are underlined, Green are ours, ? is unknown/future work Algorithm2 Levels of MemoryMultiple Levels of Memory #Words Moved and # Messages#Words Moved and #Messages BLAS-3Usual blocked or recursive algorithms Usual blocked algorithms (nested), or recursive [Gustavson,97] Cholesky LAPACK (with b = M 1/2 ) [Gustavson 97] [BDHS09] [Gustavson,97] [Ahmed,Pingali,00] [BDHS09] (←same) LU with pivoting LAPACK (rarely) [Toledo,97], [GDX 08] [GDX 08] not partial pivoting [Toledo, 97] [GDX 08]? QR Rank- revealing LAPACK (rarely) [Elmroth,Gustavson,98] [DGHL08] [Frens,Wise,03] but 3x flops [DGHL08] [Elmroth, Gustavson,98] [DGHL08] ? [Frens,Wise,03] [DGHL08] ? Eig, SVD Not LAPACK [BDD10] randomized, but more flops [BDD10]

Summary of dense 2D parallel algorithms attaining communication lower bounds 10 Assume nxn matrices on P processors, memory per processor = O(n 2 / P) ScaLAPACK assumes best block size b chosen Many references (see reports), Green are ours Recall lower bounds: #words_moved =  ( n 2 / P 1/2 ) and #messages =  ( P 1/2 ) AlgorithmReferenceFactor exceeding lower bound for #words_moved Factor exceeding lower bound for #messages Matrix multiply[Cannon, 69]11 CholeskyScaLAPACKlog P LU[GDX08] ScaLAPACK log P ( N / P 1/2 ) · log P QR[DGHL08] ScaLAPACK log P log 3 P ( N / P 1/2 ) · log P Sym Eig, SVD[BDD10] ScaLAPACK log P log 3 P N / P 1/2 Nonsym Eig[BDD10] ScaLAPACK log P P 1/2 · log P log 3 P N · log P

QR of a Tall, Skinny matrix is bottleneck; Use TSQR instead: 11 W = Q 00 R 00 Q 10 R 10 Q 20 R 20 Q 30 R 30 W0W1W2W3W0W1W2W3 Q 00 Q 10 Q 20 Q 30 = =. R 00 R 10 R 20 R 30 R 00 R 10 R 20 R 30 = Q 01 R 01 Q 11 R 11 Q 01 Q 11 =. R 01 R 11 R 01 R 11 = Q 02 R 02

Minimizing Communication in TSQR W = W0W1W2W3W0W1W2W3 R 00 R 10 R 20 R 30 R 01 R 11 R 02 Parallel: W = W0W1W2W3W0W1W2W3 R 01 R 02 R 00 R 03 Sequential: W = W0W1W2W3W0W1W2W3 R 00 R 01 R 11 R 02 R 11 R 03 Dual Core: Can Choose reduction tree dynamically Multicore / Multisocket / Multirack / Multisite / Out-of-core: ?

TSQR Performance Results Parallel –Intel Clovertown –Up to 8x speedup (8 core, dual socket, 10M x 10) –Pentium III cluster, Dolphin Interconnect, MPICH Up to 6.7x speedup (16 procs, 100K x 200) –BlueGene/L Up to 4x speedup (32 procs, 1M x 50) Tesla – early results Up to 2.5x speedup (vs LAPACK on Nehalem) Better: 10x on Fermi with standard algorithm and better SGEMV Grid – 4x on 4 cities (Dongarra et al) Cloud – early result – up and running using Nexus Sequential –Out-of-Core on PowerPC laptop As little as 2x slowdown vs (predicted) infinite DRAM LAPACK with virtual memory never finished 13 Jim Demmel, Gray Ballard, Mark Hoemmen, Laura Grigori, Julien Langou, Jack Dongarra, Michael Anderson

QR Factorization Intel 16 cores Tall Skinny Matrices 14 #cols = x #rows Gflop/s 14 Quad Socket/Quad Core Intel Xeon See LAPACK Working Note 222 Source: Jack Dongarra N = #rows

Modeled Speedups of CAQR vs ScaLAPACK (on n x n matrices; uses TSQR for panel factorization) Petascale up to 22.9x IBM Power 5 up to 9.7x “Grid” up to 11x Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4 GB/s.

Rank Revealing CAQR Decompositions Randomized URV [V,R’] = caqr(randn(n,n)) [U,R] = caqr(A·V T ) A = URV reveals the rank of A with high probability Applies to A 1  1 · A 2  1 · A 3  1 · · · Used for eigensolver/SVD QR with column pivoting “Tournament Pivoting” to pick b best columns of 2 groups of b each Only works on single matrix 16 Both cases: #words_moves = O(mn 2 /sqrt(M)), or #passes_over_data = O(n/sqrt(M)) Other CA decompositions with pivoting: LU (tournament, not partial pivoting, but stable!) Cholesky with diagonal pivoting LU with complete pivoting LDL T ?

Back to LU: Using similar idea for TSLU as TSQR: Use reduction tree, to do “Tournament Pivoting” W nxb = W1W2W3W4W1W2W3W4 P 1 ·L 1 ·U 1 P 2 ·L 2 ·U 2 P 3 ·L 3 ·U 3 P 4 ·L 4 ·U 4 = Choose b pivot rows of W 1, call them W 1 ’ Choose b pivot rows of W 2, call them W 2 ’ Choose b pivot rows of W 3, call them W 3 ’ Choose b pivot rows of W 4, call them W 4 ’ W1’W2’W3’W4’W1’W2’W3’W4’ P 12 ·L 12 ·U 12 P 34 ·L 34 ·U 34 = Choose b pivot rows, call them W 12 ’ Choose b pivot rows, call them W 34 ’ W 12 ’ W 34 ’ = P 1234 ·L 1234 ·U 1234 Choose b pivot rows Go back to W and use these b pivot rows (move them to top, do LU without pivoting) 17

Making TSLU Numerically Stable Details matter Going up the tree, we could do LU either on original rows of A (tournament pivoting), or computed rows of U Only tournament pivoting stable Thm: New scheme as stable as Partial Pivoting (GEPP) in following sense: Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A 18

What about algorithms like Strassen? Results for sequential case only Restatement of result so far for O( n 3 ) matmul: #words_moved =  ( n 3 / M 3/2 – 1 ) Lower bound for Strassen’s method: #words_moved =  ( n  / M  /2 – 1 ) where  = log 2 7 Proof very different than before! Attained by usual recursive (cache-oblivious) implementation Also attained by new algorithms for LU, QR, eig, SVD that use Strassen But not clear how to extend lower bound Also attained by fast matmul, LU, QR, eig, SVD for any  Can be made numerically stable True for any matrix multiplication algorithm that will ever be invented Conjecture: same lower bound as above What about parallelism? 19

Direct Linear Algebra: summary and future work Communication lower bounds on #words_moved and #messages BLAS, LU, Cholesky, QR, eig, SVD, tensor contractions, … Some whole programs (compositions of these operations, no matter how individual ops are interleaved, eg computing A k ) Dense and sparse matrices (where #flops << n 3 ) Sequential and parallel algorithms Some graph-theoretic algorithms (eg Floyd-Warshall) Algorithms that attain these lower bounds Nearly all dense problems (some open problems, eg LDL’) A few sparse problems Speed ups in theory and practice Extensions to Strassen-like algorithms Future work Lots to implement, autotune Next generation of Sca/LAPACK on heterogeneous architectures (MAGMA) Few algorithms in sparse case (just Cholesky) 3D Algorithms (only for Matmul so far), may be important for scaling 20

Avoiding Communication in Iterative Linear Algebra k-steps of typical iterative solver for sparse Ax=b or Ax=λx Does k SpMVs with starting vector Finds “best” solution among all linear combinations of these k+1 vectors Many such “Krylov Subspace Methods” Conjugate Gradients, GMRES, Lanczos, Arnoldi, … Goal: minimize communication in Krylov Subspace Methods Assume matrix “well-partitioned,” with modest surface-to-volume ratio Parallel implementation Conventional: O(k log p) messages, because k calls to SpMV New: O(log p) messages - optimal Serial implementation Conventional: O(k) moves of data from slow to fast memory New: O(1) moves of data – optimal Lots of speed up possible (modeled and measured) Price: some redundant computation Much prior work: See PhD Thesis by Mark Hoemmen See bebop.cs.berkeley.edu CG: [van Rosendale, 83], [Chronopoulos and Gear, 89] GMRES: [Walker, 88], [Joubert and Carey, 92], [Bai et al., 94] 21

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Locally Dependent Entries for [x,Ax], A tridiagonal, 2 processors Can be computed without communication Proc 1 Proc 2

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Can be computed without communication Proc 1 Proc 2 Locally Dependent Entries for [x,Ax,A 2 x], A tridiagonal, 2 processors

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Can be computed without communication Proc 1 Proc 2 Locally Dependent Entries for [x,Ax,…,A 3 x], A tridiagonal, 2 processors

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Can be computed without communication Proc 1 Proc 2 Locally Dependent Entries for [x,Ax,…,A 4 x], A tridiagonal, 2 processors

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Locally Dependent Entries for [x,Ax,…,A 8 x], A tridiagonal, 2 processors Can be computed without communication k=8 fold reuse of A Proc 1 Proc 2

Remotely Dependent Entries for [x,Ax,…,A 8 x], A tridiagonal, 2 processors x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x One message to get data needed to compute remotely dependent entries, not k=8 Minimizes number of messages = latency cost Price: redundant work  “surface/volume ratio” Proc 1 Proc 2 x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x

Remotely Dependent Entries for [x,Ax,A 2 x,A 3 x], A irregular, multiple processors

Sequential [x,Ax,…,A 4 x], with memory hierarchy v One read of matrix from slow memory, not k=4 Minimizes words moved = bandwidth cost No redundant work

Minimizing Communication of GMRES to solve Ax=b GMRES: find x in span{b,Ab,…,A k b} minimizing || Ax-b || 2 Cost of k steps of standard GMRES vs new GMRES Standard GMRES for i=1 to k w = A · v(i-1) MGS(w, v(0),…,v(i-1)) update v(i), H endfor solve LSQ problem with H Sequential: #words_moved = O(k·nnz) from SpMV + O(k 2 · n) from MGS Parallel: #messages = O(k) from SpMV + O(k 2 · log p) from MGS Communication-avoiding GMRES W = [ v, Av, A 2 v, …, A k v ] [Q,R] = TSQR(W) … “Tall Skinny QR” Build H from R, solve LSQ problem Sequential: #words_moved = O(nnz) from SpMV + O(k·n) from TSQR Parallel: #messages = O(1) from computing W + O(log p) from TSQR Oops – W from power method, precision lost! 30

31

How to make CA-GMRES Stable? Use a different polynomial basis for the Krylov subspace Not Monomial basis W = [v, Av, A 2 v, …], instead [v, p 1 (A)v,p 2 (A)v,…] Possible choices: Newton Basis W N = [v, (A – θ 1 I)v, (A – θ 2 I)(A – θ 1 I)v, …] Shifts θ i chosen as approximate eigenvalues from Arnoldi Using same Krylov subspace, so “free” Chebyshev Basis W C = [v, T 1 (A)v, T 2 (A)v, …] T i (z) chosen to be small in region of complex plane containing large eigenvalues 32

“Monomial” basis [Ax,…,A k x] fails to converge Newton polynomial basis does converge 33

Speed ups of GMRES on 8-core Intel Clovertown Requires co-tuning kernels [MHDY09] 34

Lots of algorithms to implement, autotune Make widely available via OSKI, Trilinos, PETSc, Python, … Job available… Extensions to other Krylov subspace methods So far just Lanczos/CG, Arnoldi/GMRES, BiCG Need BiCGStab, CGS, QMR, … Add preconditioning to solve MAx = Mb New kernel [x, Ax, MAx, AMAx, MAMAx, AMAMAx, …] Diagonal M easy Block diagonal M harder (AM)k quickly becomes dense but with low-rank off-diagonal blocks Extends to hierarchical, semiseparable M Theory exists, no implementations yet Communication Avoiding Iterative Linear Algebra: Future Work 35

For more information See papers at bebop.cs.berkeley.edu See slides for week-long short course Gene Golub SIAM Summer School in Numerical Linear Algebra 36

Summary Don’t Communic… 37 Time to redesign all dense and sparse linear algebra