Minimizing Communication in Linear Algebra

Minimizing Communication in Linear Algebra
James Demmel 9 Nov 2009 CS267 Lecture 2

Outline Background – 2 Big Events
Why avoiding communication is important Communication-optimal direct linear algebra Autotuning of sparse matrix codes Communication-optimal iterative linear algebra

Collaborators Grey Ballard, UCB EECS Ioana Dumitriu, U. Washington
Laura Grigori, INRIA Ming Gu, UCB Math Mark Hoemmen, UCB EECS Olga Holtz, UCB Math & TU Berlin Julien Langou, U. Colorado Denver Marghoob Mohiyuddin, UCB EECS Oded Schwartz , TU Berlin Hua Xiang, INRIA Sam Williams, LBL Vasily Volkov, UCB EECS Kathy Yelick, UCB EECS & NERSC BeBOP group, bebop.cs.berkeley.edu ParLab, parlab.eecs.berkeley.edu Thanks to Intel, Microsoft, UC Discovery, NSF, DOE, …

2 Big Events “Multicore revolution”, requiring all software (where performance matters!) to change ParLab – Industry/State funded center with 15 faculty, ~50 grad students ( parlab.eecs.berkeley.edu ) Rare synergy between commercial and scientific computing Establishment of a new graduate program in Computational Science and Engineering (CSE) 117 Faculty from 22 Departments 60 existing courses, more being developed CS267 – Applications of Parallel Computers 3 Day Parallel “Boot Camp” in August parlab.eecs.berkeley.edu/2009bootcamp

Parallelism Revolution is Happening Now
Chip density continuing to increase ~2x every 2 years Clock speed is not Number of processor cores may double instead There is little or no more hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

The “7 Dwarfs” of High Performance Computing
Phil Colella (LBL) identified 7 kernels out of which most large scale simulation and data-analysis programs are composed: Dense Linear Algebra Ex: Solve Ax=b or Ax = λx where A is a dense matrix Sparse Linear Algebra Ex: Solve Ax=b or Ax = λx where A is a sparse matrix (mostly zero) Operations on Structured Grids Ex: Anew(i,j) = 4*A(i,j) – A(i-1,j) – A(i+1,j) – A(i,j-1) – A(i,j+1) Operations on Unstructured Grids Ex: Similar, but list of neighbors varies from entry to entry Spectral Methods Ex: Fast Fourier Transform (FFT) N-Body (aka Particle Methods) Ex: Compute electrostatic forces using Fast Multiple Method Monte Carlo (aka MapReduce) Ex: Many independent simulations using different inputs

Motif/Dwarf: Common Computational Methods (Red Hot  Blue Cool)
Some people might subdivide, some might combine them together Trying to stretch a computer architecture then you can do subcategories as well Graph Traversal = Probabilistic Models N-Body = Particle Methods, …

2 Big Events “Multicore revolution”, requiring all software (where performance matters!) to change ParLab – Industry/State funded center with 15 faculty, ~50 grad students ( parlab.eecs.berkeley.edu ) Rare synergy between commercial and scientific computing Establishment of a new graduate program in Computational Science and Engineering (CSE) at UCB 117 Faculty from 22 Departments 60 existing courses, more being developed CS267 – Applications of Parallel Computers 3 Day Parallel “Boot Camp” in August parlab.eecs.berkeley.edu/2009bootcamp

Why avoiding communication is important (1/2)
Algorithms have two costs: Arithmetic (FLOPS) Communication: moving data between levels of a memory hierarchy (sequential case) processors over a network (parallel case). CPU Cache DRAM CPU DRAM

Why avoiding communication is important (2/2)
Running time of an algorithm is sum of 3 terms: # flops * time_per_flop # words moved / bandwidth # messages * latency communication Time_per_flop << 1/ bandwidth << latency Gaps growing exponentially with time Annual improvements Time_per_flop Bandwidth Latency Network 26% 15% DRAM 23% 5% 59% Goal : reorganize linear algebra to avoid communication Between all memory hierarchy levels L L DRAM network, etc Not just hiding communication (speedup  2x ) Arbitrary speedups possible

Naïve Sequential Matrix Multiply: C = C + A*B
for i = 1 to n {read row i of A into fast memory, n2 reads} for j = 1 to n {read C(i,j) into fast memory, n2 reads} {read column j of B into fast memory, n3 reads} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory, n2 writes} n3 + O(n2) reads/writes altogether A(i,:) C(i,j) C(i,j) B(:,j) = + * CS267 Lecture 2

Less Communication with Blocked Matrix Multiply
Blocked Matmul C = A·B explicitly refers to subblocks of A, B and C of dimensions that depend on cache size … Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc … b chosen so 3 bxb blocks fit in cache for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul, 4b2 reads/writes (n/b)3 · 4b2 = 4n3/b reads/writes altogether Minimized when 3b2 = cache size = M, yielding O(n3/M1/2) reads/writes What if we had more levels of memory? (L1, L2, cache etc)? Would need 3 more nested loops per level

Blocked vs Cache-Oblivious Algorithms
Blocked Matmul C = A·B explicitly refers to subblocks of A, B and C of dimensions that depend on cache size … Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc … b chosen so 3 bxb blocks fit in cache for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul … another level of memory would need 3 more loops Cache-oblivious Matmul C = A·B is independent of cache Function C = MM(A,B) If A and B are 1x1 C = A · B else … Break Anxn, Bnxn, Cnxn into (n/2)x(n/2) blocks labeled A(i,j), etc for i = 1 to 2, for j = 1 to 2, for k = 1 to 2 C(i,j) = C(i,j) + MM( A(i,k), B(k,j) ) … n/2 x n/2 matmul

Direct linear algebra: Prior Work on Matmul
Assume n3 algorithm (i.e. not Strassen-like) Sequential case, with fast memory of size M Lower bound on #words moved to/from slow memory =  (n3 / M1/2 ) [Hong, Kung, 81] Attained using blocked or cache-oblivious algorithms Parallel case on P processors: Let NNZ be total memory needed; assume load balanced Lower bound on #words communicated =  (n3 /(P· NNZ )1/2 ) [Irony, Tiskin, Toledo, 04] NNZ (name of alg) Lower bound on #words Attained by 3n (“2D alg”)  (n2 / P1/2 ) [Cannon, 69] 3n2 P1/3 (“3D alg”)  (n2 / P2/3 ) [Johnson,93] CS267 Lecture 2

New lower bound for all “direct” linear algebra
Let M = “fast” memory size per processor #words_moved by at least one processor = (#flops / M1/2 ) #messages_sent by at least one processor = (#flops / M3/2 ) Holds for BLAS, LU, QR, eig, SVD, … Some whole programs (sequences of these operations, no matter how they are interleaved, eg computing Ak) Dense and sparse matrices (where #flops << n3 ) Sequential and parallel algorithms Some graph-theoretic algorithms (eg Floyd-Warshall) See [BDHS09] – proof sketch if time!

Can we attain these lower bounds?
Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds? Mostly not If not, are there other algorithms that do? Yes Goals for algorithms: Minimize #words =  (#flops/ M1/2 ) Minimize #messages =  (#flops/ M3/2 ) Need new data structures Minimize for multiple memory hierarchy levels Cache-oblivious algorithms would be simplest Fewest flops when matrix fits in fastest memory Cache-oblivious algorithms don’t always attain this Only a few sparse algorithms so far

Minimizing #messages requires new data structures
Need to load/store whole rectangular subblock of matrix with one “message” Incompatible with conventional columnwise (rowwise) storage Ex: Rows (columns) not in contiguous memory locations Blocked storage: store as matrix of bxb blocks, each block stored contiguously Ok for one level of memory hierarchy, what if more? Recursive blocked storage: store each block using subblocks

Summary of dense sequential algorithms attaining communication lower bounds
Algorithms shown minimizing # Messages use (recursive) block layout Many references (see reports), only some shown, plus ours Cache-oblivious are underlined, Green are ours, ? is unknown/future work Algorithm 2 Levels of Memory Multiple Levels of Memory #Words Moved and # Messages and #Messages BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested), or recursive [Gustavson,97] Cholesky LAPACK (with b = M1/2) [Gustavson 97] [BDHS09] [Gustavson,97] [Ahmed,Pingali,00] (←same) LU with pivoting LAPACK (rarely) [Toledo,97] , [GDX 08] [GDX 08] not partial pivoting [Toledo, 97] [GDX 08]? QR LAPACK (rarely) [Elmroth,Gustavson,98] [DGHL08] [Frens,Wise,03] but 3x flops [Elmroth, Gustavson,98] [DGHL08] ? [Frens,Wise,03] [DGHL08] ? Eig, SVD Not LAPACK [BDD09] randomized, but more flops [BDD09] Questions: Was [Gustavson,97] not cache oblivious? [Frigo, Leiserson, Prokop, Ramachandran,99] CS267 Lecture 2

Summary of dense 2D parallel algorithms attaining communication lower bounds
Assume nxn matrices on P processors, memory per processor = O(n2 / P) ScaLAPACK assumes best block size b chosen Many references (see reports), Green are ours Recall lower bounds: #words_moved = ( n2 / P1/2 ) and #messages = ( P1/2 ) Algorithm Reference Factor exceeding lower bound for #words_moved lower bound for #messages Matrix multiply [Cannon, 69] 1 Cholesky ScaLAPACK log P LU [GDX08] ( N / P1/2 ) · log P QR [DGHL08] log3 P Sym Eig, SVD [BDD09] N / P1/2 Nonsym Eig P1/2 · log P N · log P CS267 Lecture 2

QR of a tall, skinny matrix is bottleneck; Use TSQR instead:
Q is represented implicitly as a product (tree of factors) [DGHL08]

Minimizing Communication in TSQR
W = W0 W1 W2 W3 R00 R10 R20 R30 R01 R11 R02 Parallel: W = W0 W1 W2 W3 R01 R02 R00 R03 Sequential: W = W0 W1 W2 W3 R00 R01 R11 R02 R03 Dual Core: Do algebra on board: Eq(1) in TSQR paper Note: Q left as implicit tree Multicore / Multisocket / Multirack / Multisite / Out-of-core: ? Can choose reduction tree dynamically

Performance of TSQR vs Sca/LAPACK
Parallel Intel Clovertown Up to 8x speedup (8 core, dual socket, 10M x 10) Pentium III cluster, Dolphin Interconnect, MPICH Up to 6.7x speedup (16 procs, 100K x 200) BlueGene/L Up to 4x speedup (32 procs, 1M x 50) Both use Elmroth-Gustavson locally – enabled by TSQR Sequential OOC on PowerPC laptop As little as 2x slowdown vs (predicted) infinite DRAM LAPACK with virtual memory never finished See UC Berkeley EECS Tech Report General CAQR = Communication-Avoiding QR in progress

Modeled Speedups of CAQR vs ScaLAPACK
Petascale up to 22.9x IBM Power 5 up to 9.7x “Grid” up to 11x Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4 GB/s.

Randomized Rank-Revealing QR
Needed for eigenvalue and SVD algorithms func [Q,R,V] = RRQR(A) [V,R1] = qr(randn(n,n)) … V random with Haar measure [Q,R] = qr(A · VT ) … so A = Q · R · V Theorem (DDH). Provided that sr+1 << sr ~ s1 , RRQR produces a rank-revealing decomposition with high probability (when r = O(n), the probability is 1 – O(n-c) ). Can apply to A = A1+/-1 A2+/-1 … Ak+/-1 Do RRQR, followed by a series of (k-1) QR / RQ to propagate unitary/orthogonal matrix to the front Analogous to [Stewart, 94]

What about o(n3) algorithms, like Strassen?
Thm (DDHK07) Given any matmul running in O(n) ops for some >2, it can be made stable and still run in O(n+) ops, for any >0 Current record:   2.38 Thm (DDH08) Given any stable matmul running in O(n+) ops, it is possible to do backward stable dense linear algebra in O(n+) ops Also reduces communication to O(n+) But can do much better (work in progress…)

Direct Linear Algebra: summary and future work
Communication lower bounds on #words_moved and #messages BLAS, LU, Cholesky, QR, eig, SVD, … Some whole programs (compositions of these operations, no matter how they are interleaved, eg computing Ak) Dense and sparse matrices (where #flops << n3 ) Sequential and parallel algorithms Some graph-theoretic algorithms (eg Floyd-Warshall) Algorithms that attain these lower bounds Nearly all dense problems (some open problems) A few sparse problems Speed ups in theory and practice Future work Lots to implement, tune Next generation of Sca/LAPACK on heterogeneous architectures (MAGMA) Few algorithms in sparse case Strassen-like algorithms 3D Algorithms (only for Matmul so far), may be important for scaling

How hard is hand-tuning matmul, anyway?
Results of 22 student teams trying to tune matrix-multiply, in CS267 Spr09 Students given “blocked” code to start with Still hard to get close to vendor tuned performance (ACML) For more discussion, see

How hard is hand-tuning matmul, anyway?

What part of the Matmul Search Space Looks Like
Number of columns in register block Number of rows in register block A 2-D slice of a 3-D register-tile search space. The dark blue region was pruned. (Platform: Sun Ultra-IIi, 333 MHz, 667 Mflop/s peak, Sun cc v5.0 compiler)

Autotuning Matmul with ATLAS (n = 500)
Source: Jack Dongarra ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor. ATLAS written by C. Whaley, inspired by PHiPAC, by Asanovic, Bilmes, Chin, D.

Automatic Performance Tuning
Goal: Let machine do hard work of writing fast code What do tuning of Matmul, dense BLAS, FFTs, signal processing, have in common? Can do the tuning off-line: once per architecture, algorithm Can take as much time as necessary (hours, a week…) At run-time, algorithm choice may depend only on few parameters (matrix dimensions, size of FFT, etc.) Can’t always do tuning off-line Algorithm and implementation may strongly depend on data only known at run-time Ex: Sparse matrix nonzero pattern determines both best data structure and implementation of Sparse-matrix-vector-multiplication (SpMV) Part of search for best algorithm just be done (very quickly!) at run-time

Source: Accelerator Cavity Design Problem (from Ko via Husbands)

Linear Programming Matrix
… First 20,000 columns of 4k x 1M matrix, rail4284

A Sparse Matrix You Use Every Day
You probably use a sparse matrix everyday, albeit indirectly. Here, I show a million x million submatrix of the web connectivity graph, constructed from an archive at the Stanford WebBase.

SpMV with Compressed Sparse Row (CSR) Storage
Matrix-vector multiply kernel: y(i)  y(i) + A(i,j)*x(j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]] A one-slide crash course on a basic sparse matrix data structure (compressed sparse row, or CSR, format) and a basic kernel: matrix-vector multiply. In CSR: store each non-zero value, but also need extra index for each non-zero  less than dense, but higher “rate” of storage per non-zero. Sparse-matrix-vector multiply (SpMV): no re-use of A, just of vectors x and y, taken to be dense. SpMV with CSR storage: indirect/irregular accesses to x! So what you’d like to do is: Unroll the k loop  but you need to know how many non-zeors per row Hoist y[i]  not a problem, absent aliasing Eliminate ind[i]  it’s an extra load, but you need to know the non-zero pattern Reuse elements of x  but you need to know the non-zero pattern Only 2 flops per 2 mem_refs: Limited by getting data from memory

Example: The Difficulty of Tuning
nnz = 1.5 M kernel: SpMV Source: NASA structural analysis problem FEM discretization from a structural analysis problem. Source: Matrix “raefsky”, provided by NASA via the University of Florida Sparse Matrix Collection

Example: The Difficulty of Tuning
nnz = 1.5 M kernel: SpMV Source: NASA structural analysis problem discretization from a structural analysis problem. Source: Matrix “raefsky”, provided by NASA via the University of Florida Sparse Matrix Collection 8x8 dense substructure: exploit this to limit #mem_refs

Speedups on Itanium 2: The Need for Search
8x8 Mflop/s Best: 4x2 [NOTE: This slide has some animation in it.] Experiment: Try all block sizes that divide 8x8—16 implementations in all. Platform: 900 MHz Itanium-2, 3.6 Gflop/s peak speed. Intel v7.0 compiler. Good speedups (4x) but at an unexpected block size (4x2). Figure taken from Im, Yelick, Vuduc, IJHPCA paper, to appear. Reference Mflop/s

Register Profile: Itanium 2
1190 Mflop/s 190 Mflop/s

Register Profiles: IBM and Intel IA-64
Power3 - 17% 252 Mflop/s Power4 - 16% 820 Mflop/s Register Profiles: IBM and Intel IA-64 122 Mflop/s 459 Mflop/s Itanium 1 - 8% Itanium % 247 Mflop/s 1.2 Gflop/s 107 Mflop/s 190 Mflop/s

Register Profiles: Sun and Intel x86
Ultra 2i - 11% 72 Mflop/s Ultra 3 - 5% 90 Mflop/s Register Profiles: Sun and Intel x86 35 Mflop/s 50 Mflop/s Pentium III - 21% Pentium III-M - 15% 108 Mflop/s 122 Mflop/s 42 Mflop/s 58 Mflop/s

Another example of tuning challenges
More complicated non-zero structure in general N = 16614 NNZ = 1.1M An enlarged submatrix for “ex11”, from an FEM fluid flow application. Original matrix: n = 16614, nnz = 1.1M Source: UF Sparse Matrix Collection

Zoom in to top corner More complicated non-zero structure in general
NNZ = 1.1M An enlarged submatrix for “ex11”, from an FEM fluid flow application. Original matrix: n = 16614, nnz = 1.1M Source: UF Sparse Matrix Collection

3x3 blocks look natural, but…
More complicated non-zero structure in general Example: 3x3 blocking Logical grid of 3x3 cells But would lead to lots of “fill-in” Suppose we know we want to use 3x3 blocking. SPARSITY lays a logical 3x3 grid on the matrix…

Extra Work Can Improve Efficiency!
More complicated non-zero structure in general Example: 3x3 blocking Logical grid of 3x3 cells Fill-in explicit zeros Unroll 3x3 block multiplies “Fill ratio” = 1.5 On Pentium III: 1.5x speedup! Actual mflop rate 1.52 = 2.25 higher The main point is that there can be a considerable pay-off for judicious choice of “fill” (r x c), but that allowing for fill makes the implementation space even more complicated. For this matrix on a Pentium III, we observed a 1.5x speedup, even after filling in an additional 50% explicit zeros. Two effects: Filling in zeros, but eliminating integer indices (overhead) (2) Quality of r x c code produced by compiler may be much better for particular r’s and c’s. In this particular example, overall data structure size stays the same, but 3x3 code is 2x faster than 1x1 code for a dense matrix stored in sparse format.

Extra Work Can Improve Efficiency!
More complicated non-zero structure in general Example: 3x3 blocking Logical grid of 3x3 cells Fill-in explicit zeros Unroll 3x3 block multiplies “Fill ratio” = 1.5 On Pentium III: 1.5x speedup! Actual mflop rate 1.52 = 2.25 higher In general: sample matrix to choose rxc block structure using heuristic The main point is that there can be a considerable pay-off for judicious choice of “fill” (r x c), but that allowing for fill makes the implementation space even more complicated. For this matrix on a Pentium III, we observed a 1.5x speedup, even after filling in an additional 50% explicit zeros. Two effects: Filling in zeros, but eliminating integer indices (overhead) (2) Quality of r x c code produced by compiler may be much better for particular r’s and c’s. In this particular example, overall data structure size stays the same, but 3x3 code is 2x faster than 1x1 code for a dense matrix stored in sparse format.

Source: Accelerator Cavity Design Problem (from Ko via Husbands)

Post-RCM Reordering

100x100 Submatrix Along Diagonal

“Microscopic” Effect of RCM Reordering
Before: Green + Red After: Green + Blue

“Microscopic” Effect of Combined RCM+TSP Reordering
Before: Green + Red After: Green + Blue Speedups from 1.6x to 3.3x

Summary of SpMV Performance Optimizations
Sequential SpMV Register blocking (RB): up to 4x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Reordering to create dense structure + splitting: 2x over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonals: 2x over CSR Cache blocking: 2.8x over CSR Multiple vectors (SpMM): 7x over CSR And combinations… Multicore SpMV NUMA, Affinity, SW Prefetch, Cache/LS/TLB blocking: 4x over naïve Pthreads Sequential Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR Higher-level kernels A·AT·x, AT·A·x: 4x over CSR, 1.8x over RB More later … All optimizations (being) automated in OSKI (bebop.cs.berkeley.edu) Being adopted by Cray; Python interface in progress OSKI Items in bold: see Vuduc’s 455 page dissertation Other items: see collaborations with BeBOPpers 2.1x for variable block splitting is just for the FEM matrices with 2 block sizes

Avoiding Communication in Iterative Linear Algebra
k-steps of typical iterative solver for sparse Ax=b or Ax=λx Does k SpMVs with starting vector Finds “best” solution among all linear combinations of these k+1 vectors Many such “Krylov Subspace Methods” Conjugate Gradients, GMRES, Lanczos, Arnoldi, … Goal: minimize communication in Krylov Subspace Methods Assume matrix “well-partitioned,” with modest surface-to-volume ratio Parallel implementation Conventional: O(k log p) messages, because k calls to SpMV New: O(log p) messages - optimal Serial implementation Conventional: O(k) moves of data from slow to fast memory New: O(1) moves of data – optimal Lots of speed up possible (modeled and measured) Price: some redundant computation Much prior work See bebop.cs.berkeley.edu CG: [van Rosendale, 83], [Chronopoulos and Gear, 89] GMRES: [Walker, 88], [Joubert and Carey, 92], [Bai et al., 94]

Minimizing Communication of GMRES to solve Ax=b
GMRES: find x in span{b,Ab,…,Akb} minimizing || Ax-b ||2 Cost of k steps of standard GMRES vs new GMRES Standard GMRES for i=1 to k w = A · v(i-1) MGS(w, v(0),…,v(i-1)) update v(i), H endfor solve LSQ problem with H Sequential: #words_moved = O(k·nnz) from SpMV + O(k2·n) from MGS Parallel: #messages = O(k) from SpMV + O(k2 · log p) from MGS Communication-avoiding GMRES W = [ v, Av, A2v, … , Akv ] [Q,R] = TSQR(W) … “Tall Skinny QR” Build H from R, solve LSQ problem Sequential: #words_moved = O(nnz) from SpMV + O(k·n) from TSQR Parallel: #messages = O(1) from computing W + O(log p) from TSQR Oops – W from power method, precision lost!

“Monomial” basis [Ax,…,Akx]
fails to converge A different polynomial basis does converge

Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]

Communication Avoiding Iterative Linear Algebra: Future Work
Lots of algorithms to implement, autotune Make widely available via OSKI, Trilinos, PETSc, Python, … Extensions to other Krylov subspace methods So far just Lanczos/CG and Arnoldi/GMRES BiCGStab, CGS, QMR, … Add preconditioning Block diagonal are easiest Hierarchical, semiseparable, … Works if interactions between distant points can be “compressed” Regarding preconditioners that “compress” the interaction between distant points, the fast multipole method is a prime example, as well as of the the last bullet. Which other algorithm permit such communication avoidance? Regarding the NSF proposal, I think we should leave out the sparse direct solvers, which would be too much. The iterative stuff is also probably too much, but we can talk about it.

Summary Don’t Communic…

Extra Slides

Proof of Communication Lower Bound on C = A·B (1/5)
Proof from Irony/Toledo/Tiskin (2004) Original proof, then generalization Think of instruction stream being executed Looks like “ … add, load, multiply, store, load, add, …” Each load/store moves a word between fast and slow memory We want to count the number of loads and stores, given that we are multiplying n-by-n matrices C = A·B using the usual 2n3 flops, possibly reordered assuming addition is commutative/associative Assuming that at most M words can be stored in fast memory Outline: Break instruction stream into segments, each with M loads and stores Somehow bound the maximum number of flops that can be done in each segment, call it F So F · # segments  T = total flops = 2·n3 , so # segments  T / F So # loads & stores = M · #segments  M · T / F

... Illustrating Segments, for M=3 Segment 1 Segment 2 Time Segment 3
Load Store FLOP Time Segment 1 Segment 2 Segment 3 ...

Proof from Irony/Toledo/Tiskin (2004) Original proof, then generalization Think of instruction stream being executed Looks like “ … add, load, multiply, store, load, add, …” Each load/store moves a word between fast and slow memory We want to count the number of loads and stores, given that we are multiplying n-by-n matrices C = A·B using the usual 2n3 flops, possibly reordered assuming addition is commutative/associative Assuming that at most M words can be stored in fast memory Outline: Break instruction stream into segments, each with M loads and stores Somehow bound the maximum number of flops that can be done in each segment, call it F So F · # segments  T = total flops = 2·n3 , so # segments  T / F So # loads & stores = M · #segments  M · T / F

Given segment of instruction stream with M loads & stores, how many adds & multiplies (F) can we do? At most 2M entries of C, 2M entries of A and/or 2M entries of B can be accessed Use geometry: Represent n3 multiplications by n x n x n cube One n x n face represents A each 1 x 1 subsquare represents one A(i,k) One n x n face represents B each 1 x 1 subsquare represents one B(k,j) One n x n face represents C each 1 x 1 subsquare represents one C(i,j) Each 1 x 1 x 1 subcube represents one C(i,j) += A(i,k) · B(k,j) May be added directly to C(i,j), or to temporary accumulator

k “C face” Cube representing C(1,1) += A(1,3)·B(3,1) C(2,3) C(1,1) B(3,1) A(1,3) j A(1,2) B(2,1) B(1,3) B(1,1) A(2,1) A(1,1) “B face” i If we have at most 2M “A squares”, 2M “B squares”, and 2M “C squares” on faces, how many cubes can we have? “A face”

k “A shadow” “B shadow” “C shadow” j i x y z y z x (i,k) is in “A shadow” if (i,j,k) in 3D set (j,k) is in “B shadow” if (i,j,k) in 3D set (i,j) is in “C shadow” if (i,j,k) in 3D set Thm (Loomis & Whitney, 1949) # cubes in 3D set = Volume of 3D set ≤ (area(A shadow) · area(B shadow) · area(C shadow)) 1/2 # cubes in black box with side lengths x, y and z = Volume of black box = x·y·z = ( xz · zy · yx)1/2 = (#A□s · #B□s · #C□s )1/2

Consider one “segment” of instructions with M loads, stores Can be at most 2M entries of A, B, C available in one segment Volume of set of cubes representing possible multiply/adds in one segment is ≤ (2M · 2M · 2M)1/2 = (2M) 3/2 ≡ F # Segments  2n3 / F # Loads & Stores = M · #Segments  M · 2n3 / F = n3 / (2M)1/2 Parallel Case: apply reasoning to one processor out of P # Adds and Muls  2n3 / P (at least one proc does this ) M= n2 / P (each processor gets equal fraction of matrix) # “Load & Stores” = # words moved from or to other procs  M · (2n3 /P) / F= M · (2n3 /P) / (2M) 3/2 = n2 / (2P)1/2

How to generalize this lower bound (1/4)
It doesn’t depend on C(i,j) being a matrix entry, just a unique memory location (same for A(i,k) and B(k,j) ) It doesn’t depend on C and A not overlapping (or C and B, or A and B) Some reorderings may change answer; still get a lower bound It doesn’t depend on doing n3 multiply/adds It doesn’t depend on C(i,j) just being a sum of products C(i,j) = k A(i,k)*B(k,j) For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments)

For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) (P) It does assume the presence of an operand generating a load and/or store; how could this not happen? Mem(b(k,j)) could be reused in many more gijk than (P) allows Ex: Compute C(m) = A * B.^m (Matlab notation) for m=1 to t Can move t1/2 times fewer words than  (#flops / M1/2 ) We might generate a result during a segment, use it, and discard it, with generating any memory traffic Turns out QR, eig, SVD all may do this Need a different analysis for them later…

For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) (P) Need to distinguish Sources, Destinations of each operand in fast memory during a segment: Possible Sources: S1: Already in fast memory at start of segment, or read; at most 2M S2: Created during segment; no bound without more information Possible Destinations: D1: Left in fast memory at end of segment, or written; at most 2M D2: Discarded; no bound without more information Need to assume no S2/D2 arguments; at most 4M of others

For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) (P) Theorem: To evaluate (P) with memory of size M, where fij and gijk are “nontrivial” functions of their arguments G is the total number of gijk‘s, No S2/D2 arguments requires at least G/ (8M)1/2 – M slow memory references Simpler: #words_moved =  (#flops / M1/2 ) Corollary: To evaluate (P) requires at least G/ (81/2 M3/2 ) – 1 messages (simpler:  (#flops / M3/2 ) ) Proof: maximum message size is M

Some Corollaries of this lower bound (1/2)
For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) (P) #words_moved =  (#flops / M1/2 )  Theorem applies to dense or sparse, parallel or sequential: MatMul, including ATA or A2 Triangular solve C = A-1·X C(i,j) = (X(i,j) - k<i A(i,k)*C(k,j)) / A(i,i) … A lower triangular C plays double role of b and c in Model (P) LU factorization (any pivoting, LU or “ILU”) L(i,j) = (A(i,j) - k<j L(i,k)*U(k,j)) / U(j,j), U(i,j) = A(i,j) - k<i L(i,k)*U(k,j) L (and U) play double role as c and a (c and b) in Model (P) Cholesky (any diagonal pivoting, C or “IC”) L(i,j) = (A(i,j) - k<j L(i,k)*LT(k,j)) / L(j,j), L(j,j) = (A(j,j) - k<j L(j,k)*LT(k,j))1/2 L (and LT) plays triple role as a, b and c

Some Corollaries of this lower bound (2/2)
For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) (P) #words_moved =  (#flops / M1/2 )  Applies to “simplified” operations Ex: Compute determinant of A(i,j) = func(i,j) using LU, where each func(i,j) called just once; so no inputs and 1 output Still get  (#flops / M1/2 – 2n2) words_moved, by “imposing” loads/stores Applies to compositions of operations Ex: Compute Ak by repeated squaring, only input A, output Ak Still get  (#flops / M1/2 – n2 log k ) by “imposing” loads/stores, Holds for any interleaving of operations Applies to some graph algorithms Ex: All-Pairs-Shortest-Path by Floyd-Warshall: gijk = “+” and fij = “min” Get  (F / M1/2) words_moved where F {n4 , n3 log n , n3 }

Lower bounds for Orthogonal Transformations (1/4)
Needed for QR, eig, SVD, … Analyze Blocked Householder Transformations j=1 to b (I – 2 uj uTj ) = I – U T UT where U = [ u1 , … , ub ] Treat Givens as 2x2 Householder Model details and assumptions Write (I – U T UT )A = A – U(TUT A) = A – UZ where Z = T(UT A) Only count operations in all A – UZ operations “Generically” a large fraction of the work Assume “Forward Progress”, that each successive Householder transformation leaves previously created zeros zero; Ok for QR decomposition Reductions to condensed forms (Hessenberg, tri/bidiagonal) Possible exception: bulge chasing in banded case One sweep of (block) Hessenberg QR iteration

Perform many A – UZ where Z = T(UT A) First challenge to applying theorem: need to collect all A-UZ into one big set to which model (P) applies Write them all as { A(i,j) = A(i,j) - k U(i,k) Z(k,j) } where k = index of k-th transformation, k not necessarily = index of column of A it comes from Second challenge: All Z(k,j) may be S2/D2 Recall: S2/D2 means computed on the fly and discarded Ex: An x 2n =Qn x n · Rn x 2n where 2n2 = M so A fits in cache Represent Q as n(n-1)/2 2x2 Householder (Givens) transforms There are n2(n-1)/2 = (M3/2) nonzero Z(k,j), not O(M) Still only do (M3/2) flops during segment But can’t use Loomis-Whitney to prove it! Need a new idea…

Dealing with Z(k,j) being S2/D2 How to bound #flops in A(i,j) = A(i,j) - k U(i,k) Z(k,j) ? Neither A nor U is S2/D2 A either turned into R or U, which are output So at most 2M of each during segment #flops ≤ ( #U(i,k) ) · ( #columns A(:,j) present ) ≤ ( #U(i,k) ) · ( #A(i,j) / min #nonzeros per column of A) ≤ h · O(M) / r where h = O(M) … How small can r be? To store h ≡ #U(i,k) Householder vector entries in r rows, there can be at most r in the first column, r-1 in the second, etc., (to maintain “Forward Progress”) so r(r-1)/2  h so r  h1/2 # flops ≤ h · O(M) / r ≤ O(M) h1/2 = O(M3/2) as desired

Theorem: #words_moved by QR decomposition using (blocked) Householder transformations = ( #flops / M1/2 ) Theorem: #words_moved by reduction to Hessenberg form, tridiagonal form, bidiagonal form, or one sweep of QR iteration (or block versions of any of these) = ( #flops / M1/2 ) Assuming Forward Progress (created zeros remain zero) Model: Merge left and right orthogonal transformations: A(i,j) = A(i,j) - kLUL(i,kL) ·ZL(kL,j) - kRZR(i,kR) ·UR(j,kR)

Multiple Levels of Memory
Summary of dense sequential Cholesky algorithms attaining communication lower bounds Algorithms shown minimizing # Messages assume (recursive) block layout Many references (see reports), only some shown, plus ours Oldest reference may or may not include analysis Cache-oblivious are underlined, Green are ours, ? is unknown/future work b = block size, M = fast memory size, n = dimension Algorithm 2 Levels of Memory Multiple Levels of Memory #Words Moved and # Messages Cholesky LAPACK (with b = M1/2) [Gustavson, 97] recursive, assumed good MatMul available [BDHS09] [BDHS09] like [Gustavson,97] & [Ahmed,Pingali,00] Uses recursive Cholesky, TRSM and Matmul (←same) has analysis Questions: Was [Gustavson,97] not cache oblivious? [Frigo, Leiserson, Prokop, Ramachandran,99] CS267 Lecture 2

Summary of dense sequential QR algorithms attaining communication lower bounds
Algorithms shown minimizing # Messages assume (recursive) block layout Many references (see reports), only some shown, plus ours Oldest reference may or may not include analysis Cache-oblivious are underlined, Green are ours, ? is unknown/future work b = block size, M = fast memory size, n = dimension Algorithm 2 Levels of Memory Multiple Levels of Memory #Words Moved and # Messages QR [Elmroth, Gustavson,98] Recursive, may do more flops LAPACK (only for nM and b=M1/2, or n2M) [DGHL08] [Frens,Wise,03] explicit Q only, 3x more flops implicit Q, but represented differently, ~ same flop count [DGHL08]? Can we get the same flop count? Questions: Was [Gustavson,97] not cache oblivious? [Frigo, Leiserson, Prokop, Ramachandran,99] CS267 Lecture 2

Multiple Levels of Memory
Summary of dense sequential LU algorithms attaining communication lower bounds Algorithms shown minimizing # Messages assume (recursive) block layout Many references (see reports), only some shown, plus ours Oldest reference may or may not include analysis Cache-oblivious are underlined, Green are ours, ? is unknown/future work b = block size, M = fast memory size, n = dimension Algorithm 2 Levels of Memory Multiple Levels of Memory #Words Moved and # Messages LU with pivoting [Toledo, 97] recursive, assumes good TRSM and Matmul available LAPACK (only for nM and b=M1/2, or n2M) [GDX08] Different than partial pivoting, still stable [GDX08]? Questions: Was [Gustavson,97] not cache oblivious? [Frigo, Leiserson, Prokop, Ramachandran,99] CS267 Lecture 2

Same idea for LU, almost [GDX08]
First idea: Use same reduction tree as in QR, with LU at each node Numerically unstable! Need a different pivoting scheme: At each node in tree, TSLU selects b pivot rows from 2b candidates from its 2 child nodes At each node, do LU on 2b original rows selected by child nodes, not U factors from child nodes When TSLU done, permute b selected rows to top of original matrix, redo b steps of LU without pivoting Thm: Same results as partial pivoting on different input matrix whose entries are the same magnitudes as original CALU – Communication Avoiding LU for general A Use TSLU for panel factorizations Apply to rest of matrix Cost: redundant panel factorizations Benefit: Stable in practice, but not same pivot choice as GEPP b times fewer messages overall - faster

TSLU: LU factorization of a tall skinny matrix
First try the obvious generalization of TSQR:

Growth factor for TSLU based factorization
Unstable for large P and large matrices. When P = # rows, TSLU is equivalent to parallel pivoting. Courtesy of H. Xiang

Growth factor for better CALU approach
Like threshold pivoting with worst case threshold = .33 , so |L| <= 3 Testing shows about same residual as GEPP

Performance vs ScaLAPACK
TSLU (Tall-Skinny) IBM Power 5 Up to 4.37x faster (16 procs, 1M x 150) Cray XT4 Up to 5.52x faster (8 procs, 1M x 150) CALU (Square) Up to 2.29x faster (64 procs, 1000 x 1000) Up to 1.81x faster (64 procs, 1000 x 1000) See INRIA Tech Report 6523 (2008), paper at SC08

CALU speedup prediction for a Petascale machine - up to 81x faster
Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4 GB/s.

Eigenvalue problems and SVD
Uses spectral divide and conquer Idea goes back to Bulgakov, Godunov, Malyshev Roughly: If A2k = QR, leading columns of Q converge to span invariant subspace for eigenvalues of A outside unit circle so Q*AQ becomes block upper triangular Repeated squaring done implicitly [Bai, D., Gu, 97] Reduced just to Matmul, QR, and Generalized QR with pivoting Eliminated need for exp(A), other improvements [DDH,07] Introduced randomized rank revealing QR [Q’,R’]=qr(randn(n,n)), [Q,R]=qr(A·Q’) Just need Matmul and QR [BDD,09] Shows that randomized GQR also rank revealing Shows limits of divide-and-conquer, depending on pseudospectrum Cache-oblivious, using [Frens,Wise,03]

Implicit Repeated Squaring
until Rj converges for each j, from 1st step: from 2nd/3rd steps: Proof:

Sparse Linear Algebra Thm (George 73, Hoffman/Martin/Rose 73, Eisenstat/ Schultz/Sherman 76 ): Cholesky on an ns x ns sparse matrix whose graph is an s-dimensional nearest neighbor grid does ( n3(s-1) ) flops; attainable by nested dissection ordering Proof: most work in dense Cholesky on final submatrix of size (ns-1) Corollary [BDHS09,G09]: Sequential Sparse Cholesky on such a matrix moves ( n3(s-1) / M1/2 ) words, and Parallel Sparse Cholesky moves ( n2(s-1) / (P log n )1/2 ) words on P procs Assumes “2D” parallel algorithm, i.e. minimal memory Thm [G09]. Attainable in parallel case by nested dissection, eg with PSPASES. Parallel Cholesky: flops_per_proc = n^3/P, memory_per_proc = n^2 log n / P, so Flops_per_proc/sqrt(memory_per_proc) = n^2/sqrt(P log n) CS267 Lecture 2

Avoiding Communication in Iterative Linear Algebra
k-steps of typical iterative solver for sparse Ax=b or Ax=λx Does k SpMVs with starting vector (eg with b, if solving Ax=b) Finds “best” solution among all linear combinations of these k+1 vectors Many such “Krylov Subspace Methods” Conjugate Gradients, GMRES, Lanczos, Arnoldi, … Goal: minimize communication in Krylov Subspace Methods Assume matrix “well-partitioned,” with modest surface-to-volume ratio Parallel implementation Conventional: O(k log p) messages, because k calls to SpMV New: O(log p) messages - optimal Serial implementation Conventional: O(k) moves of data from slow to fast memory New: O(1) moves of data – optimal Lots of speed up possible (modeled and measured) Price: some redundant computation

Can be computed without communication
Locally Dependent Entries for [x,Ax], A tridiagonal, 2 processors Proc Proc 2 x Ax A2x A3x A4x A5x A6x A7x A8x Can be computed without communication

Locally Dependent Entries for [x,Ax,A2x], A tridiagonal, 2 processors Proc Proc 2 x Ax A2x A3x A4x A5x A6x A7x A8x Can be computed without communication

Locally Dependent Entries for [x,Ax,…,A3x], A tridiagonal, 2 processors Proc Proc 2 x Ax A2x A3x A4x A5x A6x A7x A8x Can be computed without communication

Locally Dependent Entries for [x,Ax,…,A4x], A tridiagonal, 2 processors Proc Proc 2 x Ax A2x A3x A4x A5x A6x A7x A8x Can be computed without communication

Can be computed without communication k=8 fold reuse of A
Locally Dependent Entries for [x,Ax,…,A8x], A tridiagonal, 2 processors Proc Proc 2 x Ax A2x A3x A4x A5x A6x A7x A8x Can be computed without communication k=8 fold reuse of A

Remotely Dependent Entries for
[x,Ax,…,A8x], A tridiagonal, 2 processors Proc Proc 2 x Ax A2x A3x A4x A5x A6x A7x A8x One message to get data needed to compute remotely dependent entries, not k=8 Minimizes number of messages = latency cost Price: redundant work  “surface/volume ratio”

Remotely Dependent Entries for [x,Ax,A2x,A3x],
A irregular, multiple processors Red needed for k=1 Green also needed for k=2 Blue also needed for k=3

Sequential [x,Ax,…,A4x], with memory hierarchy
v One read of matrix from slow memory, not k=4 Minimizes words moved = bandwidth cost No redundant work

Performance results on 8-Core Clovertown

Minimizing Communication in Linear Algebra

Similar presentations

Presentation on theme: "Minimizing Communication in Linear Algebra"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Minimizing Communication in Linear Algebra

Similar presentations

Presentation on theme: "Minimizing Communication in Linear Algebra"— Presentation transcript:

Similar presentations

About project

Feedback