Presentation is loading. Please wait.

Presentation is loading. Please wait.

Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded.

Similar presentations


Presentation on theme: "Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded."— Presentation transcript:

1 Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded Schwartz CS294, Lecture #5 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294 Based on: Bootcamp 2011, CS267, Summer-school 2010, Laura Grigori’s, many papers

2 2 Outline for today Recall communication lower bounds What a “matching upper bound”, i.e. CA-algorithm, might have to do Summary of what is known so far Case Studies of CA dense algorithms Last time: case study I: Matrix multiplication Case study II: LU, QR Case study III: Cholesky decomposition Case Studies of CA sparse algorithms Sparse Cholesky decomposition

3 3 Communication lower bounds Applies to algorithms satisfying technical conditions discussed last time “smells like 3-nested loops” For each processor: M = size of fast memory of that processor G = #operations (multiplies) performed by that processor #words_moved to/from fast memory = Ω ( max ( G / M 1/2, #inputs + #outputs ) ) #messages to/from fast memory ≥ #words_moved max_message_size = Ω ( G / M 3/2 )

4 Summary of dense sequential algorithms attaining communication lower bounds 4 Algorithms shown minimizing # Messages use (recursive) block layout Not possible with columnwise or rowwise layouts Many references (see reports), only some shown, plus ours Cache-oblivious are underlined, Green are ours, ? is unknown/future work Algorithm2 Levels of MemoryMultiple Levels of Memory #Words Moved and # Messages#Words Moved and #Messages BLAS-3Usual blocked or recursive algorithms Usual blocked algorithms (nested), or recursive [Gustavson,97] Cholesky LAPACK (with b = M 1/2 ) [Gustavson 97] [BDHS09] [Gustavson,97] [Ahmed,Pingali,00] [BDHS09] (←same) LU with pivoting LAPACK (sometimes) [Toledo,97], [GDX 08] [GDX 08] not partial pivoting [Toledo, 97] [GDX 08]? QR Rank- revealing LAPACK (sometimes) [Elmroth,Gustavson,98] [DGHL08] [Frens,Wise,03] but 3x flops? [DGHL08] [Elmroth, Gustavson,98] [DGHL08] ? [Frens,Wise,03] [DGHL08] ? Eig, SVD Not LAPACK [BDD11] randomized, but more flops; [BDK11] [BDD11, BDK11?] [BDD11, BDK11?]

5 Summary of dense parallel algorithms attaining communication lower bounds 5 Assume nxn matrices on P processors Minimum Memory per processor = M = O(n 2 / P) Recall lower bounds: #words_moved =  ( (n 3 / P) / M 1/2 ) =  ( n 2 / P 1/2 ) #messages =  ( (n 3 / P) / M 3/2 ) =  ( P 1/2 ) AlgorithmReferenceFactor exceeding lower bound for #words_moved Factor exceeding lower bound for #messages Matrix Multiply Cholesky LU QR Sym Eig, SVD Nonsym Eig

6 Summary of dense parallel algorithms attaining communication lower bounds 6 Assume nxn matrices on P processors (conventional approach) Minimum Memory per processor = M = O(n 2 / P) Recall lower bounds: #words_moved =  ( (n 3 / P) / M 1/2 ) =  ( n 2 / P 1/2 ) #messages =  ( (n 3 / P) / M 3/2 ) =  ( P 1/2 ) AlgorithmReferenceFactor exceeding lower bound for #words_moved Factor exceeding lower bound for #messages Matrix Multiply[Cannon, 69]1 CholeskyScaLAPACKlog P LUScaLAPACKlog P QRScaLAPACKlog P Sym Eig, SVDScaLAPACKlog P Nonsym EigScaLAPACKP 1/2 log P

7 Summary of dense parallel algorithms attaining communication lower bounds 7 Assume nxn matrices on P processors (conventional approach) Minimum Memory per processor = M = O(n 2 / P) Recall lower bounds: #words_moved =  ( (n 3 / P) / M 1/2 ) =  ( n 2 / P 1/2 ) #messages =  ( (n 3 / P) / M 3/2 ) =  ( P 1/2 ) AlgorithmReferenceFactor exceeding lower bound for #words_moved Factor exceeding lower bound for #messages Matrix Multiply[Cannon, 69]11 CholeskyScaLAPACKlog P LUScaLAPACKlog Pn log P / P 1/2 QRScaLAPACKlog Pn log P / P 1/2 Sym Eig, SVDScaLAPACKlog Pn / P 1/2 Nonsym EigScaLAPACKP 1/2 log Pn log P

8 Summary of dense parallel algorithms attaining communication lower bounds 8 Assume nxn matrices on P processors (better) Minimum Memory per processor = M = O(n 2 / P) Recall lower bounds: #words_moved =  ( (n 3 / P) / M 1/2 ) =  ( n 2 / P 1/2 ) #messages =  ( (n 3 / P) / M 3/2 ) =  ( P 1/2 ) AlgorithmReferenceFactor exceeding lower bound for #words_moved Factor exceeding lower bound for #messages Matrix Multiply[Cannon, 69]11 CholeskyScaLAPACKlog P LU[GDX10]log P QR[DGHL08]log P log 3 P Sym Eig, SVD[BDD11]log Plog 3 P Nonsym Eig[BDD11]log P log 3 P

9 Can we do even better? 9 Assume nxn matrices on P processors Why just one copy of data: M = O(n 2 / P) per processor? Increase M to reduce lower bounds: #words_moved =  ( (n 3 / P) / M 1/2 ) =  ( n 2 / P 1/2 ) #messages =  ( (n 3 / P) / M 3/2 ) =  ( P 1/2 ) AlgorithmReferenceFactor exceeding lower bound for #words_moved Factor exceeding lower bound for #messages Matrix Multiply[Cannon, 69]11 CholeskyScaLAPACKlog P LU[GDX10]log P QR[DGHL08]log P log 3 P Sym Eig, SVD[BDD11]log Plog 3 P Nonsym Eig[BDD11]log P log 3 P

10 Recall Sequential One-sided Factorizations (LU, QR) Classical Approach for i=1 to n update column i update trailing matrix #words_moved = O(n 3 ) Blocked Approach (LAPACK) for i=1 to n/b update block i of b columns update trailing matrix #words moved = O(n 3 /M 1/3 ) Recursive Approach func factor(A) if A has 1 column, update it else factor(left half of A) update right half of A factor(right half of A) #words moved = O(n 3 /M 1/2 ) None of these approaches minimizes #messages or addresses parallelism Need another idea 10

11 Minimizing latency in sequential case requires new data structures To minimize latency, need to load/store whole rectangular subblock of matrix with one “message” Incompatible with conventional columnwise (rowwise) storage Ex: Rows (columns) not in contiguous memory locations Blocked storage: store as matrix of bxb blocks, each block stored contiguously Ok for one level of memory hierarchy, what if more? Recursive blocked storage: store each block using subblocks Also known as “space filling curves”, “Morton ordering” 11

12 Summer School Lecture 4 12 Different Data Layouts for Parallel GE Bad load balance: P0 idle after first n/4 steps Load balanced, but can’t easily use BLAS2 or BLAS3 Can trade load balance and BLAS2/3 performance by choosing b, but factorization of block column is a bottleneck Complicated addressing, May not want full parallelism In each column, row 0123012301230123 01230123 0123 3012 2301 1230 1) 1D Column Blocked Layout2) 1D Column Cyclic Layout 3) 1D Column Block Cyclic Layout 4) Block Skewed Layout The winner! 01010101 23232323 01010101 23232323 01010101 23232323 01010101 23232323 6) 2D Row and Column Block Cyclic Layout 0123 Bad load balance: P0 idle after first n/2 steps 01 23 5) 2D Row and Column Blocked Layout b

13 02/14/2006 CS267 Lecture 913 Distributed GE with a 2D Block Cyclic Layout 13

14 02/14/2006 CS267 Lecture 914 Matrix multiply of green = green - blue * pink 14

15 Where is communication bottleneck? In both sequential and parallel case: panel factorization Can we retain partial pivoting? No “Proof” for parallel case: Each choice of pivot requires a reduction (to find the maximum entry) per columns #messages = Ω (n) or Ω (n log p), versus goal of O(p 1/2 ) Solution for QR, then show LU 15

16 TSQR: QR of a Tall, Skinny matrix W = Q 00 R 00 Q 10 R 10 Q 20 R 20 Q 30 R 30 W0W1W2W3W0W1W2W3 Q 00 Q 10 Q 20 Q 30 = =. R 00 R 10 R 20 R 30 R 00 R 10 R 20 R 30 = Q 01 R 01 Q 11 R 11 Q 01 Q 11 =. R 01 R 11 R 01 R 11 = Q 02 R 02 Output = { Q 00, Q 10, Q 20, Q 30, Q 01, Q 11, Q 02, R 02 } 16

17 TSQR: An Architecture-Dependent Algorithm W = W0W1W2W3W0W1W2W3 R 00 R 10 R 20 R 30 R 01 R 11 R 02 Parallel: W = W0W1W2W3W0W1W2W3 R 01 R 02 R 00 R 03 Sequential: W = W0W1W2W3W0W1W2W3 R 00 R 01 R 11 R 02 R 11 R 03 Dual Core: Can choose reduction tree dynamically Multicore / Multisocket / Multirack / Multisite / Out-of-core: ? 17

18 Back to LU: Using similar idea for TSLU as TSQR: Use reduction tree, to do “Tournament Pivoting” 18 W nxb = W1W2W3W4W1W2W3W4 P 1 ·L 1 ·U 1 P 2 ·L 2 ·U 2 P 3 ·L 3 ·U 3 P 4 ·L 4 ·U 4 = Choose b pivot rows of W 1, call them W 1 ’ Choose b pivot rows of W 2, call them W 2 ’ Choose b pivot rows of W 3, call them W 3 ’ Choose b pivot rows of W 4, call them W 4 ’ W1’W2’W3’W4’W1’W2’W3’W4’ P 12 ·L 12 ·U 12 P 34 ·L 34 ·U 34 = Choose b pivot rows, call them W 12 ’ Choose b pivot rows, call them W 34 ’ W 12 ’ W 34 ’ = P 1234 ·L 1234 ·U 1234 Choose b pivot rows Go back to W and use these b pivot rows (move them to top, do LU without pivoting)

19 Minimizing Communication in TSLU W = W1W2W3W4W1W2W3W4 LU Parallel: (binary tree) W = W1W2W3W4W1W2W3W4 LU Sequential: (flat tree) W = W1W2W3W4W1W2W3W4 LU Dual Core: Can Choose reduction tree dynamically, as before Multicore / Multisocket / Multirack / Multisite / Out-of-core: 19

20 Making TSLU Numerically Stable Stability Goal: Make ||A – PLU|| very small: O(machine_epsilon · ||A||) Details matter Going up the tree, we could do LU either on original rows of W (tournament pivoting), or computed rows of U Only tournament pivoting stable Thm: New scheme as stable as Partial Pivoting (PP) in following sense: get same results as PP applied to different input matrix whose entries are blocks taken from input A CALU – Communication Avoiding LU for general A –Use TSLU for panel factorizations –Apply to rest of matrix –Cost: redundant panel factorizations: –Extra flops for one reduction step: n/b panels  (n  b 2 ) flops per panel –For the sequential algorithms this is n 2 b. b = M 1/2, so extra n 2 M 1/2 flops, (<n 3 as M < n 2 ). –For the parallel 2D algorithms, we have log P reduction steps n/b panels, n/b  b 2 / P 1/2 flops per panel and b = n/(log(P) 2 P 1/2 ), so extra n/b  (n  b 2 )  log P / P 1/2 = n 2 b  log P / P 1/2 = n 3 /(P  log(P)) < n 3 /P Benefit: –Stable in practice, but not same pivot choice as GEPP –One reduction operation per panel: reduces latency to minimum 20

21 Sketch of TSLU Stabilty Proof Consider A = with TSLU of first columns done by : (assume wlog that A 21 already contains desired pivot rows) Resulting LU decomposition (first n/2 steps) is Claim: A s 32 can be obtained by GEPP on larger matrix formed from blocks of A GEPP applied to G pivots no rows, and L 31 · L 21 -1 · A 21 · U’ 11 -1 · U’ 12 + A s 32 = L 31 · U 21 · U’ 11 -1 · U’ 12 + A s 32 = A 31 · U’ 11 -1 · U’ 12 + A s 32 = L’ 31 · U’ 12 + A s 32 = A 32 21 A 11 A 12 A 21 A 22 A 31 A 32 A 11 A 21 A 31 A 21 A’ 11 A 11 A 12 A 21 A 22 A 31 A 32  11  12 0  21  22 0 0 0 I ·= A’ 11 A’ 12 A’ 21 A’ 22 A 31 A 32 = U’ 11 U’ 12 A s 22 A s 32 L’ 11 L’ 21 I L’ 31 0 I · A’ 11 A’ 12 A 21 -A 31 A 32 G == L’ 11 A 21 ·U’ 11 -1 L 21 -L 31 I · U’ 11 U’ 12 U 21 -L 21 -1 · A 21 · U’ 11 -1 · U’ 12 A s 32 Need to show A s 32 same as above

22 Stability of LU using TSLU: CALU (1/2) Worst case analysis Pivot growth factor from one panel factorization with b columns TSLU: –Proven: 2 bH where H = 1 + height of reduction tree –Attained: 2 b, same as GEPP Pivot growth factor on m x n matrix TSLU: –Proven: 2 nH-1 –Attained: 2 n-1, same as GEPP |L| can be as large as 2 (b-2)H-b+1, but CALU still stable There are examples where GEPP is exponentially less stable than CALU, and vice-versa, so neither is always better than the other Discovered sparse matrices with O(2 n ) pivot growth 22

23 Stability of LU using TSLU: CALU (2/2) Summer School Lecture 4 23 Empirical testing Both random matrices and “special ones” Both binary tree (BCALU) and flat-tree (FCALU) 3 metrics: ||PA-LU||/||A||, normwise and componentwise backward errors See [D., Grigori, Xiang, 2010] for details 23

24 Making TSLU Stable Break n x b panel into P 1/2 submatrices of size n/ P 1/2 x b each Think of each submatrix assigned to leaf of binary tree At each leaf, run GE with partial pivoting (GEPP) to identify b “good” pivot rows At each internal tree node, TSLU selects b pivot rows from 2b candidates from its 2 child nodes Does this by running GEPP on 2b original rows selected by child nodes When TSLU done, permute b selected rows to top of original matrix, redo b steps of LU without pivoting Thm: Same results as GEPP on different input matrix whose entries are the same magnitudes as original CALU – Communication Avoiding LU for general A –Use TSLU for panel factorizations –Apply to rest of matrix –Cost: redundant panel factorizations Benefit: –Stable in practice, but not same pivot choice as GEPP –One reduction operation per panel 24

25 Performance vs ScaLAPACK TSLU –IBM Power 5 Up to 4.37x faster (16 procs, 1M x 150) –Cray XT4 Up to 5.52x faster (8 procs, 1M x 150) CALU –IBM Power 5 Up to 2.29x faster (64 procs, 1000 x 1000) –Cray XT4 Up to 1.81x faster (64 procs, 1000 x 1000) See INRIA Tech Report 6523 (2008), paper at SC08 25

26 Exascale Machine Parameters Source: DOE Exascale Workshop 2^20  1,000,000 nodes 1024 cores/node (a billion cores!) 100 GB/sec interconnect bandwidth 400 GB/sec DRAM bandwidth 1 microsec interconnect latency 50 nanosec memory latency 32 Petabytes of memory 1/2 GB total L1 on a node 20 Megawatts !? 26

27 Exascale predicted speedups for Gaussian Elimination: CA-LU vs ScaLAPACK-LU log 2 (p) log 2 (n 2 /p) = log 2 (memory_per_proc) 27

28 Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4 GB/s. CALU speedup prediction for a Petascale machine - up to 81x faster P = 8192 28

29 Case Study III: Cholesky Decomposition 29

30 30 LU and Cholesky decompositions LU decomposition of A A = P·L·U U is upper triangular L is lower triangular P is a permutation matrix Cholesky decomposition of A If A is a real symmetric and positive definite matrix, then  L (a real lower triangular) s.t., A = L·L T (L is unique if we restrict diagonal elements  0) Efficient parallel and sequential algorithms?

31 31 Application of the generalized form to Cholesky decomposition (1) Generalized form:  (i,j)  S, C(i,j) = f ij ( g i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,…  S ij other arguments)

32 32 By the Geometric Embedding theorem [Ballard, Demmel, Holtz, S. 2011a] Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] The communication cost (bandwidth) of any classic algorithm for Cholesky decomposition (for dense matrices) is:

33 33 Some Sequential Classical Cholesky Algorithms BandwidthLatency Cache Oblivious Lower bound  (n 3 /M 1/2 )  (n 3 /M 3/2 ) Naive (n3)(n3)  (n 3 /M)  LAPACK (with right block size!) Column-major Contiguous blocks O(n 3 /M 1/2 ) O(n 3 /M) O(n 3 /M 3/2 )  Rectangular Recursive [Toledo 97] Column-major Contiguous blocks O(n 3 /M 1/2 +n 2 log n)  (n 3 /M)  (n 2 )  Square Recursive [Ahmad, Pingali 00] Column-major Contiguous blocks O(n 3 /M 1/2 ) O(n 3 /M) O(n 3 /M 3/2 ) 

34 34 Some Parallel Classical Cholesky Algorithms BandwidthLatency Lower bound General 2D-layout: M=O(n 2 /P)  (n 3 /PM 1/2 )  (n 2 /P 1/2 )  (n 3 /PM 3/2 )  (P 1/2 ) Sca-LAPACK General Choosing b=n/P 1/2 O((n 2 /P 1/2 + nb)log P) O((n 2 /P 1/2 )log P) O((n/b) log p) O(P 1/2 log P)

35 35 Top (column-major): Full, Old Packed, Rectangular Full Packed. Bottom (block-contiguous): Blocked, Recursive Format, Recursive Full Packed [Frigo, Leiserson, Prokop, Ramachandran 99, Ahmed, Pingali 00, Elmroth, Gustavson, Jonsson, Kagstrom 04]. Upper bounds – Supporting Data Structures

36 36 Upper bounds – Supporting Data Structures Rules of thumb: Don’t mix column/row major with blocked: decide which one fits your algorithm. Converting between the two once in your algorithm is OK for dense instances. E.g., changing the DS at the beginning of your algorithms.

37 37 The Naïve Algorithms jkjk i j j i i Naïve left looking and right looking Bandwidth:  (n 3 )Latency:  (n 3 /M) Assuming column-major DS.

38 38 Rectangular Recursive [Toledo 97] m A 21 A 22 A 11 A 12 A 32 A 31 n/2 n L 21 A 22 L 11 A 12 L 31 A 32 L 21 L 22 L 11 0 L 31 L 32 Recursive on left rectangular Recursive on right rectangular; L 12 :=0 [A 22 ;A 23 ] := [A 22 ;A 23 ] - [L 21 ;L 31 ]L 21 T Base: Column L := A/(A 11 ) 1/2

39 39 Rectangular Recursive [Toledo 97] m A 21 A 22 A 11 A 12 A 32 A 31 n/2 n L 21 A 22 L 11 A 12 L 31 A 32 L 21 L 22 L 11 0 L 31 L 32 Toledo’s Algorithm: guarantees stability for LU. BW(m,n) = 2m if n = 1, otherwise BW(m,n/2) + BW mm (m-n/2,n/2,n/2) + BW(m-n/2,n/2) Bandwidth: O(n 3 /M 1/2 +n 2 log n) Optimal bandwidth (for almost all M)

40 40 Rectangular Recursive [Toledo 97] m A 21 A 22 A 11 A 12 A 32 A 31 n/2 n L 21 A 22 L 11 A 12 L 31 A 32 L 21 L 22 L 11 0 L 31 L 32 Is it Latency optimal? If we use a column major DS, then L =  (n 3 /M), so M 1/2 away from optimum  (n 3 /M 3/2 ) This is due to the matrix multiply. Open problem: Can we prove a latency lower bound for all matrix multiply classic algorithm using column major DS? Note: we don’t need such a lower bound here, as the matrix- multiply algorithm is known.

41 41 Rectangular Recursive [Toledo 97] m A 21 A 22 A 11 A 12 A 32 A 31 n/2 n L 21 A 22 L 11 A 12 L 31 A 32 L 21 L 22 L 11 0 L 31 L 32 Is it Latency optimal? If we use a contiguous blocks DS, then L =  (n 2 ). This is due to the forward columns updates Optimum is  (n 3 /M 3/2 ) n 3 /M 3/2 ≤ n 2 for n ≤ M 2/3. Outside this range the algorithm is not latency optimal.

42 42 Square Recursive algorithm [Ahmed, Pingali 00], First analyzed in [Ballard, Demmel, Holtz, S. 09] A 21 A 22 A 11 A 12 n n/2 A 21 A 22 L 11 0 L 21 A 22 L 11 0 L 21 A 22 L 11 0 Recursive on A 11 Recursive on A 22 A 12 =0 L 21 = RTRSM(A 21,L 11 ’) A 22 = A 22 – L 21 L 21 ’ RTRSM: Recursive Triangular Solver

43 43 Square Recursive algorithm [Ahmed, Pingali 00], First analyzed in [Ballard, Demmel, Holtz, S. 09] A 21 A 22 A 11 A 12 n n/2 A 21 A 22 L 11 0 L 21 A 22 L 11 0 L 21 A 22 L 11 0 Recursive on A 11 Recursive on A 22 A 12 =0 L 21 = RTRSM(A 21,L 11 ’) A 22 = A 22 – L 21 L 21 ’ BW = 3n 2 if 3n 2 ≤ M, otherwise 2BW(n/2) + O(n 3 /M 1/2 + n 2 )

44 44 Square Recursive algorithm [Ahmed, Pingali 00], First analyzed in [Ballard, Demmel, Holtz, S. 09] A 21 A 22 A 11 A 12 n n/2 A 21 A 22 L 11 0 L 21 A 22 L 11 0 L 21 A 22 L 11 0 Uses Recursive matrix multiplication [Frigo, Leiserson, Prokop, Ramachandran 99] and recursive RTRSM Bandwidth: O(n 3 /M 1/2 ) Latency: O(n 3 /M 3/2 ) Assuming blocks recursive DS. Optimal bandwidth Optimal latency Cache oblivious Not stable for LU

45 45 Parallel Algorithm 123 456 789 123 456 789 123 456 789 123 456 789 PcPc PrPr 1.Find Cholesky locally on a diagonal block. Broadcast downwards. 2.Parallel triangular solve. Broadcast to the right. 3. Parallel symmetric rank-b update

46 46 Parallel Algorithm 123 456 789 123 456 789 123 456 789 123 456 789 PcPc PrPr BandwidthLatency Lower bound General 2D-layout: M=O(n 2 /P)  (n 3 /PM 1/2 )  (n 2 /P 1/2 )  (n 3 /PM 3/2 )  (P 1/2 ) Sca-LAPACK General Choosing b=n/P 1/2 O((n 2 /P 1/2 + nb)log P) O((n 2 /P 1/2 )log P) O((n/b) log P) O(P 1/2 log P) Low order term, as b ≤ n/P 1/2 Optimal (up to a log P factor) only for 2D. New 2.5D algorithm extends to any size M.

47 47 Summary of algorithms for dense Cholesky decomposition Tight lower bound for Cholesky decomposition. Sequential: Bandwidth:  (n 3 /M 1/2 ) Latency:  (n 3 /M 3/2 ) Parallel: Bandwidth and latency lower bound tight up to  (log P) factor.

48 Sparse Algorithms: Cholesky Decomposition Based on [L. Grigori P.Y David, J. Demmel, S. Peyronnet, ‘00] 48

49 Lower bounds for sparse direct methods Geometric embedding for sparse 3-nested-loops-like algorithms may be loose: BW =  (#flops / M 1/2 ), L = Ω(#flops / M 3/2 ) It may be smaller than the NNZ (number of non-zeros) in the input/output. For example computing the product of two (block) diagonal matrices may involve no communication in parallel 49

50 Lower bounds for sparse matrix multiplication Compute C = AB, with A, B, C of size n  n, having dn nonzeros uniformly distributed, #flops is at most d 2 n Lower bound on communication in sequential is BW =  (dn), L =  (dn/M) Two types of algorithms - depending on how their communication pattern is designed. - Take into account the nonzero structure of the matrices Example - Buluc, Gilbert BW = L = Ω(d 2 n) - Oblivious to the nonzero structure of the matrices Example - sparse Cannon - Buluc, Gilbert for parallel case BW = Ω(dn/ p 1/2 ), #messages = Ω(p 1/2 ) L. Grigori P.Y David, J. Demmel, S. Peyronnet 50

51 Lower bounds for sparse Cholesky factorization Consider A of size k s  k s results from a finite difference operator on a regular grid of dimension s  2 with k s nodes. Its Cholesky L factor contains a dense lower triangular matrix of size k s-1  k s-1. BW = Ω(W  1/2 ) L = Ω(W  3/2 ) where W = k 3(s-1) /2 for a sequential algorithm W = k 3(s-1) /(2p) for a parallel algorithm This result applies more generally to matrix A whose graph G =(V,E) has the following property for some s: every set of vertices W ⊂ V with n/3 ≤ |W| ≤ 2n/3 is adjacent to at least s vertices in V − W the Cholesky factor of A contains a dense s x s submatrix 51

52 PSPASES with an optimal layout could attains the lower bound in parallel for 2D/3D regular grids: Uses nested dissection to reorder the matrix Distributes the matrix using the subtree-to-subcube algorithm The factorization of every dense multifrontal matrix is performed using an optimal parallel dense Cholesky factorization. Sequential multifrontal algorithm attains the lower bound The factorization of every dense multifrontal matrix is performed using an optimal dense Cholesky factorization Optimal algorithms for sparse Cholesky factorization 52

53 Open Problems / Work in Progress (1/2) Rank Revealing CAQR (RRQR) AP = QR, P a perm. chosen to put “most independent” columns first Does LAPACK geqp3 move O(n 3 ) words, or fewer? Better: Tournament Pivoting (TP) on blocks of columns CALU with more reliable pivoting (Gu, Grigori, Khabou et al) Replace tournament pivoting of panel (TSLU) with Rank-Revealing QR of Panel LU with Complete Pivoting (GECP) A = P r LUP c T, P r and P c are perms. putting “most independent” rows & columns first Idea: TP on column blocks to chose columns, then TSLU on columns to choose rows LU of A=A T, with symmetric pivoting A = PLL T P T, P perm., L lower triangular Bunch-Kaufman, Rook pivoting: O(n 3 ) words moved? CA approach: Choose block with one step of CA-GECP, permute rows & columns to top left, pick symmetric subset with local Bunch-Kaufman A = PLBL T P T, with symmetric pivoting (Toledo et al) L triangular, P perm., B banded B tridiagonal: due to Aasen CA approach: B block tridiagonal, stability still in question 53

54 Open Problems / Work in Progress (2/2) Heterogeneity (Ballard, Gearhart) How to assign work to attain lower bounds when flop rates / bandwidths / latencies vary? Using extra memory in distributed memory case (Solomonik) Using all available memory to attain improved lower bounds Just matmul and CALU so far Using extra memory in shared memory case Sparse matrices … Combining sparse and dense algorithms, for better communication performance, e.g., The above [Grigori, David, Peleg, Peyronnet ‘09] Our retuned version of [Yuster Zwick ‘04] Eliminate the log P factor? Prove latency lower bounds for specific DS use, e.g., can we show that any classic matrix multiply algorithm that uses column-major DS has to be M 1/2 away from optimality? 54


Download ppt "Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded."

Similar presentations


Ads by Google