Communication-Avoiding Algorithms: 1) Strassen-Like Algorithms 2) Hardware Implications Jim Demmel.

Communication-Avoiding Algorithms: 1) Strassen-Like Algorithms 2) Hardware Implications Jim Demmel

Outline Strassen-like algorithms Communication lower bounds and optimal algorithms Open problem: extend beyond matmul Hardware implications Network contention What network topologies are necessary to attain lower bounds? Open problem: extend analysis beyond simple topologies analyzed so far How does hardware need to scale so that arithmetic is bottleneck? Non-volatile memories On some current and emerging technologies (FLASH, 3DXPoint, …) writes can be much more expensive than reads “Write-avoiding” algorithms: attain new, even lower bounds for writes than for reads. 2

3 Graph Expansion and Communication Costs of Fast Matrix Multiplication SPAA’11: Symposium on Parallelism in Algorithms and Architectures Best Paper Award And in Journal of the ACM

4 Communication Lower Bounds Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] [Scott, Holtz, Schwartz, 2015], [Scott, 2015] Proving that your algorithm/implementation is as good as it gets.

[Strassen 69] Compute 2 x 2 matrix multiplication using only 7 multiplications (instead of 8). Apply recursively (block-wise) M 1 = (A 11 + A 22 )  (B 11 + B 22 ) M 2 = (A 21 + A 22 )  B 11 M 3 = A 11  (B 12 - B 22 ) M 4 = A 22  (B 21 - B 11 ) M 5 = (A 11 + A 12 )  B 22 M 6 = (A 21 - A 11 )  (B 11 + B 12 ) M 7 = (A 12 - A 22 )  (B 21 + B 22 ) C 11 = M 1 + M 4 - M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 C 22 = M 1 - M 2 + M 3 + M 6 Recall: Strassen’s Fast Matrix Multiplication C 21 C 22 C 11 C 12 n/2 A 21 A 22 A 11 A 12 B 21 B 22 B 11 B 12 = T(n) = 7  T(n/2) +  (n 2 ) T(n) =  (n log 2 7 ) 5

Strassen-like algorithms T(n) = n 0  0  T(n/n 0 ) +  (n 2 ) T(n) =  (n  0 ) n/n 0 = 6 Subsequently… Compute n 0 x n 0 matrix multiplication using only n 0  0 multiplications (instead of n 0 3 ). Apply recursively (block-wise)  0  2.81 [Strassen 69],[Strassen-Winograd 71] 2.79 [Pan 78] 2.78 [Bini 79] 2.55 [Schönhage 81] 2.50 [Pan Romani,Coppersmith Winograd 84] 2.48 [Strassen 87] 2.38 [Coppersmith Winograd 90] 2.38 [Cohn Kleinberg Szegedy Umans 05] Group-theoretic approach 2.3730 [Stothers 10] 2.3728640 [Vassilevska Williams 12] 2.3728642 [Le Gall 14]

Complexity of classical Matmul vs “Strassen-like” Flops: O(n 3 /p) vs O(n w /p) Communication lower bound on #words: Ω((n 3 /p)/M 1/2 ) = Ω(M(n/M 1/2 ) 3 /p) vs Ω(M(n/M 1/2 ) w /p) Communication lower bound on #messages: Ω((n 3 /p)/M 3/2 ) = Ω((n/M 1/2 ) 3 /p) vs Ω((n/M 1/2 ) w /p) All attainable as M increases past O(n 2 /p), up to a limit: can increase M by factor up to p 1/3 vs p 1-2/w #words as low as Ω(n 2 /p 2/3 ) vs Ω(n 2 /p 2/w ) #messages as low as Ω(1) vs Ω(1) Best Paper Prize, SPAA’11, Ballard, D., Holtz, Schwartz PhD Thesis 2015, Scott; SPAA’15, Scott, Holtz, Schwartz How well does parallel Strassen work in practice? Classical Matmul vs (Parallel) “Strassen-like” 02/27/2014 CS267 Lecture 12 7

Intuition for (sequential) result (not proof!) Recall classical matmul proof: What is the most “useful work” you can do with O(M) words? Multiply two M 1/2 x M 1/2 matrices, doing M 3/2 flops Proof used Loomis-Whitney Divide instruction sequence into segments, each containing M loads and stores, so O(M) words available per segment Need n 3 /M 3/2 = (n/M 1/2 ) 3 segments, so O(M*(n/M 1/2 ) 3 ) loads/stores Intuitive extension to Strassen-like: What is the most “useful work” you can do with O(M) words? Multiply two M 1/2 x M 1/2 matrices, doing M w/2 flops Divide instruction sequence into segments, each containing M loads and stores, so O(M) words available per segment Need n w /M w/2 = (n/M 1/2 ) w segments, so O(M*(n/M 1/2 ) w ) loads/stores Open question: Is there a HBL-based proof of first claim? 8

RSRS WSWS S 9 The Computation Directed Acyclic Graph Input / Output Intermediate value Dependency V How can we estimate R s and W s ? By bounding the expansion of the graph!

10 Let G = (V,E) be a graph Small sets expansion: Expansion [Ballard, Demmel, Holtz, S. 2011b], in the spirit of [Hong & Kung 81]

11 What is the Computation Graph of Strassen? Can we Compute its Expansion?

12 M 1 = (A 11 + A 22 )  (B 11 + B 22 ) M 2 = (A 21 + A 22 )  B 11 M 3 = A 11  (B 12 - B 22 ) M 4 = A 22  (B 21 - B 11 ) M 5 = (A 11 + A 12 )  B 22 M 6 = (A 21 - A 11 )  (B 11 + B 12 ) M 7 = (A 12 - A 22 )  (B 21 + B 22 ) C 11 = M 1 + M 4 - M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 C 22 = M 1 - M 2 + M 3 + M 6 The DAG of Strassen, n = 2 ` 7541326 1,11,22,12,2 1,11,22,12,21,11,22,12,2 Enc 1 A Dec 1 C Enc 1 B

Enc 1 A Dec 1 C ` 13 The DAG of Strassen, n=4 Dec 1 C 1,11,22,12,2 One recursive level: Each vertex splits into four. Multiply blocks Enc 1 BEnc 1 A Dec 1 C Enc 1 AEnc 1 B

14 Enc lg n BEnc lg n A Dec lg n C n2n2 n2n2 n0n0 lg n The DAG of Strassen: further recursive steps Dec 1 C 1,11,22,12,2 Recursive construction Given Dec i C, Construct Dec i+1 C: 1.Duplicate 4 times 2.Connect with a cross-layer of Dec 1 C  0 = lg 7

15 Is Strassen’s Graph a Good Expander? For n -by- n matrices: For M 1/2 -by- M 1/2 matrices: For M 1/2 -by- M 1/2 sub-matrices (or other small subsets): Lower Bound argument: Break instruction stream of n ω 0 flops into segments S i of M ω 0 /2 flops Each segment does M ω 0 /2 * h = Ω(M) loads/stores #Loads/stores = Ω(M)*#segments = Ω(M (n/M 1/2 ) ω 0 ) M ω 0 /2

Technical Assumptions for Proof No redundant computation Keeps CDAG fixed Raises possibility of defeating lower bound, by recomputing data instead of writing first copy/rereading later See Sec 3.2 in Ballard/D/Holtz/Schwartz 2011 “Strassen-like” Base algorithm (eg 2x2 matmul) repeated recursively Includes all fast matmuls since Strassen (AFAIK) Dec 1 C must be connected Needed to bound expansion Eliminates classical matmul Each of 4 outputs only depends on 2 of the 8 inputs, all disjoint 16 Dec 1 C 1,11,22,12,2

17 Subsequently… More Bounds, for example: For rectangular fast-matrix multiplication algorithms [MedAlg’12]. For fast numerical linear algebra [EECS-Techreport’12]. E.g., solving linear systems, least squares, eigenproblems,... with same arithmetic and communication costs, and numerically stably. How much extra memory is useful How far we can have perfect strong-scaling [SPAA’12b] Generalize to varying algorithm at each recursive layer, rectangular case, disconnected Dec 1 C, … [Scott, PhD Thesis, 2015] New Parallel Algorithm…

Communication-Optimal Parallel Algorithm for Strassen's Matrix Multiplication SPAA’12: Symposium on Parallelism in Algorithms and Architectures Supercomputing’12 18 Optimal Parallel Algorithms

Strassen-like:Classic (cubic):For Strassen’s: Distributed Sequential Algorithms attaining these bounds? Communication costs lower bounds for matrix multiplication Next [Cannon 69, McColl Tiskin 99, Solomonik Demmel ‘11] 19

20 Sequential and new 2D and 2.5D parallel Strassen-like algorithms Sequential and Hierarchy cases: Attained by the natural recursive implementation. Also: LU, QR,… (Black-box use of fast matrix multiplication) [Ballard, D., Holtz, Schwartz, Rom 2011]: New 2D parallel Strassen-like algorithm. Attains the lower bound. New 2.5D parallel Strassen-like algorithm. c  0 /2-1 parallel communication speedup over 2D implementation ( c ∙ 3n 2 = M∙P ) [Ballard, D., Holtz, Schwartz, 2011b]: This is as good as it gets.

vs. Runs all 7 multiplies in parallel Each on P/7 processors Needs 7/4 as much memory Runs all 7 multiplies sequentially Each on all P processors Needs 1/4 as much memory CAPS If EnoughMemory and P  7 then BFS step else DFS step end if Communication Avoiding Parallel Strassen (CAPS) In practice, how to best interleave BFS and DFS is a “tuning parameter”

Strong scaling of Matmul on Hopper (n=94080) 02/25/2016CS267 Lecture 11 22 G. Ballard, D., O. Holtz, B. Lipshitz, O. Schwartz “Communication-Avoiding Parallel Strassen” bebop.cs.berkeley.edu, Supercomputing’12

Outline Strassen-like algorithms Communication lower bounds and optimal algorithms Open problem: extend beyond matmul Hardware implications Network contention What network topologies are necessary to attain lower bounds? Open problem: extend analysis beyond simple topologies analyzed so far How does hardware need to scale so that arithmetic is bottleneck? Non-volatile memories On some current and emerging technologies (FLASH, 3DXPoint, …) writes can be much more expensive than reads “Write-avoiding” algorithms: attain new, even lower bounds for writes than for reads. 23

Classical (O(n 3 )) Strassen’s (O(n ≈2.81 )) Network-dependent Communication Lower Bounds 24 Most existing bounds are per-processor (ignore topology / assume all-to-all network), i.e., of the form: “There exists a processor that comm. at least W proc words” Example: Matrix multiplication (IIT04,BDHS11) CPU M CPU M CPU M CPU M Memory-dependent and memory-independent lower bounds

Classical (O(n 3 )) Strassen’s (O(n ≈2.81 )) Network-dependent Communication Lower Bounds 25 Most existing bounds are per-processor (ignore topology / assume all-to-all network), i.e., of the form: “There exists a processor that comm. at least W proc words” Example: Matrix multiplication (IIT04,BDHS11) Can a network link become the performance bottleneck? “There exists a link that communicates at least W link words” Goal: Combine existing per-processor bounds with network parameters to obtain contention bounds CPU M CPU M CPU M CPU M Memory-dependent and memory-independent lower bounds

26 N Size of input/output of size G Net Inter-processor network P # of processors (the vertices of G Net ) d degree of G Net (# of neighbors of each processor) h t small set edge expansion of G Net M Size of local memories From “per-processor” W proc to “per-link” (contention) W link Theorem (d-regular G Net ): CPU M CPU M CPU M CPU M

Proof sketch 27 CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M’ CPU M’ CPU M’ CPU M’ There exists a link that communicates at least W link words

28 Applications Plug in “per-processor” lower bound Converting “per-processor” communication-cost lower bounds into a topology-dependent contention lower bounds 28 Plug in network parameters

29 Applications Converting “per-processor” communication-cost lower bounds into a topology-dependent contention lower bounds 29 Only consider sets that attain the small set edge expansion of G Net

30 Applications First application: Problems on a torus interconnect Example machines: BlueGene/Q - 5D-Torus BlueGene/P - 3D-Torus Intel on-chip multicore - 1D Torus (ring) 30 CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M https://safe.epcc.ed.ac.uk/diracwiki/index.php/File:Edinburgh.jpg

The small set edge expansion of tori Recall definition of small set expansion: Claim [Bollobás-Leader ‘91]: A D-dimensional torus G has where the hidden constant may depend on D. 31

The small set edge expansion of tori Recall definition of small set expansion: Claim [Bollobás-Leader ‘91]: A D-dimensional torus G has where the hidden constant may depend on D. For a simpler proof (with larger hidden constants): Use Loomis-Whitney (1949). 32

33 D-dimensional Torus: W link vs. W proc lower bounds

34 D-dimensional Torus: W link vs. W proc lower bounds When is W link > W proc, i.e previous lower bound unattainable?

35 Link Contention vs. per-processor bounds Example: Strong scaling with Strassen’s matrix multiplication (  0 =log 2 7) Only when per-processor memory- dependent bound dominates is perfect strong scaling in runtime possible Higher torus dimension allows longer perfect strong-scaling.

36 Link Contention vs. per-processor bounds Example: Strong scaling with Strassen’s matrix multiplication (  0 =log 2 7) Only when per-processor memory- dependent bound dominate is perfect strong scaling in runtime possible Higher torus dimension allows longer perfect strong-scaling. D1D1 D2D2

Conclusions for Network-Dependent Lower bounds We obtained contention lower bounds: bounds that are network topology dependent. These often dominate the per-processor bounds, e.g., for dense linear algebra on low dimensional tori. And have immediate implication on energy and power use. –Higher contention cost results in longer runs –Energy cost of communication cannot be “hidden” behind flops cost. 37

Work in progress & open problems Are all these bounds attainable? Is h t (G Net ) a good enough characterization of a network? Apply to other algorithms (e.g., Sparse LA, …) Other networks? (Fat-trees, hypercubes,…) 38

Implications for sequential architectural scaling Requirements so that “most” time is spent doing arithmetic on n x n dense matrices, n 2 > M: Time = γ n 3 + β n 3 / M 1/2 + α n 3 / M 3/2 When does arithmetic dominate communication? γ n 3 ≥ β n 3 / M 1/2 + α n 3 / M 3/2 γM 3/2 ≥ β M + α Time to multiply 2 largest locally storable square matrices (i.e. of dimension ~ M 1/2 ) exceeds time to communicate whole fast memory (of size M) in one message True for all n, just a constraint on hardware Strassen-like matmul: When does arithmetic dominate communication? Time = γ n w + β n w / M w/2-1 + α n w / M w/2 γM w/2 ≥ β M + α, same story as above Strassen-like algs do fewer flops & less communication but are more demanding on the hardware. If  0  2, it is all about communication.

EXTRA SLIDES 40

[Ballard, Demmel, Holtz, S. 2011b]: Sequential and parallel Novel graph expansion proof Strassen-like:Classic (cubic):For Strassen’s: log 2 7 log 2 8 00 Sequential Distributed Communication costs lower bounds for matrix multiplication 41

42 Let G = (V,E) be a graph, S = V\S A is the normalized adjacency matrix of a regular undirected graph, with eigenvalues  1 = 1 ≥ 2 ≥ … ≥ n   1 - max { 2, | n |} Thm: [Alon-Milman84, Dodziuk84, Alon86] Small sets expansion: Expansion [Ballard, Demmel, Holtz, S. 2011b], in the spirit of [Hong & Kung 81]

RSRS WSWS S 43 The Computation Directed Acyclic Graph Expansion Communication-Cost is Graph-Expansion Input / Output Intermediate value Dependency V (Small-Sets)

44 The Expansion of the Computation Graph Methods for the analysis of the expansion of recursively constructed graph: Combinatorial - estimate directly the edge / vertex expansion (in the spirit of [Alon, S., Shapira, 08]), or Spectral - compute the edge expansion via the spectral-gap (in the spirit of the Zig-Zag analysis [Reingold, Vadhan, Wigderson 00]) Main technical challenges: Two types of vertices (with/without recursion). The graph is not regular.

45 Estimating the edge expansion- Combinatorially Dec 1 C is a consistency gadget: Mixed pays  1/12 of its edges. The fraction of S vertices is consistent between the 1 st level and the four 2 nd levels (deviations pay linearly). In S Not in S Mixed Enc lg n BEnc lg n A Dec lg n C n2n2 n2n2 n0n0 lg n  0 = lg 7

46 Estimating the edge expansion- Combinatorially SkSk S1S1 S3S3 S2S2 Dec 1 C is a consistency gadget: Mixed pays  1/12 of its edges. The fraction of S vertices is consistent between the 1 st level and the four 2 nd levels (deviations pay linearly). In S Not in S Mixed

47 Is Strassen’s Graph a Good Expander? S1S1 S2S2 S3S3 S5S5 S4S4 For n -by- n matrices: For M 1/2 -by- M 1/2 matrices: For M 1/2 -by- M 1/2 sub-matrices (or other small subsets): Summing up (the partition argument)

48 For a given run (Algorithm, Machine, Input) 1.Consider the computation DAG: G = (V, E) V = set of computations and inputs E = dependencies 2.Partition G into segments S of  (M  /2 ) vertices (correspond to time / location adjacency) 3.Show that every S has  3M vertices with incoming / outgoing edges  perform  M read/writes. 4.The total communication BW is BW = BW of one segment  #segments =  (M)  O(n  ) /  (M  /2 ) =  (n  / M  /2 -1 ) M M MM M S RSRS WSWS V The partitioning argument S1S1 S2S2 S3S3 Read Write FLOP Time...

49 Existing Communication Avoiding Sequential Implementations for Strassen-like algorithms

50 Communication Avoiding Parallel-Algorithm for Strassen’s Matrix Multiplication (CAPS)

2D-Strassen: [Luo & Drake 95] Run classical 2D inter-processors. Same (high) communication cost as classical 2D Run Strassen locally Can't use Strassen on the full matrix size Strassen-2D: [Luo & Drake 95, Grayson, Shah, & van de Geijn 95] Run Strassen inter-processors This part can be done without communication. then run classical 2D Communication costs grow exponentially with the number of Strassen steps. Neither is communication optimal. Previous Distributed Parallel Strassen-based Algorithms

Existing Strassen-based parallel algorithms [Luo & Drake 95, Grayson, Shah, & van de Geijn 95] are not communication optimal Our new algorithm is communication optimal is faster: asymptotically and in practice Generalizes to other algorithms. Previous Distributed Parallel Strassen-based Algorithms 52 Good algorithms have to respect the expansion. What does this mean?

53 Asymptotic Analysis for memory scaling AlgorithmFlops Local memory requirementBWL 2D classic Cannon/SUMMA 3D classic 1 2.5D classic [Demmel Solomonik 11] [Grayson Shah van de Geijn 95] [Nguyen Lavalallee Bui Trung 05] New algorithms 1

54 The Fastest Matrix Multiplication in the West Intrepid (IBM Blue-Gene/P), 196 cores (49 nodes), n = 21952 [BDHLS SPAA’12]

vs. Runs all 7 sub-problems in parallel Each on P/7 processors Needs 7/4 as much extra memory Requires communication, but All-BFS (if possible) minimizes communication Runs all 7 sub-problems sequentially Each on all P processors Needs 1/4 as much extra memory No immediate communication Increases bandwidth by factor of 7/4 Increases latency by factor of 7 Communication Avoiding Parallel Strassen At each level of recursion tree, choose either breadth-first or depth-first traversal. 61

vs. Runs all 7 sub-problems in parallel Each on P/7 processors Runs all 7 sub-problems sequentially Each on all P processors CAPS If EnoughMemory and P  7 then BFS step else DFS step end if The Algorithm: Communication Avoiding Parallel Strassen (CAPS) At each level of recursion tree, choose either breadth-first or depth-first traversal. 62

Implementation Details Dynamic interleaving BFS and DFS Data Layout Local shared memory Strassen Local classical multiplication for very small matrices. Running on P = m 7 k Hiding communication Dealing with P not power of 7 63

65 The Fastest Matrix Multiplication in the West Franklin (Cray XT4), Strong Scaling, n = 94080 [BDHLS SPAA’12] 24%-184% faster than previous algorithms Ours [2012] Strassen based [1995] ScaLAPACK Classical [2011]

The Fastest Matrix Multiplication in the West [DEFKLSS, SuperComputing’12] Speedup example Rectangular Matrix Multiplication, 64 x k x 64 Benchmarked on 32 cores shared memory Intel machine “…Highest Performance and Scalability across Past, Present & Future Processors…” Intel’s Math Kernel Library. From: http://software.intel.com/en-us/intel-mkl Intel’s Math Kernel Library Our Algorithm 66

The Fastest Matrix Multiplication in the West [DEFKLSS, SuperComputing’12] Our Algorithm Machine Peak ScaLAPACK Speedup example Benchmarked on Hopper ( Cray XE6) Lawrence Berkeley National Laboratory Rectangular Matrix Multiplication Dimensions: 192 x 6291456 x 192 Strong-Scaling plot. 67

The Importance of Being Oblivious Locating Tuning Opportunity by Cache- and Network Oblivious Algorithm 68 Speedup example Rectangular Matrix Multiplication Benchmarked on Hopper ( Cray XE6) Lawrence Berkeley National Laboratory Dim: 192 x k x 192. 6144 cores BFS/DFS applied to classical MatMul ScaLAPACK ScaLAPACK, B-rotated

The Importance of Being Oblivious Locating Tuning Opportunity by Cache- and Network Oblivious Algorithm Rectangular Matrix Multiplication Dimensions: 192 x 6291456 x 192 Benchmarked on Hopper ( Cray XE6) Lawrence Berkeley National Laboratory

Communication-Avoiding Algorithms: Save Energy n-body, MatMul,… [Demmel, Gerhart, Lipshitz, S. ‘12] Number of Processors Memory size 71

84 The Fastest Matrix Multiplication in the West Franklin (Cray XT4), Strong Scaling, n = 94080 [BDHLS SPAA’12] 24%-184% faster than previous algorithms Ours [2012] Strassen based [1995] ScaLAPACK Classical [2011] Performance drops

Strassen-like:Classic (cubic):For Strassen’s: Distributed Sequential Communication costs for matrix multiplication [McColl Tiskin 99, Solomonik & Demmel ‘11] 85 Memory independent bounds [Ballard, Demmel, Holtz, Lipshitz, S. 2012b]:

Strong Scaling of Matrix Multiplication and Memory-Independent Lower Bounds 86 Perfect strong scaling

Implications for sequential architectural scaling Requirements so that “most” time is spent doing arithmetic on n x n dense matrices, n 2 > M: Time to multiply 2 largest locally storable square matrices (i.e. of dimension ~ M 1/2 ) exceeds time to communicate whole fast memory (of size M) Strassen-like algs do fewer flops & less communication but are more demanding on the hardware. If  0  2, it is all about communication. CA Matrix multiplication algorithm Scaling Bandwidth Requirement Scaling Latency Requirement Classic  M 3/2    M  M 3/2    Strassen-like  M  0 /2    M  M  0 /2   

Communication-Avoiding Algorithms: 1) Strassen-Like Algorithms 2) Hardware Implications Jim Demmel.

Similar presentations

Presentation on theme: "Communication-Avoiding Algorithms: 1) Strassen-Like Algorithms 2) Hardware Implications Jim Demmel."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Communication-Avoiding Algorithms: 1) Strassen-Like Algorithms 2) Hardware Implications Jim Demmel.

Similar presentations

Presentation on theme: "Communication-Avoiding Algorithms: 1) Strassen-Like Algorithms 2) Hardware Implications Jim Demmel."— Presentation transcript:

Similar presentations

About project

Feedback