Communication-Avoiding Algorithms: 1) Strassen-Like Algorithms 2) Hardware Implications Jim Demmel.

Slides:

Advertisements

Similar presentations

Divide-and-Conquer CIS 606 Spring 2010.

Advertisements

Load Balancing Parallel Applications on Heterogeneous Platforms.

Routing and Congestion Problems in General Networks Presented by Jun Zou CAS 744.

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Lecture 17 Path Algebra Matrix multiplication of adjacency matrices of directed graphs give important information about the graphs. Manipulating these.

1 Reduction between Transitive Closure & Boolean Matrix Multiplication Presented by Rotem Mairon.

CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.

CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.

1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.

Parallel Routing Bruce, Chiu-Wing Sham. Overview Background Routing in parallel computers Routing in hypercube network –Bit-fixing routing algorithm –Randomized.

ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.

Undirected ST-Connectivity 2 DL Omer Reingold, STOC 2005: Presented by: Fenghui Zhang CPSC 637 – paper presentation.

CPSC 668Self Stabilization1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.

Towards Communication Avoiding Fast Algorithm for Sparse Matrix Multiplication Part I: Minimizing arithmetic operations Oded Schwartz CS294, Lecture #21.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.

Communication [Lower] Bounds for Heterogeneous Architectures Julian Bui.

CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?

Zig-Zag Expanders Seminar in Theory and Algorithmic Research Sashka Davis UCSD, April 2005 “ Entropy Waves, the Zig-Zag Graph Product, and New Constant-

Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.

Expanders Eliyahu Kiperwasser. What is it? Expanders are graphs with no small cuts. The later gives several unique traits to such graph, such as: – High.

1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.

Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.

External Memory Algorithms Kamesh Munagala. External Memory Model Aggrawal and Vitter, 1988.

1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,

Divide-and-Conquer 7 2  9 4   2   4   7

How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10.

Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

1 High-Performance Grid Computing and Research Networking Presented by Xing Hang Instructor: S. Masoud Sadjadi

RAM, PRAM, and LogP models

Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.

Program Efficiency & Complexity Analysis. Algorithm Review An algorithm is a definite procedure for solving a problem in finite number of steps Algorithm.

Many random walks are faster than one Noga AlonTel Aviv University Chen AvinBen Gurion University Michal KouckyCzech Academy of Sciences Gady KozmaWeizmann.

InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.

15-853:Algorithms in the Real World

Data Structures and Algorithms in Parallel Computing Lecture 1.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Artur Czumaj DIMAP DIMAP (Centre for Discrete Maths and it Applications) Computer Science & Department of Computer Science University of Warwick Testing.

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Lecture 5 Today, how to solve recurrences We learned “guess and proved by induction” We also learned “substitution” method Today, we learn the “master.

Data Structures and Algorithms in Parallel Computing

Divide and Conquer. Recall Divide the problem into a number of sub-problems that are smaller instances of the same problem. Conquer the sub-problems by.

CSE 373: Data Structures and Algorithms

CSE 421 Algorithms Richard Anderson Winter 2009 Lecture 5.

Divide and Conquer Faculty Name: Ruhi Fatima Topics Covered Divide and Conquer Matrix multiplication Recurrence.

How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part II: Geometric embedding Oded Schwartz CS294, Lecture.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

MA/CSSE 473 Day 14 Strassen's Algorithm: Matrix Multiplication Decrease and Conquer DFS.

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Algorithms for Supercomputers Upper bounds: from sequential to parallel Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance.

CSC317 1 At the same time: Breadth-first search tree: If node v is discovered after u then edge uv is added to the tree. We say that u is a predecessor.

Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance

Write-Avoiding Algorithms

Ioannis E. Venetis Department of Computer Engineering and Informatics

Chapter 3: Principles of Scalable Performance

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

COMP60621 Fundamentals of Parallel and Distributed Systems

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

CS210- Lecture 2 Jun 2, 2005 Announcements Questions

Divide-and-Conquer 7 2  9 4   2   4   7

COMP60611 Fundamentals of Parallel and Distributed Systems

Presentation transcript:

Communication-Avoiding Algorithms: 1) Strassen-Like Algorithms 2) Hardware Implications Jim Demmel

Outline Strassen-like algorithms Communication lower bounds and optimal algorithms Open problem: extend beyond matmul Hardware implications Network contention What network topologies are necessary to attain lower bounds? Open problem: extend analysis beyond simple topologies analyzed so far How does hardware need to scale so that arithmetic is bottleneck? Non-volatile memories On some current and emerging technologies (FLASH, 3DXPoint, …) writes can be much more expensive than reads “Write-avoiding” algorithms: attain new, even lower bounds for writes than for reads. 2

3 Graph Expansion and Communication Costs of Fast Matrix Multiplication SPAA’11: Symposium on Parallelism in Algorithms and Architectures Best Paper Award And in Journal of the ACM

4 Communication Lower Bounds Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] [Scott, Holtz, Schwartz, 2015], [Scott, 2015] Proving that your algorithm/implementation is as good as it gets.

[Strassen 69] Compute 2 x 2 matrix multiplication using only 7 multiplications (instead of 8). Apply recursively (block-wise) M 1 = (A 11 + A 22 )  (B 11 + B 22 ) M 2 = (A 21 + A 22 )  B 11 M 3 = A 11  (B 12 - B 22 ) M 4 = A 22  (B 21 - B 11 ) M 5 = (A 11 + A 12 )  B 22 M 6 = (A 21 - A 11 )  (B 11 + B 12 ) M 7 = (A 12 - A 22 )  (B 21 + B 22 ) C 11 = M 1 + M 4 - M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 C 22 = M 1 - M 2 + M 3 + M 6 Recall: Strassen’s Fast Matrix Multiplication C 21 C 22 C 11 C 12 n/2 A 21 A 22 A 11 A 12 B 21 B 22 B 11 B 12 = T(n) = 7  T(n/2) +  (n 2 ) T(n) =  (n log 2 7 ) 5

Strassen-like algorithms T(n) = n 0  0  T(n/n 0 ) +  (n 2 ) T(n) =  (n  0 ) n/n 0 = 6 Subsequently… Compute n 0 x n 0 matrix multiplication using only n 0  0 multiplications (instead of n 0 3 ). Apply recursively (block-wise)  0  2.81 [Strassen 69],[Strassen-Winograd 71] 2.79 [Pan 78] 2.78 [Bini 79] 2.55 [Schönhage 81] 2.50 [Pan Romani,Coppersmith Winograd 84] 2.48 [Strassen 87] 2.38 [Coppersmith Winograd 90] 2.38 [Cohn Kleinberg Szegedy Umans 05] Group-theoretic approach [Stothers 10] [Vassilevska Williams 12] [Le Gall 14]

Complexity of classical Matmul vs “Strassen-like” Flops: O(n 3 /p) vs O(n w /p) Communication lower bound on #words: Ω((n 3 /p)/M 1/2 ) = Ω(M(n/M 1/2 ) 3 /p) vs Ω(M(n/M 1/2 ) w /p) Communication lower bound on #messages: Ω((n 3 /p)/M 3/2 ) = Ω((n/M 1/2 ) 3 /p) vs Ω((n/M 1/2 ) w /p) All attainable as M increases past O(n 2 /p), up to a limit: can increase M by factor up to p 1/3 vs p 1-2/w #words as low as Ω(n 2 /p 2/3 ) vs Ω(n 2 /p 2/w ) #messages as low as Ω(1) vs Ω(1) Best Paper Prize, SPAA’11, Ballard, D., Holtz, Schwartz PhD Thesis 2015, Scott; SPAA’15, Scott, Holtz, Schwartz How well does parallel Strassen work in practice? Classical Matmul vs (Parallel) “Strassen-like” 02/27/2014 CS267 Lecture 12 7

Intuition for (sequential) result (not proof!) Recall classical matmul proof: What is the most “useful work” you can do with O(M) words? Multiply two M 1/2 x M 1/2 matrices, doing M 3/2 flops Proof used Loomis-Whitney Divide instruction sequence into segments, each containing M loads and stores, so O(M) words available per segment Need n 3 /M 3/2 = (n/M 1/2 ) 3 segments, so O(M*(n/M 1/2 ) 3 ) loads/stores Intuitive extension to Strassen-like: What is the most “useful work” you can do with O(M) words? Multiply two M 1/2 x M 1/2 matrices, doing M w/2 flops Divide instruction sequence into segments, each containing M loads and stores, so O(M) words available per segment Need n w /M w/2 = (n/M 1/2 ) w segments, so O(M*(n/M 1/2 ) w ) loads/stores Open question: Is there a HBL-based proof of first claim? 8

RSRS WSWS S 9 The Computation Directed Acyclic Graph Input / Output Intermediate value Dependency V How can we estimate R s and W s ? By bounding the expansion of the graph!

10 Let G = (V,E) be a graph Small sets expansion: Expansion [Ballard, Demmel, Holtz, S. 2011b], in the spirit of [Hong & Kung 81]

11 What is the Computation Graph of Strassen? Can we Compute its Expansion?

12 M 1 = (A 11 + A 22 )  (B 11 + B 22 ) M 2 = (A 21 + A 22 )  B 11 M 3 = A 11  (B 12 - B 22 ) M 4 = A 22  (B 21 - B 11 ) M 5 = (A 11 + A 12 )  B 22 M 6 = (A 21 - A 11 )  (B 11 + B 12 ) M 7 = (A 12 - A 22 )  (B 21 + B 22 ) C 11 = M 1 + M 4 - M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 C 22 = M 1 - M 2 + M 3 + M 6 The DAG of Strassen, n = 2 ` ,11,22,12,2 1,11,22,12,21,11,22,12,2 Enc 1 A Dec 1 C Enc 1 B

Enc 1 A Dec 1 C ` 13 The DAG of Strassen, n=4 Dec 1 C 1,11,22,12,2 One recursive level: Each vertex splits into four. Multiply blocks Enc 1 BEnc 1 A Dec 1 C Enc 1 AEnc 1 B

14 Enc lg n BEnc lg n A Dec lg n C n2n2 n2n2 n0n0 lg n The DAG of Strassen: further recursive steps Dec 1 C 1,11,22,12,2 Recursive construction Given Dec i C, Construct Dec i+1 C: 1.Duplicate 4 times 2.Connect with a cross-layer of Dec 1 C  0 = lg 7

15 Is Strassen’s Graph a Good Expander? For n -by- n matrices: For M 1/2 -by- M 1/2 matrices: For M 1/2 -by- M 1/2 sub-matrices (or other small subsets): Lower Bound argument: Break instruction stream of n ω 0 flops into segments S i of M ω 0 /2 flops Each segment does M ω 0 /2 * h = Ω(M) loads/stores #Loads/stores = Ω(M)*#segments = Ω(M (n/M 1/2 ) ω 0 ) M ω 0 /2

Technical Assumptions for Proof No redundant computation Keeps CDAG fixed Raises possibility of defeating lower bound, by recomputing data instead of writing first copy/rereading later See Sec 3.2 in Ballard/D/Holtz/Schwartz 2011 “Strassen-like” Base algorithm (eg 2x2 matmul) repeated recursively Includes all fast matmuls since Strassen (AFAIK) Dec 1 C must be connected Needed to bound expansion Eliminates classical matmul Each of 4 outputs only depends on 2 of the 8 inputs, all disjoint 16 Dec 1 C 1,11,22,12,2

17 Subsequently… More Bounds, for example: For rectangular fast-matrix multiplication algorithms [MedAlg’12]. For fast numerical linear algebra [EECS-Techreport’12]. E.g., solving linear systems, least squares, eigenproblems,... with same arithmetic and communication costs, and numerically stably. How much extra memory is useful How far we can have perfect strong-scaling [SPAA’12b] Generalize to varying algorithm at each recursive layer, rectangular case, disconnected Dec 1 C, … [Scott, PhD Thesis, 2015] New Parallel Algorithm…

Communication-Optimal Parallel Algorithm for Strassen's Matrix Multiplication SPAA’12: Symposium on Parallelism in Algorithms and Architectures Supercomputing’12 18 Optimal Parallel Algorithms

Strassen-like:Classic (cubic):For Strassen’s: Distributed Sequential Algorithms attaining these bounds? Communication costs lower bounds for matrix multiplication Next [Cannon 69, McColl Tiskin 99, Solomonik Demmel ‘11] 19

20 Sequential and new 2D and 2.5D parallel Strassen-like algorithms Sequential and Hierarchy cases: Attained by the natural recursive implementation. Also: LU, QR,… (Black-box use of fast matrix multiplication) [Ballard, D., Holtz, Schwartz, Rom 2011]: New 2D parallel Strassen-like algorithm. Attains the lower bound. New 2.5D parallel Strassen-like algorithm. c  0 /2-1 parallel communication speedup over 2D implementation ( c ∙ 3n 2 = M∙P ) [Ballard, D., Holtz, Schwartz, 2011b]: This is as good as it gets.

vs. Runs all 7 multiplies in parallel Each on P/7 processors Needs 7/4 as much memory Runs all 7 multiplies sequentially Each on all P processors Needs 1/4 as much memory CAPS If EnoughMemory and P  7 then BFS step else DFS step end if Communication Avoiding Parallel Strassen (CAPS) In practice, how to best interleave BFS and DFS is a “tuning parameter”

Strong scaling of Matmul on Hopper (n=94080) 02/25/2016CS267 Lecture G. Ballard, D., O. Holtz, B. Lipshitz, O. Schwartz “Communication-Avoiding Parallel Strassen” bebop.cs.berkeley.edu, Supercomputing’12

Outline Strassen-like algorithms Communication lower bounds and optimal algorithms Open problem: extend beyond matmul Hardware implications Network contention What network topologies are necessary to attain lower bounds? Open problem: extend analysis beyond simple topologies analyzed so far How does hardware need to scale so that arithmetic is bottleneck? Non-volatile memories On some current and emerging technologies (FLASH, 3DXPoint, …) writes can be much more expensive than reads “Write-avoiding” algorithms: attain new, even lower bounds for writes than for reads. 23

Classical (O(n 3 )) Strassen’s (O(n ≈2.81 )) Network-dependent Communication Lower Bounds 24 Most existing bounds are per-processor (ignore topology / assume all-to-all network), i.e., of the form: “There exists a processor that comm. at least W proc words” Example: Matrix multiplication (IIT04,BDHS11) CPU M CPU M CPU M CPU M Memory-dependent and memory-independent lower bounds

Classical (O(n 3 )) Strassen’s (O(n ≈2.81 )) Network-dependent Communication Lower Bounds 25 Most existing bounds are per-processor (ignore topology / assume all-to-all network), i.e., of the form: “There exists a processor that comm. at least W proc words” Example: Matrix multiplication (IIT04,BDHS11) Can a network link become the performance bottleneck? “There exists a link that communicates at least W link words” Goal: Combine existing per-processor bounds with network parameters to obtain contention bounds CPU M CPU M CPU M CPU M Memory-dependent and memory-independent lower bounds

26 N Size of input/output of size G Net Inter-processor network P # of processors (the vertices of G Net ) d degree of G Net (# of neighbors of each processor) h t small set edge expansion of G Net M Size of local memories From “per-processor” W proc to “per-link” (contention) W link Theorem (d-regular G Net ): CPU M CPU M CPU M CPU M

Proof sketch 27 CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M’ CPU M’ CPU M’ CPU M’ There exists a link that communicates at least W link words

28 Applications Plug in “per-processor” lower bound Converting “per-processor” communication-cost lower bounds into a topology-dependent contention lower bounds 28 Plug in network parameters

29 Applications Converting “per-processor” communication-cost lower bounds into a topology-dependent contention lower bounds 29 Only consider sets that attain the small set edge expansion of G Net

30 Applications First application: Problems on a torus interconnect Example machines: BlueGene/Q - 5D-Torus BlueGene/P - 3D-Torus Intel on-chip multicore - 1D Torus (ring) 30 CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M CPU M

The small set edge expansion of tori Recall definition of small set expansion: Claim [Bollobás-Leader ‘91]: A D-dimensional torus G has where the hidden constant may depend on D. 31

The small set edge expansion of tori Recall definition of small set expansion: Claim [Bollobás-Leader ‘91]: A D-dimensional torus G has where the hidden constant may depend on D. For a simpler proof (with larger hidden constants): Use Loomis-Whitney (1949). 32

33 D-dimensional Torus: W link vs. W proc lower bounds

34 D-dimensional Torus: W link vs. W proc lower bounds When is W link > W proc, i.e previous lower bound unattainable?

35 Link Contention vs. per-processor bounds Example: Strong scaling with Strassen’s matrix multiplication (  0 =log 2 7) Only when per-processor memory- dependent bound dominates is perfect strong scaling in runtime possible Higher torus dimension allows longer perfect strong-scaling.

36 Link Contention vs. per-processor bounds Example: Strong scaling with Strassen’s matrix multiplication (  0 =log 2 7) Only when per-processor memory- dependent bound dominate is perfect strong scaling in runtime possible Higher torus dimension allows longer perfect strong-scaling. D1D1 D2D2

Conclusions for Network-Dependent Lower bounds We obtained contention lower bounds: bounds that are network topology dependent. These often dominate the per-processor bounds, e.g., for dense linear algebra on low dimensional tori. And have immediate implication on energy and power use. –Higher contention cost results in longer runs –Energy cost of communication cannot be “hidden” behind flops cost. 37

Work in progress & open problems Are all these bounds attainable? Is h t (G Net ) a good enough characterization of a network? Apply to other algorithms (e.g., Sparse LA, …) Other networks? (Fat-trees, hypercubes,…) 38

Implications for sequential architectural scaling Requirements so that “most” time is spent doing arithmetic on n x n dense matrices, n 2 > M: Time = γ n 3 + β n 3 / M 1/2 + α n 3 / M 3/2 When does arithmetic dominate communication? γ n 3 ≥ β n 3 / M 1/2 + α n 3 / M 3/2 γM 3/2 ≥ β M + α Time to multiply 2 largest locally storable square matrices (i.e. of dimension ~ M 1/2 ) exceeds time to communicate whole fast memory (of size M) in one message True for all n, just a constraint on hardware Strassen-like matmul: When does arithmetic dominate communication? Time = γ n w + β n w / M w/2-1 + α n w / M w/2 γM w/2 ≥ β M + α, same story as above Strassen-like algs do fewer flops & less communication but are more demanding on the hardware. If  0  2, it is all about communication.

EXTRA SLIDES 40

[Ballard, Demmel, Holtz, S. 2011b]: Sequential and parallel Novel graph expansion proof Strassen-like:Classic (cubic):For Strassen’s: log 2 7 log 2 8 00 Sequential Distributed Communication costs lower bounds for matrix multiplication 41

42 Let G = (V,E) be a graph, S = V\S A is the normalized adjacency matrix of a regular undirected graph, with eigenvalues  1 = 1 ≥ 2 ≥ … ≥ n   1 - max { 2, | n |} Thm: [Alon-Milman84, Dodziuk84, Alon86] Small sets expansion: Expansion [Ballard, Demmel, Holtz, S. 2011b], in the spirit of [Hong & Kung 81]

RSRS WSWS S 43 The Computation Directed Acyclic Graph Expansion Communication-Cost is Graph-Expansion Input / Output Intermediate value Dependency V (Small-Sets)

44 The Expansion of the Computation Graph Methods for the analysis of the expansion of recursively constructed graph: Combinatorial - estimate directly the edge / vertex expansion (in the spirit of [Alon, S., Shapira, 08]), or Spectral - compute the edge expansion via the spectral-gap (in the spirit of the Zig-Zag analysis [Reingold, Vadhan, Wigderson 00]) Main technical challenges: Two types of vertices (with/without recursion). The graph is not regular.

45 Estimating the edge expansion- Combinatorially Dec 1 C is a consistency gadget: Mixed pays  1/12 of its edges. The fraction of S vertices is consistent between the 1 st level and the four 2 nd levels (deviations pay linearly). In S Not in S Mixed Enc lg n BEnc lg n A Dec lg n C n2n2 n2n2 n0n0 lg n  0 = lg 7

46 Estimating the edge expansion- Combinatorially SkSk S1S1 S3S3 S2S2 Dec 1 C is a consistency gadget: Mixed pays  1/12 of its edges. The fraction of S vertices is consistent between the 1 st level and the four 2 nd levels (deviations pay linearly). In S Not in S Mixed

47 Is Strassen’s Graph a Good Expander? S1S1 S2S2 S3S3 S5S5 S4S4 For n -by- n matrices: For M 1/2 -by- M 1/2 matrices: For M 1/2 -by- M 1/2 sub-matrices (or other small subsets): Summing up (the partition argument)

48 For a given run (Algorithm, Machine, Input) 1.Consider the computation DAG: G = (V, E) V = set of computations and inputs E = dependencies 2.Partition G into segments S of  (M  /2 ) vertices (correspond to time / location adjacency) 3.Show that every S has  3M vertices with incoming / outgoing edges  perform  M read/writes. 4.The total communication BW is BW = BW of one segment  #segments =  (M)  O(n  ) /  (M  /2 ) =  (n  / M  /2 -1 ) M M MM M S RSRS WSWS V The partitioning argument S1S1 S2S2 S3S3 Read Write FLOP Time...

49 Existing Communication Avoiding Sequential Implementations for Strassen-like algorithms

50 Communication Avoiding Parallel-Algorithm for Strassen’s Matrix Multiplication (CAPS)

2D-Strassen: [Luo & Drake 95] Run classical 2D inter-processors. Same (high) communication cost as classical 2D Run Strassen locally Can't use Strassen on the full matrix size Strassen-2D: [Luo & Drake 95, Grayson, Shah, & van de Geijn 95] Run Strassen inter-processors This part can be done without communication. then run classical 2D Communication costs grow exponentially with the number of Strassen steps. Neither is communication optimal. Previous Distributed Parallel Strassen-based Algorithms

Existing Strassen-based parallel algorithms [Luo & Drake 95, Grayson, Shah, & van de Geijn 95] are not communication optimal Our new algorithm is communication optimal is faster: asymptotically and in practice Generalizes to other algorithms. Previous Distributed Parallel Strassen-based Algorithms 52 Good algorithms have to respect the expansion. What does this mean?

53 Asymptotic Analysis for memory scaling AlgorithmFlops Local memory requirementBWL 2D classic Cannon/SUMMA 3D classic 1 2.5D classic [Demmel Solomonik 11] [Grayson Shah van de Geijn 95] [Nguyen Lavalallee Bui Trung 05] New algorithms 1

54 The Fastest Matrix Multiplication in the West Intrepid (IBM Blue-Gene/P), 196 cores (49 nodes), n = [BDHLS SPAA’12]

58

59

60

vs. Runs all 7 sub-problems in parallel Each on P/7 processors Needs 7/4 as much extra memory Requires communication, but All-BFS (if possible) minimizes communication Runs all 7 sub-problems sequentially Each on all P processors Needs 1/4 as much extra memory No immediate communication Increases bandwidth by factor of 7/4 Increases latency by factor of 7 Communication Avoiding Parallel Strassen At each level of recursion tree, choose either breadth-first or depth-first traversal. 61

vs. Runs all 7 sub-problems in parallel Each on P/7 processors Runs all 7 sub-problems sequentially Each on all P processors CAPS If EnoughMemory and P  7 then BFS step else DFS step end if The Algorithm: Communication Avoiding Parallel Strassen (CAPS) At each level of recursion tree, choose either breadth-first or depth-first traversal. 62

Implementation Details Dynamic interleaving BFS and DFS Data Layout Local shared memory Strassen Local classical multiplication for very small matrices. Running on P = m 7 k Hiding communication Dealing with P not power of 7 63

64

65 The Fastest Matrix Multiplication in the West Franklin (Cray XT4), Strong Scaling, n = [BDHLS SPAA’12] 24%-184% faster than previous algorithms Ours [2012] Strassen based [1995] ScaLAPACK Classical [2011]

The Fastest Matrix Multiplication in the West [DEFKLSS, SuperComputing’12] Speedup example Rectangular Matrix Multiplication, 64 x k x 64 Benchmarked on 32 cores shared memory Intel machine “…Highest Performance and Scalability across Past, Present & Future Processors…” Intel’s Math Kernel Library. From: Intel’s Math Kernel Library Our Algorithm 66

The Fastest Matrix Multiplication in the West [DEFKLSS, SuperComputing’12] Our Algorithm Machine Peak ScaLAPACK Speedup example Benchmarked on Hopper ( Cray XE6) Lawrence Berkeley National Laboratory Rectangular Matrix Multiplication Dimensions: 192 x x 192 Strong-Scaling plot. 67

The Importance of Being Oblivious Locating Tuning Opportunity by Cache- and Network Oblivious Algorithm 68 Speedup example Rectangular Matrix Multiplication Benchmarked on Hopper ( Cray XE6) Lawrence Berkeley National Laboratory Dim: 192 x k x cores BFS/DFS applied to classical MatMul ScaLAPACK ScaLAPACK, B-rotated

The Importance of Being Oblivious Locating Tuning Opportunity by Cache- and Network Oblivious Algorithm Rectangular Matrix Multiplication Dimensions: 192 x x 192 Benchmarked on Hopper ( Cray XE6) Lawrence Berkeley National Laboratory

Communication-Avoiding Algorithms: Save Energy n-body, MatMul,… [Demmel, Gerhart, Lipshitz, S. ‘12] Number of Processors Memory size 71

84 The Fastest Matrix Multiplication in the West Franklin (Cray XT4), Strong Scaling, n = [BDHLS SPAA’12] 24%-184% faster than previous algorithms Ours [2012] Strassen based [1995] ScaLAPACK Classical [2011] Performance drops

Strassen-like:Classic (cubic):For Strassen’s: Distributed Sequential Communication costs for matrix multiplication [McColl Tiskin 99, Solomonik & Demmel ‘11] 85 Memory independent bounds [Ballard, Demmel, Holtz, Lipshitz, S. 2012b]:

Strong Scaling of Matrix Multiplication and Memory-Independent Lower Bounds 86 Perfect strong scaling

Implications for sequential architectural scaling Requirements so that “most” time is spent doing arithmetic on n x n dense matrices, n 2 > M: Time to multiply 2 largest locally storable square matrices (i.e. of dimension ~ M 1/2 ) exceeds time to communicate whole fast memory (of size M) Strassen-like algs do fewer flops & less communication but are more demanding on the hardware. If  0  2, it is all about communication. CA Matrix multiplication algorithm Scaling Bandwidth Requirement Scaling Latency Requirement Classic  M 3/2    M  M 3/2    Strassen-like  M  0 /2    M  M  0 /2   