A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie.

A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie Mellon University Tamara G. Kolda Sandia National Labs: Livermore 1

Envisioned workflow 1.New architecture comes out 2.Scientists specify what they want computed on new architecture to (computer) scientists 3.(Computer) scientists provide efficient library for the computation on new architecture 4.Scientists do science 2 Formality is key!

Goals Formally describe distribution of tensor data on processing grids Identify patterns in collective communications to utilize specialized implementations when possible Provide systematic approach to creating algorithms and implementations for problems Achieve high performance 3

Outline Description of parallel matrix-matrix multiplication Quick overview of tensors and tensor contractions A notation for distributing/redistributing tensors A method for deriving algorithms 4

Data Distribution Approach “Cyclically wrap” each mode of the tensor on the grid Assign elements of the tensor to processes based on the assigned indices When restricted to 2-D objects on 2-D grids, ideas correspond to theory of Elemental 1 library 5 1 Martin D. Schatz, Jack Poulson, and Robert van de Geijn. Parallel matrix multiplication: 2d and 3d. FLAME Working Note #62 TR-12-13, The University of Texas at Austin, Department of Computer Sciences, JUNE 2012

Assume a computing grid arranged as an order-N object Elements of tensors wrapped elemental-cyclically on the grid Assumptions 6 For this example, we assume an order-2 tensor (matrix) on order-2 grid 0 1

Assume a computing grid arranged as an order-N object Elements of tensors wrapped elemental-cyclically on the grid Assumptions 7 For this example, we assume an order-2 tensor (matrix) on order-2 grid 0 1

Data distribution notation: The Basics Assign a distribution scheme to each mode of the object 8 How indices of rows (mode 1) are distributed How indices of columns (mode 0) are distributed

Data distribution notation: The Basics Assign a distribution scheme to each mode of the object 9 Distributed based on mode 1 of grid Distributed based on mode 0 of grid How indices of columns (mode 0) are distributed How indices of rows (mode 1) are distributed Tuple assigned to each mode is referred to as the “mode distribution”

Example 1 Distribute indices of columns based on mode 0 of grid 10

Example 1 Distribute indices of columns based on mode 0 of grid 11

Distribute indices of columns based on mode 0 of grid Example 1 12

Distribute indices of columns based on mode 0 of grid Distribute indices of rows based on mode 1 of grid Example 1 17

Distributions wrap elements on a logical view of grid – Allows for multiple grid modes to be used in symbols Example, views grid as represents replication Notes 23

We use boldface lowercase Roman letters to refer to mode distributions Elements of mode distributions denoted with subscripts Concatenation of mode distributions denoted Notes 24

Elemental Notation Distributions of Elemental can be viewed in terms of defined notation 25

Parallel Matrix multiplication Heuristic – Avoid communicating the “large” matrix – Leads to “Stationary” A,B,C algorithm variants Stationary C algorithm: 26

Tensors and tensor contraction Tensor – An order-m (m-mode) operator Each mode associated with feature of the application – Modes have fixed length (dimension) 31

Notation Tensors in capital script Elements of tensors in lowercase Greek Element’s location in tensor as subscripts 32

Tensor contractions Einstein notation 1 implicitly sums over modes shared by inputs 33 1 A. Einstein. Die Grundlage der allgemeinen Relativit ̈atstheorie. Annalen der Physik, 354:769–822, 1916

Tensor contractions Einstein notation 1 implicitly sums over modes shared by inputs Transpose corresponds to interchange of modes 34 1 A. Einstein. Die Grundlage der allgemeinen Relativit ̈atstheorie. Annalen der Physik, 354:769–822, 1916

Tensor contractions Einstein notation 1 implicitly sums over modes shared by inputs Transpose corresponds to interchange of modes Arbitrary number modes involved (any of which can sum) 35 1 A. Einstein. Die Grundlage der allgemeinen Relativit ̈atstheorie. Annalen der Physik, 354:769–822, 1916

Tensor contractions Third-order Møller-Plesset 1 method from computational chemistry 36 1 R J Bartlett. Many-body perturbation theory and coupled cluster theory for electron correlation in molecules. Annual Review of Physical Chemistry, 32(1):359–401, 1981

37 Through permutation of data, can arrange in such a way that MMmult can be performed Results in algorithm of form Requires large rearrangement of data – Cost of this operation magnified in distributed-memory environments Tensor contraction as MMmult

Tensor distribution notation We’ve already seen the notation for order-2 tensors on order- 2 grids What if higher-order tensor? – More modes to assigned distribution symbols to – Ex. order-4 tensor What if higher-order grid? – More grid modes to choose from when creating distribution symbols – Ex. Mode distributions may only contain elements from {0,1,2} if computing on order-3 grid 39

Redistributions: Allgather 40 Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de Geijn. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19(13):1749–1783, 2007

Allgather in action 41

Allgather in action 42

Allgather in action 43 Before

Allgather in action 44 After Before

Redistributions: Allgather Allgather within mode performs the following redistribution of data 45

Redistribution rules 46 Communication within modes specified by can perform the following redistributions – Ex.

Algorithm choices For matrix operations, “Stationary” variants are useful – Extending ideas to tensors also useful? Potentially other “families” of algorithms to choose from – Only focusing on those we know how to encode for now 48

Deriving Algorithms: Stationary Avoid communicating Assumed order-4 grid 49

Deriving Algorithms: Stationary Avoid communicating Distribute modes similarly during local computation Assumed order-4 grid 50

Deriving Algorithms: Stationary Avoid communicating Distribute modes similarly during local computation Do not reuse modes of the grid Assumed order-4 grid 51

Deriving Algorithms: Stationary Avoid communicating Distribute modes similarly during local computation Do not reuse modes of the grid Assumed order-4 grid 52

Deriving Algorithms: Stationary Assumed order-4 grid Avoid communicating 53

Deriving Algorithms: Stationary Assumed order-4 grid 54 Avoid communicating Distribute modes similarly during local computation

Deriving Algorithms: Stationary Assumed order-4 grid 55 Avoid communicating Distribute modes similarly during local computation Do not reuse modes of grid

Deriving Algorithms: Stationary Avoid communicating Distribute modes similarly during local computation Do not reuse modes of grid Output is does not have duplication (a reasonable choice) Assumed order-4 grid 56

Deriving Algorithms: Stationary Avoid communicating Distribute modes similarly during local computation Do not reuse modes of grid Output is does not have duplication (a reasonable choice) Apply rules of reduction redistribution Assumed order-4 grid 57

Deriving Algorithms: Stationary Assumed order-4 grid 58

Quick Note Blocking described algorithms should be straightforward (done for matrix operations) 59

Analyzing algorithms Communication costs used obtained from Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de Geijn. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19(13):1749– 1783, 2007. 60

Analyzing Stationary algorithm Redistribute – All-to-all modes (2,3) – Allgather modes (1,2) Redistribute – All-to-all modes (0,1) – Allgather modes (3,0) Local tensor contraction 61 grid

Analyzing Stationary algorithm 62 Redistribute – All-to-all modes (2,3) – Allgather modes (1,2) Redistribute – All-to-all modes (0,1) – Allgather modes (3,0) Local tensor contraction grid

Analyzing Matrix-mapping approach 63 Permute Local tensor contraction Permute

Analyzing Matrix-mapping approach 64 Permute Local tensor contraction Permute

Picking the “best” algorithm Stationary algorithm Matrix-multiply based algorithm 65 Collectives involved processes

How this all fits together Formalized aspects of distributed tensor computation – Rules defining valid data distributions – Rules specifying how collectives affect distributions Given a mechanical way to go from problem specification to an implementation If other knowledge can be formalized, search space reduced 66

Acknowledgements Tamara G. Kolda – Sandia National Laboratories: Livermore Robert van de Geijn Bryan Marker Devin Matthews Tze Meng Low The FLAME team 67

Thank you This work has been funded by the following – Sandia National Laboratories: Sandia Graduate Fellowship – NSF CCF-1320112: SHF: Small: From Matrix Computations to Tensor Computations – NSF ACI-1148125/1340293 (supplement): Collaborative Research: SI2- SSI: A Linear Algebra Software Infrastructure for Sustained Innovation in Computational Chemistry and other Sciences. – Argonne National Laboratories for access to computing resources 68

A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie.

Similar presentations

Presentation on theme: "A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie.

Similar presentations

Presentation on theme: "A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie."— Presentation transcript:

Similar presentations

About project

Feedback