Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie.

Similar presentations


Presentation on theme: "A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie."— Presentation transcript:

1 A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie Mellon University Tamara G. Kolda Sandia National Labs: Livermore 1

2 Envisioned workflow 1.New architecture comes out 2.Scientists specify what they want computed on new architecture to (computer) scientists 3.(Computer) scientists provide efficient library for the computation on new architecture 4.Scientists do science 2 Formality is key!

3 Goals Formally describe distribution of tensor data on processing grids Identify patterns in collective communications to utilize specialized implementations when possible Provide systematic approach to creating algorithms and implementations for problems Achieve high performance 3

4 Outline Description of parallel matrix-matrix multiplication Quick overview of tensors and tensor contractions A notation for distributing/redistributing tensors A method for deriving algorithms 4

5 Data Distribution Approach “Cyclically wrap” each mode of the tensor on the grid Assign elements of the tensor to processes based on the assigned indices When restricted to 2-D objects on 2-D grids, ideas correspond to theory of Elemental 1 library 5 1 Martin D. Schatz, Jack Poulson, and Robert van de Geijn. Parallel matrix multiplication: 2d and 3d. FLAME Working Note #62 TR-12-13, The University of Texas at Austin, Department of Computer Sciences, JUNE 2012

6 Assume a computing grid arranged as an order-N object Elements of tensors wrapped elemental-cyclically on the grid Assumptions 6 For this example, we assume an order-2 tensor (matrix) on order-2 grid 0 1

7 Assume a computing grid arranged as an order-N object Elements of tensors wrapped elemental-cyclically on the grid Assumptions 7 For this example, we assume an order-2 tensor (matrix) on order-2 grid 0 1

8 Data distribution notation: The Basics Assign a distribution scheme to each mode of the object 8 How indices of rows (mode 1) are distributed How indices of columns (mode 0) are distributed

9 Data distribution notation: The Basics Assign a distribution scheme to each mode of the object 9 Distributed based on mode 1 of grid Distributed based on mode 0 of grid How indices of columns (mode 0) are distributed How indices of rows (mode 1) are distributed Tuple assigned to each mode is referred to as the “mode distribution”

10 Example 1 Distribute indices of columns based on mode 0 of grid 10

11 Example 1 Distribute indices of columns based on mode 0 of grid 11

12 Distribute indices of columns based on mode 0 of grid Example 1 12

13 Distribute indices of columns based on mode 0 of grid Example 1 13

14 Distribute indices of columns based on mode 0 of grid Example 1 14

15 Distribute indices of columns based on mode 0 of grid Example 1 15

16 Distribute indices of columns based on mode 0 of grid Example 1 16

17 Distribute indices of columns based on mode 0 of grid Distribute indices of rows based on mode 1 of grid Example 1 17

18 Distribute indices of columns based on mode 0 of grid Distribute indices of rows based on mode 1 of grid Example 1 18

19 Distribute indices of columns based on mode 0 of grid Distribute indices of rows based on mode 1 of grid Example 1 19

20 Distribute indices of columns based on mode 0 of grid Distribute indices of rows based on mode 1 of grid Example 1 20

21 Distribute indices of columns based on mode 0 of grid Distribute indices of rows based on mode 1 of grid Example 1 21

22 Distribute indices of columns based on mode 0 of grid Distribute indices of rows based on mode 1 of grid Example 1 22

23 Distributions wrap elements on a logical view of grid – Allows for multiple grid modes to be used in symbols Example, views grid as represents replication Notes 23

24 We use boldface lowercase Roman letters to refer to mode distributions Elements of mode distributions denoted with subscripts Concatenation of mode distributions denoted Notes 24

25 Elemental Notation Distributions of Elemental can be viewed in terms of defined notation 25

26 Parallel Matrix multiplication Heuristic – Avoid communicating the “large” matrix – Leads to “Stationary” A,B,C algorithm variants Stationary C algorithm: 26

27 27

28 28

29 29

30 Outline Description of parallel matrix-matrix multiplication Quick overview of tensors and tensor contractions A notation for distributing/redistributing tensors A method for deriving algorithms 30

31 Tensors and tensor contraction Tensor – An order-m (m-mode) operator Each mode associated with feature of the application – Modes have fixed length (dimension) 31

32 Notation Tensors in capital script Elements of tensors in lowercase Greek Element’s location in tensor as subscripts 32

33 Tensor contractions Einstein notation 1 implicitly sums over modes shared by inputs 33 1 A. Einstein. Die Grundlage der allgemeinen Relativit ̈atstheorie. Annalen der Physik, 354:769–822, 1916

34 Tensor contractions Einstein notation 1 implicitly sums over modes shared by inputs Transpose corresponds to interchange of modes 34 1 A. Einstein. Die Grundlage der allgemeinen Relativit ̈atstheorie. Annalen der Physik, 354:769–822, 1916

35 Tensor contractions Einstein notation 1 implicitly sums over modes shared by inputs Transpose corresponds to interchange of modes Arbitrary number modes involved (any of which can sum) 35 1 A. Einstein. Die Grundlage der allgemeinen Relativit ̈atstheorie. Annalen der Physik, 354:769–822, 1916

36 Tensor contractions Third-order Møller-Plesset 1 method from computational chemistry 36 1 R J Bartlett. Many-body perturbation theory and coupled cluster theory for electron correlation in molecules. Annual Review of Physical Chemistry, 32(1):359–401, 1981

37 37 Through permutation of data, can arrange in such a way that MMmult can be performed Results in algorithm of form Requires large rearrangement of data – Cost of this operation magnified in distributed-memory environments Tensor contraction as MMmult

38 Outline Description of parallel matrix-matrix multiplication Quick overview of tensors and tensor contractions A notation for distributing/redistributing tensors A method for deriving algorithms 38

39 Tensor distribution notation We’ve already seen the notation for order-2 tensors on order- 2 grids What if higher-order tensor? – More modes to assigned distribution symbols to – Ex. order-4 tensor What if higher-order grid? – More grid modes to choose from when creating distribution symbols – Ex. Mode distributions may only contain elements from {0,1,2} if computing on order-3 grid 39

40 Redistributions: Allgather 40 Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de Geijn. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19(13):1749–1783, 2007

41 Allgather in action 41

42 Allgather in action 42

43 Allgather in action 43 Before

44 Allgather in action 44 After Before

45 Redistributions: Allgather Allgather within mode performs the following redistribution of data 45

46 Redistribution rules 46 Communication within modes specified by can perform the following redistributions – Ex.

47 Outline Description of parallel matrix-matrix multiplication Quick overview of tensors and tensor contractions A notation for distributing/redistributing tensors A method for deriving algorithms 47

48 Algorithm choices For matrix operations, “Stationary” variants are useful – Extending ideas to tensors also useful? Potentially other “families” of algorithms to choose from – Only focusing on those we know how to encode for now 48

49 Deriving Algorithms: Stationary Avoid communicating Assumed order-4 grid 49

50 Deriving Algorithms: Stationary Avoid communicating Distribute modes similarly during local computation Assumed order-4 grid 50

51 Deriving Algorithms: Stationary Avoid communicating Distribute modes similarly during local computation Do not reuse modes of the grid Assumed order-4 grid 51

52 Deriving Algorithms: Stationary Avoid communicating Distribute modes similarly during local computation Do not reuse modes of the grid Assumed order-4 grid 52

53 Deriving Algorithms: Stationary Assumed order-4 grid Avoid communicating 53

54 Deriving Algorithms: Stationary Assumed order-4 grid 54 Avoid communicating Distribute modes similarly during local computation

55 Deriving Algorithms: Stationary Assumed order-4 grid 55 Avoid communicating Distribute modes similarly during local computation Do not reuse modes of grid

56 Deriving Algorithms: Stationary Avoid communicating Distribute modes similarly during local computation Do not reuse modes of grid Output is does not have duplication (a reasonable choice) Assumed order-4 grid 56

57 Deriving Algorithms: Stationary Avoid communicating Distribute modes similarly during local computation Do not reuse modes of grid Output is does not have duplication (a reasonable choice) Apply rules of reduction redistribution Assumed order-4 grid 57

58 Deriving Algorithms: Stationary Assumed order-4 grid 58

59 Quick Note Blocking described algorithms should be straightforward (done for matrix operations) 59

60 Analyzing algorithms Communication costs used obtained from Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de Geijn. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19(13):1749– 1783, 2007. 60

61 Analyzing Stationary algorithm Redistribute – All-to-all modes (2,3) – Allgather modes (1,2) Redistribute – All-to-all modes (0,1) – Allgather modes (3,0) Local tensor contraction 61 grid

62 Analyzing Stationary algorithm 62 Redistribute – All-to-all modes (2,3) – Allgather modes (1,2) Redistribute – All-to-all modes (0,1) – Allgather modes (3,0) Local tensor contraction grid

63 Analyzing Matrix-mapping approach 63 Permute Local tensor contraction Permute

64 Analyzing Matrix-mapping approach 64 Permute Local tensor contraction Permute

65 Picking the “best” algorithm Stationary algorithm Matrix-multiply based algorithm 65 Collectives involved processes

66 How this all fits together Formalized aspects of distributed tensor computation – Rules defining valid data distributions – Rules specifying how collectives affect distributions Given a mechanical way to go from problem specification to an implementation If other knowledge can be formalized, search space reduced 66

67 Acknowledgements Tamara G. Kolda – Sandia National Laboratories: Livermore Robert van de Geijn Bryan Marker Devin Matthews Tze Meng Low The FLAME team 67

68 Thank you This work has been funded by the following – Sandia National Laboratories: Sandia Graduate Fellowship – NSF CCF-1320112: SHF: Small: From Matrix Computations to Tensor Computations – NSF ACI-1148125/1340293 (supplement): Collaborative Research: SI2- SSI: A Linear Algebra Software Infrastructure for Sustained Innovation in Computational Chemistry and other Sciences. – Argonne National Laboratories for access to computing resources 68


Download ppt "A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie."

Similar presentations


Ads by Google