A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie.

Slides:



Advertisements
Similar presentations
ARCHITECTURES FOR ARTIFICIAL INTELLIGENCE SYSTEMS
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Design Rule Generation for Interconnect Matching Andrew B. Kahng and Rasit Onur Topaloglu {abk | rtopalog University of California, San Diego.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
Applied Informatics Štefan BEREŽNÝ
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
HST 952 Computing for Biomedical Scientists Lecture 10.
Matrix Algebra Matrix algebra is a means of expressing large numbers of calculations made upon ordered sets of numbers. Often referred to as Linear Algebra.
The FLAME Project Faculty: Robert van de Geijn (CS/ICES) Don Batory (CS) Maggie Myers (SDS) John Stanton (Chem) Victor (TACC) Research Staff: Field Van.
Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato Parasol Lab, Department of Computer.
Recall The Team Skills 1. Analyzing the Problem (with 5 steps) 2. Understanding User and Stakeholder Needs 3. Defining the System 4. Managing Scope 5.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Fast Spectral Transforms and Logic Synthesis DoRon Motter August 2, 2001.
Matrix Definition A Matrix is an ordered set of numbers, variables or parameters. An example of a matrix can be represented by: The matrix is an ordered.
Lecture 7: Matrix-Vector Product; Matrix of a Linear Transformation; Matrix-Matrix Product Sections 2.1, 2.2.1,
CE 311 K - Introduction to Computer Methods Daene C. McKinney
1 Chapter 2 Matrices Matrices provide an orderly way of arranging values or functions to enhance the analysis of systems in a systematic manner. Their.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Graphics CSE 581 – Interactive Computer Graphics Mathematics for Computer Graphics CSE 581 – Roger Crawfis (slides developed from Korea University slides)
Little Linear Algebra Contents: Linear vector spaces Matrices Special Matrices Matrix & vector Norms.
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.
Beyond GEMM: How Can We Make Quantum Chemistry Fast? or: Why Computer Scientists Don’t Like Chemists Devin Matthews 9/25/ BLIS Retreat1.
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
ASQ Raleigh ASQ Raleigh Section 1113 Six Sigma SIG DMAIC Series.
Application of Finite Geometry LDPC code on the Internet Data Transport Wu Yuchun Oct 2006 Huawei Hisi Company Ltd.
1 High-Performance Grid Computing and Research Networking Presented by Xing Hang Instructor: S. Masoud Sadjadi
Statistics and Linear Algebra (the real thing). Vector A vector is a rectangular arrangement of number in several rows and one column. A vector is denoted.
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
1 Inference Rules and Proofs (Z); Program Specification and Verification Inference Rules and Proofs (Z); Program Specification and Verification.
Constraint Satisfaction Problems (CSPs) CPSC 322 – CSP 1 Poole & Mackworth textbook: Sections § Lecturer: Alan Mackworth September 28, 2012.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
The roots of innovation Future and Emerging Technologies (FET) Future and Emerging Technologies (FET) The roots of innovation Proactive initiative on:
Multivariate Statistics Matrix Algebra I W. M. van der Veld University of Amsterdam.
Sect. 4.2: Orthogonal Transformations
1 P. David, V. Idasiak, F. Kratz P. David, V. Idasiak, F. Kratz Laboratoire Vision et Robotique, UPRES EA 2078 ENSI de Bourges - Université d'Orléans 10.
Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.
Linear Algebra. Circuits The circuits in computers and other input devices have inputs, each of which is either a 0 or 1, the output is also 0s and 1s.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
ES 240: Scientific and Engineering Computation. Chapter 8 Chapter 8: Linear Algebraic Equations and Matrices Uchechukwu Ofoegbu Temple University.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
A Constraint Language Approach to Grid Resource Selection Chuang Liu, Ian Foster Distributed System Lab University of Chicago
Unit 1 MATRICES Dr. Shildneck Fall, WHAT IS A MATRIX? A Matrix is a rectangular array of numbers placed inside brackets. A Matrix is a rectangular.
Mike Hildreth DASPOS Update Mike Hildreth representing the DASPOS project 1.
THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan.
On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.
Multi-linear Systems and Invariant Theory
PMLAB, IECS, FCU Designing Efficient Matrix Transposition on Various Interconnection Networks Using Tensor Product Formulation Presented by Chin-Yi Tsai.
OOD OO Design. OOD-2 OO Development Requirements Use case analysis OO Analysis –Models from the domain and application OO Design –Mapping of model.
CS 450: COMPUTER GRAPHICS TRANSFORMATIONS SPRING 2015 DR. MICHAEL J. REALE.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Matrices, Vectors, Determinants.
Encoding/Decoding May 9, 2016 Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck 1.
Matrices. Variety of engineering problems lead to the need to solve systems of linear equations matrixcolumn vectors.
Wrap up. Structures and views Quality attribute scenarios Achieving quality attributes via tactics Architectural pattern and styles.
June 9-11, 2007SPAA SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas.
7.1 Matrices, Vectors: Addition and Scalar Multiplication
Matrices and Matrix Operations
Matrices Rules & Operations.
Deep Feedforward Networks
Using BLIS Building Blocks:
Coding FLAME Algorithms with Example: Cholesky factorization
7.3 Matrices.
Hidden Markov Models Part 2: Algorithms
Using BLIS Building Blocks:
Math review - scalars, vectors, and matrices
Presentation transcript:

A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie Mellon University Tamara G. Kolda Sandia National Labs: Livermore 1

Envisioned workflow 1.New architecture comes out 2.Scientists specify what they want computed on new architecture to (computer) scientists 3.(Computer) scientists provide efficient library for the computation on new architecture 4.Scientists do science 2 Formality is key!

Goals Formally describe distribution of tensor data on processing grids Identify patterns in collective communications to utilize specialized implementations when possible Provide systematic approach to creating algorithms and implementations for problems Achieve high performance 3

Outline Description of parallel matrix-matrix multiplication Quick overview of tensors and tensor contractions A notation for distributing/redistributing tensors A method for deriving algorithms 4

Data Distribution Approach “Cyclically wrap” each mode of the tensor on the grid Assign elements of the tensor to processes based on the assigned indices When restricted to 2-D objects on 2-D grids, ideas correspond to theory of Elemental 1 library 5 1 Martin D. Schatz, Jack Poulson, and Robert van de Geijn. Parallel matrix multiplication: 2d and 3d. FLAME Working Note #62 TR-12-13, The University of Texas at Austin, Department of Computer Sciences, JUNE 2012

Assume a computing grid arranged as an order-N object Elements of tensors wrapped elemental-cyclically on the grid Assumptions 6 For this example, we assume an order-2 tensor (matrix) on order-2 grid 0 1

Assume a computing grid arranged as an order-N object Elements of tensors wrapped elemental-cyclically on the grid Assumptions 7 For this example, we assume an order-2 tensor (matrix) on order-2 grid 0 1

Data distribution notation: The Basics Assign a distribution scheme to each mode of the object 8 How indices of rows (mode 1) are distributed How indices of columns (mode 0) are distributed

Data distribution notation: The Basics Assign a distribution scheme to each mode of the object 9 Distributed based on mode 1 of grid Distributed based on mode 0 of grid How indices of columns (mode 0) are distributed How indices of rows (mode 1) are distributed Tuple assigned to each mode is referred to as the “mode distribution”

Example 1 Distribute indices of columns based on mode 0 of grid 10

Example 1 Distribute indices of columns based on mode 0 of grid 11

Distribute indices of columns based on mode 0 of grid Example 1 12

Distribute indices of columns based on mode 0 of grid Example 1 13

Distribute indices of columns based on mode 0 of grid Example 1 14

Distribute indices of columns based on mode 0 of grid Example 1 15

Distribute indices of columns based on mode 0 of grid Example 1 16

Distribute indices of columns based on mode 0 of grid Distribute indices of rows based on mode 1 of grid Example 1 17

Distribute indices of columns based on mode 0 of grid Distribute indices of rows based on mode 1 of grid Example 1 18

Distribute indices of columns based on mode 0 of grid Distribute indices of rows based on mode 1 of grid Example 1 19

Distribute indices of columns based on mode 0 of grid Distribute indices of rows based on mode 1 of grid Example 1 20

Distribute indices of columns based on mode 0 of grid Distribute indices of rows based on mode 1 of grid Example 1 21

Distribute indices of columns based on mode 0 of grid Distribute indices of rows based on mode 1 of grid Example 1 22

Distributions wrap elements on a logical view of grid – Allows for multiple grid modes to be used in symbols Example, views grid as represents replication Notes 23

We use boldface lowercase Roman letters to refer to mode distributions Elements of mode distributions denoted with subscripts Concatenation of mode distributions denoted Notes 24

Elemental Notation Distributions of Elemental can be viewed in terms of defined notation 25

Parallel Matrix multiplication Heuristic – Avoid communicating the “large” matrix – Leads to “Stationary” A,B,C algorithm variants Stationary C algorithm: 26

27

28

29

Outline Description of parallel matrix-matrix multiplication Quick overview of tensors and tensor contractions A notation for distributing/redistributing tensors A method for deriving algorithms 30

Tensors and tensor contraction Tensor – An order-m (m-mode) operator Each mode associated with feature of the application – Modes have fixed length (dimension) 31

Notation Tensors in capital script Elements of tensors in lowercase Greek Element’s location in tensor as subscripts 32

Tensor contractions Einstein notation 1 implicitly sums over modes shared by inputs 33 1 A. Einstein. Die Grundlage der allgemeinen Relativit ̈atstheorie. Annalen der Physik, 354:769–822, 1916

Tensor contractions Einstein notation 1 implicitly sums over modes shared by inputs Transpose corresponds to interchange of modes 34 1 A. Einstein. Die Grundlage der allgemeinen Relativit ̈atstheorie. Annalen der Physik, 354:769–822, 1916

Tensor contractions Einstein notation 1 implicitly sums over modes shared by inputs Transpose corresponds to interchange of modes Arbitrary number modes involved (any of which can sum) 35 1 A. Einstein. Die Grundlage der allgemeinen Relativit ̈atstheorie. Annalen der Physik, 354:769–822, 1916

Tensor contractions Third-order Møller-Plesset 1 method from computational chemistry 36 1 R J Bartlett. Many-body perturbation theory and coupled cluster theory for electron correlation in molecules. Annual Review of Physical Chemistry, 32(1):359–401, 1981

37 Through permutation of data, can arrange in such a way that MMmult can be performed Results in algorithm of form Requires large rearrangement of data – Cost of this operation magnified in distributed-memory environments Tensor contraction as MMmult

Outline Description of parallel matrix-matrix multiplication Quick overview of tensors and tensor contractions A notation for distributing/redistributing tensors A method for deriving algorithms 38

Tensor distribution notation We’ve already seen the notation for order-2 tensors on order- 2 grids What if higher-order tensor? – More modes to assigned distribution symbols to – Ex. order-4 tensor What if higher-order grid? – More grid modes to choose from when creating distribution symbols – Ex. Mode distributions may only contain elements from {0,1,2} if computing on order-3 grid 39

Redistributions: Allgather 40 Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de Geijn. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19(13):1749–1783, 2007

Allgather in action 41

Allgather in action 42

Allgather in action 43 Before

Allgather in action 44 After Before

Redistributions: Allgather Allgather within mode performs the following redistribution of data 45

Redistribution rules 46 Communication within modes specified by can perform the following redistributions – Ex.

Outline Description of parallel matrix-matrix multiplication Quick overview of tensors and tensor contractions A notation for distributing/redistributing tensors A method for deriving algorithms 47

Algorithm choices For matrix operations, “Stationary” variants are useful – Extending ideas to tensors also useful? Potentially other “families” of algorithms to choose from – Only focusing on those we know how to encode for now 48

Deriving Algorithms: Stationary Avoid communicating Assumed order-4 grid 49

Deriving Algorithms: Stationary Avoid communicating Distribute modes similarly during local computation Assumed order-4 grid 50

Deriving Algorithms: Stationary Avoid communicating Distribute modes similarly during local computation Do not reuse modes of the grid Assumed order-4 grid 51

Deriving Algorithms: Stationary Avoid communicating Distribute modes similarly during local computation Do not reuse modes of the grid Assumed order-4 grid 52

Deriving Algorithms: Stationary Assumed order-4 grid Avoid communicating 53

Deriving Algorithms: Stationary Assumed order-4 grid 54 Avoid communicating Distribute modes similarly during local computation

Deriving Algorithms: Stationary Assumed order-4 grid 55 Avoid communicating Distribute modes similarly during local computation Do not reuse modes of grid

Deriving Algorithms: Stationary Avoid communicating Distribute modes similarly during local computation Do not reuse modes of grid Output is does not have duplication (a reasonable choice) Assumed order-4 grid 56

Deriving Algorithms: Stationary Avoid communicating Distribute modes similarly during local computation Do not reuse modes of grid Output is does not have duplication (a reasonable choice) Apply rules of reduction redistribution Assumed order-4 grid 57

Deriving Algorithms: Stationary Assumed order-4 grid 58

Quick Note Blocking described algorithms should be straightforward (done for matrix operations) 59

Analyzing algorithms Communication costs used obtained from Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de Geijn. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19(13):1749– 1783,

Analyzing Stationary algorithm Redistribute – All-to-all modes (2,3) – Allgather modes (1,2) Redistribute – All-to-all modes (0,1) – Allgather modes (3,0) Local tensor contraction 61 grid

Analyzing Stationary algorithm 62 Redistribute – All-to-all modes (2,3) – Allgather modes (1,2) Redistribute – All-to-all modes (0,1) – Allgather modes (3,0) Local tensor contraction grid

Analyzing Matrix-mapping approach 63 Permute Local tensor contraction Permute

Analyzing Matrix-mapping approach 64 Permute Local tensor contraction Permute

Picking the “best” algorithm Stationary algorithm Matrix-multiply based algorithm 65 Collectives involved processes

How this all fits together Formalized aspects of distributed tensor computation – Rules defining valid data distributions – Rules specifying how collectives affect distributions Given a mechanical way to go from problem specification to an implementation If other knowledge can be formalized, search space reduced 66

Acknowledgements Tamara G. Kolda – Sandia National Laboratories: Livermore Robert van de Geijn Bryan Marker Devin Matthews Tze Meng Low The FLAME team 67

Thank you This work has been funded by the following – Sandia National Laboratories: Sandia Graduate Fellowship – NSF CCF : SHF: Small: From Matrix Computations to Tensor Computations – NSF ACI / (supplement): Collaborative Research: SI2- SSI: A Linear Algebra Software Infrastructure for Sustained Innovation in Computational Chemistry and other Sciences. – Argonne National Laboratories for access to computing resources 68