1 High-Performance Grid Computing and Research Networking Presented by Xing Hang Instructor: S. Masoud Sadjadi

Slides:

Advertisements

Similar presentations

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.

Advertisements

The Game of Algebra or The Other Side of Arithmetic The Game of Algebra or The Other Side of Arithmetic © 2007 Herbert I. Gross by Herbert I. Gross & Richard.

Load Balancing Parallel Applications on Heterogeneous Platforms.

Systolic Arrays & Their Applications

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Lecture 19: Parallel Algorithms

Block LU Factorization Lecture 24 MA471 Fall 2003.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

1 Collective Operations Dr. Stephen Tse Lesson 12.

Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.

Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.

Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,

CSCI-455/552 Introduction to High Performance Computing Lecture 11.

Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

Numerical Algorithms Matrix multiplication

Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.

Numerical Algorithms • Matrix multiplication

1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.

Examples of Two- Dimensional Systolic Arrays. Obvious Matrix Multiply Rows of a distributed to each PE in row. Columns of b distributed to each PE in.

CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.

Lecture 21: Parallel Algorithms

ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.

1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.

CSC401 – Analysis of Algorithms Lecture Notes 12 Dynamic Programming

Algorithms on Rings of Processors

Design of parallel algorithms

Review of Matrix Algebra

1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.

Chapter 5, CLR Textbook Algorithms on Grids of Processors.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Design of parallel algorithms Matrix operations J. Porras.

Dense Matrix Algorithms CS 524 – High-Performance Computing.

1 Lecture 3 PRAM Algorithms Parallel Computing Fall 2008.

A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.

Lecture 7: Matrix-Vector Product; Matrix of a Linear Transformation; Matrix-Matrix Product Sections 2.1, 2.2.1,

1 High-Performance Grid Computing and Research Networking Presented by Yuming Zhang Instructor: S. Masoud Sadjadi

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Solving Scalar Linear Systems Iterative approach Lecture 15 MA/CS 471 Fall 2003.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Matrix Multiply Methods. Some general facts about matmul High computation to communication hides multitude of sins Many “symmetries” meaning that the.

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.

Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.

Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Algorithmic Techniques on a Ring of Processors. Logical Processor Topology When writing a (distributed memory) parallel application, one typically organizes.

Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)

ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.

Computer Graphics Matrices

Lecture 9 Architecture Independent (MPI) Algorithm Design

Basic Communication Operations Carl Tropper Department of Computer Science.

PARALLEL COMPUTATION FOR MATRIX MULTIPLICATION Presented By:Dima Ayash Kelwin Payares Tala Najem.

Concurrency and Performance Based on slides by Henri Casanova.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.

Arrays Department of Computer Science. C provides a derived data type known as ARRAYS that is used when large amounts of data has to be processed. “ an.

First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.

Parallel Matrix Operations

Dense Linear Algebra (Data Distributions)

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

1 High-Performance Grid Computing and Research Networking Presented by Xing Hang Instructor: S. Masoud Sadjadi sadjadi At cs Dot fiu Dot edu Algorithms on a Grid of Processors

2 Acknowledgements The content of many of the slides in this lecture notes have been adopted from the online resources prepared previously by the people listed below. Many thanks! Henri Casanova Principles of High Performance Computing

3 2-D Torus topology We ’ ve looked at a ring, but for some applications it ’ s convenient to look at a 2-D grid topology A 2-D grid with “ wrap-around ” is called a 2-D torus. Advanced parallel linear algebra libraries/languages allow to combine arbitrary data distribution strategies with arbitrary topologies (ScaLAPACK, HPF) 1-D block to a ring 2-D block to a 2-D grid cyclic or non-cyclic (more on this later) We can go through all the algorithms we saw on a ring and make them work on a grid In practice, for many linear algebra kernel, using a 2-D block-cyclic on a 2- D grid seems to work best in most situations we ’ ve seen that blocks are good for locality we ’ ve seen that cyclic is good for load-balancing √p

4 Semantics of a parallel linear algebra routine? Centralized when calling a function (e.g., LU) the input data is available on a single “ master ” machine the input data must then be distributed among workers the output data must be undistributed and returned to the “ master ” machine More natural/easy for the user Allows for the library to make data distribution decisions transparently to the user Prohibitively expensive if one does sequences of operations and one almost always does so Distributed when calling a function (e.g., LU) Assume that the input is already distributed Leave the output distributed May lead to having to “ redistributed ” data in between calls so that distributions match, which is harder for the user and may be costly as well For instance one may want to change the block size between calls, or go from a non- cyclic to a cyclic distribution Most current software adopt distributed more work for the user more flexibility and control

5 Matrix-matrix multiply Many people have thought of doing a matrix-multiply on a 2-D torus Assume that we have three matrices A, B, and C, of size N x N Assume that we have p processors, so that p=q 2 is a perfect square and our processor grid is q x q We ’ re looking at a 2-D block distribution, but not cyclic again, that would obfuscate the code too much We ’ re going to look at three “ classic ” algorithms: Cannon, Fox, Snyder A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 00 B 01 B 02 B 03 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33

6 Cannon ’ s Algorithm (1969) Very simple (from systolic arrays) Starts with a data redistribution for matrices A and B goal is to have only neighbor-to-neighbor communications A is circularly shifted/rotated “ horizontally ” so that its diagonal is on the first column of processors B is circularly shifted/rotated “ vertically ” so that its diagonal is on the first row of processors Called preskewing A 00 A 01 A 02 A 03 A 11 A 12 A 13 A 10 A 22 A 23 A 20 A 21 A 33 A 30 A 31 A 32 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 00 B 11 B 22 B 33 B 10 B 21 B 32 B 03 B 20 B 31 B 02 B 13 B 30 B 01 B 12 B 23

7 Cannon ’ s Algorithm Preskewing of A and B For k = 1 to q in parallel Local C = C + A*B Vertical shift of B Horizontal shift of A Postskewing of A and B Of course, computation and communication could be done in an overlapped fashion locally at each processor

8 Execution Steps... A 00 A 01 A 02 A 03 A 11 A 12 A 13 A 10 A 22 A 23 A 20 A 21 A 33 A 30 A 31 A 32 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 00 B 11 B 22 B 33 B 10 B 21 B 32 B 03 B 20 B 31 B 02 B 13 B 30 B 01 B 12 B 23 local computation on proc (0,0) A 01 A 02 A 03 A 00 A 12 A 13 A 10 A 11 A 23 A 20 A 21 A 22 A 30 A 31 A 32 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 10 B 21 B 32 B 03 B 20 B 31 B 02 B 13 B 30 B 01 B 12 B 23 B 00 B 11 B 22 B 33 Shifts A 01 A 02 A 03 A 00 A 12 A 13 A 10 A 11 A 23 A 20 A 21 A 22 A 30 A 31 A 32 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 10 B 21 B 32 B 03 B 20 B 31 B 02 B 13 B 30 B 01 B 12 B 23 B 00 B 11 B 22 B 33 local computation on proc (0,0)

9 Fox ’ s Algorithm(1987) Originally developed for CalTech ’ s Hypercube Uses broadcasts and is also called broadcast- multiply-roll algorithm broadcasts diagonals of matrix A Uses a shift of matrix B No preskewing step first diagonal second diagonal third diagonal...

10 Execution Steps... A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 00 B 01 B 02 B 03 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33 initial state A 00 A 11 A 22 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 Broadcast of A ’ s 1st diagonal Local computation A 00 A 11 A 22 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 00 B 01 B 02 B 03 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33 B 00 B 01 B 02 B 03 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33

11 Execution Steps... A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33 B 00 B 01 B 02 B 03 Shift of B A 01 A 12 A 23 A 30 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 Broadcast of A ’ s 2nd diagonal Local computation C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33 B 00 B 01 B 02 B 03 A 01 A 12 A 23 A 30 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33 B 00 B 01 B 02 B 03

12 Fox ’ s Algorithm // No initial data movement for k = 1 to q-1 in parallel Broadcast A’s k th diagonal Local C = C + A*B Vertical shift of B // No final data movement Note that there is an additional array to store incoming diagonal block

13 Snyder ’ s Algorithm (1992) More complex than Cannon ’ s or Fox ’ s First transposes matrix B Uses reduction operations (sums) on the rows of matrix C Shifts matrix B

14 Execution Steps... A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 00 B 01 B 02 B 03 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33 initial state C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 Transpose B Local computation C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 00 B 10 B 20 B 30 B 01 B 11 B 21 B 31 B 02 B 12 B 22 B 32 B 03 B 13 B 23 B 33 B 00 B 10 B 20 B 30 B 01 B 11 B 21 B 31 B 02 B 12 B 22 B 32 B 03 B 13 B 23 B 33 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33

15 Execution Steps... Shift B C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 01 B 11 B 21 B 31 B 02 B 12 B 22 B 32 B 03 B 13 B 23 B 32 B 00 B 10 B 20 B 30 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 01 B 11 B 21 B 31 B 02 B 12 B 22 B 32 B 03 B 13 B 23 B 32 B 00 B 10 B 20 B 30 Global sum on the rows of C A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 01 B 11 B 21 B 31 B 02 B 12 B 22 B 32 B 03 B 13 B 23 B 32 B 00 B 10 B 20 B 30 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 Local computation

16 Execution Steps... C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 Shift B B 02 B 12 B 22 B 32 B 03 B 13 B 23 B 33 B 00 B 10 B 20 B 30 B 01 B 11 B 21 B 31 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 B 02 B 12 B 22 B 32 B 03 B 13 B 23 B 33 B 00 B 10 B 20 B 30 B 01 B 11 B 21 B 31 Global sum on the rows of C C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 B 02 B 12 B 22 B 32 B 03 B 13 B 23 B 33 B 00 B 10 B 20 B 30 B 01 B 11 B 21 B 31 Local computation

17 Complexity Analysis Very cumbersome Two models 4-port model: every processor can communicate with its 4 neighbors in one step Can match underlying architectures like the Intel Paragon 1-port model: only one single communication at a time for each processor Both models are assumed bi-directional communication

18 One-port results Cannon Fox Snyder

19 Complexity Results m in these expressions is the block size Expressions for the 4-port model are MUCH more complicated Remember that this is all for non-cyclic distributions formulae and code become very complicated for a full-fledge implementation (nothing divides anything, nothing ’ s a perfect square, etc.) Performance analysis of real code is known to be hard It is done in a few restricted cases An interesting approach is to use simulation Done in ScaLAPACK (Scalable Linear Algebra PACKage) for instance Essentially: you have written a code so complex you just run a simulation of it to figure out how fast it goes in different cases

20 So What? Are we stuck with these rather cumbersome algorithms? Fortunately, there is a much simpler algorithm that ’ s not as clever about as good in practice anyway That ’ s the one you ’ ll implement in your programming assignment

21 The Outer-Product Algorithm Remember the sequential matrix multiply for i = 1 to n for j = 1 to n for k = 1 to n C ij = C ij + A ik * B kj The first two loops are completely parallel, but the third one isn’t i.e., in shared memory, would require a mutex to protect the writing of shared variable C ij One solution: view the algorithm as “n sequential steps” for k = 1 to n // done in sequence for i = 1 to n // done in parallel for j = 1 to n // done in parallel C ij = C ij + A ik * B kj

22 The Outer-Product Algorithm for k = 1 to n // done in sequence for i = 1 to n // done in parallel for j = 1 to n // done in parallel C ij = C ij + A ik * B kj During the k th step, the processor who “ owns ” C ij needs A ik and B kj Therefore, at the k th step, the k th column on A and the k th row of B must be broadcasted over all processors Let us assume a 2-D block distribution

23 2-D Block distribution A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 00 B 01 B 02 B 03 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33 k k At each step, n-q processors receive a piece of the kth column and n-q processors receive a piece of the kth row (n=q 2 processors)

24 Outer-Product Algorithm Once everybody has received a piece of row k, everybody can add to the C ij ’ s they are responsible for And this is repeated n times In your programming assignment: Implement the outer-product algorithm Do the “ theoretical ” performance analysis with assumptions similar to the ones we have used in class so far

25 Further Optimizations Send blocks of rows/column to avoid too many small transfers What is the optimal granularity? Overlap communication and computation by using asynchronous communication How much can be gained? This is a simple and effective algorithm that is not too cumbersome

26 Cyclic 2-D distributions What if I want to run on 6 processors? It ’ s not a perfect square In practice, one makes distributions cyclic to accommodate various numbers of processors How do we do this in 2-D? i.e, how do we do a 2-D block cyclic distribution?

27 The 2-D block cyclic distribution Goal: try to have all the advantages of both the horizontal and the vertical 1-D block cyclic distribution Works whichever way the computation “ progresses ” left-to-right, top-to-bottom, wavefront, etc. Consider a number of processors p = r * c arranged in a r x c matrix Consider a 2-D matrix of size NxN Consider a block size b (which divides N)

28 The 2-D block cyclic distribution b b N P0P1P2 P5P4P3

29 The 2-D block cyclic distribution P2 P5 P1 P4 P0 P3 b b N P0P1P2 P5P4P3

30 The 2-D block cyclic distribution P2P0P1P2P0P1 P5P3P4P5P3P4 P1 P4 P0 P3 b b N P0P1P2 P5P4P3 P2P0P1P2P0P1 P5P3P4P5P3P4 P1 P4 P0 P3 P2P0P1P2P0P1 P5P3P4P5P3P4 P1 P4 P0 P3 P2P0P1P2P0P1 P0 Slight load imbalance Becomes negligible with many blocks Index computations had better be implemented in separate functions Also: functions that tell a process who its neighbors are Overall, requires a whole infrastructure, but many think you can ’ t go wrong with this distribution