CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.

Slides:



Advertisements
Similar presentations
Basic Communication Operations
Advertisements

Lecture 9: Group Communication Operations
Lecture 19: Parallel Algorithms
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
1 Parallel Parentheses Matching Plus Some Applications.
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
1 Parallel Algorithms II Topics: matrix and graph algorithms.
Numerical Algorithms Matrix multiplication
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Numerical Algorithms • Matrix multiplication
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Communication operations Efficient Parallel Algorithms COMP308.
Design of parallel algorithms
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
CS 684.
CS 584. Algorithm Analysis Assumptions n Consider ring, mesh, and hypercube. n Each process can either send or receive a single message at a time. n No.
CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.
Chapter 5, CLR Textbook Algorithms on Grids of Processors.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Design of parallel algorithms Matrix operations J. Porras.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
1 Tuesday, October 31, 2006 “Data expands to fill the space available for storage.” -Parkinson’s Law.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
1 Parallel Sorting Algorithms. 2 Potential Speedup O(nlogn) optimal sequential sorting algorithm Best we can expect based upon a sequential sorting algorithm.
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
Matrix Multiplication Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi Sampath Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.
Student: Fan Bai Instructor: Dr. Sushil Prasad CSc8530.
Data Structures and Algorithms in Parallel Computing Lecture 10.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
PARALLEL PROCESSING From Applications to Systems Gorana Bosic Veljko Milutinovic
Numerical Algorithms.Matrix Multiplication.Gaussian Elimination.Jacobi Iteration.Gauss-Seidel Relaxation.
Lecture 9 Architecture Independent (MPI) Algorithm Design
Basic Communication Operations Carl Tropper Department of Computer Science.
1 Connected Components & All Pairs Shortest Paths Presented by Wooyoung Kim 3/4/09 CSc 8530 Parallel Algorithms, Spring 2009 Dr. Sushil K. Prasad.
PARALLEL COMPUTATION FOR MATRIX MULTIPLICATION Presented By:Dima Ayash Kelwin Payares Tala Najem.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
A very brief introduction to Matrix (Section 2.7) Definitions Some properties Basic matrix operations Zero-One (Boolean) matrices.
All Pairs Shortest Path Algorithms Aditya Sehgal Amlan Bhattacharya.
Numerical Algorithms Chapter 11.
7.1 Matrices, Vectors: Addition and Scalar Multiplication
PRAM Algorithms.
Parallel Programming with MPI and OpenMP
Parallel Matrix Operations
Numerical Algorithms • Parallelizing matrix multiplication
Communication operations
Matrix Addition and Multiplication
To accompany the text “Introduction to Parallel Computing”,
Parallel Sorting Algorithms
Writing Cache Friendly Code
Presentation transcript:

CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola

Outline Matrix multiplication – one processor Matrix multiplication – parallel Algorithm Example Analysis Matrix transpose Algorithm Example Analysis

Matrix multiplication – 1 processor Using one processor – O(n 3 ) time Algorithm: for (i = 0; i < n; i++) { for (j = 0; j < n; j ++) { t = 0; for(k = 0; k < n; k++) { t = t + a ik * b kj ; } c ij = t;} }

Matrix multiplication – parallel Using Hypercube: The algorithm given in the book assumes the multiplication of two n x n matrices where n can be factored into a power of 2. This will facilitate a hypercube network. Need N = n 3 = 2 3q processors where n = 2 q is the size of the matrix.

N processors allowing each processor to occupy a vertex in the hypercube. Each processor P r has a given position – where r = in 2 + jn + k for 0<= i,j,k <= n-1 If r is represented by: r 3q-1 r 3q-2 …r 2q r 2q-1 …r q r q-1 …r 0 then the binary representation of i, j, k are r 3q-1 r 3q-2 …r 2q, r 2q-1 …r q, r q-1 …r 0 respectively This allow the positioning of processors such that their position only differ by one binary digit location.

Also this allow all processors that agree in one or two of the positions i,j,k will form a hypercube Example, building a hypercube for q = 1, then for N = n 3 = 2 3q  N = 8 processors. And for P r where r = in 2 + jn + k we get:

i j k P 0  r = = P 1  r = = P 2  r = = P 3  r = = P 4  r = = P 5  r = = P 6  r = = P 7  r = =

Processor Layout Each processor will have 3 registers A r, B r and C r  P 0 The following is the step by step description of the algorithm A B C

Step 1:The elements of A and B are distributed to the n 3 processors so that the processor in position i,j,k will contain a ji and b ik (1.1): Copies of data initially in A(0,j,k) and B(0,j,k) are sent to processors in positions (i,j,k), where 1<=i<=n-1. Resulting in A(i,j,k) = a ij and B(i,j,k) = b jk for 0<=i<=n-1. (1.2) Copies of data in A(i,j,k) are sent to processors in positions (i,j,k), where 0<=k<=n-1. Resulting in A(i,j,k) = a ji for 0<=k<=n-1. (1.3) Copies of data in B(i,j,k) are sent to processors in positions (i,j,k), where 0<=j<=n-1. Resulting in B(i,j,k) = b ik for 0<=j<=n-1.

Step 2: Each processor in position (i,j,k) computes the product C(i,j,k) = A(i,j,k) * B(i,j,k) Thus C(i,j,k) = a ji * b ik for 0<=i,j,k<=n-1 Step 3: The sum C(0,j,k) = ∑C(i,j,k) for 0<=i<=n-1 and is computed for 0<=j,k<n-1.

The algorithm: Step1: (1.1) for m = 3q – 1 downto 2q do for all r ε N(r m = 0) do in parallel (i) A r(m) = A r (ii) B r(m) = B r end for (1.2) for m = q-1 downto 0 do for all r ε N(r m = r 2q+m ) do in parallel A r(m) = A r end for

(1.3) for m = 2q-1 downto 0 do for all r ε N(r m = r q+m ) do in parallel B r(m) = B r end for Step 2: for r = 0 to N-1 do in parallel C r = A r * B r end for

Step 3: for m = 2q to 3q - 1 do for all r ε N(r m = 0) do in parallel C r(m) = C r + C r(m) end for

An Example using a 2x2 matrix. This example will require n 3 processors = 8. The matrices are 1 2 A = B = -3 -4

4, -43, -3 2, -21, -1 4, -43, -3 4, -43, -3 2, -21, -1 2, -21, -1 2, -2 1, -2 3, -4 4, -44, -3 3, -3 1, -1 2, -1 4, -44, -3 3, -23, -1 1, -11, -2 2, -3 2, -4

Analysis of algorithm If the layout of the processors is viewed as a n x n x n array, then there consist of a layer of processors n each with an n x n array of processors. Initially, this first layer – n will have a distinct value from matrix A in its A register and a distinct value from matrix B in its B register. This is constant time operation. Step 1.1: Copies are sent to n/2 processors, and continually to n/4, etc – O(log n) to copy data from layer 0 to layers n-1

Step1.2 and 1.3. Each processor from column i in layer i sends data to processor in its row. Similar from row i sending data to processor in its column. Requiring constant time iterations. Step 2 require constant time Step 3 require constant time iteration Overall, it requires O(log n) time But cost is O(n 3 log n) – not optimal.

A faster algorithm - Quinn For all P m, where 1<=m<=p for i = m to n step p do for j = 1 to n do t = 0; for k = 1 to n do t = t + a[i][k] * b[k][j] c[i][j] = t time  O(n 3 /p + p) – maximum # of processors – n 2

An actual implementation Get the processor id This if statement is to make sure that the entire size of the matrix is computed chunksize = (int) (n/p); if ((chunksize * nprocs) <= sizes){ int differ = n - (chunksize*p); if (id == 0) lower = id *chunksize; else{ lower = id * chunksize + differ + 1; upper = (id + 1) * chunksize + differ; }

else{ lower = id * chunksize; upper = (id + 1) * chunksize;} for (i = lower; i < upper; i++){ for(j = 0; j < n; j++){ total = 0; for (k = 0; k < n; k++){ total = total + mat1[i][k] * mat2[k][j]; } mat3[i][j] = total; }}

Another faster Algorithm – Gupta & Sadayappan The 3-D Diagonal Algorithm is a 3 phase algorithm. The concept: a hypercube of p processors viewed as a 3-D mesh of size 3√p x 3√p x 3√p Matrices A and B are partitioned into blocks of p ⅔ with 3√p blocks along each dimension. Initially, it is assumed that A and B are mapped onto the 2-D plane x = y and the 2-D plane y = j is responsible for calculating the outer product of A *,j (the set of columns stored at processors p j,j,* ) and B j,* (the set of rows of B).

Phase 1: Point to point communication of B k,i by p i,i,k to p i,k,k Phase 2: One-to-all broadcasts of blocks of A along the x-direction and the newly acquired blocks (from phase 1) of B along the z-direction i.e. processor p i,i,k broadcasts A k,i to p *,i,k and all other processor of the form of p i,i,k broadcasts B k,i to p i,k,* At the end of phase 2, every processor p i,j,k has blocks A k,j and B j,i Each processor now calculates the product of their pair of blocks A and B.

Phase 3: After computation, there is reduction by addition in the y-direction providing the final matrix C.

Algorithm Analysis Phase 1: Passing messages of size n 2 / p ⅔ require log(3√p(t s + t w (n 2 / p ⅔ ))) where t s is the time it takes to start up for message sending and t w is time it takes to send a word from one processor to its neighbor. Phase two takes twice as much time as phase 1. Phase 3: Can be completed in the same amount of time as Phase 1. Overall, the algorithm takes (4/3 log p, n 2 / p ⅔ (4/3 log p)) where communication for each entry is t sa + t wb

Some added conditions are: 1. p <= n 3 2. Overall space used  2n 2 3√p The above description is for a one port hypercube architecture whereby a processor can use at most one communication link to send and receive data. A multi-port architecture, whereby the processor can use all of its communication ports simultaneously, the algorithm will be faster reducing the above amount of time by a factor of log(3√p).

The algorithm Initial distribution – Processor p i,i,k contains A ki and B ki Program of processor p i,j,k If (i = j) then Send B ki to p i,k,k Broadcast Bji to all processors p i,j,j endif Receive A kj from p i,j,j Calculate I ki = A kj x Bji Send I ki to p i,i,k

if ( i = j) for I = 0 to 3√p – 1 Receive I ki from p i,i,k C ki = C ki + I ki endfor endif I is an intermediate matrix.

Matrix Transposition The same concept is used here as in Matrix multiplication The number of processors used is N = n 2 = 2 2q and processor P r occupies position (i,j) where r = in + j where 0<=i,j<=n-1. Initially, processor P r holds all of the elements of matrix A where r = in + j. Upon termination, processor P s holds element a ij where s = jn + i.

If r is represented by: r 2q-1 r 2q-2 …r q r q-1 … r 1 r 0 then the binary representation of i and j are r 2q-1 r 2q-2 …r q, r q-1 …r 1 r 0 respectively And s is represented by s 2q-1 s 2q-2 …s q s q-1 … s 1 s 0 And the binary representation of j and i is s 2q-1 r 2q-2 …r q, r q-1 …r 1 r 0 respectively Thus it could be seen that for example r 2q-1 r 2q-2 …r q = s q-1 … s 1 s 0 and r q-1 r q-2 … r 0 = s 2q-1 s 2q-2 …s q

The algorithm First the requirements for the algorithm – it needs the processors to have registers – A u and B u both of processor P u The index of P u will be u = u 2q-1u2q-2 …u q uq-1 … u 1 u 0 matching that of r.

For m = 2q-1 downto q do for u = 0 to N-1 do in parallel (1) if u m ≠ u m-q then B u (m) = A u endif (2) if u m = u m-q then A u (m-q) = B u endif endfor

Explanation of algorithm This algorithm is implemented using recursion to achieve the transpose of A. Divide the matrix into 4 submatrices – n/2 x n/2. For iteration 1 when m = 2q-1, swap elements of the top right submatrix with that of the bottom left submatrix. The other 2 submatrices are not touched. Now recursively do this until all of the elements are swapped.

Example. We want the transpose of the following matrix: a b c d A = e f g h i j k l m n o p

We use 16 processors with the following indices:

Drawing a hypercube for this:

Processor 0 – binary 0000 holds a 00 which is the value a Processor 1 – binary 0001 holds a 01 which is the value b Processor 2 – binary 0010 holds a 02 which is the value c And so on In the first iteration m = 2q-1 where q = 2 in this example  m = 3. Step 1: Each P u for u 3 ≠ u 1  sends their element of A u to P u (3) which stores the value in B u (3) i.e. processors 2, 3, 6 & 7 send to processors 10, 11, 14 & 15 respectively. And processors 8, 9, 12 & 13 send to processors 0, 1, 4 & 5 respectively.

Step 2: Each processor that received a data in Step 1, will now send the data from B u to P u (1) to be stored in A u (1), i.e. Processors 0, 1, 4, 5 send to 2, 3, 6, 7 respectively Processors 10, 11, 14, 15 send to 8, 9, 12, 13 respectively By the end of the first iteration our matrix A will look like: a b i j A = e f m n c d k l g h o p

In the second iteration when m = q = 2: Step 1: Each P u (where u 2 ≠ u 0 ) sends A u to P u (2) storing it in B u (2). This is a simultaneous transfer: From: processor 4 to processor 0 processor 1 to processor 5 processor 6 to processor 2 processor 3 to processor 7 processor 12 to processor 8 processor 9 to processor 13 processor 14 to processor 10 processor 11 to processor 15

Step 2: For u 2 = u 0, each P u sends B u to P u (0) where it is stored in A u (0) thus  Swap the element in the top right corner processor with that in the bottom left corner for each of the 2 x 2 submatrices. From: processor 0 to processor 1 processor 5 to processor 4 processor 2 to processor 3 processor 7 to processor 6 processor 8 to processor 9 processor 13 to processor 12 processor 10 to processor 11 processor 15 to processor 14

After the second iteration, we have the following transposed matrix: a e i m A = b f j n c g k o d h l p

Algorithm Analysis It takes q constant time iterations giving t(n) = O(log n) But it takes n 2 processors. Therefore Cost = (n 2 log n) which is not optimal.

Bibliography Akl, Parallel Computation, Models and Methods, Prentice Hall Drake, J.B. and Luo, Q., A scalable Parallel Strassen’s matrix Multiplication Algorithm For Distributed-Memory Computers, February 1995 Proceedings of the 1995 ACM symposium on Applied computing, Gupta, H & Sadayappan P., Communication Efficient Matrix Mulitplication on Hypercubes, August 1994 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures, Quinn, M.J., Parallel Computing – Theory and Practice, McGraw Hill, 1997