CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola
Outline Matrix multiplication – one processor Matrix multiplication – parallel Algorithm Example Analysis Matrix transpose Algorithm Example Analysis
Matrix multiplication – 1 processor Using one processor – O(n 3 ) time Algorithm: for (i = 0; i < n; i++) { for (j = 0; j < n; j ++) { t = 0; for(k = 0; k < n; k++) { t = t + a ik * b kj ; } c ij = t;} }
Matrix multiplication – parallel Using Hypercube: The algorithm given in the book assumes the multiplication of two n x n matrices where n can be factored into a power of 2. This will facilitate a hypercube network. Need N = n 3 = 2 3q processors where n = 2 q is the size of the matrix.
N processors allowing each processor to occupy a vertex in the hypercube. Each processor P r has a given position – where r = in 2 + jn + k for 0<= i,j,k <= n-1 If r is represented by: r 3q-1 r 3q-2 …r 2q r 2q-1 …r q r q-1 …r 0 then the binary representation of i, j, k are r 3q-1 r 3q-2 …r 2q, r 2q-1 …r q, r q-1 …r 0 respectively This allow the positioning of processors such that their position only differ by one binary digit location.
Also this allow all processors that agree in one or two of the positions i,j,k will form a hypercube Example, building a hypercube for q = 1, then for N = n 3 = 2 3q N = 8 processors. And for P r where r = in 2 + jn + k we get:
i j k P 0 r = = P 1 r = = P 2 r = = P 3 r = = P 4 r = = P 5 r = = P 6 r = = P 7 r = =
Processor Layout Each processor will have 3 registers A r, B r and C r P 0 The following is the step by step description of the algorithm A B C
Step 1:The elements of A and B are distributed to the n 3 processors so that the processor in position i,j,k will contain a ji and b ik (1.1): Copies of data initially in A(0,j,k) and B(0,j,k) are sent to processors in positions (i,j,k), where 1<=i<=n-1. Resulting in A(i,j,k) = a ij and B(i,j,k) = b jk for 0<=i<=n-1. (1.2) Copies of data in A(i,j,k) are sent to processors in positions (i,j,k), where 0<=k<=n-1. Resulting in A(i,j,k) = a ji for 0<=k<=n-1. (1.3) Copies of data in B(i,j,k) are sent to processors in positions (i,j,k), where 0<=j<=n-1. Resulting in B(i,j,k) = b ik for 0<=j<=n-1.
Step 2: Each processor in position (i,j,k) computes the product C(i,j,k) = A(i,j,k) * B(i,j,k) Thus C(i,j,k) = a ji * b ik for 0<=i,j,k<=n-1 Step 3: The sum C(0,j,k) = ∑C(i,j,k) for 0<=i<=n-1 and is computed for 0<=j,k<n-1.
The algorithm: Step1: (1.1) for m = 3q – 1 downto 2q do for all r ε N(r m = 0) do in parallel (i) A r(m) = A r (ii) B r(m) = B r end for (1.2) for m = q-1 downto 0 do for all r ε N(r m = r 2q+m ) do in parallel A r(m) = A r end for
(1.3) for m = 2q-1 downto 0 do for all r ε N(r m = r q+m ) do in parallel B r(m) = B r end for Step 2: for r = 0 to N-1 do in parallel C r = A r * B r end for
Step 3: for m = 2q to 3q - 1 do for all r ε N(r m = 0) do in parallel C r(m) = C r + C r(m) end for
An Example using a 2x2 matrix. This example will require n 3 processors = 8. The matrices are 1 2 A = B = -3 -4
4, -43, -3 2, -21, -1 4, -43, -3 4, -43, -3 2, -21, -1 2, -21, -1 2, -2 1, -2 3, -4 4, -44, -3 3, -3 1, -1 2, -1 4, -44, -3 3, -23, -1 1, -11, -2 2, -3 2, -4
Analysis of algorithm If the layout of the processors is viewed as a n x n x n array, then there consist of a layer of processors n each with an n x n array of processors. Initially, this first layer – n will have a distinct value from matrix A in its A register and a distinct value from matrix B in its B register. This is constant time operation. Step 1.1: Copies are sent to n/2 processors, and continually to n/4, etc – O(log n) to copy data from layer 0 to layers n-1
Step1.2 and 1.3. Each processor from column i in layer i sends data to processor in its row. Similar from row i sending data to processor in its column. Requiring constant time iterations. Step 2 require constant time Step 3 require constant time iteration Overall, it requires O(log n) time But cost is O(n 3 log n) – not optimal.
A faster algorithm - Quinn For all P m, where 1<=m<=p for i = m to n step p do for j = 1 to n do t = 0; for k = 1 to n do t = t + a[i][k] * b[k][j] c[i][j] = t time O(n 3 /p + p) – maximum # of processors – n 2
An actual implementation Get the processor id This if statement is to make sure that the entire size of the matrix is computed chunksize = (int) (n/p); if ((chunksize * nprocs) <= sizes){ int differ = n - (chunksize*p); if (id == 0) lower = id *chunksize; else{ lower = id * chunksize + differ + 1; upper = (id + 1) * chunksize + differ; }
else{ lower = id * chunksize; upper = (id + 1) * chunksize;} for (i = lower; i < upper; i++){ for(j = 0; j < n; j++){ total = 0; for (k = 0; k < n; k++){ total = total + mat1[i][k] * mat2[k][j]; } mat3[i][j] = total; }}
Another faster Algorithm – Gupta & Sadayappan The 3-D Diagonal Algorithm is a 3 phase algorithm. The concept: a hypercube of p processors viewed as a 3-D mesh of size 3√p x 3√p x 3√p Matrices A and B are partitioned into blocks of p ⅔ with 3√p blocks along each dimension. Initially, it is assumed that A and B are mapped onto the 2-D plane x = y and the 2-D plane y = j is responsible for calculating the outer product of A *,j (the set of columns stored at processors p j,j,* ) and B j,* (the set of rows of B).
Phase 1: Point to point communication of B k,i by p i,i,k to p i,k,k Phase 2: One-to-all broadcasts of blocks of A along the x-direction and the newly acquired blocks (from phase 1) of B along the z-direction i.e. processor p i,i,k broadcasts A k,i to p *,i,k and all other processor of the form of p i,i,k broadcasts B k,i to p i,k,* At the end of phase 2, every processor p i,j,k has blocks A k,j and B j,i Each processor now calculates the product of their pair of blocks A and B.
Phase 3: After computation, there is reduction by addition in the y-direction providing the final matrix C.
Algorithm Analysis Phase 1: Passing messages of size n 2 / p ⅔ require log(3√p(t s + t w (n 2 / p ⅔ ))) where t s is the time it takes to start up for message sending and t w is time it takes to send a word from one processor to its neighbor. Phase two takes twice as much time as phase 1. Phase 3: Can be completed in the same amount of time as Phase 1. Overall, the algorithm takes (4/3 log p, n 2 / p ⅔ (4/3 log p)) where communication for each entry is t sa + t wb
Some added conditions are: 1. p <= n 3 2. Overall space used 2n 2 3√p The above description is for a one port hypercube architecture whereby a processor can use at most one communication link to send and receive data. A multi-port architecture, whereby the processor can use all of its communication ports simultaneously, the algorithm will be faster reducing the above amount of time by a factor of log(3√p).
The algorithm Initial distribution – Processor p i,i,k contains A ki and B ki Program of processor p i,j,k If (i = j) then Send B ki to p i,k,k Broadcast Bji to all processors p i,j,j endif Receive A kj from p i,j,j Calculate I ki = A kj x Bji Send I ki to p i,i,k
if ( i = j) for I = 0 to 3√p – 1 Receive I ki from p i,i,k C ki = C ki + I ki endfor endif I is an intermediate matrix.
Matrix Transposition The same concept is used here as in Matrix multiplication The number of processors used is N = n 2 = 2 2q and processor P r occupies position (i,j) where r = in + j where 0<=i,j<=n-1. Initially, processor P r holds all of the elements of matrix A where r = in + j. Upon termination, processor P s holds element a ij where s = jn + i.
If r is represented by: r 2q-1 r 2q-2 …r q r q-1 … r 1 r 0 then the binary representation of i and j are r 2q-1 r 2q-2 …r q, r q-1 …r 1 r 0 respectively And s is represented by s 2q-1 s 2q-2 …s q s q-1 … s 1 s 0 And the binary representation of j and i is s 2q-1 r 2q-2 …r q, r q-1 …r 1 r 0 respectively Thus it could be seen that for example r 2q-1 r 2q-2 …r q = s q-1 … s 1 s 0 and r q-1 r q-2 … r 0 = s 2q-1 s 2q-2 …s q
The algorithm First the requirements for the algorithm – it needs the processors to have registers – A u and B u both of processor P u The index of P u will be u = u 2q-1u2q-2 …u q uq-1 … u 1 u 0 matching that of r.
For m = 2q-1 downto q do for u = 0 to N-1 do in parallel (1) if u m ≠ u m-q then B u (m) = A u endif (2) if u m = u m-q then A u (m-q) = B u endif endfor
Explanation of algorithm This algorithm is implemented using recursion to achieve the transpose of A. Divide the matrix into 4 submatrices – n/2 x n/2. For iteration 1 when m = 2q-1, swap elements of the top right submatrix with that of the bottom left submatrix. The other 2 submatrices are not touched. Now recursively do this until all of the elements are swapped.
Example. We want the transpose of the following matrix: a b c d A = e f g h i j k l m n o p
We use 16 processors with the following indices:
Drawing a hypercube for this:
Processor 0 – binary 0000 holds a 00 which is the value a Processor 1 – binary 0001 holds a 01 which is the value b Processor 2 – binary 0010 holds a 02 which is the value c And so on In the first iteration m = 2q-1 where q = 2 in this example m = 3. Step 1: Each P u for u 3 ≠ u 1 sends their element of A u to P u (3) which stores the value in B u (3) i.e. processors 2, 3, 6 & 7 send to processors 10, 11, 14 & 15 respectively. And processors 8, 9, 12 & 13 send to processors 0, 1, 4 & 5 respectively.
Step 2: Each processor that received a data in Step 1, will now send the data from B u to P u (1) to be stored in A u (1), i.e. Processors 0, 1, 4, 5 send to 2, 3, 6, 7 respectively Processors 10, 11, 14, 15 send to 8, 9, 12, 13 respectively By the end of the first iteration our matrix A will look like: a b i j A = e f m n c d k l g h o p
In the second iteration when m = q = 2: Step 1: Each P u (where u 2 ≠ u 0 ) sends A u to P u (2) storing it in B u (2). This is a simultaneous transfer: From: processor 4 to processor 0 processor 1 to processor 5 processor 6 to processor 2 processor 3 to processor 7 processor 12 to processor 8 processor 9 to processor 13 processor 14 to processor 10 processor 11 to processor 15
Step 2: For u 2 = u 0, each P u sends B u to P u (0) where it is stored in A u (0) thus Swap the element in the top right corner processor with that in the bottom left corner for each of the 2 x 2 submatrices. From: processor 0 to processor 1 processor 5 to processor 4 processor 2 to processor 3 processor 7 to processor 6 processor 8 to processor 9 processor 13 to processor 12 processor 10 to processor 11 processor 15 to processor 14
After the second iteration, we have the following transposed matrix: a e i m A = b f j n c g k o d h l p
Algorithm Analysis It takes q constant time iterations giving t(n) = O(log n) But it takes n 2 processors. Therefore Cost = (n 2 log n) which is not optimal.
Bibliography Akl, Parallel Computation, Models and Methods, Prentice Hall Drake, J.B. and Luo, Q., A scalable Parallel Strassen’s matrix Multiplication Algorithm For Distributed-Memory Computers, February 1995 Proceedings of the 1995 ACM symposium on Applied computing, Gupta, H & Sadayappan P., Communication Efficient Matrix Mulitplication on Hypercubes, August 1994 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures, Quinn, M.J., Parallel Computing – Theory and Practice, McGraw Hill, 1997