Download presentation
Presentation is loading. Please wait.
1
Design of parallel algorithms Matrix operations J. Porras
2
Matrix x vector Sequential approach MAT_VECT(A,x,y) for(i=0;i<n;i++) { y[i] = 0; for(j=0;j<n:j++) { y[i] = y[i] + A[i,j] * x[j] } Work = n 2
3
Parallelization of matrix operations Matrix x vector Three ways to implement –rowwise striping –columnwise striping –checkerboarding DRAW each of these approaches !
4
Rowwise striping N x N is distributed into n processors (one row each) N x 1 vector is distributed into n processors (one element each) All processors need the whole vector so all-to-all broadcast is required
6
Rowwise striping All-to-all broadcast requires n). One row takes n) time for multiplications Rows are calculated in parallel thus the total time is n) and the work n 2 ). –Algorithm is cost-optimal
7
Block striping Assume that p < n and the matrix in partitioned by using block striping All processors contain n/p rows and n/p elements of the vector All processors require the whole vector thus all-to-all broadcast is required (message size n/p)
8
Block striping in hypercube all-to-all broadcast in hypercube with n/p- sized message takes t s log p + t w (n/p)(p-1) If p is considered large enough t s log p + t w n Multiplication requires n 2 /p time (n/p rows to multiply with the vector)
9
Block striping in hypercube Parallel execution time T P = n 2 /p + t s log p + t w n Cost pT P n 2 + t s plog p + t w np Algorithm is costoptimal if p = O(n)
10
Block striping in mesh All-to-all broadcast in mesh with wraparounds takes 2t s ( p-1) + t w (n/p)(p-1) Parallel execution requires T P = n 2 /p + 2t s ( p-1) + t w n
11
Scalability of block striping Overhead (T 0 = pT p – W) T 0 = t s plog p + t w np Isoeffiency (W = KT 0 ) for hypercube W = K t s p log p W = K t w np Since W = n 2, W = K 2 t w 2 p 2
12
Scalability of block striping Because p = O(n), n = p) n 2 = p 2 ) W = p 2 ) Equation gives the highest asymptotic rate at which the problem size must increase with the number of processors to maintain fixed efficiency
13
Scalability of block striping Isoeffiency in hypercube is (p 2 ). Similar analysis can be done for the mesh architecture and get the same value (p 2 ). Thus with striped partitioning, scalability is not any more on a hypercube than on a mesh
14
Checkerboard N x N matrix in partitioned into N 2 processors (one element per processor) N x 1 vector is located on a last column (or on a diagonal) Vector is distributed into corresponding processors Calculate multiplications in parallel and collect results with single node accumulation into the last processor
17
Checkerboard Three communication stapes are required –One-to-one communication to send the vector onto diagonal –One-to-all broadcast to distributed the elements of the vector –Single-node accumulation to sum the partial results
18
Checkerboard Mesh requires (n) time for all the operations (SF) and hypercube (log n) Multiplication happens in constant time Parallel execution time is (n) in mesh and (log n) in hypercube architecture Cost is (n 3 ) for the mesh and (n 2 log n)for the hypercube Algorithms are not cost-optimal
19
Checkerboard p < n 2 Cost-optimality can be achieved if the granularity is increased ?? Consider two dimensional mesh of p processors in which each processor stores (n/ p) x (n/ p block of the matrix Simlarly for the vector (n/ p)
20
Checkerboard p < n 2 Vector elements are sent to the diagonal Vector elements are distributed for the other processors Each processor performs n 2 /p multiplications and calculates n/ p additions Partial sums are collected with single node accumulation
21
Scalability of checkerboard p < n 2 Assume that the processors are connected in a two dimensional p x p cut-through routing mesh (no wraparounds) Sent to diagonal takes t s + t w n / p + t h p One-to-all in columns takes (t s + t w n / p) log ( p) + t h p
22
Scalability of checkerboard p < n 2 Single-node accumulation takes (t s + t w n / p) log ( p) + t h p Multiplicatios in each processor takes n 2 /p. Thus T P = n 2 /p + t s log p +(t w n / p) log p + 3t h p T 0 = pT P - W gives for the overhead: T 0 = t s plog p + t w n p log p + 3t h p 3/2
23
Scalability of checkerboard p < n 2 Isoeffiency for t s : W = Kt s p log p Isoeffiency for t w : W = n 2 = K t w n p log p n = K t w p log p n 2 = K 2 t w 2 p log 2 p W = K 2 t w 2 p log 2 p Isoeffiency for t h : W = 3 K t h p 3/2
24
Scalability of checkerboard p < n 2 If p = O(n 2 ), : p = O(n 2 ) n 2 = p) W = p) t w and t h dominate t s
25
Scalability of checkerboard p < n 2 Concentrate on t h (p 3/2 ) and t w :n (plog 2 p) Because p 3/2 > plog 2 p only for p > 65536 both of the terms could dominate Assume that the term (plog 2 p) dominates
26
Scalability of checkerboard p < n 2 Maximum number of processors that can be used costoptimally for the problem size W is determined by plog 2 p = O( n 2 ) log p + 2 log log p = O( log n ) log p = O (log n)
27
Scalability of checkerboard p < n 2 Substitute log n for log p:n p log 2 n = O (n 2 ) p = O ( n 2 / log 2 n ) p gives the upper limit for the number of processors that can be used cost-optimally
28
SF and CT Parallel execution takes n 2 / p + 2t s p + 3t w n time on p processor mesh with SF routing (isoeffiency (p 2 ) dueto t w ) CT routing performs much better Note that this is true for cases with several elements per processor HOW about fine-grain case ?
29
Striped and checkerboard Comparison shows that checkerboard is faster than striped approach with the same amount of processors If p > n, striped approach is not available How about the effect of architecture ? Scalability ? Isoefficiency ?
30
Sequential matrix multiplication Procedure MAT_MULT(A,B,C) for i := 0 to n-1 do for j := 0 to n-1 do C[i,j] := 0; for k := 0 to n-1 C[i,j] := C[i,j] + A[i,k]B[k,j] n 3 work (strassen’s algorithm has better complexity)
31
Block approach n/q * n/q submatrices Procedure BLOCK_MAT_MULT(A,B,C) for i := 0 to q-1 do for j := 0 to q-1 do Initialize C to zero for k := 0 to q-1 do C i,j := C i,j + A i,k B k,j Same complexity n 3
32
Simple parallel approach Matrices A and B partitioned into p blocks of size(n/p 1/2 ) x (n/p 1/2 ) Map into p 1/2 x p 1/2 mesh Processors P 0,0... P p-1, p-1 P i,j stores A i,j and B i,j and computes C i,j C i,j requires A i,k and B k,j A needs to communicate within rows B communicates within columns
33
Performance on hypercube Requires 2 broadcasts (rows and columns) message size n 2 /p t c = 2(t s log( p)+t w (n 2 /p)( p-1)) t m = p (n/ p) 3 =n 3 /p T p = n 3 /p + t s log p + 2t w n 2 / p, p » 1
34
Performance on mesh Store-and-forward routing t c = 2(t s p + t w n 2 / p) t m = p (n/ p) 3 =n 3 /p t p = n 3 /p + 2t s p + 2t w n 2 / p
35
Cannon´s algorithm Partition to blocks as usual Processors P 0,0 - P p-1, p-1 P i,j contains A i,j and B i,j rotate block !! A blocks to the left B blocks upwards
39
Fox’s algorithm Partition to blocks as usual P i,j contains A i,j and B i,j Uses one-to-all broadcasts p iterations (1) broadcast selected block to row (2) multiply by B (3) send B upwards (4) select A i,(j+1)mod( p)
42
DNS Dekel, Nassimi and Sahni n 3 processors available use 3D structure P i,j,k solves A[i,k]xB[k,j] C[i,j] = P i,j,0 +...+ P i,j,n-1 (log n) time
43
DNS for hypercube 3D structure is mapped into hypercube where n 3 = 2 3d processors Processor P i,j,o contains A[i,j] and B[i,j] 3 steps (1) move A & B to correct plane (2) replicate on each plane (3) single node accumulation
46
DNS < n 3 processors Processors p = q 3, q < n Partition matrices into (n/q)*(n/q) blocks Matrices contain q x q submatrices Since 1<=q<=n, p=[1,n 3 ]
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.