Design of parallel algorithms Matrix operations J. Porras.

Design of parallel algorithms Matrix operations J. Porras

Matrix x vector Sequential approach MAT_VECT(A,x,y) for(i=0;i<n;i++) { y[i] = 0; for(j=0;j<n:j++) { y[i] = y[i] + A[i,j] * x[j] } Work = n 2

Parallelization of matrix operations Matrix x vector Three ways to implement –rowwise striping –columnwise striping –checkerboarding DRAW each of these approaches !

Rowwise striping N x N is distributed into n processors (one row each) N x 1 vector is distributed into n processors (one element each) All processors need the whole vector so all-to-all broadcast is required

Rowwise striping All-to-all broadcast requires  n). One row takes  n) time for multiplications Rows are calculated in parallel thus the total time is  n) and the work  n 2 ). –Algorithm is cost-optimal

Block striping Assume that p < n and the matrix in partitioned by using block striping All processors contain n/p rows and n/p elements of the vector All processors require the whole vector thus all-to-all broadcast is required (message size n/p)

Block striping in hypercube all-to-all broadcast in hypercube with n/p- sized message takes t s log p + t w (n/p)(p-1) If p is considered large enough t s log p + t w n Multiplication requires n 2 /p time (n/p rows to multiply with the vector)

Block striping in hypercube Parallel execution time T P = n 2 /p + t s log p + t w n Cost pT P n 2 + t s plog p + t w np Algorithm is costoptimal if p = O(n)

Block striping in mesh All-to-all broadcast in mesh with wraparounds takes 2t s (  p-1) + t w (n/p)(p-1) Parallel execution requires T P = n 2 /p + 2t s (  p-1) + t w n

Scalability of block striping Overhead (T 0 = pT p – W) T 0 = t s plog p + t w np Isoeffiency (W = KT 0 ) for hypercube W = K t s p log p W = K t w np Since W = n 2, W = K 2 t w 2 p 2

Scalability of block striping Because p = O(n), n =  p) n 2 =  p 2 ) W =  p 2 ) Equation gives the highest asymptotic rate at which the problem size must increase with the number of processors to maintain fixed efficiency

Scalability of block striping Isoeffiency in hypercube is  (p 2 ). Similar analysis can be done for the mesh architecture and get the same value  (p 2 ). Thus with striped partitioning, scalability is not any more on a hypercube than on a mesh

Checkerboard N x N matrix in partitioned into N 2 processors (one element per processor) N x 1 vector is located on a last column (or on a diagonal) Vector is distributed into corresponding processors Calculate multiplications in parallel and collect results with single node accumulation into the last processor

Checkerboard Three communication stapes are required –One-to-one communication to send the vector onto diagonal –One-to-all broadcast to distributed the elements of the vector –Single-node accumulation to sum the partial results

Checkerboard Mesh requires  (n) time for all the operations (SF) and hypercube  (log n) Multiplication happens in constant time Parallel execution time is  (n) in mesh and  (log n) in hypercube architecture Cost is  (n 3 ) for the mesh and  (n 2 log n)for the hypercube Algorithms are not cost-optimal

Checkerboard p < n 2 Cost-optimality can be achieved if the granularity is increased ?? Consider two dimensional mesh of p processors in which each processor stores (n/  p) x (n/  p  block of the matrix Simlarly for the vector (n/  p)

Checkerboard p < n 2 Vector elements are sent to the diagonal Vector elements are distributed for the other processors Each processor performs n 2 /p multiplications and calculates n/  p additions Partial sums are collected with single node accumulation

Scalability of checkerboard p < n 2 Assume that the processors are connected in a two dimensional  p x  p cut-through routing mesh (no wraparounds) Sent to diagonal takes t s + t w n /  p + t h  p One-to-all in columns takes (t s + t w n /  p) log (  p) + t h  p

Scalability of checkerboard p < n 2 Single-node accumulation takes (t s + t w n /  p) log (  p) + t h  p Multiplicatios in each processor takes n 2 /p. Thus T P = n 2 /p + t s log p +(t w n /  p) log p + 3t h  p T 0 = pT P - W gives for the overhead: T 0 = t s plog p + t w n  p log p + 3t h p 3/2

Scalability of checkerboard p < n 2 Isoeffiency for t s : W = Kt s p log p Isoeffiency for t w : W = n 2 = K t w n  p log p n = K t w  p log p n 2 = K 2 t w 2 p log 2 p W = K 2 t w 2 p log 2 p Isoeffiency for t h : W = 3 K t h p 3/2

Scalability of checkerboard p < n 2 If p = O(n 2 ), : p = O(n 2 ) n 2 =  p) W =  p) t w and t h dominate t s

Scalability of checkerboard p < n 2 Concentrate on t h  (p 3/2 ) and t w :n  (plog 2 p) Because p 3/2 > plog 2 p only for p > 65536 both of the terms could dominate Assume that the term  (plog 2 p) dominates

Scalability of checkerboard p < n 2 Maximum number of processors that can be used costoptimally for the problem size W is determined by plog 2 p = O( n 2 ) log p + 2 log log p = O( log n ) log p = O (log n)

Scalability of checkerboard p < n 2 Substitute log n for log p:n p log 2 n = O (n 2 ) p = O ( n 2 / log 2 n ) p gives the upper limit for the number of processors that can be used cost-optimally

SF and CT Parallel execution takes n 2 / p + 2t s  p + 3t w n time on p processor mesh with SF routing (isoeffiency  (p 2 ) dueto t w ) CT routing performs much better Note that this is true for cases with several elements per processor HOW about fine-grain case ?

Striped and checkerboard Comparison shows that checkerboard is faster than striped approach with the same amount of processors If p > n, striped approach is not available How about the effect of architecture ? Scalability ? Isoefficiency ?

Sequential matrix multiplication Procedure MAT_MULT(A,B,C) for i := 0 to n-1 do for j := 0 to n-1 do C[i,j] := 0; for k := 0 to n-1 C[i,j] := C[i,j] + A[i,k]B[k,j] n 3 work (strassen’s algorithm has better complexity)

Block approach n/q * n/q submatrices Procedure BLOCK_MAT_MULT(A,B,C) for i := 0 to q-1 do for j := 0 to q-1 do Initialize C to zero for k := 0 to q-1 do C i,j := C i,j + A i,k B k,j Same complexity n 3

Simple parallel approach Matrices A and B partitioned into p blocks of size(n/p 1/2 ) x (n/p 1/2 ) Map into p 1/2 x p 1/2 mesh Processors P 0,0... P  p-1,  p-1 P i,j stores A i,j and B i,j and computes C i,j C i,j requires A i,k and B k,j A needs to communicate within rows B communicates within columns

Performance on hypercube Requires 2 broadcasts (rows and columns) message size n 2 /p t c = 2(t s log(  p)+t w (n 2 /p)(  p-1)) t m =  p (n/  p) 3 =n 3 /p T p = n 3 /p + t s log p + 2t w n 2 /  p, p » 1

Performance on mesh Store-and-forward routing t c = 2(t s  p + t w n 2 /  p) t m =  p (n/  p) 3 =n 3 /p t p = n 3 /p + 2t s  p + 2t w n 2 /  p

Cannon´s algorithm Partition to blocks as usual Processors P 0,0 - P  p-1,  p-1 P i,j contains A i,j and B i,j rotate block !! A blocks to the left B blocks upwards

Fox’s algorithm Partition to blocks as usual P i,j contains A i,j and B i,j Uses one-to-all broadcasts  p iterations (1) broadcast selected block to row (2) multiply by B (3) send B upwards (4) select A i,(j+1)mod(  p)

DNS Dekel, Nassimi and Sahni n 3 processors available use 3D structure P i,j,k solves A[i,k]xB[k,j] C[i,j] = P i,j,0 +...+ P i,j,n-1  (log n) time

DNS for hypercube 3D structure is mapped into hypercube where n 3 = 2 3d processors Processor P i,j,o contains A[i,j] and B[i,j] 3 steps (1) move A & B to correct plane (2) replicate on each plane (3) single node accumulation

DNS < n 3 processors Processors p = q 3, q < n Partition matrices into (n/q)*(n/q) blocks Matrices contain q x q submatrices Since 1<=q<=n, p=[1,n 3 ]

Design of parallel algorithms Matrix operations J. Porras.

Similar presentations

Presentation on theme: "Design of parallel algorithms Matrix operations J. Porras."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Design of parallel algorithms Matrix operations J. Porras.

Similar presentations

Presentation on theme: "Design of parallel algorithms Matrix operations J. Porras."— Presentation transcript:

Similar presentations

About project

Feedback