Algorithmic Techniques on a Ring of Processors. Logical Processor Topology When writing a (distributed memory) parallel application, one typically organizes.

Algorithmic Techniques on a Ring of Processors

Logical Processor Topology When writing a (distributed memory) parallel application, one typically organizes processors in a logical topology Linear Array Ring Bi-directional Ring 2-D grid 2-D torus One-level Tree Fully connected graph Arbitrary graph We ’ re going to talk about a simple Ring Natural choice to partition regular data like matrices we will come up with algorithms and performance estimates Some of these algorithms could be done better on other topologies, like bi-directional rings for instance, but the point is to see how to design and reason about parallel algorithms

Communication on the Ring Each processor is identified by a rank RANK() There is a way to find the total number of processors NUMPROCS() Each processor can send a message to its successor SEND(addr, L) RECV(addr, L) We ’ re looking only at SPMD programs P0P0 P1P1 P2P2 P3P3 P p-1

Cost of communication It is actually difficult to precisely model the cost of communication, or the way in which communication loads the processor We will be using a simple model Time =  + L   : start-up cost L: message size  : inverse of the bandwidth We assume that is a message of length L is sent from P 0 to P 1, then the communication cost is q(  + L  ) There are many assumptions in our model, some not very realistic, but we ’ ll discuss them later

Broadcast We want to write a program that that has P k send the same message of length L to all other processors Broadcast(k,addr,L) On the ring, we just send to the next processor, and so on, with no parallel communications whatsoever This is of course not the way one should implement a broadcast in practice MPI uses some type of tree topology

Broadcast Brodcast(k,addr,L) q = RANK() p = NUMPROCS() if (q == k) SEND(adr,L) else if (q == k-1 mod p) RECV(adr,L) else RECV(adr,L) SEND(adr,L) endif Assumes a blocking receive Sending may be non- blocking The broadcast time is (p-1)(  + L  )

Scatter Pk stores the message destined to Pq at address addr[q], including a message at addr[k]. The principle is just to pipeline communication by starting to send the message destined to Pk-1, the most distant processor.

Scatter q = rank() p = numprocs() if (q == k) for i = 1 to p-1 SEND(addr[k+p-1 mod p],L) addr  addr[k] else RECV(tempR,L) for i = 1 to k-1-q mod p tempS  tempR SEND(tempS,L) || RECV(tempR,L) addr  tempR Swapping of send buffer and receive buffer (pointer) Sending and Receiving in Parallel, with a non blocking Send Same execution time as the broadcast (p-1)(  + L  )

All-to-all q = rank() p = numprocs() addr[q]  my_addr for i = 1 to p-1 SEND(addr[q-i+1 mod p],L) || RECV(addr[q-i mod p],L) Same execution time as the scatter (p-1)(  + L  )

A faster broadcast? How can one accelerate the broadcast? So far we ’ ve seen (p-1)(  + L  ) One can cut the message in many small pieces, say in r pieces where L is divisible by r. The root processor just sends r messages The performance is as follows Consider the last processor to get the last piece of the message There need to be p-1 steps for the first piece to arrive, which takes (p-1)(  + L  / r) Then the remaining r-1 pieces arrive one after another, which takes (r-1)(  + L  / r) For a total of: (p - 2 + r) (  + L  / r)

A faster broadcast? The question is, what is the value of r that minimizes (p - 2 + r) (  + L  / r) ? One can view the above expression as (c+ar)(d+b/r), with four constants a, b, c, d The non-constant part of the expression is then ad.r + cb/r, which must be minimized It is known that this value is minimized for sqrt(cb / ad) and we have r opt = sqrt(L(p-2)  /  ) with the optimal time (sqrt((p-2)  + sqrt(L  )) 2 which tends to L  when L is large, which is independent of p.

Matrix-Vector product y = A x for i = 0 to n-1 /* compute a dot-product */ y[i] = 0 for j = 0 to n-1 y[i] = y[i] + a[i,j] * x[j] Just distribute the dot-product computations among processors Let n be the size of the matrix, p the number of processors Let’s assume that n is divisible by p, and let r = n/p Each processor needs r rows of the matrix

Matrix-vector Product What about the distribution of vector x? It could be replicated across all processors and then all computations would be independent But since each processor computes only a piece of y, it is more elegant to have x distributed like A, with each processor owning r components of vector x This is typically what would be done in real code so that data is distributed across processors For a vector it may be more efficient to fully duplicate it, but in general you don ’ t want to do that for matrices or other data structures Each processor has in its memory r rows of matrix A in an array a[r][n] r components of vector x in an array my_x[r]

Global vs. Local indices Having only a piece of the overall data structure is common makes it possible to partition the workload makes it possible to run larger problems by aggregating distributed memory Typically when writing code like this on have global index (I,J) that references an element of the matrix local index (i,j) that references an element of the local array that stores a piece of the matrix Translation between global and local indices think of the algorithm in terms of global indices implement it in terms of local indices P4 Global: A[5][7] Local: a[1][3] a[i,j] = A[Mblock*floor(rank/3) + i][Nblock*ceil(rank mod 3) + j] P5 P2P1P0 P3 Mblock Nblock

Principle of the Algorithm A 00 A 01 A 02 A 03 A 04 A 05 A 06 A 07 A 10 A 11 A 12 A 13 A 14 A 15 A 16 A 17 P0P0 x0x1x0x1 A 20 A 21 A 22 A 23 A 24 A 25 A 26 A 27 A 30 A 31 A 32 A 33 A 34 A 35 A 36 A 37 P1P1 x2x3x2x3 A 40 A 41 A 42 A 43 A 44 A 45 A 46 A 47 A 50 A 51 A 52 A 53 A 54 A 55 A 56 A 57 P2P2 x4x5x4x5 A 60 A 61 A 62 A 63 A 64 A 65 A 66 A 67 A 70 A 71 A 72 A 73 A 74 A 75 A 76 A 77 P3P3 x6x7x6x7 Initial data distribution for: n = 8 p = 4 r = 2

Principle of the Algorithm A 00 A 01 A 10 A 11 P0P0 x0x1x0x1 A 22 A 23 A 32 A 33 P1P1 x2x3x2x3 A 44 A 45 A 54 A 55 P2P2 x4x5x4x5 A 66 A 67 A 76 A 77 P3P3 x6x7x6x7 Step 1

Principle of the Algorithm A 06 A 07 A 16 A 17 P0P0 x6x7x6x7 A 20 A 21 A 30 A 31 P1P1 x0x1x0x1 P2P2 x2x3x2x3 P3P3 x4x5x4x5 Step 2 A 22 A 23 A 32 A 33 A 44 A 45 A 54 A 55

Principle of the Algorithm A 06 A 07 A 16 A 17 P0P0 x0x1x0x1 A 20 A 21 A 30 A 31 P1P1 x2x3x2x3 P2P2 x4x5x4x5 P3P3 x6x7x6x7 Final state A 22 A 23 A 32 A 33 A 44 A 45 A 54 A 55 The final exchange of vector x is not strictly necessary, but one may want to have it distributed as the end of the computation like it was distributed at the beginning.

Algorithm Mat_vec(in A, in x, out y) q  rank() p  numprocs() tempS  x /* My piece of the vector (r elements) */ for step = 0 to p-1 SEND(tempS,r) || RECV(tmpsR,r) || for i = 0 to r-1 y[i]  y[i] + a[i,(q - step mod p) * r + j] * tempS[j] tempS  tempR Uses two buffers (tempS for sending and tempR to receiving) Computation and Communications occur in parallel

Performance There are p identical steps During each step each processor performs three concurrent activities: computation, receive, and sending Each step goes as fast as the slowest off the 3 concurrent activities Computation: r 2 T comp Communication:  + rT comm (Tcomp and Tcomm are times for individual computations and transfers) T(p) = p * max(r 2 T comp,  + rT comm) For fixed p, when n gets large the computation time dominates and efficiency tends to 1

Performance (2) Note that an algorithm that initially broadcasts the entire vector to all processors and then have every processor compute independently would be in time (p-1)(  + n*T comm ) + pr 2 * T comp which has the same asymptotic performance is a simpler algorithm wastes only a tiny little bit of memory is arguably much less elegant It is important to think of simple solutions and see what works best given expected matrix sized, etc.

An Image Processing Application We have seen a few parallel applications with different ways of partitioning the work matrix-matrix and matrix-vector multiply sharks and fishes numerical methods... We ’ re going to look in depth at another type of parallel application, and see what the performance trade-offs are We ’ re still working on the ring topology The application model is representative of several image processing applications

The sequential application Generic algorithmic framework that can be used for distance from contour computation of an optimal trajectory others.. Let P be a n x n grid, where each point is a pixel. A point p not on the edge has 8 neighbors NWNNE WpE SESSW

Principle of the algorithm Sweep through the grid, back and forth first from the top-left corner to the bottom-right corner (FW) then back from the bottom-right corner to the top-left corner (BW) FW pass p  FW_update(p,W,NW,N,NE) BW pass p  BW_update(p,E,SE,S,SW) NWNNE Wp pE SSESW stencil

Why is this useful? Distance from contour Let P be a binary image of an object F, with a pixel value of zero if the pixel belongs to F, and of ∞ otherwise We want to replace each pixel value by the pixel ’ s distance to F ’ s complement, according to some metric Can be done in two passes: FW: p  min(p,W+t 1,NW+t 2,N+t 1,NE+t 2 ) BW: p  min(p,E+t 1,SE+t 2,S+t 1,SW+t 2 ) t 1 = 1, t 2 = ∞ : Mahattan distance t 1 = 3, t 2 = 4: good approximation of Euclidian distance Once one has this distance, it is easy to compute things such as surface, contour length, etc. Computation of optimal trajectory Each pixel has a “ cost ” value Goal: compute minimal cost trajectories from one pixel to all others A bit complicated with non-trivial updates (Bitz and Kung, 1988) O(n 2 ) passes in the worst case (<n in practice)...

Parallelization? Stencil applications are common and many people have looked at parallelizing them Here the stencil is interesting because it is asymmetric and leads to a “ wavefront ” computation We want to do this on a ring of processor Usual trade-offs apply load balance the work among processors Not pay too much for communication Get all computers to start computing early

A Greedy Algorithm Processors send pixels to their neighbors as soon as they are computed Very small start-up time Good load balancing Let say we have p = n processors and each line i of the image is assigned to a processor P i In a FW phase, as soon as P i computes a pixel, it must send it to P i+1 Given the shape of the stencil, a processor needs two values from its predecessor to start computation on a line

Execution Steps P 0 P 1 P 2 P 3 P 4 P 5...

Execution Steps 0 P 0 P 1 P 2 P 3 P 4 P 5...

Execution Steps 01 P 0 P 1 P 2 P 3 P 4 P 5...

Execution Steps 01 2 P 0 P 1 P 2 P 3 P 4 P 5...

Execution Steps 012345678 23456789 456789 6789 89 P 0 P 1 P 2 P 3 P 4... 9 At “ step ” 2i+j, processor P i does: receive pixel (i-1,j+1) from P i-1 compute pixel (i,j) send pixel (i,j) to P i+1 Note the similarity to a systolic network

Performance? Assume that sends are non-blocking and receives are blocking For each row, each processor follows a sequence get a pixel compute send a pixel || get a pixel P 0 C S C S C S C S C S.. C S P 1     C S C S C S C S.. C S C S P 2         C S C S C S.. C S C S C S.. P i             C S...... C S C S C S.. P p-1                     C S... C S C S C S C C = Compute S = Send, Receive, or Send || Receive 2i x (T comp +  + T comm )n x (T comp +  + T comm ) T comm = Time to send a pixel to a neighbor T comp = Time to compute a pixel

Performance Processor P p-1 is the last one to finish It finishes at time: T = (3n-2)(T comp ) + (3n-3)(  + T comm ) Therefore we have a O(n) complexity We would have stopped here in the land of PRAMs, etc. The problem is the 3n  term. In practice  is orders of magnitude larger than T comm short messages are known to be a bad idea in most platforms We have small start-up time reasonable good balancing expensive communications

What if p < n? This is the realistic case When p<n, one must partition the data we assume that p divides n One could give the first n/p lines to P 0, the next n/p lines to P 1, etc. the last processor would start computing very late due to the stencil shape A better way is to interleave image lines between processors classical load-balancing technique that we mentioned for sharks and fishes P 0 C S C S C S C S.. C S P 1     C S C S C S.. C S C S.. P p-1         C S C S C S.. C S C S C S P 0             C S...... C S C S C S.. P p-1                     C S... C S C S C S C 2p x (T comp +  + T comm )n x (T comp +  + T comm )

Condition for no idle time Processor P 0 finishes computing its first line at time T 0 = n x (T comp + beta + T comm ) Processor P 0 receives data from processor P p-1 to compute its second line at time T p = 2p x (T comp + beta + T comm ) It T p > T0 we have idle time Therefore, T p ≤ T 0, i.e. n ≥ 2p If n > 2p, then P 0 must store pixels received from T p-1 until it can start computing on them Trade-off between idle-time and memory consumption, with the perfect balance exactly when n = 2p This notion of finishing receiving data right when the next computing should start is a common way to obtain “ good ” schedules (we ’ ll see that we we talk about Divisible Load Scheduling). We still have the same problem of expensive communications

Idea #1: cheaper communications Get rid of most of the network latencies we need to send longer messages so we let each processor compute k consecutive pixels at each step we initiate the process by having P 0 compute some number of pixels, l 0 Process P 0 starts computing l 0 pixels without any communication P 0 sends these l 0 pixels to P 1 and then computes its next k pixels P 1 can start computing l 0 -1 pixels in parallel with P 0 ’ s computation of its next k pixels When P 1 is done computing it sends l 0 -1 pixels to P 2. P 2 can start computing its first l 0 -2 pixels etc. When one reaches the end of a line, one just starts the next line, in the interleaved pattern we saw before At each step, but for the first and perhaps the last one (depending on whether k divides n-l 0 ), each processor computes k pixels.

Execution steps 0 l0l0 P0P1P2P3P0P1P2P3 l 0 -1 First line

Execution steps 01 1 l0l0 k P0P1P2P3P0P1P2P3 l 0 -1 k First line

Execution steps 012 12 2 l0l0 kk P0P1P2P3P0P1P2P3 l 0 -1 l 0 -2 k First line

Execution steps 0124 123 234 34 l0l0 kk 4 k P0P1P2P3P0P1P2P3 l 0 -1 l 0 -2 kk kk l 0 -3 First lineSecond line 3 k 4 k k The condition so that there is no idle time is: n ≥ (k+1)p (we ’ ll prove it later)

Execution steps 0124 1235 2346 3456 l0l0 kk 45 kk P0P1P2P3P0P1P2P3 l 0 -1 l 0 -2 kk kkk l 0 -3 56 kk 8 7 k 8 First lineSecond line 3 k 4 k 5 k 7 kkkk k The larger k, the less expensive are communications The smaller k, the longer the delay between stages  There is an optimal k

Idea #2: fewer communications To do fewer communications, one can associate blocks of r lines to each processor (to increase locality) No communications between lines within a block Block allocation is interleaved (block cyclic) Example: p = 4, n = 36, r = 3 P0P0 P1P1 P2P2 P3P3 0,1,23,4,56,7,89,10,11 lines12,13,1415,16,1718,19,2021,22,23 24,25,2627,28,2930,31,3233,34,35

Execution Steps 012345678910 11 2 3 3 4 5678910 56789 5 6789 4 11 121314 121314 12 15 16 13 15 First block 14 16 111213 1415 16 1718 15 Second block n = 44, p= 4, r = 3, k = 4 0 1 1 2 2 2 3 4 3 3 4 3 5 5 4 67 56 67 8 9 7 n = 44, p= 4, r = 3, k = 13

Execution Steps 012345678910 11 2 3 3 4 5678910 56789 5 6789 4 11 121314 121314 12 15 16 13 15 First block 14 16 111213 1415 16 1718 15 Second block n = 44, p= 4, r = 3, k = 4 0 1 1 2 2 2 3 4 3 3 4 3 5 5 4 67 56 67 8 9 7 n = 44, p= 4, r = 3, k = 13 IDLE

Condition for no idle time? We will see that it is n ≥ p(r+k) Since we ’ ve reduced communication, we have increased the start-up delay. We now have two trade-offs? large k and large r: cheap communication small k and small r: low start-up delay small k and large r? large k and small r? We need a thorough performance analysis in order to determine the optimal values of k and r This is what people who design // algorithms do to tune the performance

Performance Analysis Let us assume that p x r divides n The sequential time is n 2 T comp As before, the algorithm can be seen as a succession of 2 stages: send and/or receive data, in parallel compute At each stage a proc received k pixels from its predecessor, then computes r sub-lines of k pixels, then send the last k pixels to its successor Therefore: communication cost per stage:  + kT comm computation cost per stage: rkT comp total cost: T stage =  + kT comm + rkT comp

Performance Analysis First thing to do: figure out at which stage, s q, processor q, 0 ≤ q ≤ p-1, starts computing a row of l q many pixels P 0 starts computing l 0 =l 0 pixels at stage s 0 =0 P 1 starts computing l 1 =l 0 -r mod k pixels at stage s 1 =1+  (r-l 0 )/k  0 1 2 r = 3, l 0 = 12, k = 13 l 1 = 12 - 3 mod 13 = 9 s 1 = 1 +  (3 - 13)/13  = 1 0 1 2 r = 3, l 0 = 2, k = 4 l 1 = 2 - 3 mod 4 = 3 s 1 = 1 +  (3 -2)/4  = 2 More generally: l q = (l 0 - qr) mod k and s q = q +  (qr - l 0 )/k  0 1 2 3 r = 3, l 0 = 2, k = 4 l 2 = 2 - 2*3 mod 4 = 4 (there is a subtlety here) s 2 = 2 +  (2*3 -2)/4  = 3

Performance Analysis Now we just need to count the total number of stages, S p P p-1 is the last processor to complete After computing its first “ chunk ”, P p-1 has  (n 2 /(pr) - l p-1 + r - 1)/k  chunks left to compute Therefore: S p = s p-1 +  (n 2 /(pr) - l p-1 + r - 1)/k  T // = T stage x S p

Performance Analysis Our analysis is valid only if there is no idle time (we don ’ t really care about modeling the cases with idle times anyway) P 0 receives its first pixels at stage t p At that point, it has already computed l 0 + k(t p -1) pixels in its first line (l 0 at first, and then k at each stage) We must have that the number of pixels left to compute in the first line + the ones that it can compute in the first line of its second block, using pixels sent out by P p-1, must be greater or equal than k. (Similar argument for the first algorithm we looked at, just a bit more complex) This condition can be written as n - (l 0 + k(t p -1) + l p) ≥ k which is equivalent to n ≥ p(r+k)

Performance Analysis Neglecting constant terms one obtains T // = (  + k T comm + r k T comp ) x ((p - 1)(1 + r / k) + n 2 /(p r k)) provided that r ≤ n/p, and 1 ≤ k ≤ n / p - r Given n, p, and r, one can compute k opt (r) that minimizes T//: Then one plugs that value in T // with different values of r to find the best r. Voila :)

Lessons Learned It ’ s often a good idea to start thinking of the problem in a systolic array fashion (we have as many procs as elements) and then think of a data distribution when fewer procs are available Communication costs can be reduced by delaying communications and “ bundling ” them in a single, longer message. Communications can be reduced by “ blocking ” Better load balancing by having an interleaved or “ cyclic ” distribution Many algorithms best implemented with a “ block cyclic ” data distribution Performance analysis is difficult, although it is easy to find big-O estimates Choosing the best bundling and blocking factors in non-trivial and completely problem dependent, although there are some rules of thumb This must all be put in perspective with hardware (e.g., cache size) Good parallel computing is hard

Solving Linear Systems of Eq. Method for solving Linear Systems The need to solve linear systems arises in an estimated 75% of all scientific computing problems [Dahlquist 1974] Gaussian Elimination is perhaps the most well-known method based on the fact that the solution of a linear system is invariant under scaling and under row additions One can multiply a row of the matrix by a constant as long as one multiplies the corresponding element of the right-hand side by the same constant One can add a row of the matrix to another one as long as one adds the corresponding elements of the right-hand side Idea: scale and add equations so as to transform matrix A in an upper triangular matrix: ? ? ? ? ? x = equation n-i has i unknowns, with ?

Gaussian Elimination 111 1-22 12 0 4 2 x = 111 0-31 01-2 0 4 2 x = 111 0-31 00-5 0 4 10 x = Substract row 1 from rows 2 and 3 Multiple row 3 by 3 and add row 2 -5x 3 = 10 x 3 = -2 -3x 2 + x 3 = 4 x 2 = -2 x 1 + x 2 + x 3 = 0 x 1 = 4 Solving equations in reverse order (backsolving)

Gaussian Elimination The algorithm goes through the matrix from the top-left corner to the bottom-right corner the ith step eliminates non-zero sub-diagonal elements in column i, substracting the ith row scaled by a ji /a ii from row j, for j=i+1,..,n. i 0 values already computed values yet to be updated pivot row i to be zeroed

Sequential Gaussian Elimination Simple sequential algorithm // for each column i // zero it out below the diagonal by adding // multiples of row i to later rows for i = 1 to n-1 // for each row j below row i for j = i+1 to n // add a multiple of row i to row j for k = i to n A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k) Several “ tricks ” that do not change the spirit of the algorithm but make implementation easier and/or more efficient Right-hand side is typically kept in column n+1 of the matrix and one speaks of an augmented matrix Compute the A(i,j)/A(i,i) term outside of the loop

Pivoting: Motivation A few pathological cases Division by small numbers  round-off error in computer arithmetics Consider the following system 0.0001x 1 + x 2 = 1.000 x 1 - x 2 = 1.000 exact solution: x 1 =1.9998... and x2 = 0.99980... say we round off after 4 digits after the decimal point 10 -4 1 1 1 1 110 4 1 10 4 1 1 0-10 4 10 4 -10 4 01 11 0 1 -1 - 10 4 = -10,001 = -0.10001 E+5 = -0.1000 E+5 round-off error

Partial Pivoting One can just swap rows Final solution is closer to the real solution. (Magical) Numerical stability is an entire field Partial Pivoting For numerical stability, one doesn ’ t go in order, but pick the next row in rows i to n that has the largest element in row i This row is swapped with row i (along with elements of the right hand side) before the substractions the swap is not done but rather one keeps an indirection array Total Pivoting Look for the greatest element ANYWHERE in the matrix Swap columns Swap rows 1 10 -4 1 1 1 -10-4 01 10-4 1 good round-off 2 1

Parallel Gaussian Elimination Assume that we have one processor per matrix element (as in a PRAM or a systolic array) ReductionBroadcast Compute BroadcastsCompute to find the max a ji max a ji needed to compute the scaling factor Independent computation of the scaling factor Every update needs the scaling factor and the element from the pivot row Independent computations

Parallel Gaussian Elimination Once one understands the algorithm assuming that we have one proc per element, one can decide on a data distribution when we have fewer procs One column per proc: remove reduction and some broadcasts One column block per proc increases locality, when one doesn ’ t have as many procs as columns One MUST use a cyclic distribution since the matrix is traversed top-left to bottom-right Good approach: pick a block size, allocate column blocks to processors, interleaved manner: 1-D block cyclic distribution Better approach when many processors: also partitions rows in blocks to achieve a 2-D block cyclic distribution The 2-D block cyclic distribution is sort of the panacea of dense linear algebra as it allows for good locality and good load-balancing, at the cost of more complicated code.

LU Factorization Gaussian Elimination is simple but What if we have to solve many Ax = b systems for different values of b? This happens a LOT in real applications Another method is the “ LU Factorization ” Ax = b Say we could rewrite A = L U, where L is a lower triangular matrix, and U is an upper triangular matrix O(n 3 ) Then Ax = b is written L U x = b Solve L y = bO(n 2 ) Solve U x = y O(n 2 ) ? ? ? ? ? ? x = ? ? ? ? ? ? x = equation i has i unknownsequation n-i has i unknowns triangular system solves are easy

LU Factorization: Principle It works just like the Gaussian Elimination, but instead of zeroing out elements, one “ saves ” scaling coefficients. Magically, A = L x U ! Should be done with pivoting as well 12 431 223 12 0-55 223 gaussian elimination save the scaling factor 12 4-55 223 gaussian elimination + save the scaling factor 12 4-55 2-25 gaussian elimination + save the scaling factor 12 4-55 2 2/5 3 100 410 2 1 L = 12 0-55 003 U =

LU Factorization We ’ re going to look at the simplest possible version No pivoting just creates a bunch of indirections that are easy but make the code look complicated No blocking this is not what one should do on a modern machine (i.e., one with a cache), but again, adding blocking transforms a 5 line algorithm into several pages of code (just go look at the LAPACK code and see how complicated everything looks) Very often the principle can be very simple, but the code extremely complex just for optimizations and for dealing with numerical stability The ScaLAPACK 2-D block-cyclic LU factorization code is layered on top of many libraries If you were to write it as a one-level procedure that uses MPI directly, it would be many, many pages it deals with rectangular blocks, all the horrible cases in which nothing divides anything, prime numbers, nothing ’ s a perfect square, etc.

Sequential Algorithm stores the scaling factors k k LU-sequential(A,n) { for k = 0 to n-2 { // preparing column k for i = k+1 to n-1 a ik  -a ik / a kk for j = k+1 to n-1` // Task T kj : update of column j for i=k+1 to n-1 a ij  a ij + a ik * a kj }

Sequential Algorithm LU-sequential(A,n) { for k = 0 to n-2 { // preparing column k for i = k+1 to n-1 a ik  -a ik / a kk for j = k+1 to n-1` // Task T kj : update of column j for i=k+1 to n-1 a ij  a ij + a ik * a kj } k i j k update

Parallel LU on a ring Since the algorithm operates by columns from left to right, we should distribute columns to processors At each step, the processor that owns column k does the “ prepare ” task and then broadcasts the bottom part of column k to all others. The other processors can then update. Assume there is a function alloc(k) that returns the rank of the processor that owns column k We will write everything in terms of global indices, as to avoid annoying index arithmetic

LU-broadcast algorithm LU-broadcast(A,n) { q  rank() p  numprocs() for k = 0 to n-2 { if (alloc(k) == q) // preparing column k for i = k+1 to n-1 buffer[i-k-1]  a ik  -a ik / a kk broadcast(alloc(k),buffer,n-k-1) for j = k+1 to n-1 if (alloc(j) == q) // update of column j for i=k+1 to n-1 a ij  a ij + buffer[i-k-1] * a kj }

Dealing with local indices Assume that p divides n Each processor needs to store r=n/p columns and its local indices go from 0 to r-1 After step k, only columns with index greater than k will be used Simple idea: use a local index, l, that everyone initializes to 0 At step k, processor alloc(k) increases its local index so that next time it will point to its next local column

LU-broadcast algorithm... double a[n-1][r-1]; q  rank() p  numprocs() l  0 for k = 0 to n-2 { if (alloc(k) == q) for i = k+1 to n-1 buffer[i-k-1]  a[i,k]  -a[i,l] / a[k,l] l  l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 // note that this is simpler a[i,j]  a[i,j] + buffer[i-k-1] * a[k,j] } we have replaces a ij matrix elements by arrays (typically a good idea to first write the algorithm with global indices and then move on to local indices)

Load-Balancing How to distribute the column? There are fewer and fewer columns at the execution proceeds  cyclic distribution More subtle: the amount of computation to be done isn ’ t proportional to the data size. The last column is updated n-1 times, while the first column only once Columns of higher indices require more computation  cyclic distribution (proc p-1 may have a bit more work, but asymptotically it ’ s insignificant) Performance analysis is a bit complex:  n  + (1/2) n 2 T comm +o(1) for communications  (1/2) n 2 T comp + o(1) for column preparations  (1/3)n 3 p T comp + O(n2) for updates

Pipelining on the Ring So far, the algorithm we ’ ve seen uses a simple broadcast Nothing was specific to being on a ring of processors and it ’ s portable in fact you could just write raw MPI that just looks like our pseudo-code and have a very limited, inefficient, LU factorization that works only for some number of processors) But it ’ s not efficient The n-1 communication steps are not overlapped with computations Turns out that on a ring, with a cyclic distribution of the columns, one can interleave pieces of the broadcast with the computation It almost looks like inserting the source code from the broadcast code we saw at the very beginning throughout the LU code

LU-pipeline algorithm double a[n-1][r-1]; q  rank() p  numprocs() l  0 for k = 0 to n-2 { if (k == q mod p) // Prep(k) for i = k+1 to n-1 buffer[i-k-1]  a[i,k]  -a[i,l] / a[k,l] l  l+1 send(buffer,n-k-1) else recv(buffer,n-k-1) if (q ≠ k-1 mod p) send(buffer, n-k-1) for j = l to r-1 // Update(k,j) for i=k+1 to n-1 a[i,j]  a[i,j] + buffer[i-k-1] * a[k,j] }

Prep(0) Send(0) Update(0,4) Update(0,8) Update(0,12) Recv(0) Send(0) Update(0,1) Update(0,5) Update(0,9) Update(0,13) Recv(0) Send(0) Update(0,2) Update(0,6) Update(0,10) Update(0,14) Recv(0) Update(0,3) Update(0,7) Update(0,11) Update(0,15)Prep(1) Send(1) Update(1,5) Update(1,9) Update(1,13) Recv(1) Send(1) Update(1,2) Update(1,6) Update(1,10) Update(1,14) Recv(1) Send(1) Update(1,3) Update(1,7) Update(1,11) Update(1,15) Recv(1) Update(1,4) Update(1,8) Update(1,12) Prep(2) Send(2) Update(2,6) Update(2,10) Update(2,14) Recv(2) Send(2) Update(2,3) Update(2,7) Update(2,11) Update(2,15) Recv(2) Send(2) Update(2,4) Update(2,8) Update(2,12) Recv(2) Update(2,5) Update(2,9) Update(2,13) Prep(3) Send(3) Update(3,7) Update(3,11) Update(3,15) Recv(3) Send(3) Update(3,4) Update(3,8) Update(3,12) Recv(3) Send(3) Update(3,5) Update(3,9) Update(3,13) Recv(3) Update(3,6) Update(3,10) Update(3,14) First four stages Some communication occurs in parallel with computation A processor sends out data as soon as it receives it

How can we do better? In the previous algorithm, a processor does all its updates before doing a Prep() computation that then leads to a communication But in fact, some of these updates can be done later Idea: Send out pivot as soon as possible Example: In the previous algorithm P1: Receive(0), Send(0) P1: Update(0,1), Update(0,5), Update(0,9), Update(0,13) P1: Prep(1) P1: Send(1)... In the new algorithm P1: Receive(0), Send(0) P1: Update(0,1) P1: Prep(1) P1: Send(1) P1: Update(0,5), Update(0,9), Update(0,13)...

Prep(0) Send(0) Update(0,4) Update(0,8) Update(0,12) Recv(0) Send(0) Update(0,1) Update(0,5) Update(0,9) Update(0,13) Recv(0) Send(0) Update(0,2) Update(0,6) Update(0,10) Update(0,14) Recv(0) Update(0,3) Update(0,7) Update(0,11) Update(0,15) Prep(1) Send(1) Update(1,5) Update(1,9) Update(1,13) Recv(1) Send(1) Update(1,2) Update(1,6) Update(1,10) Update(1,14) Recv(1) Send(1) Update(1,3) Update(1,7) Update(1,11) Update(1,15) Recv(1) Update(1,4) Update(1,8) Update(1,12) Prep(2) Send(2) Update(2,6) Update(2,10) Update(2,14) Recv(2) Send(2) Update(2,3) Update(2,7) Update(2,11) Update(2,15) Recv(2) Send(2) Update(2,4) Update(2,8) Update(2,12) Recv(2) Update(2,5) Update(2,9) Update(2,13) Prep(3) Send(3) Update(3,7) Update(3,11) Update(3,15) Recv(3) Send(3) Update(3,4) Update(3,8) Update(3,12) Recv(3) Send(3) Update(3,5) Update(3,9) Update(3,13) Recv(3) Update(3,6) Update(3,10) Update(3,14) First four stages Some communication occurs in parallel with computation A processor sends out data as soon as it receives it

LU-look-ahead algorithm q  rank() p  numprocs() l  0 for k = 0 to n-2 { if (k == q mod p) { Prep(k) Send(buffer,n-k-1) for all j = k mod p, j>k: Update(k-1,j) for all j = k mod p, j>k: Update(k,j) } else { Recv(buffer,n-k-1) if (q ≠ k - 1 mod p) then Send(buffer,n-k-1) if (q ≠ k + 1 mod p) then Update(k,k+1) else for all j = k mod p, j>k: Update(k,j) }

Further improving performance One can use local overlap of communication and computation multi-threading, good MPI non-blocking implementation, etc. There is much more to be said about parallel LU factorization Many research articles Many libraries available

Matrix-multiply on a grid/torus

2-D Torus topology We ’ ve looked at a ring, but for some applications it ’ s convenient to look at a 2-D grid topology A 2-D grid with “ wrap-around ” is called a 2-D torus. Advanced parallel linear algebra libraries/languages allow to combine arbitrary data distribution strategies with arbitrary topologies (ScaLAPACK, HPF) 1-D block cyclic to a ring 2-D block cyclic to a 2-D grid 2-D block non-cyclic to a ring etc.. In practice, for many linear algebra kernel, using a 2-D block-cyclic on a 2-D grid seems to work best in most situations we ’ ve seen that blocks are good for locality we ’ ve seen that cyclic is good for load-balancing

Semantics of a parallel linear algebra routine? Centralized when calling a function (e.g., LU) the input data is available on a single “ master ” machine the input data must then be distributed among workers the output data must be undistributed and returned to the “ master ” machine More natural/easy for the user Allows for the library to make data distribution decisions transparently to the user Prohibitively expensive if one does sequences of operations and one almost always does so Distributed when calling a function (e.g., LU) Assume that the input is already distributed Leave the output distributed May lead to having to “ redistributed ” data in between calls so that distributions match, which is harder for the user and may be costly as well For instance one may want to change the block size between calls, or go from a non- cyclic to a cyclic distribution Most current software adopt distributed more work for the user more flexibility and control

Matrix-matrix multiply Many people have thought of doing a matrix-multiply on a 2-D torus Assume that we have three matrices A, B, and C, of size N x N Assume that we have p processors, so that p=q 2 is a perfect square and our processor grid is q x q We ’ re looking at a block distribution, but not a cyclic distribution again, that would obfuscate the code too much We ’ re going to look at three algorithms: Cannon, Fox, Snyder A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 00 B 01 B 02 B 03 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33

Cannon ’ s Algorithm (1969) Very simple (from systolic arrays) Starts with a data redistribution for matrices A and B goal is to have only neighbor-to-neighbor communications A is circularly shifted/rotated “ horizontally ” so that its diagonal is on the first column of processors B is circularly shifted/rotated “ vertically ” so that its diagonal is on the first row of processors Called preskewing A 00 A 01 A 02 A 03 A 11 A 12 A 13 A 10 A 22 A 23 A 20 A 21 A 33 A 30 A 31 A 32 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 00 B 11 B 22 B 33 B 10 B 21 B 32 B 03 B 20 B 31 B 02 B 13 B 30 B 01 B 12 B 23

Cannon ’ s Algorithm Preskewing of A and B For k = 1 to q in parallel Local C = C + A*B Vertical shift of B Horizontal shift of A Postskewing of A and B Of course, computation and communication could be done in an overlapped fashion locally at each processor

Execution Steps... A 00 A 01 A 02 A 03 A 11 A 12 A 13 A 10 A 22 A 23 A 20 A 21 A 33 A 30 A 31 A 32 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 00 B 11 B 22 B 33 B 10 B 21 B 32 B 03 B 20 B 31 B 02 B 13 B 30 B 01 B 12 B 23 local computation on proc (0,0) A 01 A 02 A 03 A 00 A 12 A 13 A 10 A 11 A 23 A 20 A 21 A 22 A 30 A 31 A 32 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 10 B 21 B 32 B 03 B 20 B 31 B 02 B 13 B 30 B 01 B 12 B 23 B 00 B 11 B 22 B 33 Shifts A 01 A 02 A 03 A 00 A 12 A 13 A 10 A 11 A 23 A 20 A 21 A 22 A 30 A 31 A 32 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 10 B 21 B 32 B 03 B 20 B 31 B 02 B 13 B 30 B 01 B 12 B 23 B 00 B 11 B 22 B 33 local computation on proc (0,0)

Fox ’ s Algorithm(1987) Originally developed for CalTech ’ s Hypercube Uses broadcasts and is also called broadcast- multiply-roll algorithm broadcasts diagonals of matrix A Uses a shift of matrix B No preskewing step first diagonal second diagonal third diagonal...

Execution Steps... A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 00 B 01 B 02 B 03 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33 initial state A 00 A 11 A 22 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 Broadcast of A ’ s 1st diagonal Local computation A 00 A 11 A 22 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 00 B 01 B 02 B 03 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33 B 00 B 01 B 02 B 03 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33

Execution Steps... A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33 B 00 B 01 B 02 B 03 Shift of B A 01 A 12 A 23 A 30 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 Broadcast of A ’ s 2nd diagonal Local computation C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33 B 00 B 01 B 02 B 03 A 01 A 12 A 23 A 30 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33 B 00 B 01 B 02 B 03

Fox ’ s Algorithm // No initial data movement for k = 1 to q-1 in parallel Broadcast A’s k th diagonal Local C = C + A*B Vertical shift of B // No final data movement Note that there is an additional array to store incoming diagonal block

Snyder ’ s Algorithm (1992) More complex than Cannon ’ s or Fox ’ s First transposes matrix B Uses reduction operations (sums) on the rows of matrix C Shifts matrix B

Execution Steps... A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 00 B 01 B 02 B 03 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33 initial state C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 Transpose B Local computation C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 00 B 10 B 20 B 30 B 01 B 11 B 21 B 31 B 02 B 12 B 22 B 32 B 03 B 13 B 23 B 33 B 00 B 10 B 20 B 30 B 01 B 11 B 21 B 31 B 02 B 12 B 22 B 32 B 03 B 13 B 23 B 33 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33

Execution Steps... Shift B C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 01 B 11 B 21 B 31 B 02 B 12 B 22 B 32 B 03 B 13 B 23 B 32 B 00 B 10 B 20 B 30 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 01 B 11 B 21 B 31 B 02 B 12 B 22 B 32 B 03 B 13 B 23 B 32 B 00 B 10 B 20 B 30 Global sum on the rows of C A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 B 01 B 11 B 21 B 31 B 02 B 12 B 22 B 32 B 03 B 13 B 23 B 32 B 00 B 10 B 20 B 30 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 Local computation

Execution Steps... C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 Shift B B 02 B 12 B 22 B 32 B 03 B 13 B 23 B 33 B 00 B 10 B 20 B 30 B 01 B 11 B 21 B 31 C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 B 02 B 12 B 22 B 32 B 03 B 13 B 23 B 33 B 00 B 10 B 20 B 30 B 01 B 11 B 21 B 31 Global sum on the rows of C C 00 C 01 C 02 C 03 C 10 C 11 C 12 C 13 C 20 C 21 C 22 C 23 C 30 C 31 C 32 C 33 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 B 02 B 12 B 22 B 32 B 03 B 13 B 23 B 33 B 00 B 10 B 20 B 30 B 01 B 11 B 21 B 31 Local computation

Complexity Analysis Sort of cumbersome Two models 4-port model: every processor can communicate with its 4 neighbors in one step Can match underlying architectures like the Intel Paragon 1-port model: only one single communication at a time for each processor Both models are assumed bi-directional

One-port results Cannon Fox Snyder

Complexity Results m in these expressions is the block size Expressions for the 4-port model are MUCH more complicated Remember that this is all for non-cyclic distributions formulae and code become very complicated for a full-fledge implementation (nothing divides anything, nothing ’ s a perfect square, etc.) Performance analysis of real code is known to be hard It is done in a few restricted cases An interesting approach is to use simulation Done in ScaLAPACK for instance Essentially: you have written a code so complex you just run a simulation of it to ffigure out how fast it goes in different cases

Algorithmic Techniques on a Ring of Processors. Logical Processor Topology When writing a (distributed memory) parallel application, one typically organizes.

Similar presentations

Presentation on theme: "Algorithmic Techniques on a Ring of Processors. Logical Processor Topology When writing a (distributed memory) parallel application, one typically organizes."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Algorithmic Techniques on a Ring of Processors. Logical Processor Topology When writing a (distributed memory) parallel application, one typically organizes.

Similar presentations

Presentation on theme: "Algorithmic Techniques on a Ring of Processors. Logical Processor Topology When writing a (distributed memory) parallel application, one typically organizes."— Presentation transcript:

Similar presentations

About project

Feedback