Presentation is loading. Please wait.

Presentation is loading. Please wait.

Affine Partitioning for Parallelism & Locality Amy Lim Stanford University

Similar presentations


Presentation on theme: "Affine Partitioning for Parallelism & Locality Amy Lim Stanford University"— Presentation transcript:

1 Affine Partitioning for Parallelism & Locality Amy Lim Stanford University http://suif.stanford.edu/

2 Useful Transforms for Parallelism&Locality INTERCHANGEFOR i FOR j FOR j FOR i A[i,j]=A[i,j] = REVERSALFOR i= 1 to nFOR i= n downto 1 A[i]= A[i] = SKEWINGFOR i=1 TO nFOR i=1 TO n FOR j=1 TO nFOR k=i+1 to i+n A[i,j] = A[i,k-i] = FUSION/FISSIONFOR i = 1 TO nFOR i = 1 TO n A[i] =A[i] = FOR i = 1 TO n B[i] = B[i] = REINDEXINGFOR i = 1 to nA[1] = B[0] A[i] = B[i-1] FOR i = 1 to n-1 C[i] = A[i+1]A[i+1] = B[i] C[i] = A[i+1] C[n] = A[n+1] Traditional approach: is it legal & desirable to apply one transform?

3 Question: How to combine the transformations?  Affine mappings [Lim & Lam, POPL 97, ICS 99] zDomain: arbitrary loop nesting, affine loop indices; instruction optimized separately zUnifies yPermutation ySkewing yReversal yFusion yFission yStatement reordering zSupports blocking across all (non-perfectly nested) loops zOptimal: Max. deg. of parallelism & min. deg. of synchronization zMinimize communication by aligning the computation and pipelining

4 Loop Transforms: Cholesky factorization example DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K)

5 Results for Optimizing Perfect Nests Speedup on a Digital Turbolaser with 8 300Mhz 21164 processors

6 Optimizing Arbitrary Loop Nesting Using Affine Partitions DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K) A B EPSS L L L

7 Results with Affine Partitioning + Blocking

8 A Simple Example FOR i = 1 TO n DO FOR j = 1 TO n DO A[i,j] = A[i,j]+B[i-1,j];(S 1 ) B[i,j] = A[i,j-1]*B[i,j];(S 2 ) i j S1S1 S2S2

9 Best Parallelization Scheme SPMD code: Let p be the processor’s ID number if (1-n <= p <= n) then if (1 <= p) then B[p,1] = A[p,0] * B[p,1]; (S 2 ) for i 1 = max(1,1+p) to min(n,n-1+p) do A[i 1,i 1 -p] = A[i 1,i 1 -p] + B[i 1 -1,i 1 -p];(S 1 ) B[i 1,i 1 -p+1] = A[i 1,i 1 -p] * B[i 1,i 1 -p+1];(S 2 ) if (p <= 0) then A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S 1 ) Solution can be expressed as affine partitions: S1: Execute iteration (i, j) on processor i-j. S2: Execute iteration (i, j) on processor i-j+1.

10 Maximum Parallelism & No Communication Let F xj be an access to array x in statement j, i j be an iteration index for statement j, B j i j  0 represent loop bound constraints for statement j, Find C j which maps an instance of statement j to a processor:  i j, i k B j i j  0, B k i k  0 F xj ( i j ) = F xk ( i k )  C j ( i j ) = C k ( i k ) with the objective of maximizing the rank of C j Loops Array Processor ID F1 (i1)F1 (i1) F2 (i2)F2 (i2) C1 (i1)C1 (i1) C2 (i2)C2 (i2)

11 Algorithm  i j, i k B j i j  0, B k i k  0 F xj ( i j ) = F xk ( i k )  C j ( i j ) = C k ( i k ) zRewrite partition constraints as systems of linear equations  use affine form of Farkas Lemma to rewrite constraints as systems of linear inequalities in C and  use Fourier-Motzkin algorithm to eliminate Farkas multipliers and get systems of linear equations AC =0 zFind solutions using linear algebra techniques  the null space for matrix A is a solution of C with maximum rank.

12 Pipelining Alternating Direction Integration Example Requires transposing data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N DO I = 1 to N(parallel) A(I,J) = g(A(I,J),A(I,J-1)) Moves only boundary data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N(pipelined) DO I = 1 to N A(I,J) = g(A(I,J),A(I,J-1))

13 Finding the Maximum Degree of Pipelining Let F xj be an access to array x in statement j, i j be an iteration index for statement j, B j i j  0 represent loop bound constraints for statement j, Find T j which maps an instance of statement j to a time stage:  i j, i k B j i j  0, B k i k  0 ( i j  i k )  ( F xj ( i j ) = F xk ( i k ))  T j ( i j )  T k ( i k ) lexicographically with the objective of maximizing the rank of T j Loops Array Time Stage F1 (i1)F1 (i1) F2 (i2)F2 (i2) T1 (i1)T1 (i1) T2 (i2)T2 (i2)

14 Key Insight zChoice in time mapping => (pipelined) parallelism zDegrees of parallelism = rank(T) - 1

15 Putting it All Together zFind maximum outer-loop parallelism with minimum synchronization yDivide into strongly connected components yApply processor mapping algorithm (no communication) to program yIf no parallelism found, xApply time mapping algorithm to find pipelining xIf no pipelining found (found outer sequential loop) Repeat process on inner loops zMinimize communication yUse a greedy method to order communicating pairs yTry to find communication-free, or neighborhood only communication by solving similar equations zAggregate computations of consecutive data to improve spatial locality

16 Use of Affine Partitioning in Locality Opt. zPromotes array contraction yFinds independent threads and shortens the live ranges of variables zSupports blocking of imperfectly nested loops yFinds largest fully permutable loop nest via affine partitioning yFully permutable loop nest -> blockable


Download ppt "Affine Partitioning for Parallelism & Locality Amy Lim Stanford University"

Similar presentations


Ads by Google