Presentation is loading. Please wait.

Presentation is loading. Please wait.

Aug 15-18, Montreal, Canada1 Recurrence Chain Partitioning of Non-Uniform Dependences Yijun Yu Erik H. D ’ Hollander.

Similar presentations


Presentation on theme: "Aug 15-18, Montreal, Canada1 Recurrence Chain Partitioning of Non-Uniform Dependences Yijun Yu Erik H. D ’ Hollander."— Presentation transcript:

1 Aug 15-18, Montreal, Canada1 Recurrence Chain Partitioning of Non-Uniform Dependences Yijun Yu Erik H. D ’ Hollander

2 Aug 15-18, Montreal, Canada2 Overview 1. Dependence and Parallelism 2. Non-Uniform Loop Dependences 3. Recurrence Chains Partitioning 4. Related work 5. Implementations 6. Experiment Results 7. Summary

3 Aug 15-18, Montreal, Canada3 1. Background Dependence vs. Parallelism DO I = 1,3 A(I) = A(I-1) ENDDO DOALL I = 1,3 A(I) = A(I-1) ENDDO A(2) = A(1) A(1) = A(0) A(3) = A(2) 123 113 013 011 0 0 0 0 program A(1) = A(0) A(2) = A(1) A(3) = A(2) execution trace 123 023 003 000 0 0 0 0 shared memory

4 Aug 15-18, Montreal, Canada4 The CFD application @ WTCM Computation Fluid Dynamics CFD Navier-Stokes equations Successive Over-Relaxation SOR temperature 3D geometry + 1D time

5 Aug 15-18, Montreal, Canada5 The visualized Uniform dependences and transformations for the 4D loop Before transformation After transformation A 3-D unimodular transformation is found after visualizing the 4D loop nest which has 177 array references at run-time for each iteration. Here we use a regular shape. The transformation makes it possible to speed-up the program around N 2 /6 times where N is the diameter of the geometry. (Yu, Parco99)

6 Aug 15-18, Montreal, Canada6 2. Non-uniform dependences Uniform loop dependences Dependent iterations are apart at a uniform distance in the iteration space: a set of distance vector can predict the dependences and indicate the affine index loop transformation to reveal the maximal loop parallelism. Non-uniform dependences Irregular, can be caused by complex subscripts, compile-time unknowns, etc. But not rare: in SPECfp95 benchmarks 46% nested loops and 12.8% of the coupled subscripts

7 Aug 15-18, Montreal, Canada7 Non-uniform dependences Tip of the iceberg

8 Aug 15-18, Montreal, Canada8 Irregular dependence Dependences have non-uniform distance Parallelism Analysis: 200 iterations over 15 data flow steps Speedup:13.3 Problem: How to exploit it?

9 Aug 15-18, Montreal, Canada9 3. Recurrence Chain Partitioning Research objectives If DO loops fail to reveal the optimal parallelism for irregular dependences, can one use WHILE loops? WHEN can one apply WHILE loops? HOW to construct WHILE loops? WHAT to do when one can not apply WHILE loops? HOW MUCH can be achieved by an evaluation purposes?

10 Aug 15-18, Montreal, Canada10 3.1 How to Generate code? DOALL I = INIT(I) WHILE !TERMINATE(I) DO S(I) I = NEXT(I) END DO ENDDOALL INIT(I) =? TERMINATE(I)=? NEXT(I) =?

11 Aug 15-18, Montreal, Canada11 3.2 Solving recurrence equations in the unified iteration space Dependence equations: iA + a = jB + b Recurrence equations: j = i T + t or i = (j – t) T -1 = jT -1 + tT -1 T = AB -1 t = (a – b)B -1 A recurrence chain is a sequence of dependent iterations, such that i K+1 = i K T + t, or i K+1 = (i K -t)T -1 i 0 = { i | not exist j such that iA+a = jB+b or iB+b = jA+a} We have variable dependence distance d k =i k+1 -i k : d k+1 = d k T or d k =d k+1 T -1 d is not constant and exponential to a=max(1/|T|, |T|), thus the dependence chain length is O(log a L), where L is the diameter of the iteration space When |T| is negative, one can cut recurrence chain to 2 iterations by lexicographical ordering

12 Aug 15-18, Montreal, Canada12 3.3 Generate code ? DOALL I = i 0 WHILE ( I is in Iteration Space) DO S(I) I = IT+t or I = (I-t)T -1 ENDDO ENDDOALL Problem: How to tell which index update respects the dependency order?

13 Aug 15-18, Montreal, Canada13 iteration space i0i0 i0i0 i0i0 i0i0 i2i2 i4i4 i1i1 independent cyclic integer non-integer integer non-integer I1I1 I2I2 i1i1 i3i3 initial setfinal set intermediate set R1R1 R2R2 R3R3 R4R4

14 Aug 15-18, Montreal, Canada14 3.3 Generate code ! DOALL I in P1 IF (IT+t < I) T = T -1 ; t = tT ENDIF WHILE ( I is in Iteration Space) DO S(I) I = IT+t ENDDO ENDDOALL

15 Aug 15-18, Montreal, Canada15 4. Related work Strength of REC (1) Scalability LEN = length of the chain In comparison, unique-set oriented methods have to deal with LEN = 2, 3, … differently … In REC, the WHILE loops adjust their steps automatically …

16 Aug 15-18, Montreal, Canada16 4. Related work Strength of REC (2) Outermost loop parallelism Set-oriented: DOALL I in P1 S(I) DOALL I in P2 S(I) … DOALL I in Pn D(I) Recurrence Chain DOALL I in P1 IF (I > IT+t) T = T -1 ; t = tT WHILE ( I in IS) DO S(I) I = IT+t ENDDO ENDDOALL

17 Aug 15-18, Montreal, Canada17 4. Related work Shortcoming and alternatives Restriction in number of dep. Equations Fall back to the following algorithms: A recursive 3-sets partitioning (3P) (similar to unique-sets partitioning, but more accurate): can reuse the calculations for P1, P2, P3. PDM and other uniformization techniques PDM is light-weight and can apply first, then apply 3P.

18 Aug 15-18, Montreal, Canada18

19 Aug 15-18, Montreal, Canada19 sat den partly fully

20 Aug 15-18, Montreal, Canada20 sat den partly fully

21 Aug 15-18, Montreal, Canada21 sat den partly fully

22 Aug 15-18, Montreal, Canada22 4. Implementations Front end: source to source transformations PDM/PL in FPT Set-oriented algorithms in FPT XML/XSLT OC Back end Intel Fortran compiler + OPENMP directives Experiments on an EPICMP 4-CPU server

23 Aug 15-18, Montreal, Canada23 5. Results 5.1 Yu, ICPP00 DO I1=1,N1 DO I2=1,N2 a(3*I1+1,2*I1+I2-1) =a(I1+3,I2+1) ENDDO

24 Aug 15-18, Montreal, Canada24 5.1 Nonfull-rank PDM j1j1 j2j2 i2i2

25 Aug 15-18, Montreal, Canada25

26 Aug 15-18, Montreal, Canada26 5.2 Ju, 1997 ’ s example DO I=1,N DO J=1,N a(2*I+3,J+1) = … =a(I+2*J+1,I+J+3) ENDDO det(PDM) = 2

27 Aug 15-18, Montreal, Canada27 UNIQUE vs REC partitioning 132132

28 Aug 15-18, Montreal, Canada28 Ju ’ s Example Comparison We corrected the loop bounds flaw in the Ju ’ s 97 paper and 5 unique sets were derived for this case when N = 12. But theoretically O(2^(log 2 N)) = O(N) UNIQUE sets are needed In REC partitioning, just one set P1 needs to be calculated for the initial i 0

29 Aug 15-18, Montreal, Canada29

30 Aug 15-18, Montreal, Canada30 5.3 Chen, 96 ’ s Example DO I=1,N DO J=1,I DO K=J,I... = a(I+2*K+5,4*K-J) ENDDO a(I-J,I+J)=... ENDDO

31 Aug 15-18, Montreal, Canada31 Chen ’ s Example A special case It is a non-perfectedly nested loop First convert it into the unified iteration space Then symbolically calculate P1, P2, P3 and finds P2 = empty Therefore the recurrence chains are at most 1 iteration long, regardless to the loop bounds Both REC and Three-region partitioning lead to the same optimal solution

32 Aug 15-18, Montreal, Canada32

33 Aug 15-18, Montreal, Canada33 5.4 Cholesky kernel (I,K,J,L) DO 1 I = 0,NRHS DO 1 K = 0,2*N+1 IF (K.LE.N) THEN I0 = MIN(M,N-K) ELSE I0 = MIN(M,2*N-K+1) ENDIF DO 1 J = 0,I0 C$DOISV DO 1 L = 0,NMAT IF (K.LE.N) THEN IF (J.EQ.0) THEN 8 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 7 B(I,L,K+J)=B(I,L,K+J)-A(L,-J,K+J)*B(I,L,K) ENDIF ELSE IF (J.EQ.0) THEN 9 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 6 B(I,L,K-J)=B(I,L,K-J)-A(L,-J,K)*B(I,L,K) ENDIF 1 CONTINUE C THE ORIGINAL KERNEL DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 J = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+J) = B(I,L,K+J) - A(L,-J,K+J) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 J = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-J) = B(I,L,K-J) - A(L,-J,K) * B(I,L,K) Loop Fusion

34 Aug 15-18, Montreal, Canada34

35 Aug 15-18, Montreal, Canada35 Recursive Three Region partitioning After loop fusion

36 Aug 15-18, Montreal, Canada36 6. Summary Recurrence Chain partitioning is scalable to any size of the iteration space REC partitioning reveals outermost parallelism, no synchronization between partitioned regions The limitation of REC partitioning and its compensation: we provide fall back alternatives, if REC can not apply (1) PDM + Minimal distance (always applicable) (2) Recursive three-region partitioning (applicable for constant loop bounds, in some cases (e.g. Chen ’ s example) any loop bounds) PDM 3R REC


Download ppt "Aug 15-18, Montreal, Canada1 Recurrence Chain Partitioning of Non-Uniform Dependences Yijun Yu Erik H. D ’ Hollander."

Similar presentations


Ads by Google