Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

Similar presentations


Presentation on theme: "Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9."— Presentation transcript:

1 Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9

2 Optimizing Compilers for Modern Architectures Last time… Single loop methods Privatization Loop distribution Alignment Loop Fusion

3 Optimizing Compilers for Modern Architectures Loop Interchange Moves dependence-free loops to outermost level Theorem —In a perfect nest of loops, a particular loop can be parallelized at the outermost level if and only if the column of the direction matrix for that nest contain only ‘=‘ entries Vectorization moves loops to innermost level

4 Optimizing Compilers for Modern Architectures Loop Interchange DO I = 1, N DO J = 1, N A(I+1, J) = A(I, J) + B(I, J) ENDDO OK for vectorization Problematic for parallelization

5 Optimizing Compilers for Modern Architectures Loop Interchange PARALLEL DO J = 1, N DO I = 1, N A(I+1, J) = A(I, J) + B(I, J) ENDDO END PARALLEL DO

6 Optimizing Compilers for Modern Architectures Loop Interchange Working with direction matrix —Move loops with all “=“ entries into outermost position and parallelize it and remove the column from the matrix —Move loops with most “<“ entries into next outermost position and sequentialize it, eliminate the column and any rows representing carried dependences —Repeat step 1

7 Optimizing Compilers for Modern Architectures while L is not empty while there exist columns in M with all “=“ success := true; l:= loop with all “=“ column; remove l from L; parallelize l; eliminate l from M; end; if L is not empty select_loop_and_interchange(L); l:= outermost loop; remove l from L; sequentialize l; remove column corresponding to l from M; remove all rows corresponding to dependences carried by l from M; Loop Interchange

8 Optimizing Compilers for Modern Architectures Loop Selection Generate most parallelism with adequate granularity —Key is to select proper loops to run in parallel Informal parallel code generation strategy While there are loops that can be run in parallel, move them to the outermost position and parallelize them, then Select a sequential loop, run it sequentially, and find what new parallelism may have been revealed.

9 Optimizing Compilers for Modern Architectures = < < < = = Loop Selection DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K+1) = A(I, J-1, K) + A(I-1, J, K+2) + A(I-1, J, K) ENDDO

10 Optimizing Compilers for Modern Architectures Loop Selection DO I = 2, N+1 DO J = 2, M+1 PARALLEL DO K = 1, L A(I, J, K+1) = A(I, J-1, K) + A(I-1, J, K+2) + A(I-1, J, K) ENDDO

11 Optimizing Compilers for Modern Architectures < < = = < = < = = < = < = = = = < = = = = < Loop Selection Is it possible to derive a selection heuristic that provides optimal code? —Probably not possible, NP-complete problem Assume simple approach of selecting the loop with the most ‘ < ‘ directions to eliminate the max number of rows from the direction matrix —Applying to this matrix will fail

12 Optimizing Compilers for Modern Architectures Loop Selection Favor the selection of loops that must be sequentialized before parallelism can be uncovered. If there exists a loop that can legally be moved to the outermost position and there is a dependence for which that loop has the only ‘<‘ direction, sequentialize that loop If there are several such loops, they will all need to be sequentialized at some point in the process

13 Optimizing Compilers for Modern Architectures To show that loop selection is NP-complete —Bit vector —Problem of loop selection corresponds to finding a minimal basis among the loops, with “logical or” as the combining operation —The same as minimum set cover problem, known to be NP-complete Loop selection is best done by a heuristic 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 Loop Selection

14 Optimizing Compilers for Modern Architectures = < = < = < = < < Loop Selection Example of principals involved in heuristic loop selection DO I = 2, N DO J = 2, M DO K = 2, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) + A(I, J+1, K+1) + A(I-1, J, K+1) ENDDO

15 Optimizing Compilers for Modern Architectures Loop Selection DO J = 2, M DO I = 2, N PARALLEL DO K = 2, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) + A(I, J+1, K+1) + A(I-1, J, K+1) ENDDO

16 Optimizing Compilers for Modern Architectures = Loop Reversal DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO

17 Optimizing Compilers for Modern Architectures Loop Reversal DO K = L, 1, -1 PARALLEL DO I = 2, N+1 PARALLEL DO J = 2, M+1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) END PARALLEL DO ENDDO Increase the range of options available for loop selection heuristics

18 Optimizing Compilers for Modern Architectures = < = < = = = = < = = = Loop Skewing DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K) B(I, J, K+1) = B(I, J, K) + A(I, J, K) ENDDO

19 Optimizing Compilers for Modern Architectures = < < < = < = = < = = = Loop Skewing Skewed using k = K + I + J yield: DO I = 2, N+1 DO J = 2, M+1 DO k = I+J+1, I+J+L A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J) B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J) ENDDO

20 Optimizing Compilers for Modern Architectures Loop Skewing DO k = 5, N+M+1 PARALLEL DO I = MAX(2, k-M-L-1), MIN(N+1, k-L-2) PARALLEL DO J = MAX(2, k-I-L), MIN(M+1, k-I-1) A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J) B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J) ENDDO

21 Optimizing Compilers for Modern Architectures Loop Skewing Transforms skewed loop into one that can be interchanged to the outermost position without changing the meaning of the program Can be used to transform the skewed loop in such a way that, after outward interchange, it will carry all dependences formerly carried by the loop with respect to which it is skewed

22 Optimizing Compilers for Modern Architectures Loop Skewing Selection Heuristics —Parallelize outermost loop if possible —Sequentializes at most one outer loop to find parallelism in the next loop —If 1 and 2 fails, try skewing —If 3 fails, sequentialize the loop that can be moved to the outermost position and cover the most other loops

23 Optimizing Compilers for Modern Architectures Unimodular Transformations Definitions —A transformation represented by a matrix T is unimodular if —T is square —All the elements of T are integral and —The absolute value of the determinant of T is 1 Any composition of unimodular transformations is unimodular

24 Optimizing Compilers for Modern Architectures Profitability-Based Methods Motivation —Minimum granularity, low sync cost —Many alternatives for parallel code generation Static performance estimation function —No need to be accurate —Good at selecting the better of two alternatives Key considerations —Cost of memory references —Sufficiency of granularity

25 Optimizing Compilers for Modern Architectures Profitability-Based Methods Pick the best of all possible permutations and parallelizations —Impractical –Total number of alternative is exponential in the number of loops in a nest –Many of the loop upper bounds are unknown at compile time Consider only subset of the possible code arrangements, based on properties of the cost function

26 Optimizing Compilers for Modern Architectures Profitability-Based Methods Subdivide all the references in the loop body into reference groups Determine whether subsequent accesses to the same reference are —Loop invariant –Cost = 1 —Unit stride –Cost = number of iterations / cache line size —Non-unit stride –Cost = number of iterations Assign the loop a cost of the sum of the reference costs times the aggregate number of times the loop will be executed if it is innermost in the loop nest

27 Optimizing Compilers for Modern Architectures Profitability-Based Methods DO I = 1, N DO J = 1, N DO K = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO —C = 1 A = N B = N/L —Innermost K loop = N3(1+1/L)+N2 —Innermost J loop = 2N3+N2 —Innermost I loop = 2N3/L+N2 Reorder loop from innermost to outermost by increasing loop cost Can’t always have desired loop order

28 Optimizing Compilers for Modern Architectures Multilevel Loop Fusion Commonly used for imperfect loop nests Used after maximal loop distribution

29 Optimizing Compilers for Modern Architectures Multilevel Loop Fusion Decision making needs look-ahead Heuristic: Fuse with the loop that cannot be fused with one of its successors

30 Optimizing Compilers for Modern Architectures Parallel Code Generation procedure Parallelize(l, Dl); ParallelizeNest(l, success); if not success then begin if l can be distributed then begin distribute l into loop nests l1, l2, …, ln; for i:=1 to n do begin Parallelize(li, Di); end Merge({l1, l2, …, ln}); end else if l cannot be distributed then begin

31 Optimizing Compilers for Modern Architectures Parallel Code Generation for each outer loop l0 nested in l do begin let D0 be the set of dependences between statements in l0 less dependences carried by l; Parallelize(Io,D0); end let S be the set of outer loops and statements loops left in l; If ||S||>1 then Merge(S);

32 Optimizing Compilers for Modern Architectures DO J = 1, JMAXD DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) DO K = 2, N-1 DO J = 1, JMAXD DO I = 1, IMAXD F(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) DO J = 1, JMAXD DO I = 1, IMAXD TOT(I, J) = 0.0 DO J = 1, JMAXD DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) DO K = 2, N-1 DO J = 1, JMAXD DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K) Erlebacher

33 Optimizing Compilers for Modern Architectures Erlebacher PARALLEL DO = 1, JMAXD DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) DO K = 2, N – 1 DO I = 1, IMAXD F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) DO I = 1, IMAXD TOT(I, J) = 0.0 DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) DO K = 2, N-1 DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)

34 Optimizing Compilers for Modern Architectures Erlebacher PARALLEL DO J = 1, JMAXD DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) TOT(I, J) = 0.0 TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) ENDDO DO K = 2, N-1 DO I = 1, IMAXD F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K) ENDDO

35 Optimizing Compilers for Modern Architectures Strip Mining Converts available parallelism into a form more suitable for the hardware DO I = 1, N A(I) = A(I) + B(I) ENDDO k = CEIL (N / P) PARALLEL DO I = 1, N, k DO i = I, MIN(I + k-1, N) A(i) = A(i) + B(i) ENDDO END PARALLEL DO

36 Optimizing Compilers for Modern Architectures Strip Mining DO I = 1, N DO J = 2, I A(I, J) = A(I, J-1) + B(I) ENDDO Choose smaller unit size to allow more balanced load distribution

37 Optimizing Compilers for Modern Architectures Pipeline Parallelism Fortran command DOACROSS Useful where parallelization is not available High synchronization costs DO I = 2, N-1 DO J = 2, N-1 A(I, J) =.25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1)) ENDDO

38 Optimizing Compilers for Modern Architectures Pipeline Parallelism DOACROSS I = 2, N-1 POST (EV(1)) DO J = 2, N-1 WAIT(EV(J-1)) A(I, J) =.25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1)) POST (EV(J)) ENDDO

39 Optimizing Compilers for Modern Architectures Pipeline Parallelism

40 Optimizing Compilers for Modern Architectures Pipeline Parallelism DOACROSS I = 2, N-1 POST (E(1)) K = 0 DO J = 2, N-1, 2 K = K+1 WAIT(EV(K)) DO j = J, MAX(J+1, N-1) A(I, J) =.25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1) ENDDO POST (EV(K+1)) ENDDO

41 Optimizing Compilers for Modern Architectures Pipeline Parallelism

42 Optimizing Compilers for Modern Architectures Scheduling Parallel Work Parallel execution is slower than serial execution if Bakery-counter scheduling —High synchronization overhead

43 Optimizing Compilers for Modern Architectures Guided Self-Scheduling Minimize synchronization overhead —Schedules groups of iterations unlike the bakery counter method —Going from large to small chunks of work Keep all processors busy at all times Iterations dispensed at time t follows: Alternatively we can have GSS(k) that guarantees that all blocks handed out are of size k or greater

44 Optimizing Compilers for Modern Architectures Guided Self-Scheduling GSS(1)


Download ppt "Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9."

Similar presentations


Ads by Google