Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Similar presentations


Presentation on theme: "CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral"— Presentation transcript:

1 CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral http://www.cs.ualberta.ca/~amaral/courses/680

2 CMPUT 680 - Compiler Design and Optimization2 Reading Wolfe, Michael, High Performance Compilers for Parallel Computing, Addison-Wesley, 1996 Chapter 9 Allen, Randy and Kennedy, Ken, Optimizing Compilers for Modern Architectures, Morgan-Kaufmann, 2002 Chapter 8

3 CMPUT 680 - Compiler Design and Optimization3 Unswitching Remove loop independent conditionals from a loop. for i=1 to N do for j=2 to N do if T[i] > 0 then A[i,j] = A[i, j-1]*T[i] + B[i] else A[i,j] = 0.0 endif endfor Before Unswitching for i=1 to N do if T[i] > 0 then for j=2 to N do A[i,j] = A[i, j-1]*T[i] + B[i] endfor else for j=2 to N do A[i,j] = 0.0 enfor endif endfor After Unswitching

4 CMPUT 680 - Compiler Design and Optimization4 Unswitching Constraints: The conditional tested must be completely independent of the loop. Legality: It is always legal. Advantages: Reduces the frequence of execution of the conditional statement. Disadvantage: Loop structure is more complex. Code size expansion. Might prevent data reuse.

5 CMPUT 680 - Compiler Design and Optimization5 Loop Peeling Remove the first (last) iteration of the loop into separate code. for i=1 to N do A[i] = (X+Y)*B[i] endfor Before Peeling if N >= 1 then A[i] = (X+Y)*B[i] for j=2 to N do A[i] = (X+Y)*B[i] enfor endif After Peeling

6 CMPUT 680 - Compiler Design and Optimization6 Loop Peeling Constraints: If the compiler does not know that the trip count is always positive, the peeled code must be protected by a zero-trip test. Advantages: Used to enable loop fusion or remove conditionals on the index variable from inside the loop. Allows execution of loop invariant code only in the first iteration. Disadvantage: Code size expansion.

7 CMPUT 680 - Compiler Design and Optimization7 Index Set Splitting Divides the index set into two portions. for i=1 to 100 do A[i] = B[i] + C[i] if i > 10 then D[i] = A[i] + A[i-10] endif endfor Before Set Splitting for i=1 to 10 do A[i] = B[i] + C[i] endfor for i=11 to 100 do A[i] = B[i] + C[i] D[i] = A[i] + A[i-10] endfor After Set Splitting

8 CMPUT 680 - Compiler Design and Optimization8 Index Set Splitting Disadvantage: Code size expansion. Advantages: Used to enable loop fusion or remove conditionals on the index variable from inside the loop. Can remove conditionals that test index variables.

9 CMPUT 680 - Compiler Design and Optimization9 Scalar Expansion for i=1 to N do T = A[i] + B[i] C[i] = T + 1/T endfor In the following loop, the scalar variable T creates: (1) a flow dependence from the first to the second assignment; (2) a loop-carried anti-dependence from the second to the first assignment; This anti-dependence can prevent some loop transformations.

10 CMPUT 680 - Compiler Design and Optimization10 Scalar Expansion Breaks anti-dependence relations by expanding, or promoting a scalar into an array. for i=1 to N do T = A[i] + B[i] C[i] = T + 1/T endfor Before Scalar Expansion if N >= 1 then allocate Tx(1:N) for i=1 to N do Tx[i] = A[i] + B[i] C[i] = Tx[i] + 1/Tx[i] endfor T = Tx[N] endif After Scalar Expansion

11 CMPUT 680 - Compiler Design and Optimization11 Scalar Expansion Constraints: The loop must be countable and the scalar must have no upward exposed uses. Advantages: Eliminates anti-dependences and output dependences. Disadvantage: In nested loops the size of the array might be prohibitive. Flow dependences for the scalar in the loop must be loop independent If the scalar is live on the loop exit, the last value assigned in the array must be copied into the scalar upon exit.

12 CMPUT 680 - Compiler Design and Optimization12 Loop Fusion Takes two adjacent loops and generates a single loop. (1) for i=1 to N do (2) A[i] = B[i] + 1 (3) endfor (4) for i=1 to N do (5) C[i] = A[i] / 2 (6) endfor (7) for i=1 to N do (8) D[i] = 1 / C[i+1] (9) endfor Before Loop Fusion (1) for i=1 to N do (2) A[i] = B[i] + 1 (5) C[i] = A[i] / 2 (8) D[i] = 1 / C[i+1] (9) endfor After Loop Fusion But, is this fusion legal?

13 CMPUT 680 - Compiler Design and Optimization13 Loop Fusion (1) for i=1 to N do (2) A[i] = B[i] + 1 (3) endfor (4) for i=1 to N do (5) C[i] = A[i] / 2 (6) endfor (7) for i=1 to N-1 do (8) D[i] = 1 / C[i+1] (9) endfor Before Loop Fusion 1357 B 0 Assume N=4: 2468 A 0 After the first loop: 1234 C 1 After the second loop:.5.3.251 D After the third loop:

14 CMPUT 680 - Compiler Design and Optimization14 Loop Fusion (1) for i=1 to N do (2) A[i] = B[i] + 1 (5) C[i] = A[i] / 2 (8) D[i] = 1 / C[i+1] (9) endfor After Loop Fusion 1357 B 0 Assume N=4: 0000 A 0 After the first loop: 0000 C 0 After the second loop: 0 0 00 D After the third loop: 2 1

15 CMPUT 680 - Compiler Design and Optimization15 Loop Fusion To be legal, a loop fusion must preserve all the dependence relations of the original loops. (1) for i=1 to N do (2) A[i] = B[i] + 1 (3) endfor (4) for i=1 to N do (5) C[i] = A[i] / 2 (6) endfor (7) for i=1 to N do (8) D[i] = 1 / C[i+1] (9) endfor Before Loop Fusion The original loop has the flow dependencies: S 2  f S 5 S 5  f S 8 (1) for i=1 to N do (2) A[i] = B[i] + 1 (5) C[i] = A[i] / 2 (8) D[i] = 1 / C[i+1] (9) endfor After Loop Fusion What are the dependences in the fused loop?

16 CMPUT 680 - Compiler Design and Optimization16 Loop Fusion The original loop has the flow dependencies: S 2  f S 5 S 5  f S 8 (1) for i=1 to N do (2) A[i] = B[i] + 1 (5) C[i] = A[i] / 2 (8) D[i] = 1 / C[i+1] (9) endfor After Loop Fusion In the fused loop, the dependences are S 2  f S 5 S 8  a S 5 Fusion reversed the dependence between S 5 and S 8 !! ?Thus it is illegal.

17 CMPUT 680 - Compiler Design and Optimization17 Loop Fusion Takes two adjacent loops and generates a single loop. (1) for i=1 to N do (2) A[i] = B[i] + 1 (3) endfor (4) for i=1 to N do (5) C[i] = A[i] / 2 (6) endfor (7) for i=1 to N do (8) D[i] = 1 / C[i+1] (9) endfor Before Loop Fusion (1) for i=1 to N do (2) A[i] = B[i] + 1 (5) C[i] = A[i] / 2 (6) endfor (7) for i=1 to N do (8) D[i] = 1 / C[i+1] (9) endfor After Loop Fusion This is a legal fusion!

18 CMPUT 680 - Compiler Design and Optimization18 Loop Fusion Initially only data independent loops would be fused. Now we try to fuse data dependent loops to increase data locality and benefit from caches. Loop fusion increases the size of the loop, which reduces instruction temporal locality (a problem only in machines with tiny instruction caches). Larger loop bodies enable more effective scalar optimizations (common subexpression elimination and instruction scheduling).

19 CMPUT 680 - Compiler Design and Optimization19 Loop Fusion (Complications) To be fused, two loops must be compatible, i.e.: (1) they iterate the same number of times (2) they are adjacent or can be reordered to become adjacent (3) the compiler must be able to use the same induction variable in both loops Compilers use other transformations to make loops meet the conditions above.

20 CMPUT 680 - Compiler Design and Optimization20 Loop Fusion (Another Example) (1) for i=1 to 99 do (2) A[i] = B[i] + 1 (3) endfor (4) for i=1 to 98 do (5) C[i] = A[i+1] * 2 (6) endfor (2) A[1] = B[1] + 1 (1) for i=2 to 99 do (2) A[i] = B[i] + 1 (3) endfor (4) for i=1 to 98 do (5) C[i] = A[i+1] * 2 (6) endfor (1) i = 1 (2) A[i] = B[i] + 1 for ib=0 to 97 do (1) i = ib+2 (2) A[i] = B[i] + 1 (4) i = ib+1 (5) C[i] = A[i+1] * 2 (6) endfor

21 CMPUT 680 - Compiler Design and Optimization21 Loop Fission (or Loop Distribution) Breaks a loop into two or more smaller loops. (1) for i=1 to N do (2) A[i] = A[i] + B[i-1] (3) B[i] = C[i-1]*X + Z (4) C[i] = 1/B[i] (5) D[i] = sqrt(C[i]) (6) endfor S2 S3 S4 S5 0 0 1 1 Dependence Graph Original Loop

22 CMPUT 680 - Compiler Design and Optimization22 Loop Fission (or Loop Distribution) Breaks a loop into two or more smaller loops. (1) for i=1 to N do (2) A[i] = A[i] + B[i-1] (3) B[i] = C[i-1]*X + Z (4) C[i] = 1/B[i] (5) D[i] = sqrt(C[i]) (6) endfor Original Loop (1) for ib=0 to N-1 do (3) B[ib+1] = C[ib]*X + Z (4) C[ib+1] = 1/B[ib+1] (6) endfor (1) for ib=0 to N-1 do (2) A[ib+1] = A[ib+1] + B[ib] (6) endfor (1) for ib=0 to N-1 do (5) D[ib+1] = sqrt(C[ib+1]) (6) endfor (1) i = N+1 After Loop Fission

23 CMPUT 680 - Compiler Design and Optimization23 Loop Fission (or Loop Distribution) All statements that form a strongly connected component in the original loop dependence graph must remain in the same loop after fission When finding strongly connected components for loop fission, the compiler can ignore loop carried anti-dependence and output dependence for scalars that are expanded by loop fission.

24 CMPUT 680 - Compiler Design and Optimization24 Loop Fission (or Loop Distribution) To find a legal order of the loops after fission, we compute the acyclic condensation of the dependence graph. S2 S3 S4 S5 0 0 1 1 Dependence Graph S3-S4 S2S5 Acyclic Condensation

25 CMPUT 680 - Compiler Design and Optimization25 Loop Fission (or Loop Distribution) Uses of loop fission: - it can improve cache use in machines with very small caches; - it can be required for other transformations, such as loop interchanging.

26 CMPUT 680 - Compiler Design and Optimization26 Loop Reversal Run a loop backward. All dependence directions are reversed. It is only legal for loops that have no loop carried dependences. Can be used to allow fusion (1) for i=1 to N do (2) A[i] = B[i] + 1 (3) C[i] = A[i]/2 (4) endfor (5) for i=1 to N do (6) D[i] = 1/C[i+1] (7) endfor (1) for i=N downto 1 do (2) A[i] = B[i] + 1 (3) C[i] = A[i]/2 (4) endfor (5) for i=N downto 1 do (6) D[i] = 1/C[i+1] (7) endfor (1) for i=N downto 1 do (2) A[i] = B[i] + 1 (3) C[i] = A[i]/2 (6) D[i] = 1/C[i+1] (7) endfor

27 CMPUT 680 - Compiler Design and Optimization27 Loop Interchanging Reverses the nesting order of nested loops. If the outer loop iterates many times, and the inner loop iterates only a few times, interchanging reduces the startup cost of the original inner loop. Interchanging can change the spatial locality of memory references.

28 CMPUT 680 - Compiler Design and Optimization28 Loop Interchanging (1) for j=2 to M do (2) for i=1 to N do (3) A[i,j] = A[i,j-1] + B[i,j] (4) endfor (5) endfor (1) for i=1 to N do (2) for j=2 to M do (3) A[i,j] = A[i,j-1] + B[i,j] (4) endfor (5) endfor

29 CMPUT 680 - Compiler Design and Optimization29 Other Loop Restructuring Loop Skewing: Unnormalize iteration vectors to change the shape of the iteration space to allow loop interchanging. Strip Mining: Decompose a single loop into two nested loop (the inner loop computes a strip of the data computed by the original loop). Used for vector processors. Loop Tiling: The loop space is divided in tiles, with the tile boundaries parallel to the iteration space axes.


Download ppt "CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral"

Similar presentations


Ads by Google