Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures Loop-carried and Loop-independent Dependences If in a loop statement S 2 depends on S 1, then there are two possible ways of this dependence occurring: 1. S 1 and S 2 execute on different iterations —This is called a loop-carried dependence. 2. S 1 and S 2 execute on the same iteration —This is called a loop-independent dependence.

Optimizing Compilers for Modern Architectures Loop-carried dependence Definition 2.11 Statement S 2 has a loop-carried dependence on statement S 1 if and only if S 1 references location M on iteration i, S 2 references M on iteration j and d(i,j) > 0 (that is, D(i,j) contains a “<” as leftmost non “=” component). Example: DO I = 1, N S 1 A(I+1) = F(I) S 2 F(I+1) = A(I) ENDDO

Optimizing Compilers for Modern Architectures Loop-carried dependence Level of a loop-carried dependence is the index of the leftmost non-“=” of D(i,j) for the dependence. For instance: DO I = 1, 10 DO J = 1, 10 DO K = 1, 10 S 1 A(I, J, K+1) = A(I, J, K) ENDDO Direction vector for S1 is (=, =, <) Level of the dependence is 3 A level-k dependence between S 1 and S 2 is denoted by S 1  k S 2

Optimizing Compilers for Modern Architectures Loop-carried Transformations Theorem 2.4 Any reordering transformation that does not alter the relative order of any loops in the nest and preserves the iteration order of the level-k loop preserves all level-k dependences. Proof: —D(i, j) has a “<” in the k th position and “=” in positions 1 through k-1  Source and sink of dependence are in the same iteration of loops 1 through k-1  Cannot change the sense of the dependence by a reordering of iterations of those loops As a result of the theorem, powerful transformations can be applied

Optimizing Compilers for Modern Architectures Loop-carried Transformations Example: DO I = 1, 10 S 1 A(I+1) = F(I) S 2 F(I+1) = A(I) ENDDO can be transformed to: DO I = 1, 10 S 1 F(I+1) = A(I) S 2 A(I+1) = F(I) ENDDO

Optimizing Compilers for Modern Architectures Loop-independent dependences Definition 2.14. Statement S 2 has a loop-independent dependence on statement S 1 if and only if there exist two iteration vectors i and j such that: 1) Statement S 1 refers to memory location M on iteration i, S 2 refers to M on iteration j, and i = j. 2) There is a control flow path from S 1 to S 2 within the iteration. Example: DO I = 1, 10 S 1 A(I) =... S 2... = A(I) ENDDO

Optimizing Compilers for Modern Architectures Loop-independent dependences More complicated example: DO I = 1, 9 S 1 A(I) =... S 2... = A(10-I) ENDDO No common loop is necessary. For instance: DO I = 1, 10 S 1 A(I) =... ENDDO DO I = 1, 10 S 2... = A(20-I) ENDDO

Optimizing Compilers for Modern Architectures Loop-independent dependences Theorem 2.5. If there is a loop-independent dependence from S 1 to S 2, any reordering transformation that does not move statement instances between iterations and preserves the relative order of S 1 and S 2 in the loop body preserves that dependence. S 2 depends on S 1 with a loop independent dependence is denoted by S 1   S 2 Note that the direction vector will have entries that are all “=” for loop independent dependences

Optimizing Compilers for Modern Architectures Loop-carried and Loop-independent Dependences Loop-independent and loop-carried dependence partition all possible data dependences! Note that if S 1  S 2, then S 1 executes before S 2. This can happen only if: —The difference vector for the dependence is less than 0, or —The difference vector equals 0 and S 1 occurs before S 2 textually...precisely the criteria for loop-carried and loop-independent dependences.

Optimizing Compilers for Modern Architectures Simple Dependence Testing Theorem 2.7: Let a and b be iteration vectors within the iteration space of the following loop nest: DO i 1 = L 1, U 1, S 1 DO i 2 = L 2, U 2, S 2... DO i n = L n, U n, S n S 1 A(f 1 (i 1,...,i n ),...,f m (i 1,...,i n )) =... S 2... = A(g 1 (i 1,...,i n ),...,g m (i 1,...,i n )) ENDDO... ENDDO

Optimizing Compilers for Modern Architectures Simple Dependence Testing DO i 1 = L 1, U 1, S 1 DO i 2 = L 2, U 2, S 2... DO i n = L n, U n, S n S 1 A(f 1 (i 1,...,i n ),...,f m (i 1,...,i n )) =... S 2... = A(g 1 (i 1,...,i n ),...,g m (i 1,...,i n )) ENDDO... ENDDO A dependence exists from S 1 to S 2 if and only if there exist values of  and  such that (1)  is lexicographically less than or equal to  and (2) the following system of dependence equations is satisfied: f i (  ) = g i (  ) for all i, 1  i  m Direct application of Loop Dependence Theorem

Optimizing Compilers for Modern Architectures Simple Dependence Testing: Delta Notation Notation represents index values at the source and sink Example: DO I = 1, N SA(I + 1) = A(I) + B ENDDO Iteration at source denoted by: I 0 Iteration at sink denoted by: I 0 +  I Forming an equality gets us: I 0 + 1 = I 0 +  I Solving this gives us:  I = 1  Carried dependence with distance vector (1) and direction vector (<)

Optimizing Compilers for Modern Architectures Simple Dependence Testing: Delta Notation Example: DO I = 1, 100 DO J = 1, 100 DO K = 1, 100 A(I+1,J,K) = A(I,J,K+1) + B ENDDO I 0 + 1 = I 0 +  I; J 0 = J 0 +  J; K 0 = K 0 +  K + 1 Solutions:  I = 1;  J = 0;  K = -1 Corresponding direction vector: ( )

Optimizing Compilers for Modern Architectures Simple Dependence Testing: Delta Notation If a loop index does not appear, its distance is unconstrained and its direction is “*” Example: DO I = 1, 100 DO J = 1, 100 A(I+1) = A(I) + B(J) ENDDO The direction vector for the dependence is (<, *)

Optimizing Compilers for Modern Architectures Simple Dependence Testing: Delta Notation * denotes union of all 3 directions Example: DO J = 1, 100 DO I = 1, 100 A(I+1) = A(I) + B(J) ENDDO (*,, <) } Note: (>, )

Optimizing Compilers for Modern Architectures Parallelization and Vectorization Theorem 2.8. It is valid to convert a sequential loop to a parallel loop if the loop carries no dependence. Want to convert loops like: DO I=1,N X(I) = X(I) + C ENDDO to X(1:N) = X(1:N) + C (Fortran 77 to Fortran 90) However: DO I=1,N X(I+1) = X(I) + C ENDDO is not equivalent to X(2:N+1) = X(1:N) + C

Optimizing Compilers for Modern Architectures Loop Distribution Can statements in loops which carry dependences be vectorized? D0 I = 1, N S 1 A(I+1) = B(I) + C S 2 D(I) = A(I) + E ENDDO Dependence: S 1  1 S 2 can be converted to: S 1 A(2:N+1) = B(1:N) + C S 2 D(1:N) = A(1:N) + E

Optimizing Compilers for Modern Architectures Loop Distribution DO I = 1, N S 1 A(I+1) = B(I) + C S 2 D(I) = A(I) + E ENDDO transformed to: DO I = 1, N S 1 A(I+1) = B(I) + C ENDDO DO I = 1, N S 2 D(I) = A(I) + E ENDDO leads to: S 1 A(2:N+1) = B(1:N) + C S 2 D(1:N) = A(1:N) + E

Optimizing Compilers for Modern Architectures Loop Distribution Loop distribution fails if there is a cycle of dependences DO I = 1, N S 1 A(I+1) = B(I) + C S 2 B(I+1) = A(I) + E ENDDO S 1  1 S 2 and S 2  1 S 1 What about: DO I = 1, N S 1 B(I) = A(I) + E S 2 A(I+1) = B(I) + C ENDDO

Optimizing Compilers for Modern Architectures Simple Vectorization Algorithm procedure vectorize (L, D) // L is the maximal loop nest containing the statement. // D is the dependence graph for statements in L. find the set {S 1, S 2,..., S m } of maximal strongly-connected regions in the dependence graph D restricted to L (Tarjan); construct L p from L by reducing each S i to a single node and compute D p, the dependence graph naturally induced on L p by D; let {p 1, p 2,..., p m } be the m nodes of L p numbered in an order consistent with D p (use topological sort); for i = 1 to m do begin if p i is a dependence cycle then generate a DO-loop around the statements in p i ; else directly rewrite p i in Fortran 90, vectorizing it with respect to every loop containing it; end end vectorize

Optimizing Compilers for Modern Architectures Problems With Simple Vectorization DO I = 1, N DO J = 1, M S 1 A(I+1,J) = A(I,J) + B ENDDO Dependence from S 1 to itself with d(i, j) = (1,0) Key observation: Since dependence is at level 1, we can manipulate the other loop! Can be converted to: DO I = 1, N S 1 A(I+1,1:M) = A(I,1:M) + B ENDDO The simple algorithm does not capitalize on such opportunities

Optimizing Compilers for Modern Architectures Advanced Vectorization Algorithm procedure codegen(R, k, D); // R is the region for which we must generate code. // k is the minimum nesting level of possible parallel loops. // D is the dependence graph among statements in R.. find the set {S 1, S 2,..., S m } of maximal strongly-connected regions in the dependence graph D restricted to R; construct R p from R by reducing each S i to a single node and compute D p, the dependence graph naturally induced on R p by D; let {p 1, p 2,..., p m } be the m nodes of R p numbered in an order consistent with D p (use topological sort to do the numbering); for i = 1 to m do begin if p i is cyclic then begin generate a level-k DO statement; let D i be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to p i ; codegen (p i, k+1, D i ); generate the level-k ENDDO statement; end else generate a vector statement for p i in r(p i )-k+1 dimensions, where r (p i ) is the number of loops containing p i ; end

Optimizing Compilers for Modern Architectures Advanced Vectorization Algorithm DO I = 1, 100 S 1 X(I) = Y(I) + 10 DO J = 1, 100 S 2 B(J) = A(J,N) DO K = 1, 100 S 3 A(J+1,K)=B(J)+C(J,K) ENDDO S 4 Y(I+J) = A(J+1, N) ENDDO

Optimizing Compilers for Modern Architectures Advanced Vectorization Algorithm DO I = 1, 100 S 1 X(I) = Y(I) + 10 DO J = 1, 100 S 2 B(J) = A(J,N) DO K = 1, 100 S 3 A(J+1,K)=B(J)+C(J,K) ENDDO S 4 Y(I+J) = A(J+1, N) ENDDO Simple dependence testing procedure: True dependence from S 4 to S 1 I 0 + J = I 0 +  I   I = J As J is always positive  Direction is “<”

Optimizing Compilers for Modern Architectures Advanced Vectorization Algorithm DO I = 1, 100 S 1 X(I) = Y(I) + 10 DO J = 1, 100 S 2 B(J) = A(J,N) DO K = 1, 100 S 3 A(J+1,K)=B(J)+C(J,K) ENDDO S 4 Y(I+J) = A(J+1, N) ENDDO S 2 and S 3 : dependence via B(J) I does not occur in either subscript (D.V = *) We get: J 0 = J 0 +  J   J = 0  Direction vectors = (*, =)

Optimizing Compilers for Modern Architectures Advanced Vectorization Algorithm DO I = 1, 100 S 1 X(I) = Y(I) + 10 DO J = 1, 100 S 2 B(J) = A(J,N) DO K = 1, 100 S 3 A(J+1,K)=B(J)+C(J,K) ENDDO S 4 Y(I+J) = A(J+1, N) ENDDO DO I = 1, 100 codegen({S 2, S 3, S 4 }, 2}) ENDDO X(1:100) = Y(1:100) + 10 codegen called at the outermost level S 1 will be vectorized

Optimizing Compilers for Modern Architectures Advanced Vectorization Algorithm DO I = 1, 100 DO J = 1, 100 codegen({S 2, S 3 }, 3}) ENDDO S 4 Y(I+1:I+100) = A(2:101,N) ENDDO X(1:100) = Y(1:100) + 10 codegen ({S 2, S 3, S 4 }, 2}) level-1 dependences are stripped off

Optimizing Compilers for Modern Architectures Advanced Vectorization Algorithm codegen ({S 2, S 3 }, 3}) level-2 dependences are stripped off DO I = 1, 100 S 1 X(I) = Y(I) + 10 DO J = 1, 100 S 2 B(J) = A(J,N) DO K = 1, 100 S 3 A(J+1,K)=B(J)+C(J,K) ENDDO S 4 Y(I+J) = A(J+1, N) ENDDO DO I = 1, 100 DO J = 1, 100 B(J) = A(J,N) A(J+1,1:100)=B(J)+C(J,1:100) ENDDO Y(I+1:I+100) = A(2:101,N) ENDDO X(1:100) = Y(1:100) + 10

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70.

Similar presentations

Presentation on theme: "Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70.

Similar presentations

Presentation on theme: "Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70."— Presentation transcript:

Similar presentations

About project

Feedback