Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Similar presentations


Presentation on theme: "Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures."— Presentation transcript:

1 Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures

2 Fine-Grained Parallelism Techniques to enhance fine-grained parallelism: Loop Interchange Scalar Expansion Scalar Renaming Array Renaming

3 Optimizing Compilers for Modern Architectures Prelude: A Long Time Ago... procedure codegen(R, k, D); // R is the region for which we must generate code. // k is the minimum nesting level of possible parallel loops. // D is the dependence graph among statements in R.. find the set {S 1, S 2,..., S m } of maximal strongly-connected regions in the dependence graph D restricted to R construct R p from R by reducing each S i to a single node and compute D p, the dependence graph naturally induced on R p by D let {p 1, p 2,..., p m } be the m nodes of R p numbered in an order consistent with D p (use topological sort to do the numbering); for i = 1 to m do begin if p i is cyclic then begin generate a level-k DO statement; let D i be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to p i ; codegen (p i, k+1, D i ); generate the level-k ENDDO statement; end else generate a vector statement for p i in r(p i )-k+1 dimensions, where r (p i ) is the number of loops containing p i ; end We fail here

4 Optimizing Compilers for Modern Architectures Prelude: A Long Time Ago... Codegen: tries to find parallelism using transformations of loop distribution and statement reordering If we deal with loops containing cyclic dependences early on in the loop nest, we can potentially vectorize more loops Goal in Chapter 5: To explore other transformations to exploit parallelism

5 Optimizing Compilers for Modern Architectures Motivational Example DO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L T = T + A(I,K) * B(K,J) ENDDO C(I,J) = T ENDDO codegen will not uncover any vector operations. However, by scalar expansion, we can get: DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO

6 Optimizing Compilers for Modern Architectures Motivational Example DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO

7 Optimizing Compilers for Modern Architectures Motivational Example II Loop Distribution gives us: DO J = 1, M DO I = 1, N T$(I) = 0.0 ENDDO DO I = 1, N DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO DO I = 1, N C(I,J) = T$(I) ENDDO

8 Optimizing Compilers for Modern Architectures Motivational Example III Finally, interchanging I and K loops, we get: DO J = 1, M T$(1:N) = 0.0 DO K = 1,L T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J) ENDDO C(1:N,J) = T$(1:N) ENDDO A couple of new transformations used: —Loop interchange —Scalar Expansion

9 Optimizing Compilers for Modern Architectures Loop Interchange DO I = 1, N DO J = 1, M S A(I,J+1) = A(I,J) + B DV: (=, <) ENDDO Applying loop interchange: DO J = 1, M DO I = 1, N S A(I,J+1) = A(I,J) + B DV: (<, =) ENDDO leads to: DO J = 1, M S A(1:N,J+1) = A(1:N,J) + B ENDDO

10 Optimizing Compilers for Modern Architectures Loop Interchange Loop interchange is a reordering transformation Why? —Think of statements being parameterized with the corresponding iteration vector —Loop interchange merely changes the execution order of these statements. — It does not create new instances, or delete existing instances DO J = 1, M DO I = 1, N S ENDDO If interchanged, S(2, 1) will execute before S(1, 2)

11 Optimizing Compilers for Modern Architectures Loop Interchange: Safety Safety: not all loop interchanges are safe DO J = 1, M DO I = 1, N A(I,J+1) = A(I+1,J) + B ENDDO Direction vector ( ) If we interchange loops, we violate the dependence

12 Optimizing Compilers for Modern Architectures Loop Interchange: Safety A dependence is interchange-preventing with respect to a given pair of loops if interchanging those loops would reorder the endpoints of the dependence.

13 Optimizing Compilers for Modern Architectures Loop Interchange: Safety A dependence is interchange-sensitive if it is carried by the same loop after interchange. That is, an interchange-sensitive dependence moves with its original carrier loop to the new level.

14 Optimizing Compilers for Modern Architectures Loop Interchange: Safety Theorem 5.1 Let D(i,j) be a direction vector for a dependence in a perfect nest of n loops. Then the direction vector for the same dependence after a permutation of the loops in the nest is determined by applying the same permutation to the elements of D(i,j). The direction matrix for a nest of loops is a matrix in which each row is a direction vector for some dependence between statements contained in the nest and every such direction vector is represented by a row.

15 Optimizing Compilers for Modern Architectures Loop Interchange: Safety DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO The direction matrix for the loop nest is: < < = Theorem 5.2 A permutation of the loops in a perfect nest is legal if and only if the direction matrix, after the same permutation is applied to its columns, has no ">" direction as the leftmost non-"=" direction in any row. Follows from Theorem 5.1 and Theorem 2.3

16 Optimizing Compilers for Modern Architectures Loop Interchange: Profitability Profitability depends on architecture DO I = 1, N DO J = 1, M DO K = 1, L S A(I+1,J+1,K) = A(I,J,K) + B ENDDO For SIMD machines with large number of FU’s: DO I = 1, N S A(I+1,2:M+1,1:L) = A(I,1:M,1:L) + B ENDDO Not suitable for vector register machines

17 Optimizing Compilers for Modern Architectures Loop Interchange: Profitability For Vector machines, we want to vectorize loops with stride-one memory access Since Fortran stores in column-major order: —useful to vectorize the I-loop Thus, transform to: DO J = 1, M DO K = 1, L S A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO

18 Optimizing Compilers for Modern Architectures Loop Interchange: Profitability MIMD machines with vector execution units: want to cut down synchronization costs Hence, shift K-loop to outermost level: PARALLEL DO K = 1, L DO J = 1, M A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO END PARALLEL DO

19 Optimizing Compilers for Modern Architectures Loop Shifting Motivation: Identify loops which can be moved and move them to “optimal” nesting levels Theorem 5.3 In a perfect loop nest, if loops at level i, i+1,...,i+n carry no dependence, it is always legal to shift these loops inside of loop i+n+1. Furthermore, these loops will not carry any dependences in their new position. Proof:

20 Optimizing Compilers for Modern Architectures Loop Shifting DO I = 1, N DO J = 1, N DO K = 1, N S A(I,J) = A(I,J) + B(I,K)*C(K,J) ENDDO S has true, anti and output dependences on itself, hence codegen will fail as recurrence exists at innermost level Use loop shifting to move K-loop to the outermost level: DO K= 1, N DO I = 1, N DO J = 1, N S A(I,J) = A(I,J) + B(I,K)*C(K,J) ENDDO

21 Optimizing Compilers for Modern Architectures Loop Shifting DO K= 1, N DO I = 1, N DO J = 1, N S A(I,J) = A(I,J) + B(I,K)*C(K,J) ENDDO codegen vectorizes to: DO K = 1, N FORALL J=1,N A(1:N,J) = A(1:N,J) + B(1:N,K)*C(K,J) END FORALL ENDDO

22 Optimizing Compilers for Modern Architectures Loop Shifting Change body of codegen: if p i is cyclic then if k is the deepest loop in p i then try_recurrence_breaking(p i, D, k) else begin select_loop_and_interchange(p i, D, k); generate a level-k DO statement; let D i be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to p i ; codegen (p i, k+1, D i ); generate the level-k ENDDO statement end

23 Optimizing Compilers for Modern Architectures Loop Shifting procedure select_loop_and_interchange(  i, D, k) if the outermost carried dependence in  i is at level p>k then shift loops at level k,k+1,...,p-1 inside the level-p loop, making it into the level-k loop; return; end select_loop_and_interchange

24 Optimizing Compilers for Modern Architectures Loop Selection Consider: DO I = 1, N DO J = 1, M S A(I+1,J+1) = A(I,J) + A(I+1,J) ENDDO Direction matrix: < < = < Loop shifting algorithm will fail to uncover vector loops; however, interchanging the loops can lead to: DO J = 1, M A(2:N+1,J+1) = A(1:N,J) + A(2:N+1,J) ENDDO Need a more general algorithm

25 Optimizing Compilers for Modern Architectures Loop Selection Loop selection: —Select a loop at nesting level p  k that can be safely moved outward to level k and shift the loops at level k, k+1, …, p-1 inside it

26 Optimizing Compilers for Modern Architectures Loop Selection Heuristics for selecting loop level —If the level-k loop carries no dependence, then let p be the smallest integer such that the level-p loop carries a dependence. (loop-shifting heuristic.) —If the level-k loop carries a dependence, let p be the outermost loop that can be safely shifted outward to position k and that carries a dependence d whose direction vector contains an "=" in every position but the p th. If no such loop exists, let p = k. = = =... = = = < <... = = < = =... Direction vector Loop p

27 Optimizing Compilers for Modern Architectures Scalar Expansion DO I = 1, N S 1 T = A(I) S 2 A(I) = B(I) S 3 B(I) = T ENDDO Scalar Expansion: DO I = 1, N S 1 T$(I) = A(I) S 2 A(I) = B(I) S 3 B(I) = T$(I) ENDDO T = T$(N) leads to: S 1 T$(1:N) = A(1:N) S 2 A(1:N) = B(1:N) S 3 B(1:N) = T$(1:N) T = T$(N)

28 Optimizing Compilers for Modern Architectures Scalar Expansion However, not always profitable. Consider: DO I = 1, N T = T + A(I) + A(I+1) A(I) = T ENDDO Scalar expansion gives us: T$(0) = T DO I = 1, N S 1 T$(I) = T$(I-1) + A(I) + A(I+1) S 2 A(I) = T$(I) ENDDO T = T$(N)

29 Optimizing Compilers for Modern Architectures Scalar Expansion: Safety Scalar expansion is always safe When is it profitable? —Naïve approach: Expand all scalars, vectorize, shrink all unnecessary expansions. —However, we want to predict when expansion is profitable Dependences due to reuse of memory location vs. reuse of values —Dependences due to reuse of values must be preserved —Dependences due to reuse of memory location can be deleted by expansion

30 Optimizing Compilers for Modern Architectures Scalar Expansion: Covering Definitions A definition X of a scalar S is a covering definition for loop L if a definition of S placed at the beginning of L reaches no uses of S that occur past X. DO I = 1, 100 S 1 T = X(I) S 2 Y(I) = T ENDDO DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T = X(I) S 2 Y(I) = T ENDIF ENDDO covering

31 Optimizing Compilers for Modern Architectures Scalar Expansion: Covering Definitions A covering definition does not always exist: DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T = X(I) ENDIF S 2 Y(I) = T ENDDO In SSA terms: There does not exist a covering definition for a variable T if the edge out of the first assignment to T goes to a  -function later in the loop which merges its values with those for another control flow path through the loop

32 Optimizing Compilers for Modern Architectures Scalar Expansion: Covering Definitions We will consider a collection of covering definitions There is a collection C of covering definitions for T in a loop if either: —There exists no  -function at the beginning of the loop that merges versions of T from outside the loop with versions defined in the loop, or, —The  -function within the loop has no SSA edge to any  -function including itself

33 Optimizing Compilers for Modern Architectures Scalar Expansion: Covering Definitions Remember the loop which had no covering definition: DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T = X(I) ENDIF S 2 Y(I) = T ENDDO To form a collection of covering definitions, we can insert dummy assignments: DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T = X(I) ELSE S 2 T = T ENDIF S 3 Y(I) = T ENDDO

34 Optimizing Compilers for Modern Architectures Scalar Expansion: Covering Definitions Algorithm to insert dummy assignments and compute the collection, C, of covering definitions: —Central idea: Look for parallel paths to a  -function following the first assignment, until no more exist

35 Optimizing Compilers for Modern Architectures Scalar Expansion: Covering Definitions Detailed algorithm: Let S 0 be the  -function for T at the beginning of the loop, if there is one, and null otherwise. Make C empty and initialize an empty stack. Let S 1 be the first definition of T in the loop. Add S 1 to C. If the SSA successor of S 1 is a  -function S 2 that is not equal to S 0, then push S 2 onto the stack and mark it; While the stack is non-empty, —pop the  -function S from the stack; —add all SSA predecessors that are not  -functions to C; —if there is an SSA edge from S 0 into S, then insert the assignment T = T as the last statement along that edge and add it to C; —for each unmarked  -function S 3 (other than S 0 ) that is an SSA predecessor of S, mark S 3 and push it onto the stack; —for each unmarked  -function S 4 that can be reached from S by a single SSA edge and which is not predominated by S in the control flow graph mark S 4 and push it onto the stack.

36 Optimizing Compilers for Modern Architectures Scalar Expansion: Covering Definitions Given the collection of covering definitions, we can carry out scalar expansion for a normalized loop: Create an array T$ of appropriate length For each S in the covering definition collection C, replace the T on the left-hand side by T$(I). For every other definition of T and every use of T in the loop body reachable by SSA edges that do not pass through S 0, the  -function at the beginning of the loop, replace T by T$(I). For every use prior to a covering definition (direct successors of S 0 in the SSA graph), replace T by T$(I-1). If S 0 is not null, then insert T$(0) = T before the loop. If there is an SSA edge from any definition in the loop to a use outside the loop, insert T = T$(U) after the loop, were U is the loop upper bound.

37 Optimizing Compilers for Modern Architectures Scalar Expansion: Covering Definitions DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T = X(I) ENDIF S 2 Y(I) = T ENDDO DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T = X(I) ELSE S 2 T = T ENDIF S 3 Y(I) = T ENDDO T$(0) = T DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T$(I) = X(I) ELSE T$(I) = T$(I-1) ENDIF S 2 Y(I) = T$(I) ENDDO After inserting covering definitions: After scalar expansion:

38 Optimizing Compilers for Modern Architectures Deletable Dependences Uses of T before covering definitions are expanded as T$(I - 1) All other uses are expanded as T$(I) then the deletable dependences are: —Backward carried antidependences —Backward carried output dependences —Forward carried output dependences —Loop-independent antidependences into the covering definition —Loop-carried true dependences from a covering definition

39 Optimizing Compilers for Modern Architectures Scalar Expansion procedure try_recurrence_breaking(  i, D, k) if k is the deepest loop in  i then begin remove deletable edges in  i ; find the set {SC 1, SC 2,..., SC n } of maximal strongly-connected regions in D restricted to  i ; if there are vector statements among SC i then expand scalars indicated by deletable edges; codegen(  i, k, D restricted to  i ); end try_recurrence_breaking

40 Optimizing Compilers for Modern Architectures Scalar Expansion: Drawbacks Expansion increases memory requirements Solutions: —Expand in a single loop —Strip mine loop before expansion —Forward substitution: DO I = 1, N T = A(I) + A(I+1) A(I) = T + B(I) ENDDO DO I = 1, N A(I) = A(I) + A(I+1) + B(I) ENDDO

41 Optimizing Compilers for Modern Architectures Scalar Renaming DO I = 1, 100 S 1 T = A(I) + B(I) S 2 C(I) = T + T S 3 T = D(I) - B(I) S 4 A(I+1) = T * T ENDDO Renaming scalar T: DO I = 1, 100 S 1 T1 = A(I) + B(I) S 2 C(I) = T1 + T1 S 3 T2 = D(I) - B(I) S 4 A(I+1) = T2 * T2 ENDDO

42 Optimizing Compilers for Modern Architectures Scalar Renaming will lead to: S 3 T2$(1:100) = D(1:100) - B(1:100) S 4 A(2:101) = T2$(1:100) * T2$(1:100) S 1 T1$(1:100) = A(1:100) + B(1:100) S 2 C(1:100) = T1$(1:100) + T1$(1:100) T = T2$(100)

43 Optimizing Compilers for Modern Architectures Scalar Renaming Renaming algorithm partitions all definitions and uses into equivalent classes, each of which can occupy different memory locations: —Use the definition-use graph to: —Pick definition —Add all uses that the definition reaches to the equivalence class —Add all definitions that reach any of the uses… —..until fixed point is reached

44 Optimizing Compilers for Modern Architectures Scalar Renaming: Profitability Scalar renaming will break recurrences in which a loop- independent output dependence or antidependence is a critical element of a cycle Relatively cheap to use scalar renaming Usually done by compilers when calculating live ranges for register allocation

45 Optimizing Compilers for Modern Architectures Array Renaming DO I = 1, N S 1 A(I) = A(I-1) + X S 2 Y(I) = A(I) + Z S 3 A(I) = B(I) + C ENDDO S 1   S 2 S 2   -1 S 3 S 3  1 S 1 S 1   0 S 3 Rename A(I) to A$(I): DO I = 1, N S 1 A$(I) = A(I-1) + X S 2 Y(I) = A$(I) + Z S 3 A(I) = B(I) + C ENDDO Dependences remaining: S 1   S 2 and S 3  1 S 1

46 Optimizing Compilers for Modern Architectures Array Renaming: Profitability Examining dependence graph and determining minimum set of critical edges to break a recurrence is NP-complete! Solution: determine edges that are removed by array renaming and analyze effects on dependence graph procedure array_partition: —Assumes no control flow in loop body —identifies collections of references to arrays which refer to the same value —identifies deletable output dependences and antidependences Use this procedure to generate code —Minimize amount of copying back to the “original” array at the beginning and the end


Download ppt "Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures."

Similar presentations


Ads by Google