Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim

Overview Motivation Loop Interchange Scalar Expansion Scalar and Array Renaming

Motivational Example Matrix Multiplication: DO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L T = T + A(I,K) * B(K,J) ENDDO C(I,J) = T ENDDO Codegen: tries to find parallelism using transformations of loop distribution and statement reordering codegen will not uncover any vector operations because of the use of the temporary scalar T which creates several dependences carried by each of the loops.

Motivational Example I Replacing the scalar T with the vector T$ will eliminate most of the dependences: DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO

Distribution of the I -loop gives us: DO J = 1, M DO I = 1, N T$(I) = 0.0 ENDDO DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO DO I = 1, N C(I,J) = T$(I) ENDDO Motivational Example II T$(1:N) = 0.0C(1:N,J) = T$(1:N)

Motivational Example III Interchanging I and K loops, we get: DO J = 1, M T$(1:N) = 0.0 DO I = 1,N DO K = 1,L DO K = 1,L DO I = 1,N T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(1:N,J) = T$(1:N) ENDDO

Motivational Example IV Finally, Vectorizing the I -Loop: DO J = 1, M T$(1:N) = 0.0 DO K = 1,L DO I = 1,N T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J) ENDDO C(1:N,J) = T$(1:N) ENDDO

Motivational Example - Summary We used two new transformations: Loop interchange (replacing I,K loops) Scalar Expansion (T => T$)

Loop Interchange Loop Interchange - Switching the nesting order of two loops DO I = 1, N DO J = 1, M S A(I,J+1) = A(I,J) + B ENDDO Applying loop interchange: DO J = 1, M DO I = 1, N S A(I,J+1) = A(I,J) + B ENDDO leads to: DO J = 1, M S A(1:N,J+1) = A(1:N,J) + B ENDDO Impossible to vectorize since there is a cyclic dependence from S to itself, carried by the J-Loop.

Loop Interchange: Safety Safety: not all loop interchanges are safe. S(I,J) – the instance of statement S with the parameters (iteration numbers) I and J DO J = 1, M DO I = 1, N S: A(I,J+1) = A(I+1,J) ENDDO S(1,1) S(1,3) S(1,2) S(1,4) S(2,1) S(2,3) S(2,2) S(2,4) S(3,1) S(3,3) S(3,2) S(3,4) S(4,1) S(4,3) S(4,2) S4,4) J = 1 J = 2 J = 3 J = 4 I = 1I = 2I = 3I = 4 A(2,2) is assigned at S(2,1) and used at S(1,2).

Loop Interchange: Safety A dependence is interchange-preventing with respect to a given pair of loops if interchanging those loops would reorder the endpoints of the dependence.

Distance Vector - Reminder Consider a dependence in a loop nest of n loops Statement S 1 on iteration i is the source of the dependence Statement S 2 on iteration j is the sink of the dependence The distance vector is a vector of length n d(i,j) such that: d(i,j) k = j k – i k

Direction Vector - Reminder Suppose that there is a dependence from statement S 1 on iteration i of a loop nest of n loops and statement S 2 on iteration j, then the dependence direction vector is D(i,j) is defined as a vector of length n such that “ 0 “=” if d(i,j) k = 0 “>” if d(i,j) k < 0 D(i,j) k =

Direction Vector - Reminder DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO (I+1,J+1,K) - (I,J,K) = (1,1,0) => (<,<,=) (I+1,J+1,K) - (I,J+1,K+1) = (1,0,-1) => ( )

Direction Vector - Reminder We say that the I’th loop carries a dependence with direction-vector d, if the leftmost non-”=“ in d is in the I’th place. (<,<,=) ( ) It is valid to convert a sequential loop to a parallel loop if the loop carries no dependence.

Loop Interchange: Safety Theorem 5.1 Let D(i,j) be a direction vector for a dependence in a perfect nest of n loops. Then the direction vector for the same dependence after a permutation of the loops is determined by applying the same permutation to the elements of D(i,j). DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO (I+1,J+1,K) - (I,J,K) = (1,1,0) => ( (<,=,<)

Loop Interchange: Safety The direction matrix for a nest of loops is a matrix in which each row is a direction vector for some dependence between statements contained in the nest and every such direction vector is represented by a row. DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO (1,1,0) => (<,<,=) (1,0,-1) => ( ) < < =

Loop Interchange: Safety Theorem 5.2 A permutation of the loops in a perfect nest is legal if and only if the direction matrix, after the same permutation is applied to its columns, has no ">" direction as the leftmost non- "=" direction in any row. i j k < < = j i k < < = = j k i < = < = > <

Loop Interchange : profitability In general, our goal is to enhance vectorization by increasing the number of innermost loops that do not carry any dependence: DO I = 1, N DO J = 1, M S A(I,J+1) = A(I,J) + A(I-1,J) + B ENDDO Applying loop interchange: DO J = 1, M DO I = 1, N S A(I,J+1) = A(I,J) + A(I-1,J) + B ENDDO leads to: DO J = 1, M S A(1:N,J+1) = A(1:N,J) + A(0:N-1,J) + B ENDDO I J = < < J I < = <

Loop Interchange: profitability Profitability depends on architecture DO I = 1, N DO J = 1, M DO K = 1, L S A(I+1,J+1,K) = A(I,J,K) + B ENDDO For SIMD machines with large number of FU’s: DO I = 1, N S A(I+1,2:M+1,1:L) = A(I,1:M,1:L) + B ENDDO i j k < < =

Loop Interchange: profitability Since Fortran stores in column-major order we would like to vectorize the I -loop Thus, after Interchanging the I -loop to be the inner loop, we can transform to: DO J = 1, M DO K = 1, L S A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO j k i < = < A(I,1:M,1:L)

Loop Interchange: profitability MIMD machines with vector execution units want to cut down synchronization costs, therefore, shifting the K-loop to outermost level will get: PARALLEL DO K = 1, L DO J = 1, M A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO END PARALLEL DO k j i = < <

Loop Shifting Motivation: Identify loops which can be moved and move them to “optimal” nesting levels. Trying all the permutations is not practical. Theorem 5.3 In a perfect loop nest, if loops at level i, i+1,...,i+n carry no dependence, it is always legal to shift these loops inside of loop i+n+1. Furthermore, these loops will not carry any dependences in their new position. < = < = = = = < < = = = = = < < = = = = = < < = = = < = = = < = = = = = <

Loop Shifting Heuristic – move the loops that carry no dependence to the innermost position: DO I = 1, N DO J = 1, N DO K = 1, N A(I,J,k) = A(I,J,k+1) ENDDO DO K= 1, N DO I = 1, N DO J = 1, N A(I,J,k) = A(I,J,k+1) ENDDO I J K = = < K I J < = = A(1:N,1:N,k) = A(1:N,1:N,k+1)

Loop Shifting The heuristic fail in: DO I = 1, N DO J = 1, M A(I+1,J+1) = A(I,J) + A(I+1,J) ENDDO But we still want to get: DO J = 1, M A(2:N+1,J+1) = A(1:N,J) + A(2:N+1,J) ENDDO I J < = < J I < < =

Loop Shifting A recursive heuristec: If the outermost loop carries no dependence, then let the outermost loop that carries a dependence to be the outermost loop. (the former heuristic.) If the outermost loop carries a dependence, find the outermost loop L that:  L can be safely shifted to the outermost position  L carries a dependence d whose direction vector contains an "=" in every position but the L th. And move L to the outermost position.

Loop Shifting < < = = = = < < = < = = = = = < < < = = = = < < < = = = = = = < < = = < < = = = = = = < < = = < = < < = < = = = = = < =

Scalar Expansion DO I = 1, N T = A(I) A(I) = B(I) B(I) = T ENDDO Scalar Expansion : DO I = 1, N T$(I) = A(I) A(I) = B(I) B(I) = T$(I) ENDDO T = T$(N) leads to: T$(1:N) = A(1:N) A(1:N) = B(1:N) B(1:N) = T$(1:N) T = T$(N)

Scalar Expansion However, not always profitable. Consider: DO I = 1, N T = T + A(I) + A(I-1) A(I) = T ENDDO Scalar expansion gives us: T$(0) = T DO I = 1, N T$(I) = T$(I-1) + A(I) + A(I-1) A(I) = T$(I) ENDDO T = T$(N)

Scalar Expansion – How to Naïve approach: Expand all scalars, vectorize, shrink all unnecessary expansions. Dependences due to reuse of memory location vs. reuse of values: Dependences due to reuse of values must be preserved Dependences due to reuse of memory location can be deleted by expansion We will try to find the removable dependence

Reminder - SSA X= =X

Covering definition A definition X of a scalar S is a covering definition for loop L if a definition of S placed at the beginning of L reaches no uses of S that occur past X. DO I = 1, 100 S 1 T = X(I) S 2 Y(I) = T ENDDO DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T = X(I) S 2 Y(I) = T ENDIF ENDDO T=X(I) Y(I)=T

Covering definition A covering definition does not always exist: DO I = 1, 100 IF (A(I).GT. 0) THEN S 2 T = X(I) ENDIF S 3 Y(I) = T ENDDO In SSA terms: There does not exist a covering definition for a variable T if the edge out of the first assignment to T goes to a  - function later in the loop which merges its values with those for another control flow path through the loop T=X(I) Y(I)=T

Covering definition Expanded definition of covering definition: A collection C of covering definitions for T in a loop in which every control path has a covering definition. DO I = 1, 100 IF (A(I).GT. 0) THEN S1 T = X(I) ELSE S2 T = -X(I) ENDIF S3 Y(I) = T ENDDO T=X(I) T=-X(I) Y(I)=T

Find Covering definitions DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T = X(I) ENDIF S 2 Y(I) = T ENDDO To form a collection of covering definitions, we can insert dummy assignments: DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T = X(I) ELSE S 2 T = T ENDIF S 3 Y(I) = T ENDDO

Algorithm to insert dummy assignments and compute the collection, C, of covering definitions: Central idea: Look for parallel paths to a  - function following the first assignment, and add dummy assignments. Find Covering definition T=X(I) Y(I)=T Detailed algorithm: Let S 0 be the  -function for T at the beginning of the loop, if there is one, and null otherwise. Make C empty and initialize an empty stack. Let S 1 be the first definition of T in the loop. Add S 1 to C. If the SSA successor of S 1 is a  -function S 2 that is not equal to S 0, then push S 2 onto the stack and mark it; While the stack is non-empty, pop the  -function S from the stack; add all SSA predecessors that are not  -functions to C; if there is an SSA edge from S 0 into S, then insert the assignment T = T as the last statement along that edge and add it to C; for each unmarked  -function S 3 (other than S 0 ) that is an SSA predecessor of S, mark S 3 and push it onto the stack; for each unmarked  -function S 4 that can be reached from S by a single SSA edge and which is not predominated by S in the control flow graph mark S 4 and push it onto the stack. T=X(I) Y(I)=T T=T DO I = 1, 100 IF (A(I).GT. 0) THEN T = X(I) ELSE T = T ENDIF Y(I) = T ENDDO

Scalar Expansion Given the collection of covering definitions, we can carry out scalar expansion for a normalized loop: Create an array T$ of appropriate length For each S in the covering definition collection C, replace the T on the left-hand side by T$(I). For every other definition of T and every use of T in the loop body reachable by SSA edges that do not pass through S 0, the  -function at the beginning of the loop, replace T by T$(I). For every use prior to a covering definition (direct successors of S 0 in the SSA graph), replace T by T$(I-1). If S 0 is not null, then insert T$(0) = T before the loop. If there is an SSA edge from any definition in the loop to a use outside the loop, insert T = T$(U) after the loop, were U is the loop upper bound. T$(0) = T DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T$(I) = X(I) ELSE T$(I) = T$(I-1) ENDIF S 2 Y(I) = T$(I) ENDDO

Deletable dependence: Given the covering Definitions, we can identify deletable dependence: Backward carried antidependences:  Writing to T[i] after reading from T[i-1] at the previous iteration Backward carried output dependences  Writing to T[i] after writing to T[i-1] at the previous iteration Forward carried output dependences  Writing to T[i] and writing to T[i+1] at the next iteration Loop-independent antidependences into the covering definition  Writing to T[i] after reading from T[i-1] at the same iteration Loop-carried true dependences from a covering definition  reading from T[i] after writing to T[i-1] at the previous iteration

Scalar Expansion: Drawbacks Expansion increases memory requirements Solutions: Expand in a single loop Strip mine loop before expansion: Forward substitution: DO I = 1, N T = A(I) + A(I+1) A(I) = T + B(I) ENDDO DO I = 1, N,64 DO J = 0,63 T = A(I+J) +A(I+J+1) A(I) = T + B(I+J) ENDDO DO I = 1, N A(I) = A(I) + A(I+1) + B(I) ENDDO DO I = 1, N T = A(I) + A(I+1) A(I) = T + B(I) ENDDO

Scalar Renaming S 3 T2$(1:100) = D(1:100) - B(1:100) S 4 A(2:101) = T2$(1:100) * T2$(1:100) S 1 T1$(1:100) = A(1:100) + B(1:100) S 2 C(1:100) = T1$(1:100) + T1$(1:100) T = T2$(100) DO I = 1, 100 S1 T1 = A(I) + B(I) S2 C(I) = T1 + T1 S3 T2 = D(I) - B(I) S4 A(I+1) = T2 * T2 ENDDO DO I = 1, 100 S1 T = A(I) + B(I) S2 C(I) = T + T S3 T = D(I) - B(I) S4 A(I+1) = T * T ENDDO

Scalar Renaming Renaming algorithm partitions all definitions and uses into equivalent classes using the definition-use graph:  Pick definition  Add all uses that the definition reaches to the equivalence class  Add all definitions that reach any of the uses… ..until fixed point is reached

Scalar Renaming: Profitability Scalar renaming will remove a loop- independent output dependence or antidependence. (Writing to the same scalar at the same iteration or writing to the scalar after reading from it.) Relatively cheap to use scalar renaming. Usually done by compilers when calculating live ranges for register allocation.

Array Renaming DO I = 1, N S 1 A(I) = A(I-1) + X S 2 Y(I) = A(I) + Z S 3 A(I) = B(I) + C ENDDO Rename A(I) to A$(I): DO I = 1, N S 1 A$(I) = A(I-1) + X S 2 Y(I) = A$(I) + Z S 3 A(I) = B(I) + C ENDDO A(1:N) =B(1:N) +C A$(1:N)=A(0:N-1)+X Y(1:N) =A$(1:N) +Z

Array Renaming: Profitability determine edges that are removed by array renaming and analyze effects on dependence graph procedure array_partition:  Assumes no control flow in loop body  identifies collections of references to arrays which refer to the same value  identifies deletable output dependences and antidependences Use this procedure to generate code Minimize amount of copying back to the “original” array at the beginning and the end

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Similar presentations

Presentation on theme: "Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Similar presentations

Presentation on theme: "Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim."— Presentation transcript:

Similar presentations

About project

Feedback