# Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.

## Presentation on theme: "Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely."— Presentation transcript:

Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Agenda  Last Lecture at a glance  Loop Interchange for Register Reuse  Loop Fusion for Register Reuse  Putting it All Together  Complex Loop Nests  Summary

Last lecture at a glance (1) Assumption 1: Most compilers can handle register allocation to scalars (using node coloring algorithm). However they don’t know how to handle vectors. Assumption 2: We are dealing with RISC processors. All of the CPU operations need the data in the registers (except of load and store operations). Assumption 3: Memory Hierarchy: Accessing the registers is much faster than a cache hit, which is much faster than a cache miss and accessing the main memory, which is much faster than accessing the virtual memory (swap file)…

Last lecture at a glance (2) Therefore our strategy will be: Do some transformation that will “expose” vector entries as scalars, and then let the good old compiler do the register allocation. We will benefit from avoiding unnecessary Load / Store operations.

Last lecture at a glance (3) Example: (Scalar Replacement) DO I = 1, N DO J = 1, M A(I) = A(I) + B(J) ENDDO DO I = 1, N T = A(I) DO J = 1, M T = T + B(J) ENDDO A(I) = T ENDDO

Last lecture at a glance (4) Dependences to consider: True dependence A(I) = … =A(I) Output dependence A(I) = … A(I) = Antidependence =A(I) … A(I) = Input dependence = A(I) … = A(I)

Last lecture at a glance (5) We should also consider Loop Carried and Loop Independent dependences. In general the more dependences the merry. This is because there are probably more opportunities for registers reuse. We will use the dependences to decide if and how to “expose” the vectors as scalars.

Last lecture at a glance (6) We saw : Scalar Replacement (see first example) – this is the actual “exposure”. Unroll and Jam – Unrolling of loops in order to bring dependences that are carried by an outer loop into the inner loop. This can benefit register reuse if we apply Scalar Replacement afterwards.

Last lecture at a glance (7) Example: (Unroll and Jam) Original Code DO I = 1, N*2 DO J = 1, M A(I) = A(I) + B(J) ENDDO Unroll and Jam DO I = 1, N*2, 2 DO J = 1, M A(I) = A(I) + B(J) A(I+1) = A(I+1) +B(J) ENDDO Scalar Replacement DO I = 1, N*2, 2 s0 = A(I) s1 = A(I+1) DO J = 1, M t = B(J) s0 = s0 + t s1 = s1 + t ENDDO A(I) = s0 A(I+1) = s1 ENDDO

Agenda  Last Lecture at a glance  Loop Interchange for Register Reuse  Loop Fusion for Register Reuse  Putting it All Together  Complex Loop Nests  Summary

Loop Interchange (1) Loop nesting is not always optimal in regard to register reuse. For example, on CPUs with no vector engines, the following code (matrix initialization): DO I=2, N A(1:M, I) = A(1:M, I-1) ENDDO Will be converted into: DO I = 2, N DO J = 1, M A(J, I) = A(J, I-1) ENDDO

Loop Interchange (2) Which will be implemented in the following way: DO I = 2, N DO J = 1, M R1 = A(J, I-1) A(J, I) = R1 ENDDO Which is not too clever, since it has (N-1)*M Load and Store operations. If we change the order of the loops we can get a better implementation.

Loop Interchange (3) Original Code DO I = 2, N DO J = 1, M A(J, I) = A(J, I-1) ENDDO Loop Interchange DO J = 1, M DO I = 2, N A(J, I) = A(J, I-1) ENDDO Scalar Replacement DO J = 1, M R1 = A(J, 1) DO I = 2, N A(J, I) = R1 ENDDO This implementation still requires (N-1)*M Store operations (we can’t escape that), but it only requires M Load operations which can make the running time considerably shorter.

Loop Interchange (4) Considerations for Loop Interchange The basic idea is to get the loop that carries the most dependences to the innermost position. Register reuse for the outer loop is usually cannot be achieved due to limited register resources. We use the conventional direction matrix for loop nest.

Loop Interchange (5) Example: DO J = 1, N DO K = 1, N DO I = 1, 256 A(I, J, K) = A(I, J-1, K) + A(I, J-1, K-1) + A(I, J, K-1) ENDDO There are 3 true dependences which result in the following direction matrix:

Loop Interchange (6) Example (cont.): If we select the J loop to be the innermost we get: DO K = 1, N DO I = 1, 256 DO J = 1, N A(I, J, K) = A(I, J-1, K) + & A(I, J-1, K-1) + A(I, J, K-1) ENDDO DO K = 1, N DO I = 1, 256 R1 = A(I, 0, K) DO J = 1, N R1 = R1 + A(I, J-1, K-1) + & A(I, J, K-1) A(I, J, K) = R1 ENDDO We saved a Load operation in each iteration. It is possible to interchange the 2 outer loops and get further optimization.

Loop Interchange (7) Loop Interchange Algorithm: 1.Form the direction matrix for the loop nest and use it to identify the loops other than the scalarization loop that can legally be moved to the innermost position 2.For each such loop L, let count(L) be the number of rows of the direction matrix that have “<“ in the position corresponding to L and “=“ in every other position. 3.Pick the loop l that maximize the product of count(L) and the iteration count of loop L. Some assumptions need to be taken when the bounds of the loop are unknown at compile time. Loop interchange should be weighed against cache efficiency (next chapter)

Loop Interchange (8) 100 65 150 1,000  (# of loop iterations) Example 100 * 2 = 200 65 * 3 = 195 150 * 1 = 150 1,000 * 0 = 0 The outermost loop (100*2) should be the innermost loop

Agenda  Last Lecture at a glance  Loop Interchange for Register Reuse  Loop Fusion for Register Reuse  Putting it All Together  Complex Loop Nests  Summary

Loop Fusion (1) Example: On CPUs with no vector engines the following code: A(1:N) = C(1:N) + D(1:N) B(1:N) = C(1:N) – D(1:N) Will be transformed into: DO I = 1, N A(I) = C(I) + D(I) ENDDO DO I = 1, N B(I) = C(I) - D(I) ENDDO

Loop Fusion (2) Using Loop Fusion (chapter 6) we get: DO I = 1, N A(I) = C(I) + D(I) B(I) = C(I) – D(I) ENDDO Using Scalar Replacement We can save on the fetching time of C(I) and D(I) : DO I = 1, N R1 = C(I) R2 = D(I) A(I) = R1 + R2 B(I) = R1 – R2 ENDDO

Loop Fusion (3) Profitable Loop Fusion for Register Reuse Just because a loop fusion is safe does not mean it is profitable. There are 2 cases where the fusion may be profitable: The fusion results in a loop independent dependence (as we just saw). The fusion results in a forward loop carried dependence.

Loop Fusion (4) Example : (forward loop carried dependence) DO J = 1, N DO I = 1, M A(I,J) = C(I,J)+D(I,J) ENDDO DO I = 1, M B(I,J) = A(I,J-1)-E(I,J) ENDDO Fusion: DO J = 1, N DO I = 1, M A(I,J) = C(I,J)+D(I,J) B(I,J) = A(I,J-1)-E(I,J) ENDDO

Loop Fusion (5) Fusion: DO J = 1, N DO I = 1, M A(I,J) = C(I,J)+D(I,J) B(I,J) = A(I,J-1)-E(I,J) ENDDO Loop Interchange: DO I = 1, M DO J = 1, N A(I,J) = C(I,J)+D(I,J) B(I,J) = A(I,J-1)-E(I,J) ENDDO Statement Order Reversing: DO I = 1, M DO J = 1, N B(I,J) = A(I,J-1)-E(I,J) A(I,J) = C(I,J)+D(I,J) ENDDO Scalar Replacement: DO I = 1, M R1 = A(I, 0) DO J = 1, N B(I,J) = R1 - E(I,J) R1 = C(I,J)+D(I,J) A(I,J) = R1 ENDDO

Loop Fusion (6) Loop Alignment for Fusion Reminder: Blocking dependences cause problems for loop fusion. DO I = 1, M DO J = 1, N A(J,I) = B(J,I) + 1.0 ENDDO DO J = 1, N C(J,I) = A(J+1,I) + 2.0 ENDDO We cannot simply fuse the two loops because we will introduce backward-carried antidependence.

Loop Fusion (7) We can overcome this problem by aligning the loops: DO I = 1, M DO J = 0, N-1 A(J+1,I) = B(J,I+1) + 1.0 ENDDO DO J = 1, N C(J,I) = A(J+1,I) + 2.0 ENDDO We can now fuse the two loops on their common iteration range while peeling a single iteration from the beginning of the first loop and one iteration from the end of the second loop.

Loop Fusion (8) Hence we get: DO I = 1, M A(1,I) = B(1,I) + 1.0 DO J = 1, N-1 A(J+1,I) = B(J+1,I) + 1.0 C(J,I) = A(J+1,I) + 2.0 ENDDO C(N,I) = A(N+1,I) + 2.0 ENDDO Scalar Replacement DO I = 1, M A(1,I) = B(1,I) + 1.0 DO J = 1, N-1 R1 = B(J+1,I) + 1.0 A(J+1,I) = R1 C(J,I) = R1 + 2.0 ENDDO C(N,I) = A(N+1,I) + 2.0 ENDDO

Loop Fusion (9) Definition : Let  be a dependence between loops. The Alignment Threshold of  is defined as follows: If  is loop independent after merging, threshold(  ) = 0 If  is forward carried after merging, threshold(  ) is the negative of the resulting dependence threshold. If  is fusion preventing, threshold(  ) is the threshold of the merged dependence. Aligning by the largest threshold allow fusion.

Loop Fusion (10) Example: DO I = 1, N A(I) = B(I) + 1.0 ENDDO DO I = 1, N C(I) = A(I+1) + A(I-1) ENDDO We have 2 dependences: 1.Forward carried with a threshold of 1 because of the reference A(I-1)  Alignment threshold of -1. 2.Backward carried with a threshold of 1 because of the reference A(I+1)  Alignment threshold of +1.

Loop Fusion (11) Since (+1) > (-1) we should align by the alignment threshold: (+1) And so we get: DO I = 0, N-1 A(I+1) = B(I+1) + 1.0 ENDDO DO I = 1, N C(I) = A(I+1) + A(I-1) ENDDO From here we can proceed to fuse the loops and then “Scalar Replace” A(I+1).

Loop Fusion (12) Fusion Mechanics Assuming we have a collection of aligned loops how do we fuse them? 1.Sort the lower bounds of the loops into nondecreasing sequence {L 1,L 2,…L n } and sort the upper bounds of the loops into nondecreasing sequence {H 1,H 2,…,H n }. 2.Produce a sequence of fusion loops with lower bounds of L 1,L 2,…,L n-1 with respective upper bounds of L 2 -1,L 3 -1,…,L n -1. 3.Produce the central fuse loop with a lower bound of L n and an upper bound of H 1. 4.Produce a sequence of fusion loops with lower bounds of H 1 +1,H 2 +1,…,L n-1 +1 with respective upper bounds of H 2,H 3,…,H n.

Loop Fusion (13) Loop 1 Loop 2 Loop 3 Example Each color represents a fusion loop. Loops after alignment

Loop Fusion (14) The Weighted Fusion Problem The last thing to do is to form the collections of the loops to be fused. We need to do it in a profitable manner. Example L1 DO I = 1, 1,000 A(I) = B(I) + X(I) ENDDO L2 DO I = 1, 1,000 C(I) = A(I) + Y(I) ENDDO S Z = FOO(A(1:1,000)) L3 DO I = 1, 500 A(I) = C(I) + Z ENDDO L1 SL2 L3 1,000 500 1,000

Loop Fusion (15) Definition A mixed-directed graph is a graph G = (V, E = E d U E u ) where (V,E d ) forms a directed graph, (V, E u ) forms an undirected graph, and E d and E u are disjoint. G is acyclic if (V,E d ) is acyclic. w is a successor or predecessor of v if it is such in (V,E d ). w is a neighbor of v if it is such in (V,E u ).

Loop Fusion (16) Problem Definition Let G be an acyclic mixed-directed graph, W a weight function on E, B a set of bad vertices, and E b a set of bad edges. The weighted loop fusion problem is the problem of finding vertex sets {V 1,V 2,…,V n } such that: {V 1,V 2,…,V n } partitions V. Each vertex set V i either contains no bad vertices, or consists of a single bad vertex. Given two v and w in V i, there is no path from v to w (in E d ) that leaves V i. Given v and w in V i, there is no bad edge between v and w. The induced graph on the vertex sets is acyclic. The Target : To maximize the total weight of edges between vertices in the same vertex sets.

Loop Fusion (17) The Algorithm 1.Initialize all the quantities and compute initial successor, predecessor, and neighbor sets. 2.Topologically sort the vertices of the directed acyclic graph. Continued… Unfortunately, The Weighted Fusion Problem is NP-Hard. Therefore we have to resort to heuristic based algorithms. A fast and simple algorithm, is the Fast Greedy algorithm for Weighted Fusion which was developed by Kennedy.

Loop Fusion (18) The Algorithm (continued) 3. Process the vertices in V to compute for each vertex the set pathFrom[v], which contains all vertices that can be reached by a path from vertex v, and the set badPathFrom[v], a subset of pathFrom[v] that includes the set of vertices that can be reached from v by a path that contains a bad vertex or a bad edge. 4.Invert the sets pathFrom and badPathFrom, respectively, to produce the sets pathTo[v] and badPathTo[v] for each vertex v in the graph, The set pathTo[v] contains the vertices from which there is a path to v ; the set badPathTo[v] contains the vertices from which v can be reached via a bad path. Continued…

Loop Fusion (19) 5. Insert each of the edges into a priority queue edgeHeap by weight. 6. While edgeHeap is nonempty, select and remove the heaviest edge (v,w) from it. If w is in badPathFrom[v] then do not fuse – repeat step 6. Otherwise do the following: Collapse v, w, and every edge on the directed path between them. After each collapse, adjust the sets pathFrom, badPathFrom, pathTo, and badPathTo to reflect the new graph. That is, the composite node will now be reached from every vertex that reached a vertex in the composite, and it will reach any vertex that is reached by a vertex in the composite. After each vertex collapse, recompute successor, predecessor, and neighbor sets for the composite vertex, and recompute weights between the composite vertex and other vertices as appropriate. The running time of the algorithm is: O(EV + V 2 )

Loop Fusion (20) L1 SL2 L3 1,000 500 1,000 In the previous example the greedy algorithm will fuse L1 and L2 which is the optimal solution.

Loop Fusion (21) ab c e d f Bad vertex 1 a 11 11 1 1 10 However, the algorithm is not optimal. Consider the following example:

Loop Fusion (22) Since the edge (a,f) is the heaviest, the greedy algorithm will fuse the vertices a,b,c,d,f together: ab c e d f Bad vertex 1 a 11 11 1 1 10 This solution weight is 16.

Loop Fusion (23) However, fusing c,d,e,f and a,b produce a better result: ab c e d f Bad vertex 1 a 1111 1 1 10 This solution weight is 23.

Loop Fusion (24) Multilevel Loop Fusion When dealing with multiple-loop nesting problem, the strategy is simple: First align and fuse the outer most loops, then recursively repeat the process for the bodies of the resulting loops. At best it is inefficient to start with fusing the inner loops (since we won’t be able to fuse all of them, and if we will insist on fusing them we might get the wrong code as the outer loops might need alignment, and therefore the references in the inner loops will change).

Agenda  Last Lecture at a glance  Loop Interchange for Register Reuse  Loop Fusion for Register Reuse  Putting it All Together  Complex Loop Nests  Summary

Putting It All Together (1) In which order should the transformations be applied? The recommended order is as follows: 1.Loop Interchange. 2.Loop Alignment and Fusion. 3.Unroll and Jam. 4.Scalar Replacement. But Why?

Putting It All Together (2) 1.Loop Interchange : Fusion might interfere with loop interchange therefore it should be done first. 2.Loop Alignment and Fusion : This can achieve extra reuse across loops 3.Unroll and Jam : This can achieve outer loop reuse when there are dependences carried by other than the inner loop after interchange is finished. 4.Scalar Replacement : As we already noted, this is the actual “exposure” – so this must be the last transformation.

Agenda  Last Lecture at a glance  Loop Interchange for Register Reuse  Loop Fusion for Register Reuse  Putting it All Together  Complex Loop Nests  Summary

Complex Loop Nests (1) Loops with If Statements Consider the following example: DO I = 1, N IF(M(I).LT.0) THEN A(I)=B(I)+C ENDIF D(I) = A(I) + E ENDDO Scalar Replacement DO I = 1, N IF(M(I).LT.0) THEN a0 = B(I) + C A(I) = a0 ENDIF D(I) = a0 + E ENDDO Error: a0 may not be initialized

Complex Loop Nests (2) We can overcome this problem in the following way: DO I = 1, N IF(M(I).LT.0) THEN a0 = B(I) + C A(I) = a0 ELSE a0 = A(I) ENDIF D(I) = a0 + E ENDDO Note: We didn’t increase the running time.

Complex Loop Nests (3) Given a control flow graph of the loop, and assuming that each If statement has (possibly empty) Else branch: We insert initialization at the beginning of block b if the variable is used in b but not initialized on any path to b. We insert an initialization at the end of block b if the variable has not been initialized on any path to the block, it is live on exit from the block, and at some successor to the block it is used. (as done in the example).

Complex Loop Nests (4) Triangular Unroll and Jam Consider the following example: DO I = 2, 99 DO J = 1, I-1 A(I,J) = A(I,I) + A(J,J) ENDDO Naïve Unroll an Jam DO I = 2, 99, 2 DO J = 1, I-1 A(I,J) = A(I,I) + A(J,J) A(I+1,J)=A(I+1,I+1)+A(J,J) ENDDO Error: We miss an assignment We can solve the problem by applying Unroll an Jam step by step an using the loop fusion mechanics.

Complex Loop Nests (5) Original Code DO I = 2, 99 DO J = 1, I-1 A(I,J) = A(I,I) + A(J,J) ENDDO Unroll DO I = 2, 99, 2 DO J = 1, I-1 A(I,J) = A(I,I) + A(J,J) ENDDO DO J = 1, I A(I+1,J) = A(I+1,I+1)+A(J,J) ENDDO Jam (Fusion) DO I = 2, 99, 2 DO J = 1, I-1 A(I,J) = A(I,I) + A(J,J) A(I+1,J) = A(I+1,I+1)+A(J,J) ENDDO A(I+1,I) = A(I+1,I+1)+A(I,I) ENDDO Scalar Replacement DO I = 2, 99, 2 tI = A(I,I) tI1 = A(I+1,I+1) DO J = 1, I-1 tJ = A(J,J) A(I,J) = tI + tJ A(I+1,J) = tI1 + tJ ENDDO A(I+1,I) = tI1 + tI ENDDO

Complex Loop Nests (6) Note: It is also possible to Unroll using a factor bigger than 2, using the same techniques.

Complex Loop Nests (7) Trapezoidal Unroll and Jam The same technique can be used for general trapezoidal loops, for example: (A part of a convolution code) DO I = 0, N DO J = I, I+N2 F3(I) = F3(I)+F1(J)*W(I-J) ENDDO F3(I) = F3(I)*DT ENDDO Unroll DO I = 0, N, 2 DO J = I, I+N2 F3(I) = F3(I)+F1(J)*W(I-J) ENDDO F3(I) = F3(I)*DT DO J = I+1, I+N2+1 F3(I+1)=F3(I+1)+F1(J)*W(I-J+1) ENDDO F3(I+1) = F3(I+1)*DT ENDDO

Complex Loop Nests (8) Unroll DO I = 0, N, 2 DO J = I, I+N2 F3(I) = F3(I)+F1(J)*W(I-J) ENDDO F3(I) = F3(I)*DT DO J = I+1, I+N2+1 F3(I+1)=F3(I+1)+F1(J)*W(I-J+1) ENDDO F3(I+1) = F3(I+1)*DT ENDDO Jam (Fusion) DO I = 0, N, 2 F3(I) = F3(I)+F1(I)*W(0) DO J = I, I+N2 F3(I) = F3(I)+F1(J)*W(I-J) F3(I+1)=F3(I+1)+F1(J)*W(I-J+1) ENDDO F3(I+1)=F3(I+1)+F1(I+N2+1)*W(-N2) F3(I) = F3(I)*DT F3(I+1) = F3(I+1)*DT ENDDO Applying Scalar Replacement gave a speedup of 2.22 on a MIPS M120…

Agenda  Last Lecture at a glance  Loop Interchange for Register Reuse  Loop Fusion for Register Reuse  Putting it All Together  Complex Loop Nests  Summary

Summary (1) This lecture we covered: 1.Loop Interchange – This gives us more dependences in the innermost loop which we can utilize for more register reuse. 2.Loop Fusion and Alignment – Bring uses together so they can share registers. 3.Complex Loops – How to overcome some of the problems in real-world programs.

Download ppt "Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely."

Similar presentations