Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Improving Locality through Loop Transformations COMP 512 Rice University Houston, Texas Fall 2003
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use.

The Opportunities Compilers have always focused on loops
Higher execution counts Repeated, related operations Much of real work takes place in loops (linear algebra) Several effects to attack Overhead Decrease control-structure cost per iteration Locality Spatial locality  use of co-resident data Temporal locality  reuse of same data Parallelism Move loop w/independent iterations to inner (outer) position Remember Fortran H ||’ism * COMP 512, Fall 2003

Eliminating Overhead Loop unrolling (the oldest trick in the book)
To reduce overhead, replicate the loop body Sources of Improvement Less overhead per useful operation Longer basic blocks for local optimization do i = 1 to 100 by 1 a(i) = a(i) + b(i) end do i = 1 to 100 by 4 a(i) = a(i) + b(i) a(i+1) = a(i+1) + b(i+1) a(i+2) = a(i+2) + b(i+2) a(i+3) = a(i+3) + b(i+3) end becomes (unroll by 4) COMP 512, Fall 2003

Eliminating Overhead Loop unrolling with unknown bounds
Generate guard loops While loop needs an explicit update Used in this form in the BLAS & in BitBlt i = 1 do while (i+3 < n) a(i) = a(i) + b(i) a(i+1) = a(i+1) + b(i+1) a(i+2) = a(i+2) + b(i+2) a(i+3) = a(i+3) + b(i+3) i = i + 4 end do while (i < n) a(i) = a(i) + b(i) i = i + 1 do i = 1 to n by 1 a(i) = a(i) + b(i) end becomes (unroll by 4) COMP 512, Fall 2003

Eliminating Overhead One other use for loop unrolling
Eliminate copies at the end of a loop More complex cases Multiple cross-iteration copy cycles  use LCM of cycle lengths Result has been rediscovered many times [Ken’s thesis] t1 = b(0) do i = 1 to 100 t2 = b(i) a(i) = a(i) + t1 + t2 t1 = t2 end becomes (unroll + rename) do i = 1 to 100 by 2 t1 = b(i+1) a(i+1) = a(i+1) + t2 + t1 COMP 512, Fall 2003

Eliminating Overhead Loop unswitching
Hoist invariant control-flow out of loop nest Replicate the loop & specialize it No tests, branches in loop body Longer segments of straight-line code Does this happen in real code? If so, its worth doing do i = 1 to 100 a(i) = a(i) + b(i) if (expression) then d(i) = 0 end becomes (unswitch) else * COMP 512, Fall 2003 See also: Cytron, Lowry, & Zadeck in 13th POPL (1986)

Eliminating Overhead & Helping Locality
Loop fusion Two loops over same iteration space  one loop Advantages Fewer total operations (statically & dynamically ) Longer basic blocks for local optimization & scheduling Can convert inter-loop reuse to intra-loop reuse a(i) will be found in the cache For big enough arrays, a(i) will not b in the cache do i = 1 to n c(i) = a(i) + b(i) end do j = 1 to n d(j) = a(j) * e(j) becomes (fuse) d(i) = a(i) * e(i) This is safe if it does not change the values used or defined by any statement in either loop * COMP 512, Fall 2003

Increasing Overhead & Helping Locality
Loop distribution (or fission) Single loop with independent statements  multiple loops Advantages Enables other transformations (e.g., vectorization) Each resulting loop has a smaller cache footprint More reuse hits in the cache } Reads b & c Writes a Reads e & f Writes d Reads h & k Writes g { Reads b, c, e, f, h, & k Writes a, d, & g It is safe if all the statements that form a cycle in the dependence graph end up in the same loop do i = 1 to n a(i) = b(i) + c(i) end d(i) = e(i) * f(i) g(i) = h(i) - k(i) becomes (fission) * COMP 512, Fall 2003

Reordering Loops for Locality
Loop interchange Swap inner & outer loops to rearrange iteration space Effect Improves reuse by using more elements per cache line Goal is to get as much reuse into inner loop as possible After interchange, direction of Iteration is changed cache line Runs down cache line In Fortran’s column-major order, a(4,4) would lay out as cache line As little as 1 used element per line do i = 1 to 50 do j = 1 to 100 a(i,j) = b(i,j) * c(i,j) end becomes (interchange) In row-major order, the opposite loop ordering causes the same effects * COMP 512, Fall 2003

Loop permutation Interchange is degenerate case Two perfectly nested loops More general problem is called permutation Safety Permutation is safe iff no data dependences are reversed The flow of data from definitions to uses is preserved Effects Change order of access & order of computation Move accesses closer in time  increase temporal locality Move computations farther apart  cover pipeline latencies COMP 512, Fall 2003

Strip mining Splits a loop into two loops Effects (may slow down the code) Used to match loop bounds to vector length Used as prelude to loop permutation (for tiling) This is always safe do j = 1 to 100 do ii = 1 to 50 by 8 do i = ii to min(ii+7,50) a(i,j) = b(i,j) * c(i,j) end do j = 1 to 100 do i = 1 to 50 a(i,j) = b(i,j) * c(i,j) end becomes (strip mine) COMP 512, Fall 2003

Loop tiling (or blocking) Combination of strip mining and interchange Effects Reduces volume of data between reuses Works on one “tile” at a time (tile size is tk by tj) Choice of tile size is crucial Strip mine & interchange Strip mine & interchange do kk = 1 to n by tk do jj =1 to n by tj do i = 1 to m do k = kk to min(kk+tk-1,n) do j = jj to min(jj+tj-1,n) c(j,i) = c(j,i) + a(k,i) * b(j,k) end do i = 1 to m do k = 1 to n do j = 1 to n c(j,i) = c(j,i) + a(k,i) * b(j,k) end becomes (tiling) Interchange must be safe COMP 512, Fall 2003

Rewriting Loops for Better Register Allocation
Scalar Replacement Allocators never keep c(i) in a register We can trick the allocator by rewriting the references The plan Locate patterns of consistent reuse Make loads and stores use temporary scalar variable Replace references with temporary’s name May need copies at end of loop to keep reused values straight If reuse spans more than one iteration, need to “pipeline” it COMP 512, Fall 2003

Scalar Replacement Effects Decreases number of loads and stores Keeps reused values in names that can be allocated to registers In essence, this exposes the reuse of a(i) to subsequent passes do i = 1 to n do j = 1 to n a(i) = a(i) + b(j) end t = a(i) t = t + b(j) a(i) = t becomes (scalar replacement) Almost any register allocator can get t into a register COMP 512, Fall 2003

Balancing Memory Bound Loops
Balance is the ratio of memory accesses to flops Machine balance is M = Loop balance is L = Loops run better if they are balanced or compute bound If L > M , the loop is memory bound If L = M , the loop is balanced If L < M , the loop is compute bound Making memory bound loops into balanced or compute-bound loops Need more reuse to decrease number of accesses Combine scalar replacement, unrolling, and fusion accesses/cycle flops/cycle accesses/iteration flops/iteration COMP 512, Fall 2003

Example of Loop Balance do i = 1 to n do j = 1 to n a(i) = a(i) + b(j) end Original loop nest t = a(i) t = t + b(j) a(i) = t After scalar replacement L = 3 accesses/iteration 1 flop/iteration 1 access/iteration As memory accesses cost more, this gets better ! COMP 512, Fall 2003

Balancing Memory Loops
Unroll and Jam Unroll the outer loop Fuse the resulting inner loops Effect Increases reuse in the inner loop* Decreases overhead, too Combine with scalar replacement for full benefits do i = 1 to n do j = 1 to n a(i) = a(i) + b(j) end do i = 1 to n by 2 do j = 1 to n a(i) = a(i) b(j) a(i+1) = a(i+1) + b(j) end becomes (unroll & jam) * COMP 512, Fall 2003

Adding Scalar Replacement Rewrites all three cases of reuse do i = 1 to n by 2 t1 = a(i) t2 = a(i+1) do j = 1 to n t3 = b(j) t1 = t1 + t3 t2 = t2 + t3 end a(i) = t1 a(i+1) = t2 do i = 1 to n by 2 do j = 1 to n a(i) = a(i) b(j) a(i+1) = a(i+1) + b(j) end becomes (scalar replacement) 1 access 2 flops Impact Original loop had L = Scalar replacement got L down to Unroll & jam + scalar replacement produced L = 3 accesses/iteration 1 flop/iteration 1 access/iteration 2 flop/iteration captures the reuse COMP 512, Fall 2003

The Big Picture Scalar replacement + unroll & jam helps Factors of 2 to 6 for matrix multiply (matrix300) What does it take to do this in a real compiler? Data dependence analysis (see COMP 515) Method to discover consistent reuse patterns (see Carr) Way to choose the unroll amount Constrained by available registers Need heuristics to predict allocator’s behavior COMP 512, Fall 2003

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Similar presentations

Presentation on theme: "Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Similar presentations

Presentation on theme: "Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved."— Presentation transcript:

Similar presentations

About project

Feedback