Presentation is loading. Please wait.

Presentation is loading. Please wait.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Similar presentations

Presentation on theme: "Architecture-dependent optimizations Functional units, delay slots and dependency analysis."— Presentation transcript:

1 Architecture-dependent optimizations Functional units, delay slots and dependency analysis

2 RISC architectures The pipeline structure of modern architectures requires careful instruction scheduling. If instruction I 1 creates a value, there may be a latency that has to elapse before another instruction I 2 can use this value If an instruction awaits the result of a previous computation, the pipeline may have to stall until the result becomes available A branch instruction is time-consuming and affects the contents of the instruction cache. Execution cannot start at destination before one or more cycles have elapsed.

3 Instruction scheduling Purpose: minimize stalls and delays, fill delay slots with useful computations, minimize execution time of basic block. Tool: dependency analysis. Uncover legal reorderings of instructions, available parallelism in basic blocks and beyond Applications: Filling delay slots is important for all programs Dependency analysis is critical for reordering of loop computations on vector processors and others

4 Dependence relations A data dependence is a constraint that arises from the flow of data between statements. Violating a data dependence by reordering may lead to incorrect results. If S1 sets a value that S2 uses, this is flow dependence or true dependence between S1 and S2 If S1 uses some variable’s value and S2 sets it, there is an antidependence between them. If both S1 and S2 set the value of some variable, there is an output dependence between them. If both S1 and S2 read the value of some variable there is an input dependence between then. This does not impose an ordering.

5 The dependence DAG of a basic block There is an edge in the dependence dag if: I 1 writes a register or location that I 2 uses: I 1 f  I 2 I 1 uses a register or location that I 2 changes: I 1 a  I 2 I 1 and I 2 write to the same register or location: I 1 o  I 2 I 1 and I 2 exhibit a structural hazard: a load followed by a store cannot be interchanged unless the addresses are known to be distinct: X := A[I]; A[J] := Y; -- cannot interchange, X might get Y  if there is an edge between I 1 and I 2, I 2 must not start executing until I 1 has executed for some number of cycles.

6 Example 1: R3 = [R15] 2: R4 = [R15 + 4] 3: R2 = R3 – R4 -- needs R4: stall one cycle 4: R5 = [R12] 5: R12 = R12 + 4 6: R6 = R3 * R5 7: [R15+4] = R3 8: R5 = R6 + 2 12 3 6 8 5 4 7

7 Contention for resources Functional unit is pipelined, consists of multiple resources. Instructions through the pipeline may conflict on use of resources. Eg: floating-point unit on MIPS: A: Mantissa add E: Exception test M: Multiplier first stage N: Multiplier second stage R: Adder Round S: Operand shift U: unpack Add uses successively U, S and A, A and R, R and S. (4 cycles) Mul uses U, M, M, M, M, M and A, R. (7 cycles) Conflict depends on relative starting time of two instructions. Edges in dependency graph are labelled with latencies (>= 1).

8 Branch scheduling Important use of dependency graph: fill delay slots (branch takes two cycles to reach destination) R2 = [R1] R2 = [R1] R3 = [R1+4] R3 = [R1+4] R4 = R2 + R3 (stall) R5 = R2 -1 R5 = R2 -1 goto L1 goto L1 R4 = R2 + R3 nop

9 Conditional jumps and delay slots Instruction in delay slot is executed while jump is in progress. What if jump is not taken? Need mechanism to annull instruction. Branch prediction: assume target is known, fill delay slot with first instruction in target block If both destinations start with same instruction, ideal choice for delay slot Good heuristics for loops: assume that a backwards conditional jump is usually taken. Move first instruction in loop to delay slot for branch at end Call instruction has delay slot: fill with parameter push

10 A greedy algorithm: list scheduling Finding optimal schedule for DAG is NP-complete Simple algorithm is O (N 2 ) at worst, usually linear Roots of DAG are instructions without predecessors First pass: from leaves to roots: compute latest possible starting time for each instruction to end of block For leaf: execution time of instruction For inner node: maximum delay imposed by successors E.g. if I n is followed by I m, I m can start at T – 4, and there is a latency of 2 between I n and I m, then I n must start by T – 6.

11 List scheduling: second pass Second pass: from roots to leaves: schedule instructions with the greatest slack (farthest from block end) and that can start as early as possible from now. At each step: D1: candidates with the largest remaining delay D2: candidates with the earliest possible starting time (computed from starting time of their predecessors) Choose from D1 if unique, else from D2 if unique, else use heuristics: Choose earliest starting time, or Choose instruction that uses least used pipeline, or Choose instruction that frees register.

12 Procedure integration: inlining Calls make optimizations harder. There is a large payoff to local optimizations over large basic blocks: inlining subprogram bodies is often very effective: It exposes the values of the actuals in the body It creates larger basic blocks It saves the cost of the call Can be done at the tree level or at the RTL level. In both cases it can enable other optimizations. Possible disadvantages: code size increase, debugging is harder

13 Inlining as a tree transformation Treat body of subprogram as a generic unit Each inlined body needs its own local variables Global references are captured at the point of definition Inlining works like instantiation: replace formals with actuals, complete analysis and expansion of inserted body Replace multiple return statements where needed Introduce temporary to hold return value of function

14 Name capture: recognize global entities function memo (x : integer) return integer is local : integer := x **2; begin Saved := Saved + local + x; -- Saved is global return Saved; end memo; … Val := memo (15); Becomes declare local : integer := 15 ** 2; -- each inlining has its own result : integer; -- maybe superfluous if context is assignment begin Saved := Saved + local + 15l; -- Saved is the same entity in all inlinings result := Saved; end; Val := result;

15 Handling return statements Subprogram needs a label to serve a single exit point. In a function: identify target of result, or create temporary for it; replace return with assignment to target, followed by goto to exit label In a procedure: replace return with goto to exit label Optimizations if body of function is single return statement and context is assignment, can replace right-hand side with expression If procedure has no return statement, exit label is superfluous

16 Parameter passing If actual is an expression, it is evaluated once: create temporary in block and replace formal with temporary Val2 := memo (x * f (z)); Becomes: declare C1 : integer := x * f (z); local : integer := C1 ** 2; begin Saved := Saved + local + C1; Val2 := Saved; -- context is assignment end;

17 Parameter passing: variables An in-out parameter is a location, cannot create a temporary for it: must use a renaming declaration. procedure incr (x : in out integer) is begin x := x + 1; end; … incr (a (i)); Becomes declare c1 : integer renames a (i); begin c1 := c1 + 1; end;

18 Context includes more than global names semantics of inlined call must be identical to original program If constraint checks are not suppressed in the body, they must not be suppressed in the inlined block, even if suppressed at the point of call. Status of constraint checks is part of closure of body to inline, must be applied when analyzing inlined block

19 Specialized inlining: loop unrolling Create successive copies of body of loop: saves tests, makes bigger basic block, increases instruction level parallelism for j in 1.. N loop loop_body; end loop; Becomes for k in 1.. N / r loop -- unroll r times loop_ body loop_body [j -> j +1] -- replace loop variable for each unrolling … loop_ body [j -> j +r -1] end; for k in N / r +1.. N loop loop_body end loop; -- leftover iterations

Download ppt "Architecture-dependent optimizations Functional units, delay slots and dependency analysis."

Similar presentations

Ads by Google