Instruction Level Parallelism (ILP) Pipelining is one form of concurrency With several pipelines, opportunity for executing several (pipelined) instructions concurrently With more sophisticated control units, possibility of issuing several instructions in the same cycle The order of execution is determined by the compiler: static scheduling for superscalar processors The order of execution is determined at run-time: dynamic scheduling for out-of-order execution processors 1/17/2019 CSE 471 ILP
ILP improves (reduces) CPI Recall: Single Pipeline CPI = 1 (Ideal CPI) + CPI contributed by stalls With allowing n instruction issues per cycle CPIn = 1/n (Ideal CPI) + CPI contributed by stalls ILP will increase structural hazards ILP makes reduction in other stalls even more important A “bubble” costs more than the loss of a single instruction issue 1/17/2019 CSE 471 ILP
Where can we optimize? (control) The unit of code where one can find ILP is the basic block Contiguous block of instructions with a single entry point and a single exit point Basic blocks are small (branches occur about 30% of the time) ILP can be increased (CPI due to control stalls decreased) by Reducing the number of branches: loop unrolling (compiler) Branch prediction: Static (compiler) or dynamic (hardware) Code movement: based on trace scheduling (compiler) and/or other methods (will not be covered in this course) Predication (compiler and hardware): see later 1/17/2019 CSE 471 ILP
Where can we optimize? (data) CPI contributed by data hazards can be decreased by compiler optimizations load scheduling, dependence analysis, software pipelining, trace scheduling Hardware (run-time) techniques forwarding (we know all about that!) register renaming 1/17/2019 CSE 471 ILP
Compiler optimizations ( a sample) The processor (number of pipelines, latencies of various units, number of registers) is exposed to the compiler Loop unrolling (see next slide) to increase potential for ILP Avoid dependencies (data, name, control) at the instruction level to increase ILP Data dependence translates into RAW load scheduling and code reorganization Name dependence (register allocation): anti-dependence ( WAR) and output dependence (WAW) Control dependence = branches (predicated execution) 1/17/2019 CSE 471 ILP
Loop unrolling Replicate the body of the loop and reduce correspondingly the number of iterations (software pipelining) Pros Decrease loop overhead (branches, counter settings) Allows better scheduling (longer basic blocks hence better opportunities to find ILP and to hide latency of “long” operations) Cons Increases register pressure Increases code length (I-cache occupancy) Requires prologue or epilogue (beginning or end of loop) 1/17/2019 CSE 471 ILP
Data dependencies: RAW Instruction j is dependent on instruction i preceding it in sequential program order if the result of i is a source operand of j Transitivity: Instruction j dependent on k and k dependent on i Dependence is a program property Hazards (RAW in this case) and their (partial) removals is a pipeline organization property Code optimization goal Maintain dependence and avoid hazard (e.g., scheduling) Eliminate dependence by modifying code (e.g., register allocation) 1/17/2019 CSE 471 ILP
Name dependencies: WAR and WAW Anti dependence: WAR Instruction i: R1 <- R2 (long operator) R3 Instruction j: R2 <- R4 (short operator) R5 This is a WAR hazard if instruction j finishes first Output dependence Instruction j: R1 <- R4 (short operator) R5 This is a WAW hazard if instruction j finishes first In both cases, not really a dependence but a “naming” problem (would not happen if we had enough registers) 1/17/2019 CSE 471 ILP
Control dependencies Branches restrict the scheduling of instructions Speculation (i.e., executing an instruction that might not be needed) must be: Safe (no additional exception) Legal (the end result should be the same as without speculation) Speculation can be implemented by: Compiler (code motion) Hardware (branch prediction) Both (branch prediction, conditional operations) 1/17/2019 CSE 471 ILP
Static vs. dynamic scheduling Assumptions (for now): 1 instruction issue / cycle Several pipelines with a common IF and ID Static scheduling (optimized by compiler) When there is a stall (hazard) no further issue of instructions (example Alpha 21064 and 21164) Dynamic scheduling (enforced by hardware) Instructions following the one that stalls can “issue” if no structural hazard or dependence issue might take different meanings as we’ll see 1/17/2019 CSE 471 ILP
Dynamic scheduling Implies possibility of: Out of order issue (and hence out of order execution) Out of order completion (also true but less frequent in static scheduling) Imprecise exceptions Example (different pipes for add/sub and divide) R1 = R2/ R3 (long latency) R2 = R1 + R5 (stall because of RAW on R1) R6 = R7 - R8 (can be issued before the add, executed and completed before the other 2) 1/17/2019 CSE 471 ILP
Conceptual execution on a processor with ILP Instruction fetch and branch prediction Corresponds to IF in simple pipeline (we have seen that) Instruction decode, dependence check, dispatch, issue Corresponds (many variations) to ID Although instructions are issued (i.e., assigned to functional units), they might not execute right away (might wait for dependent instructions to forward their result) Instruction execution Corresponds to EX and/or MEM (with various latencies) Instruction commit Corresponds to WB but more complex because of speculative and out-of-order completion 1/17/2019 CSE 471 ILP