Loop-Level Parallelism Analysis at the source level Dependencies across iterations for (i=1000; i>0; i=i-1) x[i] = x[i] + s; for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; /* loop-carried dependence */ y[i+1] = y[i] + x[i+1]; }
Loop-Carried Dependences for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; } Non-circular dependences x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100];
Compiler support for ILP Dependence analysis Finding dependences is important for: Good scheduling of code Determining loop-level parallelism Eliminating name dependencies Complexity Simple for scalar variable references Complex for pointers, array references Software pipelining Trace scheduling
Loop-level Parallelism Primary focus of dependence analysis Determine all dependences and find cycles for (i=1; i<=100; i=i+1) { x[i] = y[i] + z[i]; w[i] = x[i] + v[i]; } for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; loop-carried, recurrent, circular dependence } x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100]; for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; }
Dependence Analysis Algorithms Assume array indexes are affine (ai + b) GCD test: For two affine array indexes ai+b and ci+d: if a loop-carried dependence exists, then GCD (c,a) must divide (d-b) x[8*i ] = x[4*i + 2] +3 (2-0)/GCD(8,4) General graph cycle determination is NP a, b, c, and d may not be known at compile time
Example For(I=1; I<=100; I=1+1) { Y[I] = X[I] / c X[I] = X[I] + c; Z[I] = Y[I] + c; Y[I] = c - Y[I]; } For(I=1; I<=100; I=1+1) { T[I] = X[I] / c X1[I] = X[I] + c; Z[I] = T[I] + c; Y[I] = c - T[I]; }
Software Pipelining Start-up Finish-up Software pipelined iteration Iteration 0 Iteration 1 Iteration 2 Iteration 3 Software pipelined iteration
Example Iteration i Iteration i+1 Iteration i+2 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop Loop: SD 16(R1), F4 ADDD F4, F0, F2 LD F0, 0(R1) SUBI R1, R1, #8 BNEZ R1, Loop
Trace Scheduling Find ILP across conditional branches Two-step process Trace selection Find a trace (sequence of basic blocks) Use loop unrolling to generate long traces Use static branch prediction for other conditional branches Trace compaction Squeeze the trace into a small number of wide instructions Preserve data and control dependences
Trace Selection LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4 A[I] = A[I] + B[I] LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4 BNEZ R4, else . . . . SW 0(R2), . . . J join Else: . . . . X Join: . . . . SW 0(R3), . . . T F A[I] = 0? X B[I] = C[I] =
Summary of Compiler Techniques Try to avoid dependence stalls Loop unrolling Reduce loop overhead Software pipelining Reduce single body dependence stalls Trace scheduling Reduce impact of other branches Compilers use a mix of three All techniques depend on prediction accuracy
Hardware-based ILP Techniques Limitation of static techniques Ability to predict at compile time the behavior of branches Hardware-based schemes Conditional or predicated instructions Extend the ISA Speculation Static Hardware support for compiler speculation Dynamic Use branch prediction to guide the speculation process
Predicated Instructions Condition evaluation It is part of the instruction execution If true, the instruction executes normally If false, the instruction is replaced by a no-op Conditional move BNEZ R1, L MOV R2, R3 L: … If (A= = 0) S = T; CMOVZ R2, R3, R1
Limitations of Predicated Instructions Annulled conditional instructions take execution time Useful when the condition can be evaluated early Data dependences may not allow to separate conditional instruction and branch Control flow involves more than a simple alternative sequence Moving an instruction across multiple branches Conditional instructions may be more expensive
Compiler Speculation Conditional instructions can be used for limited speculative computation If compiler can predict branches accurately then it can use speculation for: Improving scheduling (eliminating stalls) Increasing the issue rate (IPC) Challenge: maintain exception behavior Resuming and terminating exceptions Methods for aggressive speculation Hardware-software cooperation Poison bits Renaming
Hardware – Software Cooperation Hardware and operating system handle exceptions: Resumable exceptions are processed as usual Terminating instructions return an undefined value Correct programs never fail (what about incorrect ones?) LW R1, 0(R3) BNEZ R1, L1 LW R1, 0(R2) J L2 L1: ADDI R1, R1, #4 L2: SW 0(R3), R1 LW R1, 0(R3) LW R14, 0(R2) BEQZ R1, L3 ADDI R14, R1, #4 L3: SW 0(R3), R14 Stores can not be speculative (only register renaming)
Poison Bits Less change to the exception behavior Incorrect programs will still cause exceptions when speculation is used Poison bits: one for each register and instruction Destination register: ON when speculative instruction produces terminating exception Instruction: ON for speculative instructions No memory poison bits => Saves are not speculative
Renaming Renaming and buffering in hardware (like Tomasulo) Boosted instructions (moved across a branch) Execute before the controlling branch Commit/abort once the branch is resolved Other boosted instructions can use temporary results LW R1, 0(R3) LW R1, 0(R2) BEQZ R1, L3 ADDI R1, R1, #4 L3: SW 0(R3), R1
Dynamic Speculation Dataflow execution Advantages Dynamic branch prediction Speculation Dynamic scheduling Advantages Memory references disambiguation Hardware-based branch prediction is better Precise exception model even for all instructions No need for compensation or bookkeeping code Portability across hardware platforms Disadvantage: complex hardware
Implementation of Dynamic Speculation Extend Tomasulo’s algorithm Execute and bypass results out of order Commit in order Reorder buffer Additional virtual registers (can be operands) Stores speculative results (before commit) Integrates the function of the store and load buffers Fields: Instruction type, Destination and Value Easy to undo speculated instructions on mispredicted branches or exceptions Precise exceptions