Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 2 Trace Scheduling Parallelism across IF branches vs. LOOP branches –Trace scheduling works when the behavior of the branches is fairly predictable at compile time Two steps: –Trace Selection Find likely sequence of basic blocks (trace) of (statically predicted) long sequence of straight-line code –Trace Compaction Squeeze trace into few VLIW instructions Need bookkeeping code in case prediction is wrong
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 3 Trace Scheduling * See the kinds of exceptions in page 179 Trace Compaction by speculation - Move the code associated with B and C to make VLIW word(s) before the branch - This may cause exceptions when executed X A[i] = A[i]+B[i] B[i]= C[i]= A[i]=0 T F Select this Trace If True branch is taken more frequently Speculation should not introduce any new exception*
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 4 HW Support for More ILP Conditional Instructions Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP –If false, then neither stores result nor causes exception* –Expanded ISA of Alpha, MIPS, SPARC have conditional move; PA- RISC can annul any following instr. Drawbacks to conditional instructions –Still takes a clock even if “annulled” –Stall if condition is evaluated late –Complex conditions reduce effectiveness; condition becomes known late in pipeline * See the kinds of exceptions in page 179
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 5 HW Support for More ILP Conditional Instructions LWC must have no effect if the condition is not satisfied. LWC cannot write the result nor cause any exceptions if the condition is not satisfied. Two-issue superscalar, combination of one M reference and one ALU(or Br) operations First instruction slot Second instruction slot LW R1,40(R2) ADD R3,R4,R5 ADD R6,R3,R7 BEQZ R10,L LW R8,20(R10) LW R9,0(R8) Waste of the Green slot. Data dependence in Reds. Example BNZ R1,LCMOVZR2,R3,R1 MOVR2,R3 L: First instruction slot Second instruction slot LW R1,40(R2) ADD R3,R4,R5 LWC R8,20(R10),R10 ADD R6,R3,R7 BEQZ R10,L LW R9,0(R8) Execute LW only when [R10] = 0, i.e., LWC is same as LW unless 3rd operand is 0.
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 6 HW Support for More ILP Speculation Speculation Allow an instruction to issue that is dependent on a branch (predicted to be taken) without any consequences(including exceptions). If branch is not actually taken (“HW undo”) –allows the execution of an instruction before the processor knows that the instruction should execute(i.e., it avoids control dependence stall) Often try to combine with dynamic scheduling Tomasulo Separate speculative bypassing of results from real bypassing of results –When an instruction is no longer speculative, write its results (instruction commit) –execute out-of-order but commit in order
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 7 Compiler Speculation with HW Support: (1) HW-SW Cooperation for Speculation HW undo for miss prediction –simply handle all resumable exceptions when exception occurs –simply return an undefined value for any exception that would cause termination the compiled code using compiler-based speculation LWR1, 0(R3); load A LWR14, 0(R2) ; speculative load B BEQZR1, L3; other branch of the if ADDR14, R1, 4; the else clause L3:SW0(R3), R14; nonspeculative store if (A==0) A =B; else A = A + 4; compiled code LW R1, 0(R3); load A BNEZ R1,L1; test A LWR1, 0(R2); if clause JL2; skip else L1:ADDR1,R1,4; else clause L2: SW0(R3), R1; store A * Assume the then clause is almost always executed. Register renaming; Need for an extra register
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 8 Compiler Speculation with HW Support: (2) Speculation with Poison Bits Speculation with Poison Bits –allows compiler speculation with less change to the exception behavior –a poison bit is added to every register –another bit is added to every instruction to indicate whether the instruction is speculative LWR1, 0(R3); load A LW*R14, 0(R2) ; speculative load B BEQZR1, L3; other branch of the if ADDR14, R1, 4; the else clause L3:SW0(R3), R14; nonspeculative store If the speculative LW* generates a terminating exception, the poison bit of R14 will be set. When the nonspeculative SW instruction occurs, it will raise an exception if the poison bit for R14 is on.
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 9 Compiler Speculation with HW Support The main disadvantages of the two previous schemes –the need to introduce copies to deal with register renaming –the possibility of exhausting the registers Speculative Instructions with Renaming (Boosting) –flagging the instructions which are moved past branches as speculative –providing renaming and buffering in the HW
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 10 Compiler Speculation with HW Support: (3) Speculative Instructions with Renaming Extra register is no longer necessary Result of the boosted instruction is not written into R1 until after branch Other boosted instructions could use the results of the boosted load LWR1, 0(R3); load A LW+R1, 0(R2) ;;boosted load B BEQZR1, L3; other branch of the if ADDR1, R1, 4; the else clause L3:SW0(R3), R1; nonspeculative store written to R1 never written to R1
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 11 Hardware-based Speculation –dynamic branch prediction –speculation to allow the execution of instructions before the control dependencies are resolved –dynamic scheduling to deal with the scheduling of different combinations of basic blocks Advantages –dynamic runtime disambiguation of memory addresses –hardware-based branch prediction –a completely precise exception model –does not require compensation or bookkeeping code –does not require different code sequences to achieve good performance for different implementation
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 12 HW-based Speculation Need HW buffer for results of uncommitted instructions: reorder buffer –Reorder buffer can be operand source –Once operand commits, result is found in register –3 fields: instr. type, destination, value –Use reorder buffer number instead of reservation station –Instructions commit in order –As a result, it is easy to undo speculated instructions on mispredicted branches or on exceptions Reorder Buffer FP Regs FP Op Queue FP Adder Res Stations From M(LD)
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 13 4 Steps of Speculative Tomasulo Algorithm 1.Issue: Get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. to the RS 2.Execution: Operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute 3.Write result: Finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4.Commit: Update register with reorder result When an instruction is at the head of reorder buffer & result present, update register with result (or store to memory) and remove the instruction from reorder buffer.
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 14 Limits to ILP Conflicting studies of amount of parallelism available in late 1980s and early 1990s. Different assumptions about: –Benchmarks (vectorized Fortran FP vs. integer C programs) –Hardware sophistication –Compiler sophistication
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 15 Limits to ILP HW Model for ultimate issue performance; MIPS compilers 1. Register renaming: Infinite virtual registers and all WAW & WAR hazards are avoided 2. Branch prediction: Perfect; no mispredictions 3. Jump prediction: All jumps perfectly predicted => machine with perfect speculation and an unbounded buffer of instructions available 4. Memory-address alias analysis: addresses are known and a store can be moved before a load provided addresses are not equal 1 cycle latency for all instructions
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 17 Limitations on Window Size and Maximum Issue Count Window : the set of instructions examined for simultaneous execution –n instructions: to determine whether they have any register dependencies among them 2n - 2 + 2n - 4 +..... + 2 = n 2 -n 2000 instructions -- 4 million comparisons 50 instructions -- 2450 comparisons –current technology : window size - 4 to 32 requires about 900 comparisons Multiple Issues -- lengthen the clock cycle –typically have clock cycles that are 1.5 to 3 times longer –typically have CPIs that are 2 to 3 times lower
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 19 More Realistic HW: Branch Impact window of 2000 and maximum issue of 64 instructions/clock cycle
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 20 Selective History Predictor 8192 x 2 bits 2048 x 4 x 2 bits Branch Addr Global History 2 00 01 10 11 Taken/Not Taken 8K x 2 bit Selector 11 10 01 00 Choose Non-correlator Choose Correlator 1010 11 Taken 10 ” 01 Not Taken 00 ” Non-correlating predictor Correlating predictor
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 22 More Realistic HW: Alias Impact 2000 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers Program Instruction issues per cycle 0 5 10 15 20 25 30 35 40 45 50 gccespressolifppppdoducdtomcatv 10 15 12 49 16 45 7 7 9 49 16 4 5 4 4 6 5 3 5 3 3 4 4 45 PerfectGlobal/stack Perfect +Inspection #None * * All memory accesses are assumed to conflict + Ongoing research # Most commercial compilers
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 23 Realistic HW for 90s: Window Impact Realistic HW in 90s: Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 24 Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster. –sophisticated pipelines typically have slower clock rates than processors with simple pipelines –example : IBM Power-2(low CPI) : two FP and two load-store, clock rate 71.5 MHz(slower clock rate) Dec Alpha 21604(high CPI) : dual-issue with one load-store and one FP, 200 MHz(faster clock rate)
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 25 Braniac vs. Speed Demon 6-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe) vs. 2-scalar Alpha @ 200 MHz (7 stage pipe)
Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 26 Recent High Performance Processors Issue capabilitySPEC YearInitial(measure shipped inclock rate Issue Schedul- Maxi-Load-Integer or Processor systems(MHz) structureing mumstoreALUFPBranchestimate) IBM1994 67DynamicStatic6 2 2 2 2 95 int Power-2270 FP Intel199466DynamicStatic2221165 int Pentium65 FP DEC Alpha1995300StaticStatic42221330 int 21164500 FP Sun Ultra-1995167DynamicStatic41111275 int 305 FP Intel P61995150DynamicDynamic31211>200 int PowerPC1995133DynamicDynamic41112 25 int 620300 FP MIPS1996200DynamicDynamic41221300 int R10000600 FP HP 80001996200DynamicStatic42221>360 int >550 FP