Presentation is loading. Please wait.

Presentation is loading. Please wait.

March 2, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste.

Similar presentations


Presentation on theme: "March 2, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste."— Presentation transcript:

1 March 2, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152

2 March 2, 2011CS152, Spring 2011 2 Last time in Lecture 12 Pipelining is complicated by multiple and/or variable latency functional units Out-of-order and/or pipelined execution requires tracking of dependencies –RAW –WAR –WAW Dynamic issue logic can support out-of-order execution to improve performance –Last time, looked at simple scoreboard to track out-of-order completion Hardware register renaming can further improve performance by removing hazards.

3 March 2, 2011CS152, Spring 2011 3 Register Renaming Decode does register renaming and adds instructions to the issue-stage instruction reorder buffer (ROB)  renaming makes WAR or WAW hazards impossible Any instruction in ROB whose RAW hazards have been satisfied can be issued.  Out-of-order or dataflow execution IFIDWB ALUMem Fadd Fmul Issue

4 March 2, 2011CS152, Spring 2011 4 Renaming Structures Renaming table & regfile Reorder buffer Load Unit FU Store Unit Ins# use exec op p1 src1 p2 src2 t1t2..tnt1t2..tn Instruction template (i.e., tag t) is allocated by the Decode stage, which also associates tag with register in regfile When an instruction completes, its tag is deallocated Replacing the tag by its value is an expensive operation

5 March 2, 2011CS152, Spring 2011 5 Reorder Buffer Management Instruction slot is candidate for execution when: It holds a valid instruction (“use” bit is set) It has not already started execution (“exec” bit is clear) Both operands are available (p1 and p2 are set) t1t2...tnt1t2...tn ptr 2 next to deallocate ptr 1 next available Ins# use exec op p1 src1 p2 src2 Destination registers are renamed to the instruction’s slot tag ROB managed circularly “exec” bit is set when instruction begins execution When an instruction completes its “use” bit is marked free ptr 2 is incremented only if the “use” bit is marked free

6 March 2, 2011CS152, Spring 2011 6 Renaming & Out-of-order Issue An example When are tags in sources replaced by data? When can a name be reused? 1LDF2, 34(R2) 2LDF4,45(R3) 3MULTDF6,F4,F2 4SUBDF8,F2,F2 5DIVDF4,F2,F8 6ADDDF10,F6,F4 Renaming tableReorder buffer Ins# use exec op p1 src1 p2 src2 t1t2t3t4t5..t1t2t3t4t5.. data / t i p data F1 F2 F3 F4 F5 F6 F7 F8 Whenever an FU produces data Whenever an instruction completes t1 1 1 0 LD t2 2 1 0 LD 5 1 0 DIV 1 v1 0 t4 4 1 0 SUB 1 v1 1 v1 t4 3 1 0 MUL 0 t2 1 v1 t3 t5 v1 1 1 1 LD 0 4 1 1 SUB 1 v1 1 v1 4 0 v4 5 1 0 DIV 1 v1 1 v4 2 1 1 LD 2 0 3 1 0 MUL 1 v2 1 v1

7 March 2, 2011CS152, Spring 2011 7 IBM 360/91 Floating-Point Unit R. M. Tomasulo, 1967 Mult 1 123456123456 load buffers (from memory) 12341234 Adder 123123 Floating- Point Regfile store buffers (to memory)... instructions Common bus ensures that data is made available immediately to all the instructions waiting for it. Match tag, if equal, copy value & set presence “p”. Distribute instruction templates by functional units ptag/datap p p p p p p p p p p p 2 p p p p p p p p p p

8 March 2, 2011CS152, Spring 2011 8 Effectiveness? Renaming and Out-of-order execution was first implemented in 1969 in IBM 360/91 but did not show up in the subsequent models until mid- Nineties. Why ? Reasons 1. Effective on a very small class of programs 2. Memory latency a much bigger problem 3. Exceptions not precise! One more problem needed to be solved Control Hazards!

9 March 2, 2011CS152, Spring 2011 9 Precise Interrupts It must appear as if an interrupt is taken between two instructions (say I i and I i+1 ) the effect of all instructions up to and including I i is totally complete no effect of any instruction after I i has taken place The interrupt handler either aborts the program or restarts it at I i+1.

10 March 2, 2011CS152, Spring 2011 10 Effect on Interrupts Out-of-order Completion I 1 DIVDf6, f6,f4 I 2 LDf2,45(r3) I 3 MULTDf0,f2,f4 I 4 DIVDf8,f6,f2 I 5 SUBDf10,f0,f6 I 6 ADDDf6,f8,f2 out-of-order comp1 2 2 3 1 4 3 5 5 4 6 6 restore f2 restore f10 Consider interrupts Precise interrupts are difficult to implement at high speed - want to start execution of later instructions before exception checks finished on earlier instructions

11 March 2, 2011CS152, Spring 2011 11 Exception Handling (In-Order Five-Stage Pipeline) Hold exception flags in pipeline until commit point (M stage) Exceptions in earlier pipe stages override later exceptions Inject external interrupts at commit point (override others) If exception at commit: update Cause and EPC registers, kill all stages, inject handler PC into fetch stage Asynchronous Interrupts Exc D PC D PC Inst. Mem D Decode EM Data Mem W + Exc E PC E Exc M PC M Cause EPC Kill D Stage Kill F Stage Kill E Stage Illegal Opcode Overflow Data Addr Except PC Address Exceptions Kill Writeback Select Handler PC Commit Point

12 March 2, 2011CS152, Spring 2011 12 Fetch: Instruction bits retrieved from cache. Phases of Instruction Execution I-cache Fetch Buffer Issue Buffer Functional Units Architectural State Execute: Instructions and operands issued to execution units. When execution completes, all results and exception flags are available. Decode: Instructions dispatched to appropriate issue-stage buffer Result Buffer Commit: Instruction irrevocably updates architectural state (aka “graduation”). PC Commit Decode/Rename

13 March 2, 2011CS152, Spring 2011 13 In-Order Commit for Precise Exceptions Instructions fetched and decoded into instruction reorder buffer in-order Execution is out-of-order (  out-of-order completion) Commit (write-back to architectural state, i.e., regfile & memory, is in-order Temporary storage needed to hold results before commit (shadow registers and store buffers) FetchDecode Execute Commit Reorder Buffer In-order Out-of-order Exception? Kill Inject handler PC

14 March 2, 2011CS152, Spring 2011 14 Extensions for Precise Exceptions Reorder buffer ptr 2 next to commit ptr 1 next available add fields in the instruction template commit instructions to reg file and memory in program order  buffers can be maintained circularly on exception, clear reorder buffer by resetting ptr 1 =ptr 2 (stores must wait for commit before updating memory) Inst# use exec op p1 src1 p2 src2 pd dest data cause

15 March 2, 2011CS152, Spring 2011 15 Rollback and Renaming Register file does not contain renaming tags any more. How does the decode stage find the tag of a source register? Search the “dest” field in the reorder buffer Register File (now holds only committed state) Reorder buffer Load Unit FU Store Unit t1t2..tnt1t2..tn Ins# use exec op p1 src1 p2 src2 pd dest data Commit

16 March 2, 2011CS152, Spring 2011 16 Renaming Table Register File Reorder buffer Load Unit FU Store Unit t1t2..tnt1t2..tn Ins# use exec op p1 src1 p2 src2 pd dest data Commit Rename Table Renaming table is a cache to speed up register name look up. It needs to be cleared after each exception taken. When else are valid bits cleared? Control transfers r1r1 tv r2r2 tag valid bit

17 March 2, 2011CS152, Spring 2011 17 I-cache Fetch Buffer Issue Buffer Func. Units Arch. State Execute Decode Result Buffer Commit PC Fetch Branch executed Next fetch started Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution ! Control Flow Penalty How much work is lost if pipeline doesn’t follow correct instruction flow ? ~ Loop length x pipeline width

18 March 2, 2011CS152, Spring 2011 18 InstructionTaken known?Target known? J JR BEQZ/BNEZ MIPS Branches and Jumps Each instruction fetch depends on one or two pieces of information from the preceding instruction: 1) Is the preceding instruction a taken branch? 2) If so, what is the target address? After Inst. Decode After Reg. Fetch After Reg. Fetch * * Assuming zero detect on register read

19 March 2, 2011CS152, Spring 2011 19 Branch Penalties in Modern Pipelines A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B Branch Address Calc/Begin Decode I Complete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute Remainder of execute pipeline (+ another 6 stages) UltraSPARC-III instruction fetch pipeline stages (in-order issue, 4-way superscalar, 750MHz, 2000) Branch Target Address Known Branch Direction & Jump Register Target Known

20 March 2, 2011CS152, Spring 2011 20 Reducing Control Flow Penalty Software solutions Eliminate branches - loop unrolling Increases the run length Reduce resolution time - instruction scheduling Compute the branch condition as early as possible (of limited value) Hardware solutions Find something else to do - delay slots Replaces pipeline bubbles with useful work (requires software cooperation) Speculate - branch prediction Speculative execution of instructions beyond the branch

21 March 2, 2011CS152, Spring 2011 21 Branch Prediction Motivation: Branch penalties limit performance of deeply pipelined processors Modern branch predictors have high accuracy (>95%) and can reduce branch penalties significantly Required hardware support: Prediction structures: Branch history tables, branch target buffers, etc. Mispredict recovery mechanisms: Keep result computation separate from commit Kill instructions following branch in pipeline Restore state to state following branch

22 March 2, 2011CS152, Spring 2011 22 Static Branch Prediction Overall probability a branch is taken is ~60-70% but: ISA can attach preferred direction semantics to branches, e.g., Motorola MC88110 bne0 (preferred taken) beq0 (not taken) ISA can allow arbitrary choice of statically predicted direction, e.g., HP PA-RISC, Intel IA-64 typically reported as ~80% accurate JZ backward 90% forward 50%

23 March 2, 2011CS152, Spring 2011 23 Dynamic Branch Prediction learning based on past behavior Temporal correlation The way a branch resolves may be a good predictor of the way it will resolve at the next execution Spatial correlation Several branches may resolve in a highly correlated manner (a preferred path of execution)

24 March 2, 2011CS152, Spring 2011 24 Assume 2 BP bits per instruction Change the prediction after two consecutive mistakes! ¬take wrong taken ¬ taken taken ¬take right take right take wrong ¬ taken BP state: (predict take/¬take) x (last prediction right/wrong) Branch Prediction Bits

25 March 2, 2011CS152, Spring 2011 25 Branch History Table 4K-entry BHT, 2 bits/entry, ~80-90% correct predictions 00 Fetch PC Branch? Target PC + I-Cache Opcodeoffset Instruction k BHT Index 2 k -entry BHT, 2 bits/entry Taken/¬Taken?

26 March 2, 2011CS152, Spring 2011 26 Exploiting Spatial Correlation Yeh and Patt, 1992 History register, H, records the direction of the last N branches executed by the processor if (x[i] < 7) then y += 1; if (x[i] < 5) then c -= 4; If first condition false, second condition also false

27 March 2, 2011CS152, Spring 2011 27 Two-Level Branch Predictor Pentium Pro uses the result from the last two branches to select one of the four sets of BHT bits (~95% correct) 00 k Fetch PC Shift in Taken/¬Taken results of each branch 2-bit global branch history shift register Taken/¬Taken?

28 March 2, 2011CS152, Spring 2011 28 Speculating Both Directions resource requirement is proportional to the number of concurrent speculative executions An alternative to branch prediction is to execute both directions of a branch speculatively branch prediction takes less resources than speculative execution of both paths only half the resources engage in useful work when both directions of a branch are executed speculatively With accurate branch prediction, it is more cost effective to dedicate all resources to the predicted direction

29 March 2, 2011CS152, Spring 2011 29 Limitations of BHTs Only predicts branch direction. Therefore, cannot redirect fetch stream until after branch target is determined. UltraSPARC-III fetch pipeline Correctly predicted taken branch penalty Jump Register penalty A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B Branch Address Calc/Begin Decode I Complete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute Remainder of execute pipeline (+ another 6 stages)

30 March 2, 2011CS152, Spring 2011 30 CS152 Administrivia

31 March 2, 2011CS152, Spring 2011 31 Branch Target Buffer BP bits are stored with the predicted target address. IF stage: If (BP=taken) then nPC=target else nPC=PC+4 later: check prediction, if wrong then kill the instruction and update BTB & BPb else update BPb IMEM PC Branch Target Buffer (2 k entries) k BPb predicted targetBP target

32 March 2, 2011CS152, Spring 2011 32 Address Collisions What will be fetched after the instruction at 1028? BTB prediction= Correct target=  Assume a 128-entry BTB BPb target take 236 1028 Add..... 132 Jump 100 Instruction Memory 236 1032 kill PC=236 and fetch PC=1032 Is this a common occurrence? Can we avoid these bubbles?

33 March 2, 2011CS152, Spring 2011 33 BTB is only for Control Instructions BTB contains useful information for branch and jump instructions only  Do not update it for other instructions For all other instructions the next PC is PC+4 ! How to achieve this effect without decoding the instruction?

34 March 2, 2011CS152, Spring 2011 34 Branch Target Buffer (BTB) Keep both the branch PC and target PC in the BTB PC+4 is fetched if match fails Only taken branches and jumps held in BTB Next PC determined before branch fetched and decoded 2 k -entry direct-mapped BTB (can also be associative) I-Cache PC k Valid valid Entry PC = match predicted target target PC

35 March 2, 2011CS152, Spring 2011 35 Combining BTB and BHT BTB entries are considerably more expensive than BHT, but can redirect fetches at earlier stage in pipeline and can accelerate indirect branches (JR) BHT can hold many more entries and is more accurate A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B Branch Address Calc/Begin Decode I Complete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute BTB BHT BHT in later pipeline stage corrects when BTB misses a predicted taken branch BTB/BHT only updated after branch resolves in E stage

36 March 2, 2011CS152, Spring 2011 36 Uses of Jump Register (JR) Switch statements (jump to address of matching case) Dynamic function call (jump to run-time function address) Subroutine returns (jump to return address) How well does BTB work for each of these cases? BTB works well if same case used repeatedly BTB works well if same function usually called, (e.g., in C++ programming, when objects have same type in virtual function call) BTB works well if usually return to the same place  Often one function called from many distinct call sites!

37 March 2, 2011CS152, Spring 2011 37 Subroutine Return Stack Small structure to accelerate JR for subroutine returns, typically much more accurate than BTBs. &fb() &fc() Push call address when function call executed Pop return address when subroutine return decoded fa() { fb(); } fb() { fc(); } fc() { fd(); } &fd() k entries (typically k=8-16)

38 March 2, 2011CS152, Spring 2011 38 Mispredict Recovery In-order execution machines: –Assume no instruction issued after branch can write-back before branch resolves –Kill all instructions in pipeline behind mispredicted branch –Multiple instructions following branch in program order can complete before branch resolves Out-of-order execution?

39 March 2, 2011CS152, Spring 2011 39 In-Order Commit for Precise Exceptions Instructions fetched and decoded into instruction reorder buffer in-order Execution is out-of-order (  out-of-order completion) Commit (write-back to architectural state, i.e., regfile & memory, is in-order Temporary storage needed in ROB to hold results before commit FetchDecode Execute Commit Reorder Buffer In-order Out-of-order Kill Exception? Inject handler PC

40 March 2, 2011CS152, Spring 2011 40 Branch Misprediction in Pipeline FetchDecode Execute Commit Reorder Buffer Kill Branch Resolution Inject correct PC Can have multiple unresolved branches in ROB Can resolve branches out-of-order by killing all the instructions in ROB that follow a mispredicted branch Branch Prediction PC Complete

41 March 2, 2011CS152, Spring 2011 41 tvtvtv Recovering ROB/Renaming Table Registe r File Reorder buffer Load Unit FU Store Unit t1t2..tnt1t2..tn Ins# use exec op p1 src1 p2 src2 pd dest data Commit Rename Table r1r1 tv r2r2 Take snapshot of register rename table at each predicted branch, recover earlier snapshot if branch mispredicted Rename Snapshots Ptr 2 next to commit Ptr 1 next available rollback next available

42 March 2, 2011CS152, Spring 2011 42 “Data-in-ROB” Design (HP PA8000, Pentium Pro, Core2Duo, Nehalem) On dispatch into ROB, ready sources can be in regfile or in ROB dest (copied into src1/src2 if ready before dispatch) On completion, write to dest field and broadcast to src fields. On issue, read from ROB src fields Register File holds only committed state Reorder buffer Load Unit FU Store Unit t1t2..tnt1t2..tn Ins# use exec op p1 src1 p2 src2 pd dest data Commit

43 March 2, 2011CS152, Spring 2011 43 Acknowledgements These slides contain material developed and copyright by: –Arvind (MIT) –Krste Asanovic (MIT/UCB) –Joel Emer (Intel/MIT) –James Hoe (CMU) –John Kubiatowicz (UCB) –David Patterson (UCB) MIT material derived from course 6.823 UCB material derived from course CS252


Download ppt "March 2, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste."

Similar presentations


Ads by Google