Presentation is loading. Please wait.

Presentation is loading. Please wait.

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Similar presentations


Presentation on theme: "Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques."— Presentation transcript:

1 branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques

2 branch.2 10/14 I-cache Fetch Buffer Issue Buffer Func. Units Arch. State Execute Decode Result Buffer Commit PC Fetch Branch executed Next fetch started Modern processors have 10 -14 pipeline stages between next PC calculation and branch resolution ! Control Flow Penalty Why Branch Prediction work lost if pipeline makes wrong prediction ~ Loop length x pipeline width

3 branch.3 10/14 Branch Penalties in a Superscalar are extensive

4 branch.4 10/14 Reducing Control Flow Penalty Software solutions Minimize branches - loop unrolling Increases the run length Hardware solutions Find something else to do - delay slots Speculate –Dynamic branch prediction Speculative execution of instructions beyond branch

5 branch.5 10/14 Motivation: Branch penalties limit performance of deeply pipelined processors Much worse for superscalar processors Modern branch predictors have high accuracy (>95%) and can reduce branch penalties significantly Required hardware support: Dynamic Prediction HW: Branch history tables, branch target buffers, etc. Mispredict recovery mechanisms: Keep computation result separate from commit Kill instructions following branch Restore state to state following branch Branch Prediction

6 branch.6 10/14 Static Branch Prediction- review Overall probability a branch is taken is ~60-70% but: ISA can attach preferred direction semantics to branches, e.g., Motorola MC88110 bne0 (preferred taken) beq0 (not taken) ISA can allow arbitrary choice of statically predicted direction, e.g., HP PA-RISC, Intel IA-64 typically reported as ~80% accurate JZ backward 90% forward 50%

7 branch.7 10/14 Branch Prediction Needs Target address generation –Get register: PC, Link reg, GP reg. –Calculate: +/- offset, auto inc/dec –Target speculation Condition resolution –Get register: condition code reg, count reg., other reg. –Compare registers –Condition speculation

8 branch.8 10/14 Target address generation takes time

9 branch.9 10/14 Condition resolution takes time

10 branch.10 10/14 Solution: Branch speculation

11 branch.11 10/14 Branch Prediction Schemes 1.2-bit Branch-Prediction Buffer 2.Branch Target Buffer 3.Correlating Branch Prediction Buffer 4.Tournament Branch Predictor 5.Integrated Instruction Fetch Units 6.Return Address Predictors (for subroutines, Pentium, Core Duo) 7.Predicated Execution (Itanium)

12 branch.12 10/14 Dynamic Branch Prediction learning based on past behavior Incoming stream of addresses Fast outgoing stream of predictions Correction information returned from pipeline Branch Predictor Incoming Branches { Address } Prediction { Address, Value } Corrections { Address, Value } History Information

13 branch.13 10/14 Branch History Table (BHT) Table of predictors Each branch given its own predictor BHT is table of “Predictors” –Could be 1-bit or more –Indexed by PC address of Branch Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit): –End of loop case: when it exits loop –First time through loop, it predicts exit instead of looping most schemes use at least 2 bit predictors Performance = ƒ(accuracy, cost of misprediction) –Misprediction  Flush Reorder Buffer In Fetch state of branch: –Use Predictor to make prediction When branch completes –Update corresponding Predictor Predictor 0 Predictor 7 Predictor 1 Branch PC

14 branch.14 10/14 Branch History Table Organization Target PC calculation takes time 4K-entry BHT, 2 bits/entry, ~80-90% correct predictions 00 Fetch PC Branch? Target PC + I-Cache Opcodeoffset Instruction k BHT Index 2 k -entry BHT, 2 bits/entry Taken/¬Taken?

15 branch.15 10/14 Better Solution: 2-bit scheme where change prediction only if get misprediction twice: Red: stop, not taken Green: go, taken Adds hysteresis to decision making process 2-bit Dynamic Branch Prediction more accurate than 1-bit T T NT Predict Taken Predict Not Taken Predict Taken Predict Not Taken T NT T

16 branch.16 10/14 BTB: Branch Address at Same Time as Prediction Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) Branch PCPredicted PC =? PC of instruction FETCH prediction state bits Yes: instruction is branch and use predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4) Only predicted taken branches and jumps held in BTB Next PC determined before branch fetched and decoded later: check prediction, if wrong kill instruction, update BPb

17 branch.17 10/14 BTB contains only Branch & Jump Instructions BTB contains information for branch and jump instructions only  not updated for other instructions For all other instructions the next PC is PC+4 ! Achieved without decoding instruction

18 branch.18 10/14 Combining BTB and BHT BTB entries considerably more expensive than BHT, fetch redirected earlier in pipeline - can accelerate indirect branches (JR) BHT can hold many more entries - more accurate A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B Branch Address Calc/Begin Decode I Complete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute BTB BHT BHT in later pipeline stage corrects when BTB misses a predicted taken branch BTB/BHT only updated after branch resolves in E stage

19 branch.19 10/14 Subroutine Return Stack Small stack – accelerate subroutine returns more accurate than BTBs. Push return address when function call executed Pop return address when subroutine return decoded &nexta &nextb &nextc k entries (typically k=8-16)

20 branch.20 10/14 Mispredict Recovery In-order execution machines: –Instructions issued after branch cannot write-back before branch resolves –all instructions in pipeline behind mispredicted branch Killed

21 branch.21 10/14 Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP –If false, then neither store result nor cause exception –Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. –IA-64: 64 1-bit condition fields selected so conditional execution of any instruction –This transformation is called “if-conversion” Drawbacks to conditional instructions –Still takes a clock even if “annulled” –Stall if condition evaluated late –Complex conditions reduce effectiveness; condition becomes known late in pipeline x A = B op C Predicated Execution

22 branch.22 10/14 Accuracy v. Size (SPEC89)

23 branch.23 10/14 Dynamic Branch Prediction Summary Prediction becoming important part of scalar execution Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlated with next branch. Tournament Predictor: more resources to competitive solutions and pick between them Branch Target Buffer: include branch address & prediction Predicated Execution can reduce number of branches, number of mispredicted branches Return address stack for prediction of indirect jump


Download ppt "Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques."

Similar presentations


Ads by Google