Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques

branch.2 10/14 I-cache Fetch Buffer Issue Buffer Func. Units Arch. State Execute Decode Result Buffer Commit PC Fetch Branch executed Next fetch started Modern processors have 10 -14 pipeline stages between next PC calculation and branch resolution ! Control Flow Penalty Why Branch Prediction work lost if pipeline makes wrong prediction ~ Loop length x pipeline width

branch.3 10/14 Branch Penalties in a Superscalar are extensive

branch.4 10/14 Reducing Control Flow Penalty Software solutions Minimize branches - loop unrolling Increases the run length Hardware solutions Find something else to do - delay slots Speculate –Dynamic branch prediction Speculative execution of instructions beyond branch

branch.5 10/14 Motivation: Branch penalties limit performance of deeply pipelined processors Much worse for superscalar processors Modern branch predictors have high accuracy (>95%) and can reduce branch penalties significantly Required hardware support: Dynamic Prediction HW: Branch history tables, branch target buffers, etc. Mispredict recovery mechanisms: Keep computation result separate from commit Kill instructions following branch Restore state to state following branch Branch Prediction

branch.6 10/14 Static Branch Prediction- review Overall probability a branch is taken is ~60-70% but: ISA can attach preferred direction semantics to branches, e.g., Motorola MC88110 bne0 (preferred taken) beq0 (not taken) ISA can allow arbitrary choice of statically predicted direction, e.g., HP PA-RISC, Intel IA-64 typically reported as ~80% accurate JZ backward 90% forward 50%

branch.7 10/14 Branch Prediction Needs Target address generation –Get register: PC, Link reg, GP reg. –Calculate: +/- offset, auto inc/dec –Target speculation Condition resolution –Get register: condition code reg, count reg., other reg. –Compare registers –Condition speculation

branch.8 10/14 Target address generation takes time

branch.9 10/14 Condition resolution takes time

branch.10 10/14 Solution: Branch speculation

branch.11 10/14 Branch Prediction Schemes 1.2-bit Branch-Prediction Buffer 2.Branch Target Buffer 3.Correlating Branch Prediction Buffer 4.Tournament Branch Predictor 5.Integrated Instruction Fetch Units 6.Return Address Predictors (for subroutines, Pentium, Core Duo) 7.Predicated Execution (Itanium)

branch.12 10/14 Dynamic Branch Prediction learning based on past behavior Incoming stream of addresses Fast outgoing stream of predictions Correction information returned from pipeline Branch Predictor Incoming Branches { Address } Prediction { Address, Value } Corrections { Address, Value } History Information

branch.13 10/14 Branch History Table (BHT) Table of predictors Each branch given its own predictor BHT is table of “Predictors” –Could be 1-bit or more –Indexed by PC address of Branch Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit): –End of loop case: when it exits loop –First time through loop, it predicts exit instead of looping most schemes use at least 2 bit predictors Performance = ƒ(accuracy, cost of misprediction) –Misprediction  Flush Reorder Buffer In Fetch state of branch: –Use Predictor to make prediction When branch completes –Update corresponding Predictor Predictor 0 Predictor 7 Predictor 1 Branch PC

branch.14 10/14 Branch History Table Organization Target PC calculation takes time 4K-entry BHT, 2 bits/entry, ~80-90% correct predictions 00 Fetch PC Branch? Target PC + I-Cache Opcodeoffset Instruction k BHT Index 2 k -entry BHT, 2 bits/entry Taken/¬Taken?

branch.15 10/14 Better Solution: 2-bit scheme where change prediction only if get misprediction twice: Red: stop, not taken Green: go, taken Adds hysteresis to decision making process 2-bit Dynamic Branch Prediction more accurate than 1-bit T T NT Predict Taken Predict Not Taken Predict Taken Predict Not Taken T NT T

branch.16 10/14 BTB: Branch Address at Same Time as Prediction Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) Branch PCPredicted PC =? PC of instruction FETCH prediction state bits Yes: instruction is branch and use predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4) Only predicted taken branches and jumps held in BTB Next PC determined before branch fetched and decoded later: check prediction, if wrong kill instruction, update BPb

branch.17 10/14 BTB contains only Branch & Jump Instructions BTB contains information for branch and jump instructions only  not updated for other instructions For all other instructions the next PC is PC+4 ! Achieved without decoding instruction

branch.18 10/14 Combining BTB and BHT BTB entries considerably more expensive than BHT, fetch redirected earlier in pipeline - can accelerate indirect branches (JR) BHT can hold many more entries - more accurate A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B Branch Address Calc/Begin Decode I Complete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute BTB BHT BHT in later pipeline stage corrects when BTB misses a predicted taken branch BTB/BHT only updated after branch resolves in E stage

branch.19 10/14 Subroutine Return Stack Small stack – accelerate subroutine returns more accurate than BTBs. Push return address when function call executed Pop return address when subroutine return decoded &nexta &nextb &nextc k entries (typically k=8-16)

branch.20 10/14 Mispredict Recovery In-order execution machines: –Instructions issued after branch cannot write-back before branch resolves –all instructions in pipeline behind mispredicted branch Killed

branch.21 10/14 Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP –If false, then neither store result nor cause exception –Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. –IA-64: 64 1-bit condition fields selected so conditional execution of any instruction –This transformation is called “if-conversion” Drawbacks to conditional instructions –Still takes a clock even if “annulled” –Stall if condition evaluated late –Complex conditions reduce effectiveness; condition becomes known late in pipeline x A = B op C Predicated Execution

branch.22 10/14 Accuracy v. Size (SPEC89)

branch.23 10/14 Dynamic Branch Prediction Summary Prediction becoming important part of scalar execution Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlated with next branch. Tournament Predictor: more resources to competitive solutions and pick between them Branch Target Buffer: include branch address & prediction Predicated Execution can reduce number of branches, number of mispredicted branches Return address stack for prediction of indirect jump

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Similar presentations

Presentation on theme: "Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Similar presentations

Presentation on theme: "Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques."— Presentation transcript:

Similar presentations

About project

Feedback