Presentation is loading. Please wait.

Presentation is loading. Please wait.

Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science.

Similar presentations


Presentation on theme: "Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science."— Presentation transcript:

1 Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

2 Chapter 2: A Five Stage RISC Pipeline 2 Control Hazard beq r1,r3,label and r2,r3,r5 or r6,r1,r7 add r8,r1,r9 label: xor r10,r1,r11 Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg

3 Chapter 2: A Five Stage RISC Pipeline 3 Branch Penalty Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution: –Determine branch taken or not sooner, AND –Compute taken branch address earlier MIPS branch tests if register = 0 or  0 –beqz R4, name MIPS Solution: –Move Zero test to ID/RF stage –Adder to calculate new PC in ID/RF stage –1 clock cycle penalty for branch versus 3

4 Chapter 2: A Five Stage RISC Pipeline 4 Adder IF/ID Modified MIPS Datapath Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc MUX Sign Extend Zero? MEM/WB EX/MEM Next SEQ PC rd WB Data Next PC PC rs rt Imm MUX ID/EX Instruction Memory Register File Data Memory ALU Adder

5 Chapter 2: A Five Stage RISC Pipeline 5 Branch Resolved in ID Stage beq r1,r3,label and r2,r3,r5 Label: xor r10,r1,r11 … … Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg

6 Chapter 2: A Five Stage RISC Pipeline 6 Branch Prediction Predict Branch Not Taken –Execute successor instructions in sequence. –“Squash” instructions in pipeline if branch actually taken. –47% MIPS branches not taken on average. –PC+4 already calculated, so use it to get next instruction. Predict Branch Taken –53% MIPS branches taken on average. –But haven’t calculated branch target address yet MIPS still incurs 1 cycle branch penalty Other machines: branch target known before outcome Delay Branch Technique

7 Chapter 2: A Five Stage RISC Pipeline 7 Delay Branches This technique involves using software making the delay slots valid and useful. Some n number of instructions after the branch is executed regardless of whether the branch is taken. branch instruction sequential successor 1 sequential successor 2........ sequential successor n branch target if taken 1 delay slot allows proper decision and branch target address in 5 stage pipeline MIPS uses this. Branch delay of length n

8 Chapter 2: A Five Stage RISC Pipeline 8 Performance Effect of Branch Penalty Let p b = the probability that an instruction is a branch p t = the probability that a branch is taken b = the branch penalty CPI = the average number of cycles per instruction. Then CPI = (1 - p b ) + p b [p t (1 + b) + (1 - p t )] CPI = 1 + bp t p b

9 Chapter 2: A Five Stage RISC Pipeline 9 Delay Branch Technique

10 Chapter 2: A Five Stage RISC Pipeline 10 Delay Branch Technique (1) A:=B+C If B>C Then Goto Next Delay Slot... Next: becomes If B>C Then Goto Next A:=B+C.... Next: “From before”

11 Chapter 2: A Five Stage RISC Pipeline 11 Delay Branch Technique (2) Next:X := Y * Z... B := A + C If B > C Then Goto Next Delay Slot becomes X := Y * Z Next:...... B := A + C If B > C Then Goto Next X := Y * Z “From target” Must be OK to execute when not taken May need to duplicate

12 Chapter 2: A Five Stage RISC Pipeline 12 Delay Branch Technique (3) B := A + C If B > C Then Goto Next Delay Slot X := Y * Z... Next: becomes B := A + C If B > C Then Goto Next X := Y * Z... Next: “From fall through” Must be OK to execute when taken

13 Chapter 2: A Five Stage RISC Pipeline 13 Delay Branch Technique (cont.) The performance of Delay Branches can be modeled by the following equation: CPI = 1+bp b p nop where p nop is the fraction of the b delay slots filled with nops. Thus, if f i is the probability that the delay slot i is filled with a useful instruction, then p nop = 1 - (f 1 + f 2 + …+ f b )/b Example: Suppose we have the following characteristic b=4, f 1 =0.6, f 2 = 0.1, f 3 = f 4 =0, p b =0.2 We have CPI = 1 + 4  0.2  0.825 = 1.66

14 Chapter 2: A Five Stage RISC Pipeline 14 Delay Branch Technique (cont.) The concept of squashing or annulling can be used in conjunction with delay branches. X := Y * Z Next:... … B := A + C If B > C Then Goto Next X := Y * Z=>This instruction is nullified bne,ars,rt,label a bitBranch outcomeDelay inst. Executed? takenyes not takenyes atakenyes anot takenno (annulled)

15 Chapter 2: A Five Stage RISC Pipeline 15 Delay Branch Technique (cont.) For processors with this capability, the performance can be modeled as CPI = 1 + bp b [p nop (1 - p null ) + p null )] where p null =(1-p t ) for nullify-on-branch-not-taken. Suppose b=4, f 1 =0.8, f 2 =0.3, f 3 =0.1, f 4 =0, p b =0.2, p null = 0.35 => CPI=1.644

16 Chapter 2: A Five Stage RISC Pipeline 16 Delayed Branch Performance Compiler effectiveness for single branch delay slot: –Fills about 60% of branch delay slots. –About 80% of instructions executed in branch delay slots useful in computation. –About 50% (60% x 80%) of slots usefully filled.

17 Chapter 2: A Five Stage RISC Pipeline 17 Evaluating Branch Alternatives Suppose Conditional & Unconditional = 14%, 65% change PC PredictionBranchCPIspeedup v.speedup v. schemepenaltyunpipelinedstall Stall pipeline31.423.51.0 Predict taken11.144.41.26 Predict not taken11.094.51.29 Delayed branch0.51.074.61.31

18 18 Reducing Branch Penalty Branch penalty in dynamically scheduled processors: wasted cycles due to pipeline flushing on mis- predicted branches Reduce branch penalty: 1.Predict branch/jump instructions AND branch direction (taken or not taken) 2.Predict branch/jump target address (for taken branches) 3.Speculatively execute instructions along the predicted path

19 19 What to Use and What to Predict Available info: –Current predicted PC –Past branch history (direction and target) What to predict: –Conditional branch inst: branch direction and target address –Jump inst: target address –Procedure call/return: target address May need instruction pre-decoded IM PC Predictors PCPC pred_PC pred infofeedbackPC & Inst

20 20 Mis-prediction Detections and Feedbacks Detections: At the end of decoding –Target address known at decoding, and not match –Flush fetch stage At commit (most cases) –Wrong branch direction or target address not match –Flush the whole pipeline Feedbacks: Any time a mis-prediction is detected At a branch’s commit (at EXE: called speculative update) FETCH RENAME SCHD REB/ROB COMMIT WB EXE predictors

21 21 Branch Direction Prediction Predict branch direction: taken or not taken (T/NT) Static prediction: compilers decide the direction Dynamic prediction: hardware decides the direction using dynamic information 1.1-bit Branch-Prediction Buffer 2.2-bit Branch-Prediction Buffer 3.Correlating Branch Prediction Buffer 4.Tournament Branch Predictor 5.and more … Not taken taken BNE R1, R2, L1 … L1: …

22 22 Predictor for a Single Branch state 2. Predict Output T/NT 1. Access 3. Feedback T/NT T Predict Taken 1 0 T NT General Form 1-bit prediction NT PC Feedback

23 23 Branch History Table of 1-bit Predictor BHT also Called Branch Prediction Buffer in textbook Can use only one 1-bit predictor, but accuracy is low BHT: use a table of simple predictors, indexed by bits from PC Similar to direct mapped cache More entries, more cost, but less conflicts, higher accuracy BHT can contain complex predictors Prediction K-bit Branch address 2k2k

24 24 1-bit BHT Weakness Example: in a loop, 1-bit BHT will cause 2 mispredictions Consider a loop of 9 iterations before exit: for (…){ for (i=0; i<9; i++) a[i] = a[i] * 2.0; } –End of loop case, when it exits instead of looping as before –First time through loop on next time through code, when it predicts exit instead of looping –Only 80% accuracy even if loop 90% of the time

25 25 Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249) Gray: stop, not taken Blue: go, taken Adds hysteresis to decision making process 2-bit Saturating Counter T T NT Predict Taken Predict Not Taken Predict Taken Predict Not Taken 11 10 0100 T NT T

26 26 Correlating Branches Code example showing the potential If (d==0) d=1; If (d==1) … Assemble code BNEZ R1, L1 DADDIU R1,R0,#1 L1: DADDIU R3,R1,#- 1 BNEZ R3, L2 L2: … Observation: if BNEZ1 is not taken, then BNEZ2 is taken

27 Chapter 3 - Exploiting ILP27 (1, 1) Predictor (1,1) predictor - last branch, 1-bit prediction We use a pair of bits where the first bit being the prediction if the last branch in the program was not taken, and the second bit being the prediction if the last branch was taken. Prediction Bits Prediction If Last branch Not TakenLast Branch Taken NT/NTNot Taken NT/TNot TakenTaken T/NTTakenNot Taken T/TTaken

28 Chapter 3 - Exploiting ILP28 (1, 1) Predictor: Example Consider the following code assuming d is assigned to R1. if (d==0) d=1; if (d==1) bnezR1,L1; branch b1 (d!=0) addiR1,R0,#1; d==0, so d=1 L1:subiR3,R1,#1 bnezR3,L2; branch b2 (d!=1)... L2: Suppose d alternates between 2 and 0, (1, 1) predictor initialized to not taken. Bold indicate prediction. The only misprediction is on the first iteration, when d=2, because the b1 was not correlated with the previous prediction of b2 d=?b1 predb1 actionnew b1 predb2 predb2 actionnew b2 pred 2NT/NTTT/NTNT/NTTNT/T 0T/NTNTT/NTNT/TNTNT/T 2T/NTT NT/TT 0T/NTNTT/NTNT/TNTNT/T

29 Chapter 3 - Exploiting ILP29 (1, 1) Predictor: Example If we had use a 1-bit predictor We would have had all the branches mispredicted! d=?b1 predb1 actionnew b1 predb2 predb2 actionnew b2 pred 2NTTT TT 0T T 2 TT TT 0T T

30 Chapter 3 - Exploiting ILP30 (m, n) Predictor (m,n) Predictor: In general, (m,n) predictor uses the behavior of last m branches (using shift register) to choose from 2 m branch predictors, each of which is a n-bit predictor for a single branch.

31 Chapter 3 - Exploiting ILP31 Performance of (2, 2) Predictor Improvement is most noticeable in integer benchmarks. (m,n) predictor outperforms 2-bit predictor, even with unlimited entries! Integer benchmarks

32 Chapter 3 - Exploiting ILP32 Tournament Predictors Uses multiple predictors, usually one based on local information and one based on global information. –Local predictors are better for some branches –Global predictors are better at utilizing correlation A selector is used to choose among the predictors, usually a 2-bit saturating counter. n/m means: n - left predictor m - right predictor 0/1 means: 0 - Incorrect 1 - Correct 11 10 01 00

33 Chapter 3 - Exploiting ILP33 Example: Alpha 21264 Branch Predictor 21264 uses the most sophisticated branch predictor. Last 10 outcomes of this branch 3-bit saturating counter 2-bit predictor 2-bit saturating counter Last 12 outcomes of all the branches

34 Tournament Predictor in Alpha 21264 Local predictor consists of a 2-level predictor: –Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted –Next level Selected entry from the local history table is used to index a table of 1K entries consisting 3-bit saturating counters, which provide the local prediction Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180K transistors) 1K  10 bits 1K  3 bits

35 % of predictions from local predictor in Tournament Prediction Scheme 98% 100% 94% 90% 55% 76% 72% 63% 37% 69% 0%20%40%60%80%100% nasa7 matrix300 tomcatv doduc spice fpppp gcc espresso eqntott li

36 94% 96% 98% 97% 100% 70% 82% 77% 82% 84% 99% 88% 86% 88% 86% 95% 99% 0%20%40%60%80%100% gcc espresso li fpppp doduc tomcatv Profile-based 2-bit counter Tournament Accuracy of Branch Prediction Profile: branch profile from last execution (static in that is encoded in instruction, but profile) fig 3.40

37 Accuracy v. Size (SPEC89) 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 081624324048566472808896104112120128 Total predictor size (Kbits) Conditional branch misprediction rate Local - 2 bit counters Correlating - (2,2) scheme Tournament

38 Power Consumption BlueRISC’s Compiler-driven Power-Aware Branch Prediction Comparison with 512 entry BTAC bimodal (patent-pending) Copyright 2007 CAM & BlueRISC

39 Pitfall: Sometimes dumber is better Alpha 21264 uses tournament predictor (29 Kbits) Earlier 21164 uses a simple 2-bit predictor with 2K entries (or a total of 4 Kbits) SPEC95 benchmarks, 21264 outperforms –21264 avg. 11.5 mispredictions per 1000 instructions –21164 avg. 16.5 mispredictions per 1000 instructions Reversed for transaction processing (TP) ! –21264 avg. 17 mispredictions per 1000 instructions –21164 avg. 15 mispredictions per 1000 instructions TP code much larger & 21164 hold 2X branch predictions based on local behavior (2K vs. 1K local predictor in the 21264) What about power? –Large predictors give some increase in prediction rate but for a large power cost

40 Chapter 3 - Exploiting ILP40 Branch Target Buffer BTB acts as a cache for BTAs. This eliminates cycles wasted per branch required to calculate the BTAs.

41 Chapter 3 - Exploiting ILP41 BTB (cont.) BTA and the outcome of the branch is known by end of ID stage …but not relayed until EX stage

42 Chapter 3 - Exploiting ILP42 BTB (cont.)

43 Chapter 3 - Exploiting ILP43 Return Address Prediction BTB and BPB do a good job in predicting how future behavior will repeat. However, the subroutine call/return paradigm makes correct prediction difficult. The BTB then contains the following after the second subroutine is called: Inst. AddrTarget Addr. 100500 520104 112500 When we return from subr, we get a hit on a valid entry in the BTB (Inst. Addr. = 520) and predict that we will return to address 104. However, this is not correct. The next instruction should be 116!

44 Chapter 3 - Exploiting ILP44 Subroutine Return Stack In order to detect such mispredictions, subroutine return stack can be used to augment the BTB.

45 Chapter 3 - Exploiting ILP45 Performance of SRS SPEC 95

46 Pentium 4’s Branch Predictor “Unveiling the Intel Branch Predictors” –Pentium 4 –http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1597026http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1597026 46

47 Natural Branch Predictors “Towards a High Performance Neural Branch Predictor” –http://webspace.ulbsibiu.ro/lucian.vintan/html/USA.pdfhttp://webspace.ulbsibiu.ro/lucian.vintan/html/USA.pdf –The main advantage of the neural predictor is its ability to exploit long histories while requiring only linear resource growth –Used in IA-64 simulators 47

48 Core 2’s Branch Predictor? TAGE: Tagged Geometric Chapter 3 - Exploiting ILP48

49 TAGE Performance 49

50 To Learn More Chapter 3 - Exploiting ILP50


Download ppt "Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science."

Similar presentations


Ads by Google