Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Architecture 2011 – Branch Prediction 1 Computer Architecture Advanced Branch Prediction Lihu Rappoport and Adi Yoaz.

Similar presentations


Presentation on theme: "Computer Architecture 2011 – Branch Prediction 1 Computer Architecture Advanced Branch Prediction Lihu Rappoport and Adi Yoaz."— Presentation transcript:

1 Computer Architecture 2011 – Branch Prediction 1 Computer Architecture Advanced Branch Prediction Lihu Rappoport and Adi Yoaz

2 Computer Architecture 2011 – Branch Prediction 2 Introduction  Need to predict:  Conditional branch direction (taken or no taken)  Actual direction is known only after execution  Wrong direction prediction causes a full flush  All taken branch (conditional taken or unconditional) targets  Target of direct branches known at decode  Target of indirect branches known at execution  Branch type  Conditional, uncond. direct, uncond. indirect, call, return  Target: minimize branch misprediction rate for a given predictor size

3 Computer Architecture 2011 – Branch Prediction 3 What/Who/When We Predict/Fix FetchDecodeExecute Target Array  Branch type  conditional  unconditional direct  unconditional indirect  call  return  Branch target Indirect Target Array  Predict indirect target  override TA target Return Stack Buffer  Predict return target Cond. Branch Predictor  Predict conditional T/NT  Fix TA miss  Fix wrong direct target on TA miss try to fix Fix TA miss Fix Wrong prediction Dec Flush Exe Flush

4 Computer Architecture 2011 – Branch Prediction 4 Branches and Performance  MPI : misprediction-per-instruction: # of incorrectly predicted branches MPI = total # of instructions  MPI correlates well with performance. For example:  MPI = 1% (1 out of 100 instructions @1 out of 20 branches)  IPC=2 (IPC is the average number of instructions per cycle),  flush penalty of 10 cycles  We get:  MPI = 1%  flush in every 100 instructions  flush in every 50 cycles (since IPC=2),  10 cycles flush penalty every 50 cycles  20% in performance

5 Computer Architecture 2011 – Branch Prediction 5 Target Array  TA is accessed using the branch address (branch IP)  Implemented as an n-way set associative cache  Tags usually partial  Save space  Can get false hits  Few branches aliased to the same entry  No correctness only performance  TA predicts the following  Indication that instruction is a branch  Predicted target  Branch type  Unconditional: take target  Conditional: predict direction  TA allocated / updated at execution Branch IP tag target predicted target hit / miss (indicates a branch) type predicted type

6 Computer Architecture 2011 – Branch Prediction 6 Conditional Branch Direction Prediction

7 Computer Architecture 2011 – Branch Prediction 7 One-Bit Predictor  Problem: 1-bit predictor has a double mistake in loops Branch Outcome 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 Prediction? 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 branch IP Prediction (at fetch): previous branch outcome counter array / cache Update (at execution) Update bit with branch outcome

8 Computer Architecture 2011 – Branch Prediction 8 Bimodal (2-bit) Predictor  A 2-bit counter avoids the double mistake in glitches  Need “more evidence” to change prediction  2 bits encode one of 4 states  00 – strong NT, 01 – weakly NT, 10 – weakly taken, 11 – strong taken  Initial state: weakly-taken (most branches are taken)  Update  Branch was actually taken: increment counter (saturate at 11)  Branch was actually not-taken: decrement counter (saturate at 00)  Predict according to m.s.bit of counter (0 – NT, 1 – taken)  Does not predict well branches with patterns like 010101… 0 SNT taken not-taken taken not-taken not- taken 01 WNT 10 WT 1 ST Predict takenPredict not-taken

9 Computer Architecture 2011 – Branch Prediction 9 l.s. bits of branch IP Bimodal Predictor (cont.) Prediction = msb of counter 2-bit-sat counter array Update counter with branch outcome

10 Computer Architecture 2011 – Branch Prediction 10 Bimodal Predictor - example  Br1 prediction  Pattern:1 01 0 1 0  counter:2 3 2 32 3  Prediction:TTTTT T  Br2 prediction  Pattern:01 0 1 0 1  counter:2 1 2 12 1  Prediction:T nTT nTT nT  Br3 prediction  Pattern:11 1 1 1 0  counter:2 3 3 33 3  Prediction:T TT TT T Code:  Loop: ….  br1: if (n/2) {  ……. }  br2: if ((n+1)/2) {  ……. }  n--  br3: JNZ n, Loop

11 Computer Architecture 2011 – Branch Prediction 11 2-Level Prediction: Local Predictor  Save the history of each branch in a Branch History Register (BHR):  A shift-register updated by branch outcome  Saves the last n outcomes of the branch  Used as a pointer to an array of bits specifying direction per history  Example: assume n=6  Assume the pattern 000100010001...  At the steady-state, the following patterns are repeated in the BHR: 000100010001... 000100 001000 010001 100010  Following 000100, 010001, 100010 the jump is not taken  Following 001000 the jump is taken BHR 0 2 n -1 n

12 Computer Architecture 2011 – Branch Prediction 12 Local Predictor (cont.)  There could be glitches from the pattern  Use 2-bit saturating counters instead of 1 bit to record outcome:  Too long BHRs are not good:  Past history may be no longer relevant  Warm-Up is longer  Counter array becomes too big Update History with branch outcome prediction = msb of counter 2-bit-sat counter array Update counter with branch outcome history BHR

13 Computer Architecture 2011 – Branch Prediction 13 Local Predictor: private counter arrays Branch IP taghistory prediction = msb of counter 2-bit-sat counter arrays Update counter with branch outcome Update History with branch outcome history cache Predictor size: #BHRs × (tag_size + history_size + 2 × 2 history_size ) Example: #BHRs = 1024; tag_size=8; history_size=6  size=1024 × (8 + 6 + 2×2 6 ) = 142Kbit Holding BHRs and counter arrays for many branches:

14 Computer Architecture 2011 – Branch Prediction 14 Local Predictor: shared counter arrays  Using a single counter array shared by all BHR’s  All BHR’s index the same array  Branches with similar patterns interfere with each other prediction = msb of counter Branch IP 2-bit-sat counter array taghistory history cache Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size Example: #BHRs = 1024; tag_size=8; history_size=6  size=1024 × (8 + 6) + 2×2 6 = 14.1Kbit

15 Computer Architecture 2011 – Branch Prediction 15 Local Predictor: lselect  lselect reduces inter-branch-interference in the counter array prediction = msb of counter Branch IP 2-bit-sat counter array taghistory history cache h h+m m l.s.bits of IP Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size + m

16 Computer Architecture 2011 – Branch Prediction 16 Local Predictor: lshare lshare reduces inter-branch-interference in the counter array: maps common patterns in different branches to different counters h h h l.s.bits of IP history cache taghistory prediction = msb of counter Branch IP 2-bit-sat counter array Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size

17 Computer Architecture 2011 – Branch Prediction 17  The behavior of some branches is highly correlated with the behavior of other branches: if (x < 1)... if (x > 1)...  Using a Global History Register (GHR), the prediction of the second if may be based on the direction of the first if  For other branches the history interference might be destructive Global Predictor

18 Computer Architecture 2011 – Branch Prediction 18 Global Predictor (cont.) Update History with branch outcome prediction = msb of counter 2-bit-sat counter array Update counter with branch outcome history GHR The predictor size: history_size + 2*2 history_size Example: history_size = 12  size = 8 K Bits

19 Computer Architecture 2011 – Branch Prediction 19 gshare combines the global history information with the branch IP Global Predictor: Gshare prediction = msb of counter 2-bit-sat counter array Update counter with branch outcome Branch IP history GHR Update History with branch outcome

20 Computer Architecture 2011 – Branch Prediction 20 Chooser  The chooser may also be indexed by the GHR +1 if Bimodal / Local correct and Global wrong -1 if Bimodal / Local wrong and Global correct Bimodal / Local Global Branch IP Prediction Chooser array (an array of 2-bit sat. counters) GHR A chooser selects between 2 predictor that predict the same branch: Use the predictor that was more correct in the past

21 Computer Architecture 2011 – Branch Prediction 21 Speculative History Updates  Deep pipeline  many cycles between fetch and branch resolution  If history is updated only at resolution  Local: future occurrences of the same branch may see stale history  Global: future occurrences of all branches may see stale history  History is speculatively updated according to the prediction  History must be corrected if the branch is mispredicted  Speculative updates are done in a special field to enable recovery  Speculative History Update  Speculative history updated assuming previous predictions are correct  Speculation bit set to indicate that speculative history is used  Counter array is not updated speculatively  Prediction can change only on a misprediction (state 01→10 or 10→01)  On branch resolution  Update the real history and reset speculative histories if mispredicted

22 Computer Architecture 2011 – Branch Prediction 22 Return Stack Buffer  A return instruction is a special case of an indirect branch:  Each times it jumps to a different target  The target is determined by the location of the corresponding call instruction  The idea:  Hold a small stack of targets  When the target array predicts a call  Push the address of the instruction which follows the call into the stack  When the target array predicts a return  Pop a target from the stack and use it as the return address

23 Computer Architecture 2011 – Branch Prediction 23 Branch Prediction in commercial Processors

24 Computer Architecture 2011 – Branch Prediction 24  386 / 486  All branches are statically predicted Not Taken  Pentium  IP based, 2-bit saturating counters (Lee-Smith)  BTB miss - statically predicted Not Taken Older Processors

25 Computer Architecture 2011 – Branch Prediction 25 Intel Pentium III  2-level, local histories, per-set counters  4-way set associative: 512 entries in 128 sets IP Tag Hist 1001 Pred= msb of counter 9 0 15 Way 0Way 1 Target Way 2 Way 3 9 4 32 counters 128 sets PTV 211 32 LRR 2 Per-Set Branch Type 00- cond 01- ret 10- call 11- uncond Return Stack Buffer

26 Computer Architecture 2011 – Branch Prediction 26 Alpha 264 - LG Chooser 1024 3 Counters 4 ways 256 Histories IP In each entry: 6 bit tag + 10 bit History 4096 2 Counters GHR 12 4096 Counters Global Local Chooser 2  New entry on the Local stage is allocated on a global stage miss-prediction  Chooser state-machines: 2 bit each:  one bit saves last time global correct/wrong,  and the other bit saves for the local correct/wrong  Chooses Local only if local was correct and global was wrong

27 Computer Architecture 2011 – Branch Prediction 27 Pentium® M  Combines 3 predictors  Bimodal, Global and Loop predictor  Loop predictor analyzes branches to see if they have loop behavior  Moving in one direction (taken or NT) a fixed number of times  Ended with a single movement in the opposite direction

28 Computer Architecture 2011 – Branch Prediction 28 Pentium® M – Indirect Branch Predictor  Indirect branch targets is data dependent  Can have many targets: e.g., a case statement  Can still have only a single target at run time  Resolved at execution  high misprediction penalty  Used in object-oriented code (C++, Java)  becomes a growing source of branch mispredictions  A dedicated indirect branch target predictor (iTA)  Chooses targets based on a global history (similar to global predictor)  Initially indirect branch is allocated only in the target array (TA)  If target is mispredicted  allocate an iTA entry corresponding to the global history leading to this instance of the indirect branch  Data-dependent indirect branches allocate as many targets as needed  Monotonic indirect branches are still predicted by the TA

29 Computer Architecture 2011 – Branch Prediction 29 Indirect branch target prediction (cont)  Prediction from the iTA is used if  TA indicates an indirect branch  iTA hits for the current global history (XORed with branch address) Target Array Indirect Target Predictor Branch IP Predicted Target M X U hit indirect branch hit Target HIT Global history Target

30 Computer Architecture 2011 – Branch Prediction 30 Backup

31 Computer Architecture 2011 – Branch Prediction 31 BHT - Branch History Table 2-level 8,192-entry global predictor: 16 Entry BTC - Branch Target Cache  Supplies the first 16 bytes of target instructions to the decoders when the branches are predicted.  Organized as 16 entries of 16 bytes.  Avoids a bubble for correct predictions.  No Target Address Buffer: Address ALUs calculate target addresses on-the-fly during decode 16 Entry RAS - Return Address Stack  Caches the return addresses 2-bit-sat counter array 13 bit global history GHR AMD-K6

32 Computer Architecture 2011 – Branch Prediction 32  256 X 3- bit branch history table (BHT)  Instead of 2-bit-sat counters, stores the results of the last three iterations of each branch  The prediction is based on a majority vote of the three bits  Offers a similar level of hysteresis and accuracy as 2-bit-sat, but easier to update (shift results vs. read-modify-write)  The BHT is only updated as branch instructions are retired  prevents corrupting the history information with speculative executions of the branch HP PA-8000


Download ppt "Computer Architecture 2011 – Branch Prediction 1 Computer Architecture Advanced Branch Prediction Lihu Rappoport and Adi Yoaz."

Similar presentations


Ads by Google