Computer Architecture 2011 – Branch Prediction 1 Computer Architecture Advanced Branch Prediction Lihu Rappoport and Adi Yoaz.

Slides:



Advertisements
Similar presentations
Branch prediction Titov Alexander MDSP November, 2009.
Advertisements

Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Dynamic Branch Prediction
Branch prediction Avi Mendelson, 4/ MAMAS – Computer Architecture Branch Prediction Dr. Avi Mendelson Some of the slides were taken from: Dr. Lihu.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
EECC551 - Shaaban #1 lec # 5 Spring Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
EECC551 - Shaaban #1 lec # 5 Fall Static Conditional Branch Prediction Branch prediction schemes can be classified into static and dynamic.
EECC551 - Shaaban #1 lec # 5 Fall Static Conditional Branch Prediction Branch prediction schemes can be classified into static (at compilation.
EECS 470 Branch Prediction Lecture 6 Coverage: Chapter 3.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
OOOE & Exception © Avi Mendelson 05/ MAMAS – Computer Architecture Out Of Order Execution cont. Lecture #8-9 Dr. Avi Mendelson Alex Gontmakher Some.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Goal: Reduce the Penalty of Control Hazards
Branch Target Buffers BPB: Tag + Prediction
EECC551 - Shaaban #1 lec # 5 Winter Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 19, 2009 Topic: Instruction-Level Parallelism III (Dynamic Branch Prediction)
Branch pred. CSE 471 Autumn 011 Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific.
So far we have dealt with control hazards in instruction pipelines by:
Dynamic Branch Prediction
CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
EECC551 - Shaaban #1 lec # 5 Fall Static Conditional Branch Prediction Branch prediction schemes can be classified into static and dynamic.
Branch Prediction CSE 4711 Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications;
Evaluation of Dynamic Branch Prediction Schemes in a MIPS Pipeline Debajit Bhattacharya Ali JavadiAbhari ELE 475 Final Project 9 th May, 2012.
Computer Architecture 2012 – advanced branch prediction 1 Computer Architecture Advanced Branch Prediction By Dan Tsafrir, 21/5/2012 Presentation based.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Korea UniversityG. Lee CRE652 Processor Architecture Dynamic Branch Prediction.
Computer Structure Advanced Branch Prediction
Computer Architecture 2015 – Advanced Branch Prediction 1 Computer Architecture Advanced Branch Prediction By Yoav Etsion and Dan Tsafrir Presentation.
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
Yiorgos Makris Professor Department of Electrical Engineering University of Texas at Dallas EE (CE) 6304 Computer Architecture Lecture #13 (10/28/15) Course.
Dynamic Branch Prediction
CSL718 : Pipelined Processors
CS203 – Advanced Computer Architecture
Computer Structure Advanced Branch Prediction
Dynamic Branch Prediction
Computer Architecture Advanced Branch Prediction
CS5100 Advanced Computer Architecture Advanced Branch Prediction
COSC3330 Computer Architecture Lecture 15. Branch Prediction
Samira Khan University of Virginia Dec 4, 2017
CMSC 611: Advanced Computer Architecture
Module 3: Branch Prediction
TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb bubble
So far we have dealt with control hazards in instruction pipelines by:
CPE 631: Branch Prediction
Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.
Dynamic Branch Prediction
Pipelining and control flow
So far we have dealt with control hazards in instruction pipelines by:
Lecture 10: Branch Prediction and Instruction Delivery
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Pipelining: dynamic branch prediction Prof. Eric Rotenberg
Adapted from the slides of Prof
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Computer Structure Advanced Branch Prediction
CPE 631 Lecture 12: Branch Prediction
Presentation transcript:

Computer Architecture 2011 – Branch Prediction 1 Computer Architecture Advanced Branch Prediction Lihu Rappoport and Adi Yoaz

Computer Architecture 2011 – Branch Prediction 2 Introduction  Need to predict:  Conditional branch direction (taken or no taken)  Actual direction is known only after execution  Wrong direction prediction causes a full flush  All taken branch (conditional taken or unconditional) targets  Target of direct branches known at decode  Target of indirect branches known at execution  Branch type  Conditional, uncond. direct, uncond. indirect, call, return  Target: minimize branch misprediction rate for a given predictor size

Computer Architecture 2011 – Branch Prediction 3 What/Who/When We Predict/Fix FetchDecodeExecute Target Array  Branch type  conditional  unconditional direct  unconditional indirect  call  return  Branch target Indirect Target Array  Predict indirect target  override TA target Return Stack Buffer  Predict return target Cond. Branch Predictor  Predict conditional T/NT  Fix TA miss  Fix wrong direct target on TA miss try to fix Fix TA miss Fix Wrong prediction Dec Flush Exe Flush

Computer Architecture 2011 – Branch Prediction 4 Branches and Performance  MPI : misprediction-per-instruction: # of incorrectly predicted branches MPI = total # of instructions  MPI correlates well with performance. For example:  MPI = 1% (1 out of 100 out of 20 branches)  IPC=2 (IPC is the average number of instructions per cycle),  flush penalty of 10 cycles  We get:  MPI = 1%  flush in every 100 instructions  flush in every 50 cycles (since IPC=2),  10 cycles flush penalty every 50 cycles  20% in performance

Computer Architecture 2011 – Branch Prediction 5 Target Array  TA is accessed using the branch address (branch IP)  Implemented as an n-way set associative cache  Tags usually partial  Save space  Can get false hits  Few branches aliased to the same entry  No correctness only performance  TA predicts the following  Indication that instruction is a branch  Predicted target  Branch type  Unconditional: take target  Conditional: predict direction  TA allocated / updated at execution Branch IP tag target predicted target hit / miss (indicates a branch) type predicted type

Computer Architecture 2011 – Branch Prediction 6 Conditional Branch Direction Prediction

Computer Architecture 2011 – Branch Prediction 7 One-Bit Predictor  Problem: 1-bit predictor has a double mistake in loops Branch Outcome Prediction? branch IP Prediction (at fetch): previous branch outcome counter array / cache Update (at execution) Update bit with branch outcome

Computer Architecture 2011 – Branch Prediction 8 Bimodal (2-bit) Predictor  A 2-bit counter avoids the double mistake in glitches  Need “more evidence” to change prediction  2 bits encode one of 4 states  00 – strong NT, 01 – weakly NT, 10 – weakly taken, 11 – strong taken  Initial state: weakly-taken (most branches are taken)  Update  Branch was actually taken: increment counter (saturate at 11)  Branch was actually not-taken: decrement counter (saturate at 00)  Predict according to m.s.bit of counter (0 – NT, 1 – taken)  Does not predict well branches with patterns like … 0 SNT taken not-taken taken not-taken not- taken 01 WNT 10 WT 1 ST Predict takenPredict not-taken

Computer Architecture 2011 – Branch Prediction 9 l.s. bits of branch IP Bimodal Predictor (cont.) Prediction = msb of counter 2-bit-sat counter array Update counter with branch outcome

Computer Architecture 2011 – Branch Prediction 10 Bimodal Predictor - example  Br1 prediction  Pattern:  counter:  Prediction:TTTTT T  Br2 prediction  Pattern:  counter:  Prediction:T nTT nTT nT  Br3 prediction  Pattern:  counter:  Prediction:T TT TT T Code:  Loop: ….  br1: if (n/2) {  ……. }  br2: if ((n+1)/2) {  ……. }  n--  br3: JNZ n, Loop

Computer Architecture 2011 – Branch Prediction 11 2-Level Prediction: Local Predictor  Save the history of each branch in a Branch History Register (BHR):  A shift-register updated by branch outcome  Saves the last n outcomes of the branch  Used as a pointer to an array of bits specifying direction per history  Example: assume n=6  Assume the pattern  At the steady-state, the following patterns are repeated in the BHR:  Following , , the jump is not taken  Following the jump is taken BHR 0 2 n -1 n

Computer Architecture 2011 – Branch Prediction 12 Local Predictor (cont.)  There could be glitches from the pattern  Use 2-bit saturating counters instead of 1 bit to record outcome:  Too long BHRs are not good:  Past history may be no longer relevant  Warm-Up is longer  Counter array becomes too big Update History with branch outcome prediction = msb of counter 2-bit-sat counter array Update counter with branch outcome history BHR

Computer Architecture 2011 – Branch Prediction 13 Local Predictor: private counter arrays Branch IP taghistory prediction = msb of counter 2-bit-sat counter arrays Update counter with branch outcome Update History with branch outcome history cache Predictor size: #BHRs × (tag_size + history_size + 2 × 2 history_size ) Example: #BHRs = 1024; tag_size=8; history_size=6  size=1024 × ( ×2 6 ) = 142Kbit Holding BHRs and counter arrays for many branches:

Computer Architecture 2011 – Branch Prediction 14 Local Predictor: shared counter arrays  Using a single counter array shared by all BHR’s  All BHR’s index the same array  Branches with similar patterns interfere with each other prediction = msb of counter Branch IP 2-bit-sat counter array taghistory history cache Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size Example: #BHRs = 1024; tag_size=8; history_size=6  size=1024 × (8 + 6) + 2×2 6 = 14.1Kbit

Computer Architecture 2011 – Branch Prediction 15 Local Predictor: lselect  lselect reduces inter-branch-interference in the counter array prediction = msb of counter Branch IP 2-bit-sat counter array taghistory history cache h h+m m l.s.bits of IP Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size + m

Computer Architecture 2011 – Branch Prediction 16 Local Predictor: lshare lshare reduces inter-branch-interference in the counter array: maps common patterns in different branches to different counters h h h l.s.bits of IP history cache taghistory prediction = msb of counter Branch IP 2-bit-sat counter array Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size

Computer Architecture 2011 – Branch Prediction 17  The behavior of some branches is highly correlated with the behavior of other branches: if (x < 1)... if (x > 1)...  Using a Global History Register (GHR), the prediction of the second if may be based on the direction of the first if  For other branches the history interference might be destructive Global Predictor

Computer Architecture 2011 – Branch Prediction 18 Global Predictor (cont.) Update History with branch outcome prediction = msb of counter 2-bit-sat counter array Update counter with branch outcome history GHR The predictor size: history_size + 2*2 history_size Example: history_size = 12  size = 8 K Bits

Computer Architecture 2011 – Branch Prediction 19 gshare combines the global history information with the branch IP Global Predictor: Gshare prediction = msb of counter 2-bit-sat counter array Update counter with branch outcome Branch IP history GHR Update History with branch outcome

Computer Architecture 2011 – Branch Prediction 20 Chooser  The chooser may also be indexed by the GHR +1 if Bimodal / Local correct and Global wrong -1 if Bimodal / Local wrong and Global correct Bimodal / Local Global Branch IP Prediction Chooser array (an array of 2-bit sat. counters) GHR A chooser selects between 2 predictor that predict the same branch: Use the predictor that was more correct in the past

Computer Architecture 2011 – Branch Prediction 21 Speculative History Updates  Deep pipeline  many cycles between fetch and branch resolution  If history is updated only at resolution  Local: future occurrences of the same branch may see stale history  Global: future occurrences of all branches may see stale history  History is speculatively updated according to the prediction  History must be corrected if the branch is mispredicted  Speculative updates are done in a special field to enable recovery  Speculative History Update  Speculative history updated assuming previous predictions are correct  Speculation bit set to indicate that speculative history is used  Counter array is not updated speculatively  Prediction can change only on a misprediction (state 01→10 or 10→01)  On branch resolution  Update the real history and reset speculative histories if mispredicted

Computer Architecture 2011 – Branch Prediction 22 Return Stack Buffer  A return instruction is a special case of an indirect branch:  Each times it jumps to a different target  The target is determined by the location of the corresponding call instruction  The idea:  Hold a small stack of targets  When the target array predicts a call  Push the address of the instruction which follows the call into the stack  When the target array predicts a return  Pop a target from the stack and use it as the return address

Computer Architecture 2011 – Branch Prediction 23 Branch Prediction in commercial Processors

Computer Architecture 2011 – Branch Prediction 24  386 / 486  All branches are statically predicted Not Taken  Pentium  IP based, 2-bit saturating counters (Lee-Smith)  BTB miss - statically predicted Not Taken Older Processors

Computer Architecture 2011 – Branch Prediction 25 Intel Pentium III  2-level, local histories, per-set counters  4-way set associative: 512 entries in 128 sets IP Tag Hist 1001 Pred= msb of counter Way 0Way 1 Target Way 2 Way counters 128 sets PTV LRR 2 Per-Set Branch Type 00- cond 01- ret 10- call 11- uncond Return Stack Buffer

Computer Architecture 2011 – Branch Prediction 26 Alpha LG Chooser Counters 4 ways 256 Histories IP In each entry: 6 bit tag + 10 bit History Counters GHR Counters Global Local Chooser 2  New entry on the Local stage is allocated on a global stage miss-prediction  Chooser state-machines: 2 bit each:  one bit saves last time global correct/wrong,  and the other bit saves for the local correct/wrong  Chooses Local only if local was correct and global was wrong

Computer Architecture 2011 – Branch Prediction 27 Pentium® M  Combines 3 predictors  Bimodal, Global and Loop predictor  Loop predictor analyzes branches to see if they have loop behavior  Moving in one direction (taken or NT) a fixed number of times  Ended with a single movement in the opposite direction

Computer Architecture 2011 – Branch Prediction 28 Pentium® M – Indirect Branch Predictor  Indirect branch targets is data dependent  Can have many targets: e.g., a case statement  Can still have only a single target at run time  Resolved at execution  high misprediction penalty  Used in object-oriented code (C++, Java)  becomes a growing source of branch mispredictions  A dedicated indirect branch target predictor (iTA)  Chooses targets based on a global history (similar to global predictor)  Initially indirect branch is allocated only in the target array (TA)  If target is mispredicted  allocate an iTA entry corresponding to the global history leading to this instance of the indirect branch  Data-dependent indirect branches allocate as many targets as needed  Monotonic indirect branches are still predicted by the TA

Computer Architecture 2011 – Branch Prediction 29 Indirect branch target prediction (cont)  Prediction from the iTA is used if  TA indicates an indirect branch  iTA hits for the current global history (XORed with branch address) Target Array Indirect Target Predictor Branch IP Predicted Target M X U hit indirect branch hit Target HIT Global history Target

Computer Architecture 2011 – Branch Prediction 30 Backup

Computer Architecture 2011 – Branch Prediction 31 BHT - Branch History Table 2-level 8,192-entry global predictor: 16 Entry BTC - Branch Target Cache  Supplies the first 16 bytes of target instructions to the decoders when the branches are predicted.  Organized as 16 entries of 16 bytes.  Avoids a bubble for correct predictions.  No Target Address Buffer: Address ALUs calculate target addresses on-the-fly during decode 16 Entry RAS - Return Address Stack  Caches the return addresses 2-bit-sat counter array 13 bit global history GHR AMD-K6

Computer Architecture 2011 – Branch Prediction 32  256 X 3- bit branch history table (BHT)  Instead of 2-bit-sat counters, stores the results of the last three iterations of each branch  The prediction is based on a majority vote of the three bits  Offers a similar level of hysteresis and accuracy as 2-bit-sat, but easier to update (shift results vs. read-modify-write)  The BHT is only updated as branch instructions are retired  prevents corrupting the history information with speculative executions of the branch HP PA-8000