Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science.

Slides:

Advertisements

Similar presentations

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Advertisements

Pipelining and Control Hazards Oct

Lecture Objectives: 1)Define branch prediction. 2)Draw a state machine for a 2 bit branch prediction scheme 3)Explain the impact on the compiler of branch.

CPE 631: Branch Prediction Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

Dynamic Branch Prediction

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

Copyright 2001 UCB & Morgan Kaufmann ECE668.1 Adapted from Patterson, Katz and Culler © UCB Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

CPE 631: Branch Prediction Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.

Lecture 3: Branch Prediction Young Cho Graduate Computer Architecture I.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

EECE476: Computer Architecture Lecture 20: Branch Prediction Chapter extra The University of British ColumbiaEECE 476© 2005 Guy Lemieux.

EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Goal: Reduce the Penalty of Control Hazards

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Branch Prediction Dimitris Karteris Rafael Pasvantidιs.

COMP381 by M. Hamdi 1 (Recap) Control Hazards. COMP381 by M. Hamdi 2 Control (Branch) Hazard A: beqz r2, label B: label: P: Problem: The outcome.

Dynamic Branch Prediction

EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.

EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Branch Hazards and Static Branch Prediction Techniques

CPE 631 Session 17 Branch Prediction Electrical and Computer Engineering University of Alabama in Huntsville.

HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

CMPE 421 Parallel Computer Architecture Part 3: Hardware Solution: Control Hazard and Prediction.

CSIE30300 Computer Architecture Unit 06: Containing Control Hazards

Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.1 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Csaba Andras Moritz.

Dynamic Branch Prediction

Instruction-Level Parallelism and Its Dynamic Exploitation

CS203 – Advanced Computer Architecture

UNIVERSITY OF MASSACHUSETTS Dept

5 Steps of MIPS Datapath Figure A.2, Page A-8

Chapter 4 The Processor Part 4

ECS 154B Computer Architecture II Spring 2009

CMSC 611: Advanced Computer Architecture

So far we have dealt with control hazards in instruction pipelines by:

CPE 631: Branch Prediction

Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.

Dynamic Branch Prediction

/ Computer Architecture and Design

Control unit extension for data hazards

So far we have dealt with control hazards in instruction pipelines by:

Lecture 10: Branch Prediction and Instruction Delivery

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

Adapted from the slides of Prof

Pipelining (II).

Control unit extension for data hazards

Dynamic Hardware Prediction

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

Control unit extension for data hazards

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

CPE 631 Lecture 12: Branch Prediction

Presentation transcript:

Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 2 Control Hazard beq r1,r3,label and r2,r3,r5 or r6,r1,r7 add r8,r1,r9 label: xor r10,r1,r11 Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg

Chapter 2: A Five Stage RISC Pipeline 3 Branch Penalty Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution: –Determine branch taken or not sooner, AND –Compute taken branch address earlier MIPS branch tests if register = 0 or  0 –beqz R4, name MIPS Solution: –Move Zero test to ID/RF stage –Adder to calculate new PC in ID/RF stage –1 clock cycle penalty for branch versus 3

Chapter 2: A Five Stage RISC Pipeline 4 Adder IF/ID Modified MIPS Datapath Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc MUX Sign Extend Zero? MEM/WB EX/MEM Next SEQ PC rd WB Data Next PC PC rs rt Imm MUX ID/EX Instruction Memory Register File Data Memory ALU Adder

Chapter 2: A Five Stage RISC Pipeline 5 Branch Resolved in ID Stage beq r1,r3,label and r2,r3,r5 Label: xor r10,r1,r11 … … Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg

Chapter 2: A Five Stage RISC Pipeline 6 Branch Prediction Predict Branch Not Taken –Execute successor instructions in sequence. –“Squash” instructions in pipeline if branch actually taken. –47% MIPS branches not taken on average. –PC+4 already calculated, so use it to get next instruction. Predict Branch Taken –53% MIPS branches taken on average. –But haven’t calculated branch target address yet MIPS still incurs 1 cycle branch penalty Other machines: branch target known before outcome Delay Branch Technique

Chapter 2: A Five Stage RISC Pipeline 7 Delay Branches This technique involves using software making the delay slots valid and useful. Some n number of instructions after the branch is executed regardless of whether the branch is taken. branch instruction sequential successor 1 sequential successor sequential successor n branch target if taken 1 delay slot allows proper decision and branch target address in 5 stage pipeline MIPS uses this. Branch delay of length n

Chapter 2: A Five Stage RISC Pipeline 8 Performance Effect of Branch Penalty Let p b = the probability that an instruction is a branch p t = the probability that a branch is taken b = the branch penalty CPI = the average number of cycles per instruction. Then CPI = (1 - p b ) + p b [p t (1 + b) + (1 - p t )] CPI = 1 + bp t p b

Chapter 2: A Five Stage RISC Pipeline 9 Delay Branch Technique

Chapter 2: A Five Stage RISC Pipeline 10 Delay Branch Technique (1) A:=B+C If B>C Then Goto Next Delay Slot... Next: becomes If B>C Then Goto Next A:=B+C.... Next: “From before”

Chapter 2: A Five Stage RISC Pipeline 11 Delay Branch Technique (2) Next:X := Y * Z... B := A + C If B > C Then Goto Next Delay Slot becomes X := Y * Z Next: B := A + C If B > C Then Goto Next X := Y * Z “From target” Must be OK to execute when not taken May need to duplicate

Chapter 2: A Five Stage RISC Pipeline 12 Delay Branch Technique (3) B := A + C If B > C Then Goto Next Delay Slot X := Y * Z... Next: becomes B := A + C If B > C Then Goto Next X := Y * Z... Next: “From fall through” Must be OK to execute when taken

Chapter 2: A Five Stage RISC Pipeline 13 Delay Branch Technique (cont.) The performance of Delay Branches can be modeled by the following equation: CPI = 1+bp b p nop where p nop is the fraction of the b delay slots filled with nops. Thus, if f i is the probability that the delay slot i is filled with a useful instruction, then p nop = 1 - (f 1 + f 2 + …+ f b )/b Example: Suppose we have the following characteristic b=4, f 1 =0.6, f 2 = 0.1, f 3 = f 4 =0, p b =0.2 We have CPI =  0.2  = 1.66

Chapter 2: A Five Stage RISC Pipeline 14 Delay Branch Technique (cont.) The concept of squashing or annulling can be used in conjunction with delay branches. X := Y * Z Next:... … B := A + C If B > C Then Goto Next X := Y * Z=>This instruction is nullified bne,ars,rt,label a bitBranch outcomeDelay inst. Executed? takenyes not takenyes atakenyes anot takenno (annulled)

Chapter 2: A Five Stage RISC Pipeline 15 Delay Branch Technique (cont.) For processors with this capability, the performance can be modeled as CPI = 1 + bp b [p nop (1 - p null ) + p null )] where p null =(1-p t ) for nullify-on-branch-not-taken. Suppose b=4, f 1 =0.8, f 2 =0.3, f 3 =0.1, f 4 =0, p b =0.2, p null = 0.35 => CPI=1.644

Chapter 2: A Five Stage RISC Pipeline 16 Delayed Branch Performance Compiler effectiveness for single branch delay slot: –Fills about 60% of branch delay slots. –About 80% of instructions executed in branch delay slots useful in computation. –About 50% (60% x 80%) of slots usefully filled.

Chapter 2: A Five Stage RISC Pipeline 17 Evaluating Branch Alternatives Suppose Conditional & Unconditional = 14%, 65% change PC PredictionBranchCPIspeedup v.speedup v. schemepenaltyunpipelinedstall Stall pipeline Predict taken Predict not taken Delayed branch

18 Reducing Branch Penalty Branch penalty in dynamically scheduled processors: wasted cycles due to pipeline flushing on mispredicted branches Reduce branch penalty: 1.Predict branch/jump instructions AND branch direction (taken or not taken) 2.Predict branch/jump target address (for taken branches) 3.Speculatively execute instructions along the predicted path

19 What to Use and What to Predict Available info: –Current predicted PC –Past branch history (direction and target) What to predict: –Conditional branch inst: branch direction and target address –Jump inst: target address –Procedure call/return: target address May need instruction pre-decoded IM PC Predictors PCPC pred_PC pred infofeedbackPC & Inst

20 Mis-prediction Detections and Feedbacks Detections: At the end of decoding –Target address known at decoding, and not match –Flush fetch stage At commit (most cases) –Wrong branch direction or target address not match –Flush the whole pipeline Feedbacks: Any time a mis-prediction is detected At a branch’s commit (at EXE: called speculative update) FETCH RENAME SCHD REB/ROB COMMIT WB EXE predictors

21 Branch Direction Prediction Predict branch direction: taken or not taken (T/NT) Static prediction: compilers decide the direction Dynamic prediction: hardware decides the direction using dynamic information 1.1-bit Branch-Prediction Buffer 2.2-bit Branch-Prediction Buffer 3.Correlating Branch Prediction Buffer 4.Tournament Branch Predictor 5.and more … Not taken taken BNE R1, R2, L1 … L1: …

22 Predictor for a Single Branch state 2. Predict Output T/NT 1. Access 3. Feedback T/NT T Predict Taken 1 0 T NT General Form 1-bit prediction NT PC Feedback

23 Branch History Table of 1-bit Predictor BHT also Called Branch Prediction Buffer in textbook Can use only one 1-bit predictor, but accuracy is low BHT: use a table of simple predictors, indexed by bits from PC Similar to direct mapped cache More entries, more cost, but less conflicts, higher accuracy BHT can contain complex predictors Prediction K-bit Branch address 2k2k

24 1-bit BHT Weakness Example: in a loop, 1-bit BHT will cause 2 mispredictions Consider a loop of 9 iterations before exit: for (…){ for (i=0; i<9; i++) a[i] = a[i] * 2.0; } –End of loop case, when it exits instead of looping as before –First time through loop on next time through code, when it predicts exit instead of looping –Only 80% accuracy even if loop 90% of the time

25 Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249) Gray: stop, not taken Blue: go, taken Adds hysteresis to decision making process 2-bit Saturating Counter T T NT Predict Taken Predict Not Taken Predict Taken Predict Not Taken T NT T

26 Correlating Branches Code example showing the potential If (d==0) d=1; If (d==1) … Assemble code BNEZ R1, L1 DADDIU R1,R0,#1 L1: DADDIU R3,R1,#- 1 BNEZ R3, L2 L2: … Observation: if BNEZ1 is not taken, then BNEZ2 is taken

Chapter 3 - Exploiting ILP27 (1, 1) Predictor (1,1) predictor - last branch, 1-bit prediction We use a pair of bits where the first bit being the prediction if the last branch in the program was not taken, and the second bit being the prediction if the last branch was taken. Prediction Bits Prediction If Last branch Not TakenLast Branch Taken NT/NTNot Taken NT/TNot TakenTaken T/NTTakenNot Taken T/TTaken

Chapter 3 - Exploiting ILP28 (1, 1) Predictor: Example Consider the following code assuming d is assigned to R1. if (d==0) d=1; if (d==1) bnezR1,L1; branch b1 (d!=0) addiR1,R0,#1; d==0, so d=1 L1:subiR3,R1,#1 bnezR3,L2; branch b2 (d!=1)... L2: Suppose d alternates between 2 and 0, (1, 1) predictor initialized to not taken. Bold indicate prediction. The only misprediction is on the first iteration, when d=2, because the b1 was not correlated with the previous prediction of b2 d=?b1 predb1 actionnew b1 predb2 predb2 actionnew b2 pred 2NT/NTTT/NTNT/NTTNT/T 0T/NTNTT/NTNT/TNTNT/T 2T/NTT NT/TT 0T/NTNTT/NTNT/TNTNT/T

Chapter 3 - Exploiting ILP29 (1, 1) Predictor: Example If we had use a 1-bit predictor We would have had all the branches mispredicted! d=?b1 predb1 actionnew b1 predb2 predb2 actionnew b2 pred 2NTTT TT 0T T 2 TT TT 0T T

Chapter 3 - Exploiting ILP30 (m, n) Predictor (m,n) Predictor: In general, (m,n) predictor uses the behavior of last m branches (using shift register) to choose from 2 m branch predictors, each of which is a n-bit predictor for a single branch.

Chapter 3 - Exploiting ILP31 Performance of (2, 2) Predictor Improvement is most noticeable in integer benchmarks. (m,n) predictor outperforms 2-bit predictor, even with unlimited entries! Integer benchmarks

Chapter 3 - Exploiting ILP32 Tournament Predictors Uses multiple predictors, usually one based on local information and one based on global information. –Local predictors are better for some branches –Global predictors are better at utilizing correlation A selector is used to choose among the predictors, usually a 2-bit saturating counter. n/m means: n - left predictor m - right predictor 0/1 means: 0 - Incorrect 1 - Correct

Chapter 3 - Exploiting ILP33 Example: Alpha Branch Predictor uses the most sophisticated branch predictor. Last 10 outcomes of this branch 3-bit saturating counter 2-bit predictor 2-bit saturating counter Last 12 outcomes of all the branches

Tournament Predictor in Alpha Local predictor consists of a 2-level predictor: –Top level a local history table consisting of bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted –Next level Selected entry from the local history table is used to index a table of 1K entries consisting 3-bit saturating counters, which provide the local prediction Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180K transistors) 1K  10 bits 1K  3 bits

% of predictions from local predictor in Tournament Prediction Scheme 98% 100% 94% 90% 55% 76% 72% 63% 37% 69% 0%20%40%60%80%100% nasa7 matrix300 tomcatv doduc spice fpppp gcc espresso eqntott li

94% 96% 98% 97% 100% 70% 82% 77% 82% 84% 99% 88% 86% 88% 86% 95% 99% 0%20%40%60%80%100% gcc espresso li fpppp doduc tomcatv Profile-based 2-bit counter Tournament Accuracy of Branch Prediction Profile: branch profile from last execution (static in that is encoded in instruction, but profile) fig 3.40

Accuracy v. Size (SPEC89) 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% Total predictor size (Kbits) Conditional branch misprediction rate Local - 2 bit counters Correlating - (2,2) scheme Tournament

Power Consumption BlueRISC’s Compiler-driven Power-Aware Branch Prediction Comparison with 512 entry BTAC bimodal (patent-pending) Copyright 2007 CAM & BlueRISC

Pitfall: Sometimes dumber is better Alpha uses tournament predictor (29 Kbits) Earlier uses a simple 2-bit predictor with 2K entries (or a total of 4 Kbits) SPEC95 benchmarks, outperforms –21264 avg mispredictions per 1000 instructions –21164 avg mispredictions per 1000 instructions Reversed for transaction processing (TP) ! –21264 avg. 17 mispredictions per 1000 instructions –21164 avg. 15 mispredictions per 1000 instructions TP code much larger & hold 2X branch predictions based on local behavior (2K vs. 1K local predictor in the 21264) What about power? –Large predictors give some increase in prediction rate but for a large power cost

Chapter 3 - Exploiting ILP40 Branch Target Buffer BTB acts as a cache for BTAs. This eliminates cycles wasted per branch required to calculate the BTAs.

Chapter 3 - Exploiting ILP41 BTB (cont.) BTA and the outcome of the branch is known by end of ID stage …but not relayed until EX stage

Chapter 3 - Exploiting ILP42 BTB (cont.)

Chapter 3 - Exploiting ILP43 Return Address Prediction BTB and BPB do a good job in predicting how future behavior will repeat. However, the subroutine call/return paradigm makes correct prediction difficult. The BTB then contains the following after the second subroutine is called: Inst. AddrTarget Addr When we return from subr, we get a hit on a valid entry in the BTB (Inst. Addr. = 520) and predict that we will return to address 104. However, this is not correct. The next instruction should be 116!

Chapter 3 - Exploiting ILP44 Subroutine Return Stack In order to detect such mispredictions, subroutine return stack can be used to augment the BTB.

Chapter 3 - Exploiting ILP45 Performance of SRS SPEC 95

Pentium 4’s Branch Predictor “Unveiling the Intel Branch Predictors” –Pentium 4 – 46

Natural Branch Predictors “Towards a High Performance Neural Branch Predictor” – –The main advantage of the neural predictor is its ability to exploit long histories while requiring only linear resource growth –Used in IA-64 simulators 47

Core 2’s Branch Predictor? TAGE: Tagged Geometric Chapter 3 - Exploiting ILP48

TAGE Performance 49

To Learn More Chapter 3 - Exploiting ILP50