Lecture 3: Branch Prediction Young Cho Graduate Computer Architecture I.

Slides:



Advertisements
Similar presentations
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Advertisements

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Pipelining and Control Hazards Oct
Lecture Objectives: 1)Define branch prediction. 2)Draw a state machine for a 2 bit branch prediction scheme 3)Explain the impact on the compiler of branch.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CPE 631: Branch Prediction Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
Dynamic Branch Prediction
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Copyright 2001 UCB & Morgan Kaufmann ECE668.1 Adapted from Patterson, Katz and Culler © UCB Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
EECE476: Computer Architecture Lecture 20: Branch Prediction Chapter extra The University of British ColumbiaEECE 476© 2005 Guy Lemieux.
EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 5 Branch Prediction (2.3) and Scoreboarding (A.7)
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Goal: Reduce the Penalty of Control Hazards
King Fahd University of Petroleum and Minerals King Fahd University of Petroleum and Minerals Computer Engineering Department Computer Engineering Department.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
COMP381 by M. Hamdi 1 (Recap) Control Hazards. COMP381 by M. Hamdi 2 Control (Branch) Hazard A: beqz r2, label B: label: P: Problem: The outcome.
Dynamic Branch Prediction
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.
Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.
CPE 731 Advanced Computer Architecture Pipelining Review Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science.
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CMPE 421 Parallel Computer Architecture
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.
Branch Hazards and Static Branch Prediction Techniques
CPE 631 Session 17 Branch Prediction Electrical and Computer Engineering University of Alabama in Huntsville.
HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.
CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.
HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
CMPE 421 Parallel Computer Architecture Part 3: Hardware Solution: Control Hazard and Prediction.
CSIE30300 Computer Architecture Unit 06: Containing Control Hazards
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.1 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Csaba Andras Moritz.
Dynamic Branch Prediction
Instruction-Level Parallelism and Its Dynamic Exploitation
CS203 – Advanced Computer Architecture
UNIVERSITY OF MASSACHUSETTS Dept
5 Steps of MIPS Datapath Figure A.2, Page A-8
Chapter 4 The Processor Part 4
Appendix A - Pipelining
CMSC 611: Advanced Computer Architecture
CPE 631: Branch Prediction
Dynamic Branch Prediction
Advanced Computer Architecture
/ Computer Architecture and Design
Lecture 10: Branch Prediction and Instruction Delivery
Adapted from the slides of Prof
Dynamic Hardware Prediction
CPE 631 Lecture 12: Branch Prediction
Presentation transcript:

Lecture 3: Branch Prediction Young Cho Graduate Computer Architecture I

2 - CSE/ESE 560M – Graduate Computer Architecture I “Instruction Frequency” CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count “Average Cycles per Instruction” Cycles Per Instructions

3 - CSE/ESE 560M – Graduate Computer Architecture I Instruction Memory Register File ALU Data Memory PC Control IF/ID ID/EXEX/MEMMEM/WB Typical Load/Store Processor

4 - CSE/ESE 560M – Graduate Computer Architecture I Pipelining Laundry 30 minutes35 minutes Three sets of Clean Clothes in 2 hours 40 minutes 35 minutes25 minutes With large number of sets, the each load takes average of ~35 min to wash 3X Increase in Productivity!!!

5 - CSE/ESE 560M – Graduate Computer Architecture I Introducing Problems Hazards prevent next instruction from executing during its designated clock cycle –Structural hazards: HW cannot support this combination of instructions (single person to dry and iron clothes simultaneously) –Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock – needs both before putting them away) –Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (Er…branch & jump)

6 - CSE/ESE 560M – Graduate Computer Architecture I Read After Write (RAW) –Instr 2 tries to read operand before Instr 1 writes it –Caused by a “Dependence” in compiler term Write After Read (WAR) –Instr 2 writes operand before Instr 1 reads it –Called an “anti-dependence” in compiler term Write After Write (WAW) –Instr 2 writes operand before Instr 1 writes it –“Output dependence” in compiler term WAR and WAW in more complex systems Data Hazards

7 - CSE/ESE 560M – Graduate Computer Architecture I 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11 Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg 3 instructions are in the pipeline before new instruction can be fetched. Branch Hazard (Control)

8 - CSE/ESE 560M – Graduate Computer Architecture I Branch Hazard Alternatives Stall until branch direction is clear Predict Branch Not Taken –Execute successor instructions in sequence –“Squash” instructions in pipeline if branch actually taken –Advantage of late pipeline state update –47% DLX branches not taken on average –PC+4 already calculated, so use it to get next instr Predict Branch Taken –53% DLX branches taken on average –DLX still incurs 1 cycle branch penalty –Other machines: branch target known before outcome

9 - CSE/ESE 560M – Graduate Computer Architecture I Branch delay of length n Delayed Branch –Define branch to take place AFTER a following instruction (Fill in Branch Delay Slot) branch instruction sequential successor 1 sequential successor sequential successor n branch target if taken –1 slot delay allows proper decision and branch target address in 5 stage pipeline Branch Hazard Alternatives

10 - CSE/ESE 560M – Graduate Computer Architecture I Evaluating Branch Alternatives SchedulingBranchCPIspeedup v.speedup v. scheme penaltyunpipelinedstall Stall pipeline Predict taken Predict not taken Delayed branch Conditional & Unconditional = 14%, 65% change PC

11 - CSE/ESE 560M – Graduate Computer Architecture I Solution to Hazards Structural Hazards –Delaying HW Dependent Instruction –Increase Resources (i.e. dual port memory) Data Hazards –Data Forwarding –Software Scheduling Control Hazards –Pipeline Stalling –Predict and Flush –Fill Delay Slots with Previous Instructions

12 - CSE/ESE 560M – Graduate Computer Architecture I Administrative Literature Survey –One Q&A per Literature –Q&A should show that you read the paper Changes in Schedule –Need to be out of town on Oct 4 th (Tuesday) –Quiz 2 moved up 1 lecture Tool and VHDL help

13 - CSE/ESE 560M – Graduate Computer Architecture I Typical Pipeline Example: MIPS R4000 IF ID MEM WB integer unit FP/int Multiply FP adder FP/int divider ex m1m2m3m4m5m6m7 a1a2a3a4 Div (lat = 25, Init inv=25)

14 - CSE/ESE 560M – Graduate Computer Architecture I Prediction Easy to fetch multiple (consecutive) instructions per cycle –Essentially speculating on sequential flow Jump: unconditional change of control flow –Always taken Branch: conditional change of control flow –Taken typically ~50% of the time in applications Backward: 30% of the Branch  80% taken = ~24% Forward: 70% of the Branch  40% taken = ~28%

15 - CSE/ESE 560M – Graduate Computer Architecture I Current Ideas Reactive –Adapt Current Action based on the Past –TCP windows –URL completion,... Proactive –Anticipate Future Action based on the Past –Branch prediction –Long Cache block –Tracing

16 - CSE/ESE 560M – Graduate Computer Architecture I Branch Prediction Schemes Static Branch Prediction Dynamic Branch Prediction –1-bit Branch-Prediction Buffer –2-bit Branch-Prediction Buffer –Correlating Branch Prediction Buffer –Tournament Branch Predictor Branch Target Buffer Integrated Instruction Fetch Units Return Address Predictors

17 - CSE/ESE 560M – Graduate Computer Architecture I Static Branch Prediction Execution profiling –Very accurate if Actually take time to Profile –Incovenient Heuristics based on nesting and coding –Simple heuristics are very inaccurate Programmer supplied hints... –Inconvenient and potentially inaccurate

18 - CSE/ESE 560M – Graduate Computer Architecture I Dynamic Branch Prediction Performance = ƒ(accuracy, cost of mis-prediction) 1-bit Branch History Table –Bitmap for Lower bits of PC address –Says whether or not branch taken last time –If Inst is Branch, predict and update the table Problem –1-bit BHT will cause 2 mis-predictions for Loops First time through the loop, it predicts exit instead loop End of loop case, it predicts loops instead of exit –Avg is 9 iterations before exit Only 80% accuracy even if loop 90% of the time

19 - CSE/ESE 560M – Graduate Computer Architecture I N-bit Dynamic Branch Prediction N-bit scheme where change prediction only if get misprediction N-times: T T NT Predict Taken Predict Not Taken Predict Taken Predict Not Taken T NT T 2-bit Scheme: Saturates the prediction up to 2 times

20 - CSE/ESE 560M – Graduate Computer Architecture I Correlating Branches (2,2) predictor –2-bit global: indicates the behavior of the last two branches –2-bit local (2-bit Dynamic Branch Prediction) Branch History Table –Global branch history is used to choose one of four history bitmap table –Predicts the branch behavior then updates only the selected bitmap table Branch address (4 bits) Prediction 2-bit recent global branch history (01 = not taken then taken)

21 - CSE/ESE 560M – Graduate Computer Architecture I Accuracy of Different Schemes 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 18% Frequency of Mispredictions 0% 1% 5% 6% 11% 4% 6% 5% 1% 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% nasa7matrix300tomcatvdoducdspicefppppgccespressoeqntottli Frequency of Mispredictions

22 - CSE/ESE 560M – Graduate Computer Architecture I BHT Accuracy Mispredict because either: –Wrong guess for the branch –Wrong Index for the branch 4096 entry table –programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% For SPEC92 –4096 about as good as infinite table

23 - CSE/ESE 560M – Graduate Computer Architecture I Tournament Branch Predictors Correlating Predictor –2-bit predictor failed on important branches –Better results by also using global information Tournament Predictors –1 Predictor based on global information –1 Predictor based on local information –Use the predictor that guesses better addr Predictor B Predictor A

24 - CSE/ESE 560M – Graduate Computer Architecture I Alpha K 2-bit counters to choose from among a global predictor and a local predictor Global predictor also has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor –12-bit pattern: ith bit 0 => ith prior branch not taken; ith bit 1 => ith prior branch taken; Local predictor consists of a 2-level predictor: –Top level a local history table consisting of bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted. –Next level Selected entry from the local history table is used to index a table of 1K entries consisting a 3-bit saturating counters, which provide the local prediction Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180,000 transistors)

25 - CSE/ESE 560M – Graduate Computer Architecture I Branch Prediction Accuracy 94% 96% 98% 97% 100% 70% 82% 77% 82% 84% 99% 88% 86% 88% 86% 95% 99% 0%20%40%60%80%100% gcc espresso li fpppp doduc tomcatv Profile-based 2-bit dynmic Tournament

26 - CSE/ESE 560M – Graduate Computer Architecture I Accuracy versus Size

27 - CSE/ESE 560M – Graduate Computer Architecture I Branch Target Buffer Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) –Note: must check for branch match now, since can’t use wrong branch address Branch PCPredicted PC =? PC of instruction FETCH Extra prediction state bits Yes: instruction is branch and use predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4)

28 - CSE/ESE 560M – Graduate Computer Architecture I Predicated Execution Built in Hardware Support –Bit for predicated instruction execution –Both paths are in the code –Execution based on the result of the condition No Branch Prediction is Required –Instructions not selected are ignored –Sort of inserting Nop

29 - CSE/ESE 560M – Graduate Computer Architecture I andr3,r1,r5 addi r2,r3,#4 subr4,r2,r1 jaldoit subi r1,r1,#1 A: subr4,r2,r1doit addir2,r3,#4A+8 N subr4,r2,r1 L andr3,r1,r5A+4 N subir1,r1,#1A+20 N Internal Cache state: Zero Cycle Jump What really has to be done at runtime? –Once an instruction has been detected as a jump or JAL, we might recode it in the internal cache. –Very limited form of dynamic compilation? Use of “Pre-decoded” instruction cache –Called “branch folding” in the Bell-Labs CRISP processor. –Original CRISP cache had two addresses and could thus fold a complete branch into the previous instruction –Notice that JAL introduces a structural hazard on write

30 - CSE/ESE 560M – Graduate Computer Architecture I Dynamic Branch Prediction Summary Prediction becoming important part of scalar execution Branch History Table –2 bits for loop accuracy Correlation –Recently executed branches correlated with next branch. –Either different branches –Or different executions of same branches Tournament Predictor –More resources to competitive solutions and pick between them Branch Target Buffer –Branch address & prediction Predicated Execution –No need for Prediction –Hardware Support needed