Computer Architecture 2011– pipeline (lec2-3) 1 Computer Architecture MIPS Pipeline By Dan Tsafrir, 7/3/2011, 14/3/2011 Presentation based on slides by.

Slides:



Advertisements
Similar presentations
Computer Architecture 2014– Pipeline 1 Computer Architecture Pipeline By Yoav Etsion & Dan Tsafrir Presentation based on slides by David Patterson, Avi.
Advertisements

Computer Structure 2013 – Pipeline 1 Computer Structure MIPS Pipeline Lihu Rappoport and Adi Yoaz Some of the slides were taken from: (1) Avi Mendelson.
Computer Architecture 2011 – Pipeline 1 Computer Architecture MIPS Pipeline Lihu Rappoport and Adi Yoaz Some of the slides were taken from: (1) Avi Mendelson.
Review: MIPS Pipeline Data and Control Paths
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
 The actual result $1 - $3 is computed in clock cycle 3, before it’s needed in cycles 4 and 5  We forward that value to later instructions, to prevent.
Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
1 Stalls and flushes  So far, we have discussed data hazards that can occur in pipelined CPUs if some instructions depend upon others that are still executing.
Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.
Pipeline Data Hazards: Detection and Circumvention Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly.
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
Basic Pipelining & MIPS Pipelining Chapter 6 [Computer Organization and Design, © 2007 Patterson (UCB) & Hennessy (Stanford), & Slides Adapted from: Mary.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
Computer Organization CS224 Chapter 4 Part b The Processor Spring 2010 With thanks to M.J. Irwin, T. Fountain, D. Patterson, and J. Hennessy for some lecture.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
1 (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann,
CSE431 L07 Overcoming Data Hazards.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 07: Overcoming Data Hazards Mary Jane Irwin (
Branch Hazards and Static Branch Prediction Techniques
CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.
CSIE30300 Computer Architecture Unit 05: Overcoming Data Hazards Hsin-Chou Chi [Adapted from material by and
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
CMPE 421 Parallel Computer Architecture Part 3: Hardware Solution: Control Hazard and Prediction.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
CPE432 Chapter 4B.1Dr. W. Abu-Sufah, UJ Chapter 4B: The Processor, Part B-1 Read Sections 4.7 Adapted from Slides by Prof. Mary Jane Irwin, Penn State.
Computer Architecture 2015– Pipeline 1 Computer Architecture Pipeline By Yoav Etsion & Dan Tsafrir Presentation based on slides by David Patterson, Avi.
Computer Structure 2015 – Pipeline 1 Computer Structure Pipeline Lecturer: Aharon Kupershtok Created by Lihu Rappoport.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
CSE 340 Computer Architecture Spring 2016 Overcoming Data Hazards.
Computer Organization
Stalling delays the entire pipeline
Note how everything goes left to right, except …
Pipelining: Hazards Ver. Jan 14, 2014
5 Steps of MIPS Datapath Figure A.2, Page A-8
Appendix C Pipeline implementation
Chapter 4 The Processor Part 4
ECS 154B Computer Architecture II Spring 2009
ECS 154B Computer Architecture II Spring 2009
ECE232: Hardware Organization and Design
Forwarding Now, we’ll introduce some problems that data hazards can cause for our pipelined processor, and show how to handle them with forwarding.
Chapter 4 The Processor Part 3
Review: MIPS Pipeline Data and Control Paths
Morgan Kaufmann Publishers The Processor
Chapter 4 The Processor Part 2
Computer Architecture MIPS Pipeline
Pipelining review.
Pipelining in more detail
The Processor Lecture 3.6: Control Hazards
Control unit extension for data hazards
Instruction Execution Cycle
Pipelining (II).
Control unit extension for data hazards
Pipelining Appendix A and Chapter 3.
Introduction to Computer Organization and Architecture
Control unit extension for data hazards
Stalls and flushes Last time, we discussed data hazards that can occur in pipelined CPUs if some instructions depend upon others that are still executing.
©2003 Craig Zilles (derived from slides by Howard Huang)
Computer Structure Pipeline
ELEC / Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher.
Presentation transcript:

Computer Architecture 2011– pipeline (lec2-3) 1 Computer Architecture MIPS Pipeline By Dan Tsafrir, 7/3/2011, 14/3/2011 Presentation based on slides by David Patterson, Avi Mendelson, Randi Katz, and Lihu Rappoport

Computer Architecture 2011– pipeline (lec2-3) 2 Pipeline idea: keep everyone busy

Computer Architecture 2011– pipeline (lec2-3) 3 Pipeline: more accurately… Expert in placing tomato and closing the sandwich Expert in placing roast biff Expert in cutting bread Pipelining elsewhere Unix shell grep string File | wc -l Assembling cars Whenever want to keep functional units busy

Computer Architecture 2011– pipeline (lec2-3) 4 Data Access Data Access Data Access Data Access Data Access Pipeline: microarchitecture First commercial use in 1985 In Intel chips since 486 (until then, serial execution) Inst Fetch Reg ALU Reg Inst Fetch Reg ALU Reg Inst Fetch Inst Fetch Reg ALU Reg Inst Fetch Reg ALU Reg Inst Fetch Reg ALU Reg 2 ns 8 ns Time Program execution order lw R1, 100(R0) lw R2, 200(R0) lw R3, 300(R0) Time Program execution order lw R1, 100(R0) lw R2, 200(R0) lw R3, 300(R0) before after

Computer Architecture 2011– pipeline (lec2-3) 5 MIPS  Introduced in 1981 by Hennessy (of “Patterson & Hennessy”)  “Microprocessor without Interlocked Pipeline Stages”  RISC  Often used in computer architecture courses  Was very successful (e.g., inspired the Alpha ISA)  Interlocks  Mechanism preventing undesired states in a state machine  Initially, “divide” & “multiply” required interlocks (allowed stages to indicate they’re busy)  => Paused other stages upstream

Computer Architecture 2011– pipeline (lec2-3) 6 Pipeline: principles  Ideal speedup = num of pipeline stages  Every clock cycles: one instruction finishes  (Namely, IPC of an ideal pipelined machine is 1)  Increase throughput rather than reduce latency  One instruction still takes the same (or longer)  Since max speedup = num of stages & Latency determined by slowest stage, should:  Partition pipe to many stages  Balance work across stages  Shorten longest stage as much as possible

Computer Architecture 2011– pipeline (lec2-3) 7 Pipeline: overheads & limitations  Can increase per-instruction latency  Due to stages imbalance  Requires more logic (e.g., for latches)  Time to “fill” pipe reduces speedup Time to “drain” pipe reduces speedup (e.g., upon interrupt or context switch)  Stall for dependencies  Too many pipe-stages start to lose performance

Computer Architecture 2011– pipeline (lec2-3) 8 Pipelined CPU

Computer Architecture 2011– pipeline (lec2-3) 9 Pipeline: fetch bring next instruction from memory; 4 bytes (32 bit) per instruction Instruction saved in latch, in preparation of next pipe stage when not branching, next instruction is in next word

Computer Architecture 2011– pipeline (lec2-3) 10 Pipeline: decode + regs fetch decode source reg numbers read their values from reg file reg IDs are 5 bits (2^5 = 32)

Computer Architecture 2011– pipeline (lec2-3) 11 Pipeline: decode + regs fetch decode & sign-extend immediate (from 16 bit to 32)

Computer Architecture 2011– pipeline (lec2-3) 12 Pipeline: decode + regs fetch decode destination reg (can be one of two, depending on op) & save in latch for next stage…

Computer Architecture 2011– pipeline (lec2-3) 13 Pipeline: decode + regs fetch decode destination reg (can be one of two, depending on op) & save in latch for next stage… …based on the op type, next phase will determine, which reg of the two is the destination

Computer Architecture 2011– pipeline (lec2-3) 14 Pipeline: execute ALU computes – “R” operation (the “shift” field is missing from this illustration) reg1 reg2 func (6bit) to reg3

Computer Architecture 2011– pipeline (lec2-3) 15 Pipeline: execute ALU computes – “I” operation (not branch & not load/store) reg1 imm opcode to reg2

Computer Architecture 2011– pipeline (lec2-3) 16 Pipeline: execute ALU computes – “I” operation conditional branch BEQ or BNE [ if (reg1==reg2) pc = pc+4 + (imm<<2) ] reg1 imm opcode reg2 Branch?

Computer Architecture 2011– pipeline (lec2-3) 17 Pipeline: execute ALU computes – “I” operation load (store is similar) ( reg2 = mem[reg1+imm] ) reg1 imm to reg2

Computer Architecture 2011– pipeline (lec2-3) 18 Pipeline: updating PC no branch: just add 4 to PC unconditional branch: add immediate to PC+4 (type J operation) conditional branch: depends on result of ALU

Computer Architecture 2011– pipeline (lec2-3) 19 Pipelined CPU with Control

Computer Architecture 2011– pipeline (lec2-3) 20 Pipeline Example: cycle 1 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R5 12 or R13,R6,R7

Computer Architecture 2011– pipeline (lec2-3) 21 Pipeline Example: cycle 2 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R5 12 or R13,R6,R7

Computer Architecture 2011– pipeline (lec2-3) 22 Pipeline Example: cycle 3 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R5 12 or R13,R6,R7

Computer Architecture 2011– pipeline (lec2-3) 23 Pipeline Example: cycle 4 ALUSrc 6 ALU result Zero Add result Add Shift left 2 ALU Control ALUOp RegDst RegWrite Read reg 1 Read reg 2 Write reg Write data Read data 1 Read data 2 Register File [15-0] [20-16] [15-11] Sign extend ID/EX EX/MEM MEM/WB Instruction MemRead MemWrite Address Write Data Read Data Memory Branch PCSrc MemtoReg 4 Instruction Memory Address Add IF/ID 0 1 muxmux 0 1 muxmux 0 1 muxmux 1 0 muxmux Instruction lw PC or [R4 ] Data from memory address [R1]+9 sub 4 [R5] and [R2]-[R3] 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R5 12 or R13,R6,R7

Computer Architecture 2011– pipeline (lec2-3) 24 Structural Hazards

Computer Architecture 2011– pipeline (lec2-3) 25 Structural Hazard  Two instructions attempt to use same resource simultaneously  Problem: register-file accessed in 2 stages  Write during stage 5 (WB)  Read during stage 2 (ID) => Resource (RF) conflict  Solution  Split stage into two sub-stages  Do write in first half  Do reads in second half  2 read ports, 1 write port (separate)

Computer Architecture 2011– pipeline (lec2-3) 26 Structural Hazard  Problem: memory accessed in 2 stages  Fetch (stage 1), when reading instructions from memory  Memory (stage 4), when data is read/written from/to memory  Solution  “Memory” is actually “cache”  Separate instruction cache and data cache

Computer Architecture 2011– pipeline (lec2-3) 27  Problem with starting next instruction before first is finished  dependencies that “go backward in time” are data hazards Dependencies: RAW Hazard sub R2, R1, R3 and R12,R2, R5 or R13,R6, R2 add R14,R2, R2 sw R15,100(R2) Program execution order CC 1CC 2CC 3CC 4CC 5CC 6 Time (clock cycles) CC 7CC 8CC –20 Value of R

Computer Architecture 2011– pipeline (lec2-3) 28 IM bubble IM IM RAW Hazard: HW Solution 1 - Add Stalls IMReg CC 1CC 2CC 3CC 4CC 5CC 6 Time (clock cycles) CC 7CC 8CC /–20 Value of R2 DM Reg IMDMReg Reg IMReg IM Reg DMReg IMDMReg Reg Reg DM sub R2, R1, R3 stall and R12,R2, R5 or R13,R6, R2 add R14,R2, R2 sw R15,100(R2) Program execution order Have the hardware detect hazard and add stalls if needed Problem: slow! Solution: forwarding whenever possible

Computer Architecture 2011– pipeline (lec2-3) 29  Use temporary results, don’t wait for them to be written to the register file  register file forwarding to handle read/write to same register  ALU forwarding RAW Hazard: HW Solution 2 - Forwarding IMReg IMReg IMRegDMReg IMDMReg IMDMReg DMReg Reg Reg Reg XXX–20XXXXX XXXX– XXXX DM sub R2, R1, R3 and R12,R2, R5 or R13,R6, R2 add R14,R2, R2 sw R15,100(R2) Program execution order CC 1CC 2CC 3CC 4CC 5CC 6 Time (clock cycles) CC 7CC 8CC –20 Value of R Value EX/MEM Value MEM/WB

Computer Architecture 2011– pipeline (lec2-3) 30 Forwarding Hardware

Computer Architecture 2011– pipeline (lec2-3) 31 Forwarding Hardware Added 2 mux units before ALU Each mux gets 3 inputs, from: 1.Prev stage (ID/EX) 2.Next stage (EX/MEM) 3.The one after (MEM/WB) Forward unit tells the 2 mux units which input to use

Computer Architecture 2011– pipeline (lec2-3) 32 Forwarding Control  EX Hazard:  if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg1)) ALUSelA = 1  if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg2)) ALUSelB = 1  MEM Hazard:  if (MEM/WB.RegWrite and ((not EX/MEM.RegWrite) or (EX/MEM.WriteReg  ID/EX.ReadReg1)) and (MEM/WB.WriteReg = ID/EX.ReadReg1)) ALUSelA = 2  if (MEM/WB.RegWrite and ((not EX/MEM.RegWrite) or (EX/MEM.WriteReg  ID/EX.ReadReg2)) and (MEM/WB.WriteReg = ID/EX.ReadReg2)) ALUSelB = 2

Computer Architecture 2011– pipeline (lec2-3) 33 Forwarding Control  EX Hazard:  if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg1)) ALUSelA = 1  if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg2)) ALUSelB = 1  MEM Hazard:  if (MEM/WB.RegWrite and ((not EX/MEM.RegWrite) or (EX/MEM.WriteReg  ID/EX.ReadReg1)) and (MEM/WB.WriteReg = ID/EX.ReadReg1)) ALUSelA = 2  if (MEM/WB.RegWrite and ((not EX/MEM.RegWrite) or (EX/MEM.WriteReg  ID/EX.ReadReg2)) and (MEM/WB.WriteReg = ID/EX.ReadReg2)) ALUSelB = 2 If, in memory stage, we’re writing the output to a register And the reg we’re writing to also happens to be inp_reg1 for the execute stage Then mux_A should select inp_1, namely, the ALU should feed itself

Computer Architecture 2011– pipeline (lec2-3) 34 Forwarding Hardware Example: Bypassing From EX to Src1 and From WB to Src2 lw R11,9(R1) sub R10,R2, R3 and R12,R10,R11 load op => read from “1”

Computer Architecture 2011– pipeline (lec2-3) 35 Forwarding Hardware Example #2: Bypassing From WB to Src2 sub R10,R2, R3 xxx and R12,R10,R11 not load op => read from “0”

Computer Architecture 2011– pipeline (lec2-3) 36 RF Split => no need to forward sub R2, R1, R3 xxx and R12,R2,R11  Register file is written during first half of the cycle  Register file is read during second half of the cycle  Register file is written before it is read  returns the correct data IMReg IM Reg IMDMReg IMDMReg DM Reg Reg DM RegReg

Computer Architecture 2011– pipeline (lec2-3) 37  “load” op can cause “un-forward-able” hazards  load value to R  In the next instruction, use R as input  A hazard detection unit needed to “stall” load instruction Can't always forward (stall inevitable) Reg IM Reg Reg IM IMRegDMReg IMDMReg IMDMReg DMReg Reg Reg DM CC 1CC 2CC 3CC 4CC 5CC 6 Time (clock cycles) CC 7CC 8CC 9 Program execution order lw R2, 30(R1) and R12,R2, R5 or R13,R6, R2 add R14,R2, R2 sw R15,100(R2)

Computer Architecture 2011– pipeline (lec2-3) 38  De-assert the enable to ID/EXE  The dependant instruction ( and ) stays another cycle in IF/EXE  De-assert the enable to the IF/ID latch, and to the PC  Freeze pipeline stages preceding the stalled instruction  Issue a NOP into the EXE/MEM latch (instead of the stalled inst.)  Allow the stalling instruction ( lw ) to move on Stalling

Computer Architecture 2011– pipeline (lec2-3) 39 Hazard Detection (Stall) Logic if (ID/EX.RegWrite and (ID/EX.opcode = lw) and ( (ID/EX.WriteReg = IF/ID.ReadReg1) or (ID/EX.WriteReg = IF/ID.ReadReg2) ) then stall

Computer Architecture 2011– pipeline (lec2-3) 40 Forwarding + Hazard Detection Unit

Computer Architecture 2011– pipeline (lec2-3) 41 Example: code for (assume all variables are in memory): a = b + c; d = e – f; Slow code LW Rb,b LW Rc,c Stall ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f Stall SUB Rd,Re,Rf SWd,Rd Instruction order can be changed as long as correctness is kept (no dependencies violated) Compiler scheduling helps avoid load hazards (when possible) Fast code LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SWd,Rd

Computer Architecture 2011– pipeline (lec2-3) 42 14/3/2011

Computer Architecture 2011– pipeline (lec2-3) 43 Control Hazards

Computer Architecture 2011– pipeline (lec2-3) 44 Branch, but where?  The decision to branch happens deep within the pipeline  Likewise, the target of the branch becomes known deep within the pipeline  How does this effect the pipeline logic?  For example…

Computer Architecture 2011– pipeline (lec2-3) 45 Executing a BEQ Instruction (i) BEQ R4, R5, 27 ; if (R4-R5=0) then PC  PC+4+SignExt(27)*4 ; else PC  PC+4 0 or 4 beq R4, R5, 27 8 and 12 sw 16 sub Assume this program state

Computer Architecture 2011– pipeline (lec2-3) 46 Executing a BEQ Instruction (i) BEQ R4, R5, 27 ; if (R4-R5=0) then PC  PC+4+SignExt(27)*4 ; else PC  PC+4 0 or 4 beq R4, R5, 27 8 and 12 sw 16 sub We know: Values of registers We don’t know: If branch will be taken What’s its target

Computer Architecture 2011– pipeline (lec2-3) 47 Executing a BEQ Instruction (ii) BEQ R4, R5, 27 ; if (R4-R5=0) then PC  PC+4+SignExt(27)*4 ; else PC  PC+4 0 or 4 beq R4, R5, 27 8 and 12 sw 16 sub …Now we know, but only in next cycle will this effect PC Calculate branch condition = compute R4-R5 & compare to 0 Calculate branch target

Computer Architecture 2011– pipeline (lec2-3) 48 Executing a BEQ Instruction (iii) BEQ R4, R5, 27 ; if (R4-R5=0) then PC  PC+4+SignExt(27)*4 ; else PC  PC+4 0 or 4 beq R4, R5, 27 8 and 12 sw 16 sub Finally, if taken, branch sets the PC

Computer Architecture 2011– pipeline (lec2-3) 49 Control Hazard on Branches And Beq sub sw Inst from target IMRegDM Reg PC IMRegDM Reg IMRegDM Reg IMRegDM Reg IMRegDM Reg Outcome: The 3 instructions following the branch are in the pipeline even if branch is taken!

Computer Architecture 2011– pipeline (lec2-3) 50 Stall  Easiest solution:  Stall pipe when branch encountered until resolved  The impact of stalling, assuming:  CPI = 1  20% of instructions are branches (realistic)  Stall 3 cycles on every branch  Is:  CPI new = × 3 = 1.6  [ CPI new = CPI Ideal + avg. stall cycles / instr. ]  Namely:  We loose 60% of the performance!

Computer Architecture 2011– pipeline (lec2-3) 51 Static ranch prediction: not taken  Execute instructions from the fall-through (not-taken) path  As if there is no branch  If the branch is not-taken (~50%), no penalty is paid  If branch actually taken  Flush the fall-through path instructions before they change the machine state (memory / registers)  Fetch the instructions from the correct (taken) path  Assuming ~50% branches not taken on average  CPI new = 1 + (0.2 × 0.5) × 3 = 1.3  30% slowdown instead of 60%  Actually, we can do much better  With modern branch predictors, misprediction often < 2%

Computer Architecture 2011– pipeline (lec2-3) 52 Dynamic branch prediction  Given an instruction, we need to predict  If it’s a branch  If branch is taken  What’s the target address  To avoid stalling  We need this at the end of the ‘fetch’ phase  Before we even now what’s the instruction…  We do this with the help of the “BTB”  Branch Target Buffer

Computer Architecture 2011– pipeline (lec2-3) 53 BTB fast lookup table PC of fetched instruction ?= Predicted branch taken or not taken? (last few times) No => we don’t know, so we don’t predict Yes => instruction is a branch, so let’s predict it Branch PC Target PC History Predicted Target

Computer Architecture 2011– pipeline (lec2-3) 54 How it works in a nutshell  Until proven otherwise, assume branches not taken  Fall through instructions (assume branch has no effect)  Upon the first time a branch is taken  Pay the price (in terms of stalls), but  Save the details of the branch in the BTB (= PC + target PC + whether or not branch was taken)  While fetching, hardware checks (in parallel):  Whether or not PC is in BTB  If found, make a prediction  Taken? Address?  Upon misprediction  Flush (throw out) latches content & start over from right PC

Computer Architecture 2011– pipeline (lec2-3) 55 Prediction steps 1. Allocate  Insert to BTB instruction once identified as taken branch  Inserting both conditional & unconditional  Not inserting untaken  Implicitly predict they’d continue not to be taken 2. Predict  BTB lookup done in parallel to PC-lookup, providing:  Indication whether PC is a branch (=> BTB “hit”)  Branch target  Branch direction (forward or backward in program)  Branch type (conditional or not) 3. Update (when branch taken & its outcome becomes known)  Branch target, history (taken or not), type

Computer Architecture 2011– pipeline (lec2-3) 56 Misprediction  Occurs when  Predict = not taken, reality = taken  Predict = taken, reality = not taken  Branch taken as predicted, but wrong target (jmp register)  Must flush pipeline  Reset latches (same as making all instructions NOPs)  Set the PC source to be from the correct path  Start fetching instruction from correct path

Computer Architecture 2011– pipeline (lec2-3) 57 CPI  Assuming a fraction of p correct predictions  CPI_new = 1 + (0.2 × (1-p)) × 3  Example, p=0.7  CPI_new = 1 + (0.2 × 0.3) × 3 = 1.18  Example, p=0.98  CPI_new = 1 + (0.2 × 0.02) × 3 =  (But this is a simplistic model; in reality the price can sometimes be much higher.)

Computer Architecture 2011– pipeline (lec2-3) 58 History & prediction algorithm  Can save a history window  What happened last time, and before that, and before…  The bigger the window, the greater the complexity  Some branches regularly alternate between taken & untaken  Taken, then untaken, then taken, …  Need only one history bit to identify this  Some branches exhibit “locality”  Typically behave as the last time they were invoked  Typically depend on their previous outcome (& it alone)  Some branches are correlated with previous brenchs  Those that lead to them  “Always backward” prediction  Works for long loops

Computer Architecture 2011– pipeline (lec2-3) 59 Adding a BTB to the Pipeline

Computer Architecture 2011– pipeline (lec2-3) 60 Using The BTB PC moves to next instruction Inst Mem gets PC and fetches new inst BTB gets PC and looks it up IF/ID latch loaded with new inst BTB Hit ?Br taken ? PC  PC + 4PC  perd addr IF ID IF/ID latch loaded with pred inst IF/ID latch loaded with seq. inst Branch ? yesno yes noyes EXE

Computer Architecture 2011– pipeline (lec2-3) 61 Using The BTB (cont.) ID EXE MEM WB Branch ? Calculate br cond & trgt Flush pipe & update PC Corect pred ? yesno IF/ID latch loaded with correct inst continue Update BTB yes no continue

Computer Architecture 2011– pipeline (lec2-3) 62 Prediction algorithm  Can do an entire course on this issue  Still actively researched  As noted, modern predictors can often achieve misprediction < 2%  Still, it has been shown that these 2% can sometimes significantly worsen performance  We didn’t talk about the issue of indirect branches  As in virtual function calls (object oriented)  Where the branch target is written in memory, elsewhere