CDA 5155 Week 3 Branch Prediction Superscalar Execution.

Slides:



Advertisements
Similar presentations
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Advertisements

EECS 470 Lecture 6 Branches: Address prediction and recovery (And interrupt recovery too.)
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Computer Organization and Architecture
1 RISC Pipeline Han Wang CS3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 4.6.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
EECS 470 Branch Prediction Lecture 6 Coverage: Chapter 3.
EECS 470 Pipeline Control Hazards Lecture 5 Coverage: Chapter 3 & Appendix A.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Computer Organization and Architecture The CPU Structure.
EECS 470 Pipeline Hazards Lecture 4 Coverage: Appendix A.
Computer Architecture Lecture 3 Coverage: Appendix A
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
Goal: Reduce the Penalty of Control Hazards
Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
Dynamic Branch Prediction
Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
Control Hazards.1 Review: Datapath with Data Hazard Control Read Address Instruction Memory Add PC 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
11/13/2015 8:57 AM 1 of 86 Pipelining Chapter 6. 11/13/2015 8:57 AM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Pipelining Basics.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.
CDA 5155 Computer Architecture Week 1.5. Start with the materials: Conductors and Insulators Conductor: a material that permits electrical current to.
1/24/ :00 PM 1 of 86 Pipelining Chapter 6. 1/24/ :00 PM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which.
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 9: MIPS Pipeline.
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
CMPE 421 Parallel Computer Architecture Part 3: Hardware Solution: Control Hazard and Prediction.
CSIE30300 Computer Architecture Unit 06: Containing Control Hazards
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Computer Organization
Exceptions Another form of control hazard Could be caused by…
Computer Organization CS224
CS203 – Advanced Computer Architecture
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers The Processor
Chapter 4 The Processor Part 4
ECS 154B Computer Architecture II Spring 2009
Pipelining: Advanced ILP
Chapter 4 The Processor Part 3
Morgan Kaufmann Publishers The Processor
Morgan Kaufmann Publishers The Processor
Pipelining review.
The processor: Pipelining and Branching
Current Design.
Pipelining in more detail
The Processor Lecture 3.6: Control Hazards
Control unit extension for data hazards
Lecture 10: Branch Prediction and Instruction Delivery
Instruction Execution Cycle
Computer Architecture
CSC3050 – Computer Architecture
Pipeline Control unit (highly abstracted)
Pipelining (II).
Control unit extension for data hazards
Control unit extension for data hazards
MIPS Pipelined Datapath
ELEC / Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher.
Presentation transcript:

CDA 5155 Week 3 Branch Prediction Superscalar Execution

PC Inst mem REG file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB sign ext Control beq bpc MUXMUX target eq?

Branch Target Buffer Fetch PC Predicted target PC Send PC to BTB found? Yes use target use PC+1 No

Branch prediction Predict not taken: ~50% accurate –No BTB needed; always use PC+1 Predict backward taken:~65% accurate –BTB holds targets for backward branches (loops) Predict same as last time:~80% accurate –Update BTB for any taken branch

What about indirect branches? Could use same approach –PC+1 unlikely indirect target –Indirect jumps often have multiple targets (for same instruction) Switch statements Virtual function calls Shared library (DLL) calls

Indirect jump: Special Case Return address stack –Function returns have deterministic behavior (usually) Return to different locations (BTB doesn’t work well) Return location known ahead of time –In some register at the time of the call –Build a specialize structure for return addresses Call instructions write return address to R31 AND RAS Return instructions pop predicted target off stack –Issues: finite size (save or forget on overflow?); –Issues: long jumps (clear when wrong?)

Costs of branch prediction/speculation Performance costs? –Minimal: no difference between waiting and squashing; and it is a huge gain when prediction is correct! Power? –Large: in very long/wide pipelines many instructions can be squashed Squashed = # mispredictions  pipeline length/width before target resolved

Costs of branch prediction/speculation Area? –Can be large: predictors can get very big as we will see next time Complexity? –Designs are more complex –Testing becomes more difficult, but …

What else can be speculated? Dependencies –I think this data is coming from that store instruction Values –I think I will load a 0 value Accuracy? –Branch prediction (direction) is Boolean (T,NT) –Branch targets are stable or predictable (RAS) –Dependencies are limited –Values cover a huge space (0 – 4B)

Parts of the branch predictor Direction Predictor –For conditional branches Predicts whether the branch will be taken –Examples: Always taken; backwards taken Address Predictor –Predicts the target address (use if predicted taken) –Examples : BTB; Return Address Stack; Precomputed Branch Recovery logic Ref: The Precomputed Branch Architecture

Characteristics of branches Individual branches differ –Loops tend not to exit Unoptimized code: not-taken Optimized code: taken –If-statements: Tend to be less predictable –Unconditional branches Still need address prediction

Example gzip: gzip: loop branch 0x d8 Executed: times Taken: times Not-taken:10 times % time taken: 99% - 100% Easy to predict (direction and address)

Example gzip: gzip: if branch 0x12000fa04 Executed: times Taken:71480 times Not-taken:79929 times % time taken: ~49% Easy to predict? (maybe not/ maybe dynamically)

Example: gzip Direction prediction: always taken Accuracy: ~73 % Easy to predict A B

Branch Backwards Most backward branches are heavily TAKEN Forward branches slightly more likely to be NOT-TAKEN Ref: The Effects of Predicated Execution on Branch Prediction

Using history 1-bit history (direction predictor) –Remember the last direction for a branch branchPC NT T Branch History Table How big is the BHT?

Example: gzip Direction prediction: always taken Accuracy: ~73 % A B How many times will branch A mispredict? How many times will branch B mispredict?

Using history 2-bit history (direction predictor) branchPC SN NT Branch History Table T ST How big is the BHT?

Example: gzip Direction prediction: always taken Accuracy: ~76 % A B How many times will branch A mispredict? How many times will branch B mispredict?

Using History Patterns ~80 percent of branches are either heavily TAKEN or heavily NOT-TAKEN For the other 20%, we need to look a patterns of reference to see if they are predictable using a more complex predictor Example: gcc has a branch that flips each time T(1) NT(0)

Local history branchPC NT T Pattern History Table Branch History Table What is the prediction for this BHT ? When do I update the tables?

Local history branchPC NT T Pattern History Table Branch History Table On the next execution of this branch instruction, the branch history table is , pointing to a different pattern What is the accuracy of a flip/flop branch …?

Global history Pattern History Table Branch History Register if (aa == 2) aa = 0; if (bb == 2) bb = 0; if (aa != bb) { … How can branches interfere with each other? for (i=0; i<100; i++) for (j=0; j<3; j++) j<3j =  taken j<3j =  taken j<3j =  not taken i<  usually taken

Gshare predictor Ref: Combining Branch Predictors branchPC Pattern History Table Branch History Register xor Must read!

Bimod predictor Choice predictor PHT skewed taken PHT skewed Not-taken Global history regbranchPC xor mux

Tournament predictors Local predictor (e.g. 2-bit) Global/gshare predictor (much more state) Prediction 1 Prediction 2 Selection table (2-bit state machine) How do you select which predictor to use? How do you update the various predictor/selector? Prediction

Overriding Predictors Big predictors are slow, but more accurate Use a single cycle predictor in fetch Start the multi-cycle predictor –When it completes, compare it to the fast prediction. If same, do nothing If different, assume the slow predictor is right and flush pipline. Advantage: reduced branch penalty for those branches mispredicted by the fast predictor and correctly predicted by the slow predictor

Pipelined Gshare Predictor How can we get a pipelined global prediction by stage 1? –Start in stage –2 –Don’t have the most recent branch history… Access multiple entries –E.g. if we are missing last three branches, get 8 histories and pick between them during fetch stage. Ref: Reconsidering Complex Branch Predictors

Exceptions Exceptions are events that are difficult or impossible to manage in hardware alone. Exceptions are usually handled by jumping into a service (software) routine. Examples: I/O device request, page fault, divide by zero, memory protection violation (seg fault), hardware failure, etc.

Taking and Exception Once an exception occurs, how does the processor proceed. –Non-pipelined: don’t fetch from PC; save state; fetch from interrupt vector table –Pipelined: depends on the exception Precise Interrupt: Must stop all instruction “after the exception” (squash) –Divide by zero: flush fetch/decode –Page fault: (fetch or mem stage?) Save state after last instruction before exception completes (PC, regs) Fetch from interrupt vector table

Optimizing CPU Performance Golden Rule: t CPU = N inst *CPI*t CLK Given this, what are our options –Reduce the number of instructions executed Compiler Job (COP 5621; COP 5622) –Reduce the clock period Fabrication (Some Engineering classes) –Reduce the cycles to execute an instruction Approach: Instruction Level Parallelism (ILP)

Adding width to basic pipelining 5 stage “RISC” load-store architecture –About as simple as things get 1.Instruction fetch: get 2+ instructions from memory/cache 2.Instruction decode: translate opcodes into control signals and read regs 3.Execute: perform ALU operations 4.Memory: Access memory operations if load/store 5.Writeback/retire: update register file

Stage 1: Fetch Design a datapath that can fetch two instructions from memory every cycle. –Use PC to index memory to read instruction –Read 2 instructions –Increment the PC (by 2) Write everything needed to complete execution to the pipeline register (IF/ID) –Instruction 1; instruction 2; PC+1; PC+2

Instr1 Instr2 IF / ID Pipeline register PC Instruction Memory/ Cache en 2 + MUXMUX Rest of pipelined datapath PC+1 PC+2

Stage 2: Decode Design a datapath that reads the IF/ID pipeline register, decodes instructions and reads register file (specified by regA and regB of instruction bits for both instructions). Write everything needed to complete execution to the pipeline register (ID/EX) –Pass on both instructions. –Including PC+1, PC+2 even though decode didn’t use it.

Destreg Data ID / EX Pipeline register Contents Of regA Contents Of regB Register File regA regB en Rest of pipelined datapath Instruction bits IF / ID Pipeline register PC + 1 Control Signals Stage 1: Fetch datapath Changes? Hazard detection?

Stage 3: Execute Design a datapath that performs the proper ALU operations for the instructions specified and the values present in the ID/EX pipeline register. –The inputs to ALU top are the contents of regA top and either the contents of RegB top or the offset top field on the instruction. –The inputs to ALU bottom are the contents of regA bottom and either the contents of RegB bottom or the offset bottom field on the instruction. –Also, calculate PC+1+offset top in case this is a branch. –Also, calculate PC+2+offset bottom in case this is a branch.

ID / EX Pipeline register Contents Of regA Contents Of regB Rest of pipelined datapath Alu Result EX/Mem Pipeline register PC + 1 Control Signals Stage 2: Decode datapath Control Signals PC+1 +offset + contents of regB ALUALU MUXMUX How many data forwarding paths?

Stage 4: Memory Operation Design a datapath that performs the proper memory operation(s) for the instructions specified and the values present in the EX/Mem pipeline register. –ALU results contain addresses for ld and st instructions. –Opcode bits control memory R/W and enable signals. Write everything needed to complete execution to the pipeline register (Mem/WB) –ALU results and MemData(x2) –Instruction bits for opcodes and destRegs specifiers

Alu Result Mem/WB Pipeline register Rest of pipelined datapath Alu Result EX/Mem Pipeline register Stage 3: Execute datapath Control Signals PC+1 +offset contents of regB This goes back to the MUX before the PC in stage 1. Memory Read Data Data Memory en R/W Control Signals MUX control for PC input Should we process 2 memory operations in one cycle?

Stage 5: Write back Design a datapath that completes the execution of these instructions, writing to the register file if required. –Write MemData to destReg for ld instructions –Write ALU result to destReg for add or nand instructions. –Opcode bits also control register write enable signal.

Alu Result Mem/WB Pipeline register Stage 4: Memory datapath Control Signals Memory Read Data MUXMUX This goes back to data input of register file This goes back to the destination register specifier MUXMUX bits 0-2 bits register write enable What about ordering the register writes if same destination specifier for each instruction?

How Much ILP is There?

ALU Operation GOOD, Branch BAD Expected Number of Branches Between Mispredicts E(X) ~ 1/(1-p) E.g., p = 95%, E(X) ~ 20 brs, 100-ish insts

How Accurate are Branch Predictors?