CDA 5155 Week 3 Branch Prediction Superscalar Execution.

CDA 5155 Week 3 Branch Prediction Superscalar Execution

PC Inst mem REG file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB sign ext Control beq bpc MUXMUX target eq?

Branch Target Buffer Fetch PC Predicted target PC Send PC to BTB found? Yes use target use PC+1 No

Branch prediction Predict not taken: ~50% accurate –No BTB needed; always use PC+1 Predict backward taken:~65% accurate –BTB holds targets for backward branches (loops) Predict same as last time:~80% accurate –Update BTB for any taken branch

What about indirect branches? Could use same approach –PC+1 unlikely indirect target –Indirect jumps often have multiple targets (for same instruction) Switch statements Virtual function calls Shared library (DLL) calls

Indirect jump: Special Case Return address stack –Function returns have deterministic behavior (usually) Return to different locations (BTB doesn’t work well) Return location known ahead of time –In some register at the time of the call –Build a specialize structure for return addresses Call instructions write return address to R31 AND RAS Return instructions pop predicted target off stack –Issues: finite size (save or forget on overflow?); –Issues: long jumps (clear when wrong?)

Costs of branch prediction/speculation Performance costs? –Minimal: no difference between waiting and squashing; and it is a huge gain when prediction is correct! Power? –Large: in very long/wide pipelines many instructions can be squashed Squashed = # mispredictions  pipeline length/width before target resolved

Costs of branch prediction/speculation Area? –Can be large: predictors can get very big as we will see next time Complexity? –Designs are more complex –Testing becomes more difficult, but …

What else can be speculated? Dependencies –I think this data is coming from that store instruction Values –I think I will load a 0 value Accuracy? –Branch prediction (direction) is Boolean (T,NT) –Branch targets are stable or predictable (RAS) –Dependencies are limited –Values cover a huge space (0 – 4B)

Parts of the branch predictor Direction Predictor –For conditional branches Predicts whether the branch will be taken –Examples: Always taken; backwards taken Address Predictor –Predicts the target address (use if predicted taken) –Examples : BTB; Return Address Stack; Precomputed Branch Recovery logic Ref: The Precomputed Branch Architecture

Characteristics of branches Individual branches differ –Loops tend not to exit Unoptimized code: not-taken Optimized code: taken –If-statements: Tend to be less predictable –Unconditional branches Still need address prediction

Example gzip: gzip: loop branch A@ 0x1200098d8 Executed: 1359575 times Taken:1359565 times Not-taken:10 times % time taken: 99% - 100% Easy to predict (direction and address)

Example gzip: gzip: if branch B@ 0x12000fa04 Executed: 151409 times Taken:71480 times Not-taken:79929 times % time taken: ~49% Easy to predict? (maybe not/ maybe dynamically)

Example: gzip 0 100 Direction prediction: always taken Accuracy: ~73 % Easy to predict A B

Branch Backwards Most backward branches are heavily TAKEN Forward branches slightly more likely to be NOT-TAKEN Ref: The Effects of Predicated Execution on Branch Prediction

Using history 1-bit history (direction predictor) –Remember the last direction for a branch branchPC NT T Branch History Table How big is the BHT?

Example: gzip 0 100 Direction prediction: always taken Accuracy: ~73 % A B How many times will branch A mispredict? How many times will branch B mispredict?

Using history 2-bit history (direction predictor) branchPC SN NT Branch History Table T ST How big is the BHT?

Example: gzip 0 100 Direction prediction: always taken Accuracy: ~76 % A B How many times will branch A mispredict? How many times will branch B mispredict?

Using History Patterns ~80 percent of branches are either heavily TAKEN or heavily NOT-TAKEN For the other 20%, we need to look a patterns of reference to see if they are predictable using a more complex predictor Example: gcc has a branch that flips each time T(1) NT(0) 10101010101010101010101010101010101010

Local history branchPC NT T 10101010 Pattern History Table Branch History Table What is the prediction for this BHT 10101010? When do I update the tables?

Local history branchPC NT T 01010101 Pattern History Table Branch History Table On the next execution of this branch instruction, the branch history table is 01010101, pointing to a different pattern What is the accuracy of a flip/flop branch 0101010101010…?

Global history 01110101 Pattern History Table Branch History Register if (aa == 2) aa = 0; if (bb == 2) bb = 0; if (aa != bb) { … How can branches interfere with each other? for (i=0; i<100; i++) for (j=0; j<3; j++) j<3j = 11101  taken j<3j = 21011  taken j<3j = 30111  not taken i<1001110  usually taken

Gshare predictor Ref: Combining Branch Predictors branchPC 01110101 Pattern History Table Branch History Register xor Must read!

Bimod predictor Choice predictor PHT skewed taken PHT skewed Not-taken Global history regbranchPC xor mux

Tournament predictors Local predictor (e.g. 2-bit) Global/gshare predictor (much more state) Prediction 1 Prediction 2 Selection table (2-bit state machine) How do you select which predictor to use? How do you update the various predictor/selector? Prediction

Overriding Predictors Big predictors are slow, but more accurate Use a single cycle predictor in fetch Start the multi-cycle predictor –When it completes, compare it to the fast prediction. If same, do nothing If different, assume the slow predictor is right and flush pipline. Advantage: reduced branch penalty for those branches mispredicted by the fast predictor and correctly predicted by the slow predictor

Pipelined Gshare Predictor How can we get a pipelined global prediction by stage 1? –Start in stage –2 –Don’t have the most recent branch history… Access multiple entries –E.g. if we are missing last three branches, get 8 histories and pick between them during fetch stage. Ref: Reconsidering Complex Branch Predictors

Exceptions Exceptions are events that are difficult or impossible to manage in hardware alone. Exceptions are usually handled by jumping into a service (software) routine. Examples: I/O device request, page fault, divide by zero, memory protection violation (seg fault), hardware failure, etc.

Taking and Exception Once an exception occurs, how does the processor proceed. –Non-pipelined: don’t fetch from PC; save state; fetch from interrupt vector table –Pipelined: depends on the exception Precise Interrupt: Must stop all instruction “after the exception” (squash) –Divide by zero: flush fetch/decode –Page fault: (fetch or mem stage?) Save state after last instruction before exception completes (PC, regs) Fetch from interrupt vector table

Optimizing CPU Performance Golden Rule: t CPU = N inst *CPI*t CLK Given this, what are our options –Reduce the number of instructions executed Compiler Job (COP 5621; COP 5622) –Reduce the clock period Fabrication (Some Engineering classes) –Reduce the cycles to execute an instruction Approach: Instruction Level Parallelism (ILP)

Adding width to basic pipelining 5 stage “RISC” load-store architecture –About as simple as things get 1.Instruction fetch: get 2+ instructions from memory/cache 2.Instruction decode: translate opcodes into control signals and read regs 3.Execute: perform ALU operations 4.Memory: Access memory operations if load/store 5.Writeback/retire: update register file

Stage 1: Fetch Design a datapath that can fetch two instructions from memory every cycle. –Use PC to index memory to read instruction –Read 2 instructions –Increment the PC (by 2) Write everything needed to complete execution to the pipeline register (IF/ID) –Instruction 1; instruction 2; PC+1; PC+2

Instr1 Instr2 IF / ID Pipeline register PC Instruction Memory/ Cache en 2 + MUXMUX Rest of pipelined datapath PC+1 PC+2

Stage 2: Decode Design a datapath that reads the IF/ID pipeline register, decodes instructions and reads register file (specified by regA and regB of instruction bits for both instructions). Write everything needed to complete execution to the pipeline register (ID/EX) –Pass on both instructions. –Including PC+1, PC+2 even though decode didn’t use it.

Destreg Data ID / EX Pipeline register Contents Of regA Contents Of regB Register File regA regB en Rest of pipelined datapath Instruction bits IF / ID Pipeline register PC + 1 Control Signals Stage 1: Fetch datapath Changes? Hazard detection?

Stage 3: Execute Design a datapath that performs the proper ALU operations for the instructions specified and the values present in the ID/EX pipeline register. –The inputs to ALU top are the contents of regA top and either the contents of RegB top or the offset top field on the instruction. –The inputs to ALU bottom are the contents of regA bottom and either the contents of RegB bottom or the offset bottom field on the instruction. –Also, calculate PC+1+offset top in case this is a branch. –Also, calculate PC+2+offset bottom in case this is a branch.

ID / EX Pipeline register Contents Of regA Contents Of regB Rest of pipelined datapath Alu Result EX/Mem Pipeline register PC + 1 Control Signals Stage 2: Decode datapath Control Signals PC+1 +offset + contents of regB ALUALU MUXMUX How many data forwarding paths?

Stage 4: Memory Operation Design a datapath that performs the proper memory operation(s) for the instructions specified and the values present in the EX/Mem pipeline register. –ALU results contain addresses for ld and st instructions. –Opcode bits control memory R/W and enable signals. Write everything needed to complete execution to the pipeline register (Mem/WB) –ALU results and MemData(x2) –Instruction bits for opcodes and destRegs specifiers

Alu Result Mem/WB Pipeline register Rest of pipelined datapath Alu Result EX/Mem Pipeline register Stage 3: Execute datapath Control Signals PC+1 +offset contents of regB This goes back to the MUX before the PC in stage 1. Memory Read Data Data Memory en R/W Control Signals MUX control for PC input Should we process 2 memory operations in one cycle?

Stage 5: Write back Design a datapath that completes the execution of these instructions, writing to the register file if required. –Write MemData to destReg for ld instructions –Write ALU result to destReg for add or nand instructions. –Opcode bits also control register write enable signal.

Alu Result Mem/WB Pipeline register Stage 4: Memory datapath Control Signals Memory Read Data MUXMUX This goes back to data input of register file This goes back to the destination register specifier MUXMUX bits 0-2 bits 16-18 register write enable What about ordering the register writes if same destination specifier for each instruction?

How Much ILP is There?

ALU Operation GOOD, Branch BAD Expected Number of Branches Between Mispredicts E(X) ~ 1/(1-p) E.g., p = 95%, E(X) ~ 20 brs, 100-ish insts

How Accurate are Branch Predictors?

CDA 5155 Week 3 Branch Prediction Superscalar Execution.

Similar presentations

Presentation on theme: "CDA 5155 Week 3 Branch Prediction Superscalar Execution."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CDA 5155 Week 3 Branch Prediction Superscalar Execution.

Similar presentations

Presentation on theme: "CDA 5155 Week 3 Branch Prediction Superscalar Execution."— Presentation transcript:

Similar presentations

About project

Feedback