Lecture 3: Branch Prediction Young Cho Graduate Computer Architecture I.

Lecture 3: Branch Prediction Young Cho Graduate Computer Architecture I

2 - CSE/ESE 560M – Graduate Computer Architecture I “Instruction Frequency” CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count “Average Cycles per Instruction” Cycles Per Instructions

3 - CSE/ESE 560M – Graduate Computer Architecture I Instruction Memory Register File ALU Data Memory PC Control IF/ID ID/EXEX/MEMMEM/WB Typical Load/Store Processor

4 - CSE/ESE 560M – Graduate Computer Architecture I Pipelining Laundry 30 minutes35 minutes Three sets of Clean Clothes in 2 hours 40 minutes 35 minutes25 minutes With large number of sets, the each load takes average of ~35 min to wash 3X Increase in Productivity!!!

5 - CSE/ESE 560M – Graduate Computer Architecture I Introducing Problems Hazards prevent next instruction from executing during its designated clock cycle –Structural hazards: HW cannot support this combination of instructions (single person to dry and iron clothes simultaneously) –Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock – needs both before putting them away) –Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (Er…branch & jump)

6 - CSE/ESE 560M – Graduate Computer Architecture I Read After Write (RAW) –Instr 2 tries to read operand before Instr 1 writes it –Caused by a “Dependence” in compiler term Write After Read (WAR) –Instr 2 writes operand before Instr 1 reads it –Called an “anti-dependence” in compiler term Write After Write (WAW) –Instr 2 writes operand before Instr 1 writes it –“Output dependence” in compiler term WAR and WAW in more complex systems Data Hazards

7 - CSE/ESE 560M – Graduate Computer Architecture I 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11 Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg 3 instructions are in the pipeline before new instruction can be fetched. Branch Hazard (Control)

8 - CSE/ESE 560M – Graduate Computer Architecture I Branch Hazard Alternatives Stall until branch direction is clear Predict Branch Not Taken –Execute successor instructions in sequence –“Squash” instructions in pipeline if branch actually taken –Advantage of late pipeline state update –47% DLX branches not taken on average –PC+4 already calculated, so use it to get next instr Predict Branch Taken –53% DLX branches taken on average –DLX still incurs 1 cycle branch penalty –Other machines: branch target known before outcome

9 - CSE/ESE 560M – Graduate Computer Architecture I Branch delay of length n Delayed Branch –Define branch to take place AFTER a following instruction (Fill in Branch Delay Slot) branch instruction sequential successor 1 sequential successor 2........ sequential successor n branch target if taken –1 slot delay allows proper decision and branch target address in 5 stage pipeline Branch Hazard Alternatives

10 - CSE/ESE 560M – Graduate Computer Architecture I Evaluating Branch Alternatives SchedulingBranchCPIspeedup v.speedup v. scheme penaltyunpipelinedstall Stall pipeline31.423.51.0 Predict taken11.144.41.26 Predict not taken11.094.51.29 Delayed branch0.51.074.61.31 Conditional & Unconditional = 14%, 65% change PC

11 - CSE/ESE 560M – Graduate Computer Architecture I Solution to Hazards Structural Hazards –Delaying HW Dependent Instruction –Increase Resources (i.e. dual port memory) Data Hazards –Data Forwarding –Software Scheduling Control Hazards –Pipeline Stalling –Predict and Flush –Fill Delay Slots with Previous Instructions

12 - CSE/ESE 560M – Graduate Computer Architecture I Administrative Literature Survey –One Q&A per Literature –Q&A should show that you read the paper Changes in Schedule –Need to be out of town on Oct 4 th (Tuesday) –Quiz 2 moved up 1 lecture Tool and VHDL help

13 - CSE/ESE 560M – Graduate Computer Architecture I Typical Pipeline Example: MIPS R4000 IF ID MEM WB integer unit FP/int Multiply FP adder FP/int divider ex m1m2m3m4m5m6m7 a1a2a3a4 Div (lat = 25, Init inv=25)

14 - CSE/ESE 560M – Graduate Computer Architecture I Prediction Easy to fetch multiple (consecutive) instructions per cycle –Essentially speculating on sequential flow Jump: unconditional change of control flow –Always taken Branch: conditional change of control flow –Taken typically ~50% of the time in applications Backward: 30% of the Branch  80% taken = ~24% Forward: 70% of the Branch  40% taken = ~28%

15 - CSE/ESE 560M – Graduate Computer Architecture I Current Ideas Reactive –Adapt Current Action based on the Past –TCP windows –URL completion,... Proactive –Anticipate Future Action based on the Past –Branch prediction –Long Cache block –Tracing

16 - CSE/ESE 560M – Graduate Computer Architecture I Branch Prediction Schemes Static Branch Prediction Dynamic Branch Prediction –1-bit Branch-Prediction Buffer –2-bit Branch-Prediction Buffer –Correlating Branch Prediction Buffer –Tournament Branch Predictor Branch Target Buffer Integrated Instruction Fetch Units Return Address Predictors

17 - CSE/ESE 560M – Graduate Computer Architecture I Static Branch Prediction Execution profiling –Very accurate if Actually take time to Profile –Incovenient Heuristics based on nesting and coding –Simple heuristics are very inaccurate Programmer supplied hints... –Inconvenient and potentially inaccurate

18 - CSE/ESE 560M – Graduate Computer Architecture I Dynamic Branch Prediction Performance = ƒ(accuracy, cost of mis-prediction) 1-bit Branch History Table –Bitmap for Lower bits of PC address –Says whether or not branch taken last time –If Inst is Branch, predict and update the table Problem –1-bit BHT will cause 2 mis-predictions for Loops First time through the loop, it predicts exit instead loop End of loop case, it predicts loops instead of exit –Avg is 9 iterations before exit Only 80% accuracy even if loop 90% of the time

19 - CSE/ESE 560M – Graduate Computer Architecture I N-bit Dynamic Branch Prediction N-bit scheme where change prediction only if get misprediction N-times: T T NT Predict Taken Predict Not Taken Predict Taken Predict Not Taken T NT T 2-bit Scheme: Saturates the prediction up to 2 times

20 - CSE/ESE 560M – Graduate Computer Architecture I Correlating Branches (2,2) predictor –2-bit global: indicates the behavior of the last two branches –2-bit local (2-bit Dynamic Branch Prediction) Branch History Table –Global branch history is used to choose one of four history bitmap table –Predicts the branch behavior then updates only the selected bitmap table Branch address (4 bits) Prediction 2-bit recent global branch history (01 = not taken then taken)

21 - CSE/ESE 560M – Graduate Computer Architecture I Accuracy of Different Schemes 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 18% Frequency of Mispredictions 0% 1% 5% 6% 11% 4% 6% 5% 1% 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% nasa7matrix300tomcatvdoducdspicefppppgccespressoeqntottli Frequency of Mispredictions

22 - CSE/ESE 560M – Graduate Computer Architecture I BHT Accuracy Mispredict because either: –Wrong guess for the branch –Wrong Index for the branch 4096 entry table –programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% For SPEC92 –4096 about as good as infinite table

23 - CSE/ESE 560M – Graduate Computer Architecture I Tournament Branch Predictors Correlating Predictor –2-bit predictor failed on important branches –Better results by also using global information Tournament Predictors –1 Predictor based on global information –1 Predictor based on local information –Use the predictor that guesses better addr Predictor B Predictor A

24 - CSE/ESE 560M – Graduate Computer Architecture I Alpha 21264 4K 2-bit counters to choose from among a global predictor and a local predictor Global predictor also has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor –12-bit pattern: ith bit 0 => ith prior branch not taken; ith bit 1 => ith prior branch taken; Local predictor consists of a 2-level predictor: –Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted. –Next level Selected entry from the local history table is used to index a table of 1K entries consisting a 3-bit saturating counters, which provide the local prediction Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180,000 transistors)

25 - CSE/ESE 560M – Graduate Computer Architecture I Branch Prediction Accuracy 94% 96% 98% 97% 100% 70% 82% 77% 82% 84% 99% 88% 86% 88% 86% 95% 99% 0%20%40%60%80%100% gcc espresso li fpppp doduc tomcatv Profile-based 2-bit dynmic Tournament

26 - CSE/ESE 560M – Graduate Computer Architecture I Accuracy versus Size

27 - CSE/ESE 560M – Graduate Computer Architecture I Branch Target Buffer Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) –Note: must check for branch match now, since can’t use wrong branch address Branch PCPredicted PC =? PC of instruction FETCH Extra prediction state bits Yes: instruction is branch and use predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4)

28 - CSE/ESE 560M – Graduate Computer Architecture I Predicated Execution Built in Hardware Support –Bit for predicated instruction execution –Both paths are in the code –Execution based on the result of the condition No Branch Prediction is Required –Instructions not selected are ignored –Sort of inserting Nop

29 - CSE/ESE 560M – Graduate Computer Architecture I andr3,r1,r5 addi r2,r3,#4 subr4,r2,r1 jaldoit subi r1,r1,#1 A: subr4,r2,r1doit addir2,r3,#4A+8 N subr4,r2,r1 L --- -- andr3,r1,r5A+4 N subir1,r1,#1A+20 N Internal Cache state: Zero Cycle Jump What really has to be done at runtime? –Once an instruction has been detected as a jump or JAL, we might recode it in the internal cache. –Very limited form of dynamic compilation? Use of “Pre-decoded” instruction cache –Called “branch folding” in the Bell-Labs CRISP processor. –Original CRISP cache had two addresses and could thus fold a complete branch into the previous instruction –Notice that JAL introduces a structural hazard on write

30 - CSE/ESE 560M – Graduate Computer Architecture I Dynamic Branch Prediction Summary Prediction becoming important part of scalar execution Branch History Table –2 bits for loop accuracy Correlation –Recently executed branches correlated with next branch. –Either different branches –Or different executions of same branches Tournament Predictor –More resources to competitive solutions and pick between them Branch Target Buffer –Branch address & prediction Predicated Execution –No need for Prediction –Hardware Support needed

Lecture 3: Branch Prediction Young Cho Graduate Computer Architecture I.

Similar presentations

Presentation on theme: "Lecture 3: Branch Prediction Young Cho Graduate Computer Architecture I."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 3: Branch Prediction Young Cho Graduate Computer Architecture I.

Similar presentations

Presentation on theme: "Lecture 3: Branch Prediction Young Cho Graduate Computer Architecture I."— Presentation transcript:

Similar presentations

About project

Feedback