Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 1 Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Similar presentations


Presentation on theme: "Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 1 Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining."— Presentation transcript:

1 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 1 Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining

2 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 2 Review: Tomasulo Summary Register file is not the bottleneck Avoids the WAR, WAW hazards of Scoreboard Not limited to basic blocks (provided branch prediction) Allows loop unrolling in HW Lasting Contributions –Dynamic scheduling –Register renaming –Load/store disambiguation

3 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 3 Reasons for Branch Prediction For a branch instruction, predict whether the branch will actually be TAKEN or NOT TAKEN Pipeline bubbles(stalls) due to branch are the main source of performance degradation –Avoids pipeline bubbles by feeding the pipeline with the instructions in the predicted path(speculative) Speculative execution significantly improves the performance of the deeply pipelined, wide issue superscalar processors More accurate branch prediction is required because all speculative work beyond a branch must be thrown away if mispredicted Performance=f(Accuracy, cost of misprediction) –Accuracy is better with Dynamic Prediction

4 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 4 1-Bit Branch History Table Simplest of all dynamic prediction schemes Based on the properties that most branches are either usually TAKEN or usually NOT TAKEN –property found in the iterations of a loop Keep the last outcome of each branch in the BHT BHT is indexed using the lower order bits of PC..... Prediction PC BHT Actual outcome Problem: in a loop, 1-bit BHT will cause two mispredictions: End of loop branch, - Last time through the loop, when it exits instead of looping as before - First time through loop, when it predicts exit instead of looping

5 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 5 1-Bit Branch History Table Example main() { int i, j; while (1) for (i=1; i<=4; i++) { j = i++; } ‘while’ branch : always TAKEN 11111111111111111…. ‘for’ branch : TAKEN x 3, NOT TAKEN x 1 1110111011101110…. Prediction accuracy of ‘for’ branch: 50% O : correct, X : mispredict 11T1O11T1O 11T1O11T1O 11T1O11T1O 01T0X01T0X 11T1O11T1O 11T1O11T1O 01T0X01T0X 11T1O11T1O 11T1O11T1O 01T0X01T0X branch outcome last outcome prediction new last outcome correctness... 10N1X10N1X 10N1X10N1X 10N1X10N1X Assume initial last outcome = 1

6 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 6 2-Bit Counter Scheme Prediction is made based on the last two branch outcomes Each of BHT entries consists of 2 bits, usually a 2-bit counter, which is associated with the state of the automaton(many different automata are possible)..... Prediction PC BHT Automaton Branch outcome

7 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 7 2-Bit BHT(1) 2-bit scheme where change prediction only if get misprediction twice MSB of the state symbol represents the prediction; –1x: TAKEN, 0x: NOT TAKEN Predict TAKEN11 Predict TAKEN10 Predict NOT TAKEN01 Predict NOT TAKEN00 T NT T T T Prediction accuracy of ‘for’ branch: 75 % 1 11 T 11 O 1 11 T 11 O 1 11 T 11 O 1 10 T 11 O 1 11 T 11 O 1 11 T 11 O 1 10 T 11 O 1 11 T 11 O 1 11 T 11 O O : correct, X : mispredict branch outcome counter value prediction new counter value correctness... 0 11 T 10 X 0 11 T 10 X 0 11 T 10 X assume initial counter value : 11

8 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 8 2-Bit BHT(2) A branch going unusual direction once causes a misprediction only once, rather than twice as in the 1-bit BHT MSB of the state represents the prediction; –1x: TAKEN, 0x: NOT TAKEN Predict TAKEN 11 Predict TAKEN 10 Predict NOT TAKEN 00 Predict NOT TAKEN 01 T NT T T T Prediction accuracy of ‘for’ branch: 75 % 1 10 T 11 O 1 11 T 11 O 1 11 T 11 O 1 10 T 11 O 1 11 T 11 O 1 11 T 11 O 1 10 T 11 O 1 11 T 11 O 1 11 T 11 O branch outcome counter value prediction new counter value correctness... 0 11 T 10 X 0 11 T 10 X 0 11 T 10 X assume initial counter value : 10

9 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 9 2-Level Self Prediction First level –Record previous few outcomes(history) of the branch itself –Branch History Table(BHT) - A shift register Second Level –Predictor Table(PT) indexed by history pattern captured by BHT –Each entry is a 2-bit counter PC... Prediction BHT PT

10 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 10 2-Level Self Prediction Algorithm Using PC, read BHT Using self history from BHT, read the counter value from PT If (MSB==1), predict T else predict NT Branch Outcome Self History(BHT) Counter Value(PT) Prediction New BHT entry New PT entry Correctness 1 0000 10 T 1000 11 O 1 1000 10 T 1100 11 O 1 1100 10 T 1110 11 O 1 0111 10 T 1011 11 O 1 1011 10 T 1101 11 O 1 1101 10 T 1110 11 O 1 0111 11 T 1011 11 O 1 1011 11 T 1101 11 O 1 1101 11 T 1110 11 O... 0 1110 10 T 0111 01 X 0 1110 01 N 0111 00 O 0 1110 00 N 0111 00 O Initial BHT entry =0000, Self history length=4, Initial PT value=10 Prediction Accuracy = 100% After resolving branch outcome, bookkeeping If mispredicted, discard all speculative executions Shift right the new branch outcome into the entry read from BHT Update PT:If(branch outcome==T) increase counter value by 1 else decrease by 1

11 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 11 BHT Accuracy Mispredict because either: –Wrong guess for that branch –Got branch history of wrong branch when indexing the table 4096 entry table, programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% 4096 about as good as infinite table, but 4096 is a lot of HW

12 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 12 Correlating Branches Example: if (d==0) d=1; if (d==1) BNEZR1,L1 ; (b1)(d!=0) ADDIR1,R1,#1; d==0, so d=1 L1 :SUBIR3,R1,#1 BNEZR3,L2; (b2)(d!=1)..... L2 : Initial value of d 0 1 2 d==0? Y N b1 NT T Value of d before b2 1 2 d==1? Y N b2 NT T If b1 is NT, then b2 is NT d=? 2 0 2 0 b1 prediction NT T NT T b1 action T NT T NT New b1 prediction T NT T NT b2 prediction NT T NT T b2 action T NT T NT New b2 prediction T NT T NT All branches are mispredicted 1-bit self history predictor Sequence of 2,0,2,0,2,0,...

13 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 13 Correlating Branches Idea: TAKEN/NOT TAKEN of recently executed branches (GHR-global history) is related to the behavior of the next branch (as well as the history of that branch behavior)(PHT- self history) - Behavior of the recent branches(GHR) selects from, say, four predictions (PHT) of the next branch, and updating the selected prediction only Branch address 2-bits per predictors Prediction XX 4 (2,2) correlating predictor XX Pattern History Table(PHT) Global History Register(GHR; 2 bits)

14 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 14 Correlating Branches Example: if (d==0) d=1; if (d==1) BNEZR1,L1 ; branch b1 (d!=0) ADDIR1,R1,#1; d==0, so d=1 L1 :SUBIR3,R1,#1 BNEZR3,L2; branch b2(d!=1)..... L2 : Self Prediction bits(XX) NT/NT NT/T T/NT T/T Prediction, if last branch action was NT NT T Prediction, if last branch action was T NT T NT T Gloabal d=? 2 0 2 0 b1 prediction NT/NT T/NT b1 action T NT T NT new b1 prediction T/NT b2 prediction NT/NT NT/T b2 action T NT T NT new b2 prediction NT/T Initial self prediction bits NT/NT and Initial last branch was NT. Prediction used is shown in purple Misprediction only in the firest two predictions

15 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 15 Accuracy of Different Schemes Frequency of Mispredictions 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li 0% 1% 5% 6% 11% 4% 6% 5% 1% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT Frequency of Mispredictions

16 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 16 Need Predicted Address as well as Prediction Branch Target Buffer (BTB): Indexed by the Address of branch to get prediction AND branch address (if taken) Predicted PC T/NT(as predicted) PC of instruction to fetch Number of entries in branch- target buffer No: instruction is not predicted to be a branch. Proceed normally Yes: then instruction is a branch and predicted PC should be used as the next PC Look up =

17 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 17 Branch Target Buffer Send PC to memory and branch-target buffer Entry found in BTB? Send out predicted PC Is instruction a taken branch? Taken branch? Enter branch instr. address and next PC into BTB Mispredicted branch, kill fetched instr.; restart fetch at other target; delete entry from BTB. Branch correctly predicted Normal instruction execution IF ID EX No Yes No Yes No Yes BTB Prediction Actual Penalty Y T T 0 Y T NT 2 N - T 2

18 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 18

19 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 19 Getting CPI < 1: Issuing Multiple Instructions/Cycle Two variations –Superscalar: varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW(Tomasulo) IBM PowerPC, Sun SuperSparc, DEC Alpha, HP 7100 –Very Long Instruction Words (VLIW): fixed number of instructions (16) scheduled by the compiler Joint HP/Intel agreement in 1998?

20 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 20 Getting CPI < 1: Issuing Multiple Instructions/Cycle  1 cycle load delay expands to 3 instructions in SS – instruction in right half cannot use it, nor instructions in the next slot Superscalar DLX: 2 instructions, 1 FP & 1 anything else – Fetch 64-bits/clock cycle; Int on left, FP on right Integer Instruction FP Instruction Instr TypePipe Stages Int. instructionIF ID EX MEM WB FP instructionIF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB – Can issue the 2nd instruction only if the 1st instruction issues – More ports for FP registers to do FP load & FP op in a pair

21 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 21 Unrolled Loop that Minimizes Stalls for Scalar 1 Loop:LDF0,0(R1) 2 LDF6,-8(R1) 3 LDF10,-16(R1) 4 LDF14,-24(R1) 5 ADDDF4,F0,F2 6 ADDDF8,F6,F2 7 ADDDF12,F10,F2 8 ADDDF16,F14,F2 9 SD0(R1),F4 10 SD-8(R1),F8 11 SUBIR1,R1,#32 12 SD16(R1),F12 13 BNEZR1,LOOP 14 SD8(R1),F16; 8-32 = -24 LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles 14 clock cycles, or 3.5 per iteration Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #32 BNEZ R1, Loop

22 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 22 Loop Unrolling in Superscalar Integer instruction FP instruction Clock cycle Loop:LD F0,0(R1)1 LD F6,-8(R1)2 LD F10,-16(R1)ADDD F4,F0,F23 LD F14,-24(R1)ADDD F8,F6,F24 LD F18,-32(R1)ADDD F12,F10,F25 SD 0(R1),F4ADDD F16,F14,F26 SD -8(R1),F8ADDD F20,F18,F27 SD -16(R1),F128 SUBI R1,R1,#409 SD 16(R1),F16 10 BNEZ R1,LOOP11 SD 8(R1),F2012 Unrolled 5 times to avoid delays (+1 due to 2 cycle initiation delay for ADDD to SD) 12 clocks, or 2.4 clocks per iteration LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles

23 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 23 Dynamic Scheduling in Superscalar Dependencies stop instruction issue Code compiled for scalar version will run poorly on SS –Good code for superscalar depends on the structure of the superscalar Simple approach –Issue an integer instruction and a FP instruction to their separate reservation stations, i.e. for Integer FU/Reg and for FP FU/Reg –Issue both only when two instructions do not access the same register - instructions with data dependence cannot be issued together

24 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 24 Dynamic Scheduling in Superscalar How to issue two instructions to the reservation stations and keep in-order issue for Tomasulo? –In order issue for the purpose of simplifying bookkeeping –Issue stage runs 2X Clock Rate, so that issue can be made 2 times in an ordinary clock cycle to make issues remain in order but execution can be done on the same ordinary clock cycle –Only FP loads might cause dependency between integer and FP issue: Replace load reservation station with a load queue; operands must be read in the order they are fetched Load checks addresses in Store Queue to avoid RAW violation Store checks addresses in Load Queue to avoid WAR, WAW

25 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 25 Performance of Dynamic SS Iter No. Issues Executes Memory access Write Result 1 LD F0,0(R1) 1 ADDD F4,F0,F2 1 SD 0(R1),F4 1 SUBI R1,R1,# 8 1 BNEZ R1,LOOP 2 LD F0,0(R1) 2 ADDD F4,F0,F2 2 SD 0(R1),F4 2 SUBI R1,R1,#8 2 BNEZ R1,LOOP * 4 clocks per iteration Branches, Decrements still take 1 clock cycle 2 more cycles to complete 2 2(efa) 4 3 3 3(efa) 4 4 4 7 7 7(efa) 7 5 5 5 5 8 8 8 8 9 9 10 11 1 1 6 6(efa) 6

26 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 26 Limits of Superscalar While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: –Exactly 50% FP operations –No hazards If more instructions issue at the same time, greater difficulty of decode and issue –Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue VLIW: trade-off instruction space for simple decoding –The long instruction word has room for many operations –By definition, all the operations the compiler puts in the long instruction word can execute in parallel –E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch 16 to 24 bits per field => 7x16 or 112 bits to 7x24 or 168 bits wide –Need compiling technique that schedules across several branches

27 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 27 Loop Unrolling in VLIW Memory MemoryFPFPInt. op/Clock reference 1 reference 2 operation 1 op. 2 branch a LD F0,0(R1)LD F6,-8(R1)1 LD F10,-16(R1)LD F14,-24(R1)2 LD F18,-32(R1)LD F22,-40(R1)ADDD F4,F0,F2ADDD F8,F6,F23 LD F26,-48(R1)ADDD F12,F10,F2ADDD F16,F14,F24 ADDD F20,F18,F2ADDD F24,F22,F25 SD 0(R1),F4SD -8(R1),F8ADDD F28,F26,F26 SD -16(R1),F12SD -24(R1),F167 SD -32(R1),F20SD -40(R1),F24SUBI R1,R1,#488 SD 0(R1),F28BNEZ R1,LOOP9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration Need more registers in VLIW

28 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 28 Limits to Multi-Issue Machines 1. Inherent limitations of ILP in programs –1 branch in 5 instructions => how to keep a 5-way VLIW busy? –Latencies of units => many operations must be scheduled –Need about Pipeline Depth x No. Functional Units of independent operations to keep machines busy 2. Difficulties in building HW –Duplicate FUs to get parallel execution –Increase ports to Register File - VLIW example needs 6 reads(2 Integer Unit, 2 LD Units and 2 SD Units) and 3 writes(1 Integer Unit and 2 LD Units) for Int. register needs 6 reads(4 FP Units, 2 SD Units) and 4 writes(2 FP Units and 2 LD Units) for FP register –Increase ports to memory –Decoding SS and impact on clock rate, pipeline depth

29 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 29 Limits to Multi-Issue Machines 3.Limitations specific to either SS or VLIW implementation –Multiple issue logic in SS –VLIW code size: unroll loops + wasted fields in VLIW –VLIW lock step => 1 hazard & all instructions stall –VLIW & binary compatibility is practical weakness

30 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 30 Detecting and Eliminating Dependencies Read Section 4.5

31 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 31 Software Pipelining Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (Tomasulo in SW)

32 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 32 SW Pipelining Example Before: Unrolled 3 times 1 LOOP LD F0,0(R1) 2ADDD F4,F0,F2 3SD 0(R1),F4 4LD F6,-8(R1) 5ADDD F8,F0,F2 6SD -8(R1),F8 7LD F10,16(R1) 8ADDD F12,F10,F2 9SD-16(R1),F12 10SUBIR1,R1,#24 11BNEZR1,LOOP After: Software Pipelined version of loop LD F0,0(R1) ADDD F4,F0,F2 LD F0,-8(R1) 1 LOOPSD 0(R1),F4; Stores to M[i] 2ADDD F4,F0,F2; Adds to M[i-1] 3LD F0,-16(R1); Loads from M[i-2] 4SUBI R1,R1,#8 5BNEZ R1,LOOP SD 0(R1),F4 ADDD F4,F0,F2 SD -8(R1),F4 Start-up code Finish code Iter i Iter i+1 Iter i+2 IF ID EX Mem WB SD ADDD LD Read F4(i) Write F4(i+1) Read F0(i) Write F0(i+2)

33 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 33 SW Pipelining Example Symbolic Loop Unrolling – Less code space – Overhead paid only once vs. each iteration in loop unrolling Software Pipelining Number of overlapped operations Time Loop Unrolling 100 iterations = 25 loops with 4 unrolled iterations each Number of overlapped operations Time...

34 Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 34 SummarySummary Branch Prediction –Branch History Table: 2 bits for loop accuracy –Correlation: Recently executed branches correlated with next branch –Branch Target Buffer: include branch address & prediction Superscalar and VLIW –CPI < 1 –Dynamic issue vs. Static issue –More instructions issue at the same time, larger the penalty of hazards SW Pipelining –Symbolic Loop Unrolling to get most from pipeline with little code expansion, little overhead –SW pipelining works when the behavior of branches is fairly predictable at compile time


Download ppt "Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 1 Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining."

Similar presentations


Ads by Google