Dynamic Branch PredictionCS510 Computer ArchitecturesLecture 10 - 1 Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Slides:



Advertisements
Similar presentations
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Advertisements

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Dynamic Branch Prediction (Sec 4.3) Control dependences become a limiting factor in exploiting ILP So far, we’ve discussed only static branch prediction.
A scheme to overcome data hazards
Compiler techniques for exposing ILP
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Pipelining and Control Hazards Oct
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
COMP4611 Tutorial 6 Instruction Level Parallelism
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Computer Architecture Lec 8 – Instruction Level Parallelism.
CS252 Graduate Computer Architecture Lecture 6 Static Scheduling, Scoreboard February 6 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Lecture 8: More ILP stuff Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
EECC551 - Shaaban #1 lec # 6 Fall Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 lec # 8 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
1 Instruction Level Parallelism Vincent H. Berk October 15, 2008 Reading for today: A.7 – A.8 Reading for Friday: 2.1 – 2.5 Project Proposals Due Right.
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Instruction-Level Parallelism and Its Dynamic Exploitation
Compiler Techniques for ILP
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
COMP 740: Computer Architecture and Implementation
CS 704 Advanced Computer Architecture
CS203 – Advanced Computer Architecture
CPE 631 Lecture 13: Exploiting ILP with SW Approaches
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Computer Architecture
Siddhartha Chatterjee Spring 2008
Lecture 23: Static Scheduling for High ILP
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Midterm 2 review Chapter
Dynamic Hardware Prediction
CPE 631 Lecture 14: Exploiting ILP with SW Approaches (2)
CMSC 611: Advanced Computer Architecture
Loop-Level Parallelism
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Review: Tomasulo Summary Register file is not the bottleneck Avoids the WAR, WAW hazards of Scoreboard Not limited to basic blocks (provided branch prediction) Allows loop unrolling in HW Lasting Contributions –Dynamic scheduling –Register renaming –Load/store disambiguation

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Reasons for Branch Prediction For a branch instruction, predict whether the branch will actually be TAKEN or NOT TAKEN Pipeline bubbles(stalls) due to branch are the main source of performance degradation –Avoids pipeline bubbles by feeding the pipeline with the instructions in the predicted path(speculative) Speculative execution significantly improves the performance of the deeply pipelined, wide issue superscalar processors More accurate branch prediction is required because all speculative work beyond a branch must be thrown away if mispredicted Performance=f(Accuracy, cost of misprediction) –Accuracy is better with Dynamic Prediction

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Bit Branch History Table Simplest of all dynamic prediction schemes Based on the properties that most branches are either usually TAKEN or usually NOT TAKEN –property found in the iterations of a loop Keep the last outcome of each branch in the BHT BHT is indexed using the lower order bits of PC..... Prediction PC BHT Actual outcome Problem: in a loop, 1-bit BHT will cause two mispredictions: End of loop branch, - Last time through the loop, when it exits instead of looping as before - First time through loop, when it predicts exit instead of looping

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Bit Branch History Table Example main() { int i, j; while (1) for (i=1; i<=4; i++) { j = i++; } ‘while’ branch : always TAKEN …. ‘for’ branch : TAKEN x 3, NOT TAKEN x …. Prediction accuracy of ‘for’ branch: 50% O : correct, X : mispredict 11T1O11T1O 11T1O11T1O 11T1O11T1O 01T0X01T0X 11T1O11T1O 11T1O11T1O 01T0X01T0X 11T1O11T1O 11T1O11T1O 01T0X01T0X branch outcome last outcome prediction new last outcome correctness... 10N1X10N1X 10N1X10N1X 10N1X10N1X Assume initial last outcome = 1

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Bit Counter Scheme Prediction is made based on the last two branch outcomes Each of BHT entries consists of 2 bits, usually a 2-bit counter, which is associated with the state of the automaton(many different automata are possible)..... Prediction PC BHT Automaton Branch outcome

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Bit BHT(1) 2-bit scheme where change prediction only if get misprediction twice MSB of the state symbol represents the prediction; –1x: TAKEN, 0x: NOT TAKEN Predict TAKEN11 Predict TAKEN10 Predict NOT TAKEN01 Predict NOT TAKEN00 T NT T T T Prediction accuracy of ‘for’ branch: 75 % 1 11 T 11 O 1 11 T 11 O 1 11 T 11 O 1 10 T 11 O 1 11 T 11 O 1 11 T 11 O 1 10 T 11 O 1 11 T 11 O 1 11 T 11 O O : correct, X : mispredict branch outcome counter value prediction new counter value correctness T 10 X 0 11 T 10 X 0 11 T 10 X assume initial counter value : 11

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Bit BHT(2) A branch going unusual direction once causes a misprediction only once, rather than twice as in the 1-bit BHT MSB of the state represents the prediction; –1x: TAKEN, 0x: NOT TAKEN Predict TAKEN 11 Predict TAKEN 10 Predict NOT TAKEN 00 Predict NOT TAKEN 01 T NT T T T Prediction accuracy of ‘for’ branch: 75 % 1 10 T 11 O 1 11 T 11 O 1 11 T 11 O 1 10 T 11 O 1 11 T 11 O 1 11 T 11 O 1 10 T 11 O 1 11 T 11 O 1 11 T 11 O branch outcome counter value prediction new counter value correctness T 10 X 0 11 T 10 X 0 11 T 10 X assume initial counter value : 10

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Level Self Prediction First level –Record previous few outcomes(history) of the branch itself –Branch History Table(BHT) - A shift register Second Level –Predictor Table(PT) indexed by history pattern captured by BHT –Each entry is a 2-bit counter PC... Prediction BHT PT

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Level Self Prediction Algorithm Using PC, read BHT Using self history from BHT, read the counter value from PT If (MSB==1), predict T else predict NT Branch Outcome Self History(BHT) Counter Value(PT) Prediction New BHT entry New PT entry Correctness T O T O T O T O T O T O T O T O T O T X N O N O Initial BHT entry =0000, Self history length=4, Initial PT value=10 Prediction Accuracy = 100% After resolving branch outcome, bookkeeping If mispredicted, discard all speculative executions Shift right the new branch outcome into the entry read from BHT Update PT:If(branch outcome==T) increase counter value by 1 else decrease by 1

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture BHT Accuracy Mispredict because either: –Wrong guess for that branch –Got branch history of wrong branch when indexing the table 4096 entry table, programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% 4096 about as good as infinite table, but 4096 is a lot of HW

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Correlating Branches Example: if (d==0) d=1; if (d==1) BNEZR1,L1 ; (b1)(d!=0) ADDIR1,R1,#1; d==0, so d=1 L1 :SUBIR3,R1,#1 BNEZR3,L2; (b2)(d!=1)..... L2 : Initial value of d d==0? Y N b1 NT T Value of d before b2 1 2 d==1? Y N b2 NT T If b1 is NT, then b2 is NT d=? b1 prediction NT T NT T b1 action T NT T NT New b1 prediction T NT T NT b2 prediction NT T NT T b2 action T NT T NT New b2 prediction T NT T NT All branches are mispredicted 1-bit self history predictor Sequence of 2,0,2,0,2,0,...

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Correlating Branches Idea: TAKEN/NOT TAKEN of recently executed branches (GHR-global history) is related to the behavior of the next branch (as well as the history of that branch behavior)(PHT- self history) - Behavior of the recent branches(GHR) selects from, say, four predictions (PHT) of the next branch, and updating the selected prediction only Branch address 2-bits per predictors Prediction XX 4 (2,2) correlating predictor XX Pattern History Table(PHT) Global History Register(GHR; 2 bits)

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Correlating Branches Example: if (d==0) d=1; if (d==1) BNEZR1,L1 ; branch b1 (d!=0) ADDIR1,R1,#1; d==0, so d=1 L1 :SUBIR3,R1,#1 BNEZR3,L2; branch b2(d!=1)..... L2 : Self Prediction bits(XX) NT/NT NT/T T/NT T/T Prediction, if last branch action was NT NT T Prediction, if last branch action was T NT T NT T Gloabal d=? b1 prediction NT/NT T/NT b1 action T NT T NT new b1 prediction T/NT b2 prediction NT/NT NT/T b2 action T NT T NT new b2 prediction NT/T Initial self prediction bits NT/NT and Initial last branch was NT. Prediction used is shown in purple Misprediction only in the firest two predictions

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Accuracy of Different Schemes Frequency of Mispredictions 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li 0% 1% 5% 6% 11% 4% 6% 5% 1% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT Frequency of Mispredictions

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Need Predicted Address as well as Prediction Branch Target Buffer (BTB): Indexed by the Address of branch to get prediction AND branch address (if taken) Predicted PC T/NT(as predicted) PC of instruction to fetch Number of entries in branch- target buffer No: instruction is not predicted to be a branch. Proceed normally Yes: then instruction is a branch and predicted PC should be used as the next PC Look up =

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Branch Target Buffer Send PC to memory and branch-target buffer Entry found in BTB? Send out predicted PC Is instruction a taken branch? Taken branch? Enter branch instr. address and next PC into BTB Mispredicted branch, kill fetched instr.; restart fetch at other target; delete entry from BTB. Branch correctly predicted Normal instruction execution IF ID EX No Yes No Yes No Yes BTB Prediction Actual Penalty Y T T 0 Y T NT 2 N - T 2

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Getting CPI < 1: Issuing Multiple Instructions/Cycle Two variations –Superscalar: varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW(Tomasulo) IBM PowerPC, Sun SuperSparc, DEC Alpha, HP 7100 –Very Long Instruction Words (VLIW): fixed number of instructions (16) scheduled by the compiler Joint HP/Intel agreement in 1998?

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Getting CPI < 1: Issuing Multiple Instructions/Cycle  1 cycle load delay expands to 3 instructions in SS – instruction in right half cannot use it, nor instructions in the next slot Superscalar DLX: 2 instructions, 1 FP & 1 anything else – Fetch 64-bits/clock cycle; Int on left, FP on right Integer Instruction FP Instruction Instr TypePipe Stages Int. instructionIF ID EX MEM WB FP instructionIF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB – Can issue the 2nd instruction only if the 1st instruction issues – More ports for FP registers to do FP load & FP op in a pair

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Unrolled Loop that Minimizes Stalls for Scalar 1 Loop:LDF0,0(R1) 2 LDF6,-8(R1) 3 LDF10,-16(R1) 4 LDF14,-24(R1) 5 ADDDF4,F0,F2 6 ADDDF8,F6,F2 7 ADDDF12,F10,F2 8 ADDDF16,F14,F2 9 SD0(R1),F4 10 SD-8(R1),F8 11 SUBIR1,R1,#32 12 SD16(R1),F12 13 BNEZR1,LOOP 14 SD8(R1),F16; 8-32 = -24 LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles 14 clock cycles, or 3.5 per iteration Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #32 BNEZ R1, Loop

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Loop Unrolling in Superscalar Integer instruction FP instruction Clock cycle Loop:LD F0,0(R1)1 LD F6,-8(R1)2 LD F10,-16(R1)ADDD F4,F0,F23 LD F14,-24(R1)ADDD F8,F6,F24 LD F18,-32(R1)ADDD F12,F10,F25 SD 0(R1),F4ADDD F16,F14,F26 SD -8(R1),F8ADDD F20,F18,F27 SD -16(R1),F128 SUBI R1,R1,#409 SD 16(R1),F16 10 BNEZ R1,LOOP11 SD 8(R1),F2012 Unrolled 5 times to avoid delays (+1 due to 2 cycle initiation delay for ADDD to SD) 12 clocks, or 2.4 clocks per iteration LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Dynamic Scheduling in Superscalar Dependencies stop instruction issue Code compiled for scalar version will run poorly on SS –Good code for superscalar depends on the structure of the superscalar Simple approach –Issue an integer instruction and a FP instruction to their separate reservation stations, i.e. for Integer FU/Reg and for FP FU/Reg –Issue both only when two instructions do not access the same register - instructions with data dependence cannot be issued together

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Dynamic Scheduling in Superscalar How to issue two instructions to the reservation stations and keep in-order issue for Tomasulo? –In order issue for the purpose of simplifying bookkeeping –Issue stage runs 2X Clock Rate, so that issue can be made 2 times in an ordinary clock cycle to make issues remain in order but execution can be done on the same ordinary clock cycle –Only FP loads might cause dependency between integer and FP issue: Replace load reservation station with a load queue; operands must be read in the order they are fetched Load checks addresses in Store Queue to avoid RAW violation Store checks addresses in Load Queue to avoid WAR, WAW

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Performance of Dynamic SS Iter No. Issues Executes Memory access Write Result 1 LD F0,0(R1) 1 ADDD F4,F0,F2 1 SD 0(R1),F4 1 SUBI R1,R1,# 8 1 BNEZ R1,LOOP 2 LD F0,0(R1) 2 ADDD F4,F0,F2 2 SD 0(R1),F4 2 SUBI R1,R1,#8 2 BNEZ R1,LOOP * 4 clocks per iteration Branches, Decrements still take 1 clock cycle 2 more cycles to complete 2 2(efa) (efa) (efa) (efa) 6

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Limits of Superscalar While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: –Exactly 50% FP operations –No hazards If more instructions issue at the same time, greater difficulty of decode and issue –Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue VLIW: trade-off instruction space for simple decoding –The long instruction word has room for many operations –By definition, all the operations the compiler puts in the long instruction word can execute in parallel –E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch 16 to 24 bits per field => 7x16 or 112 bits to 7x24 or 168 bits wide –Need compiling technique that schedules across several branches

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Loop Unrolling in VLIW Memory MemoryFPFPInt. op/Clock reference 1 reference 2 operation 1 op. 2 branch a LD F0,0(R1)LD F6,-8(R1)1 LD F10,-16(R1)LD F14,-24(R1)2 LD F18,-32(R1)LD F22,-40(R1)ADDD F4,F0,F2ADDD F8,F6,F23 LD F26,-48(R1)ADDD F12,F10,F2ADDD F16,F14,F24 ADDD F20,F18,F2ADDD F24,F22,F25 SD 0(R1),F4SD -8(R1),F8ADDD F28,F26,F26 SD -16(R1),F12SD -24(R1),F167 SD -32(R1),F20SD -40(R1),F24SUBI R1,R1,#488 SD 0(R1),F28BNEZ R1,LOOP9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration Need more registers in VLIW

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Limits to Multi-Issue Machines 1. Inherent limitations of ILP in programs –1 branch in 5 instructions => how to keep a 5-way VLIW busy? –Latencies of units => many operations must be scheduled –Need about Pipeline Depth x No. Functional Units of independent operations to keep machines busy 2. Difficulties in building HW –Duplicate FUs to get parallel execution –Increase ports to Register File - VLIW example needs 6 reads(2 Integer Unit, 2 LD Units and 2 SD Units) and 3 writes(1 Integer Unit and 2 LD Units) for Int. register needs 6 reads(4 FP Units, 2 SD Units) and 4 writes(2 FP Units and 2 LD Units) for FP register –Increase ports to memory –Decoding SS and impact on clock rate, pipeline depth

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Limits to Multi-Issue Machines 3.Limitations specific to either SS or VLIW implementation –Multiple issue logic in SS –VLIW code size: unroll loops + wasted fields in VLIW –VLIW lock step => 1 hazard & all instructions stall –VLIW & binary compatibility is practical weakness

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Detecting and Eliminating Dependencies Read Section 4.5

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Software Pipelining Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (Tomasulo in SW)

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture SW Pipelining Example Before: Unrolled 3 times 1 LOOP LD F0,0(R1) 2ADDD F4,F0,F2 3SD 0(R1),F4 4LD F6,-8(R1) 5ADDD F8,F0,F2 6SD -8(R1),F8 7LD F10,16(R1) 8ADDD F12,F10,F2 9SD-16(R1),F12 10SUBIR1,R1,#24 11BNEZR1,LOOP After: Software Pipelined version of loop LD F0,0(R1) ADDD F4,F0,F2 LD F0,-8(R1) 1 LOOPSD 0(R1),F4; Stores to M[i] 2ADDD F4,F0,F2; Adds to M[i-1] 3LD F0,-16(R1); Loads from M[i-2] 4SUBI R1,R1,#8 5BNEZ R1,LOOP SD 0(R1),F4 ADDD F4,F0,F2 SD -8(R1),F4 Start-up code Finish code Iter i Iter i+1 Iter i+2 IF ID EX Mem WB SD ADDD LD Read F4(i) Write F4(i+1) Read F0(i) Write F0(i+2)

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture SW Pipelining Example Symbolic Loop Unrolling – Less code space – Overhead paid only once vs. each iteration in loop unrolling Software Pipelining Number of overlapped operations Time Loop Unrolling 100 iterations = 25 loops with 4 unrolled iterations each Number of overlapped operations Time...

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture SummarySummary Branch Prediction –Branch History Table: 2 bits for loop accuracy –Correlation: Recently executed branches correlated with next branch –Branch Target Buffer: include branch address & prediction Superscalar and VLIW –CPI < 1 –Dynamic issue vs. Static issue –More instructions issue at the same time, larger the penalty of hazards SW Pipelining –Symbolic Loop Unrolling to get most from pipeline with little code expansion, little overhead –SW pipelining works when the behavior of branches is fairly predictable at compile time