Spring 2003CSE P5481 Control Hazard Review The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction.

Slides:



Advertisements
Similar presentations
Branch prediction Titov Alexander MDSP November, 2009.
Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Dynamic Branch Prediction (Sec 4.3) Control dependences become a limiting factor in exploiting ILP So far, we’ve discussed only static branch prediction.
Pipelining and Control Hazards Oct
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
CPE 631: Branch Prediction Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
Dynamic Branch Prediction
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.
EECC551 - Shaaban #1 lec # 5 Spring Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
EECS 470 Branch Prediction Lecture 6 Coverage: Chapter 3.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Goal: Reduce the Penalty of Control Hazards
Branch Target Buffers BPB: Tag + Prediction
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Branch Prediction Dimitris Karteris Rafael Pasvantidιs.
1 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 19, 2009 Topic: Instruction-Level Parallelism III (Dynamic Branch Prediction)
Branch pred. CSE 471 Autumn 011 Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific.
Dynamic Branch Prediction
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.
Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
Branch Prediction CSE 4711 Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications;
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Dynamic Branch Prediction
CS203 – Advanced Computer Architecture
So far we have dealt with control hazards in instruction pipelines by:
Dynamic Hardware Branch Prediction
CPE 631: Branch Prediction
Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.
Dynamic Branch Prediction
Pipelining and control flow
Control unit extension for data hazards
So far we have dealt with control hazards in instruction pipelines by:
Lecture 10: Branch Prediction and Instruction Delivery
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Adapted from the slides of Prof
Control unit extension for data hazards
Dynamic Hardware Prediction
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Wackiness Algorithm A: Algorithm B:
So far we have dealt with control hazards in instruction pipelines by:
Control unit extension for data hazards
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
CPE 631 Lecture 12: Branch Prediction
Presentation transcript:

Spring 2003CSE P5481 Control Hazard Review The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional branch is taken (execute code at the target address) or not taken (execute the sequential code)? What is the difference in cycles between them? The cost of stalling until you know whether to branch number of cycles in between * branch frequency = the contribution to CPI due to branches Predict the branch outcome to avoid stalling

Spring 2003CSE P5482 Branch Prediction Review Branch prediction: Resolve a branch hazard by predicting which path will be taken Proceed under that assumption Flush the wrong-path instructions from the pipeline & fetch the right path if wrong Dynamic branch prediction: the prediction changes as program behavior changes branch prediction implemented in hardware (static branch prediction is done by the compiler) common algorithm: predict the branch taken if branched the last time predict the branch not-taken if didn’t branch the last time Performance improvement depends on: how soon you can check the prediction whether the prediction is correct (here’s most of the innovation)

Spring 2003CSE P5483 Branch Prediction Buffer Branch prediction buffer small memory indexed by the lower bits of the address of a branch instruction during the fetch stage contains a prediction (which path the last branch to index to this BPB location took) do what the prediction says to do if the prediction is taken & it is correct only incur a one-cycle penalty  why? if the prediction is not taken & it is correct incur no penalty  why? if the prediction is incorrect change the prediction also flush the pipeline  why? penalty is the same as if there were no branch prediction  why?

Spring 2003CSE P5484 Two-bit Prediction A single prediction bit does not work well with loops mispredicts the first & last iterations of a nested loop Two-bit branch prediction for loops Algorithm: have to be wrong twice before the prediction is changed Works well when branches predominantly go in one direction Why? A second check is made to make sure that a short & temporary change of direction does not change the prediction away from the dominant direction What pattern is bad for two-bit branch prediction?

Spring 2003CSE P5485 Two-bit Prediction Often implemented as a table (a prediction buffer) of 2-bit saturating counters increase on a taken branch, not greater than 3 decrease on a not-taken branch, not less than 0 most significant bit is the prediction Indexed by the low-order bits of the PC prediction improves with table size -- why? Could also be bits/counters associated with each cache line

Spring 2003CSE P5486 Is Branch Prediction is More Important Today? Think about: Is the number of branches in code changing? Is the number of branches being executed changing? Is the misprediction penalty changing?

Spring 2003CSE P5487 Branch Prediction is More Important Today Conditional branches still comprise about 20% of instructions Correct predictions are more important today  why? pipelines deeper branch not resolved until more cycles from fetching therefore the misprediction penalty greater cycle times smaller: more emphasis on throughput (performance) more functionality between fetch & execute multiple instruction issue (superscalars & VLIW) branch occurs almost every cycle flushing & refetching more instructions object-oriented programming more indirect branches which harder to predict dual of Amdahl’s Law other forms of pipeline stalling are being addressed so the portion of CPI due to branch delays is relatively larger All this means that the potential stalling due to branches is greater

Spring 2003CSE P5488 Branch Prediction is More Important Today On the other hand, chips are denser so we can consider sophisticated HW solutions hardware cost is small compared to the performance gain

Spring 2003CSE P5489 Directions in Branch Prediction 1: Improve the prediction correlated (2-level) predictor (Pentium Pro, Pentium III) hybrid local/global predictor (Alpha 21264) confidence predictors 2: Determine the target earlier branch target buffer (Pentium Pro, IA-64 Itanium) next address in I-cache (Alpha 21264, UltraSPARC) return address stack (Alpha 21264, IA-64 Itanium, MIPS R10000, Pentium Pro, UltraSPARC-3) 3: Reduce misprediction penalty fetch both instruction streams (IBM mainframes, SuperSPARC) 4: Eliminate the branch predicated execution (IA-64 Itanium, Alpha 21264)

Spring 2003CSE P : Correlated Predictor The rationale: having the prediction depend on the outcome of only 1 branch might produce bad predictions some branch outcomes are correlated example: same condition variable if (d==0)... if (d!=0) example: related condition variable if (d==0) b=1; if (b==1)

Spring 2003CSE P : Correlated Predictor another example: related condition variables if (x==2) /* branch 1 */ x=0; if (y==2) /* branch 2 */ y=0; if (x!=y) /* branch 3 */ do this; else do that; if branches 1 & 2 are taken, branch 3 is not taken  use a history of the past m branches represents a path through the program (but still n bits of prediction)

Spring 2003CSE P : Correlated Predictor General idea of correlated branch prediction: put the global branch history in a global history register global history is a shift register: shift left in the new branch outcome use its value to access a pattern history table (PHT) of 2-bit saturating counters

Spring 2003CSE P : Correlated Predictor Many implementation variations number of history registers 1 history register for all branches (global) table of history registers, 1 for each branch (private) table of history registers, each shared by several branches (shared) history length (size of history registers) number of PHTs What is the trade-off?

Spring 2003CSE P : Correlated Predictor Private history tables, 4 bits of history, shared PHTs

Spring 2003CSE P : Correlated Predictor Organization in the book: really a linear buffer, accessed by concatenating the global history with the low-order bits from the PC current implementations XOR branch address & global history bits called gshare more accuracy with same bits or equivalent accuracy with fewer bits

Spring 2003CSE P : Correlated Predictor Predictor classification (the book’s simple version) (m,n) predictors m = history bits, number of branches in the history n = prediction bits (0,1) = 1-bit branch prediction buffer (0,2) = 2-bit branch prediction buffer (5,2) = first picture (4,2) = second picture (2,2) = book picture (4,2) = Pentium Pro scheme

Spring 2003CSE P : Correlated Predictor Predictor classification (the real version) Xy(m,n) predictors X = history register organization GA = global, one for all branches PA = private, one for each branch SA = several shared among a group of branches y = the number of PHTs g = one p = one/branch address s = one/group of branch addresses m = history bits, number of branches in the history n = prediction bits GAs(5,2) = first picture PAs(4,2) = second picture

Spring 2003CSE P : Tournament Predictor Combine branch predictors local, per-branch prediction, accessed by the PC correlated prediction based on the last m branches, assessed by the global history indicator of which had been the best predictor for this branch 2-bit counter: increase for one, decrease for the other Compaq Alpha ~5% misprediction on SPEC95 2% of die

Spring 2003CSE P : Confidence Predictors Indicates of how confident you are of the prediction if very confident, then follow the prediction if not confident, then stall An implementation: a counter which is increased when a prediction is correct and cleared when it is wrong the higher the value, the more confident you can be of the prediction pick a threshold value for following the prediction

Spring 2003CSE P : Branch Target Buffer (BTB) Cache that stores: the PCs of branches (tag) the predicted target address (data) branch prediction bits (data) Accessed by PC address in fetch stage if hit: address was for this branch instruction fetch the target instruction if prediction bits say taken No branch delay if: branch found in BTB prediction is correct (assume BTB update is done in the next cycles)

Spring 2003CSE P : Return Address Stack The bad news: indirect jumps are hard to predict registers are accessed several stages after fetch The good news: most indirect jumps (85%) are returns optimize for the common case Return address stack provides the return target early return address pushed on a call, popped on a return best for procedures that are called from multiple call sites BTB would predict address of the return from the last call if “big enough”, can predict returns perfectly these days 1-32 entries

Spring 2003CSE P : Fetch Both Targets Fetch target & fall-through code reduces the misprediction penalty but requires lots of I-cache bandwidth a dual-ported instruction cache requires independent bank accessing wide cache-to-pipeline buses

Spring 2003CSE P : Predicated Execution A review instructions are executed conditionally set a condition test a condition & execute the instruction if the condition is true if the condition is false, don’t write the instruction’s result in the register file (disable the register write signal) i.e., instruction execution is predicated on the condition replaces conditional branch (expensive if mispredicted) can fetch & execute instructions on both branch paths all instructions depend on the same condition eliminates branch latencies by changing a control hazard to a data hazard

Spring 2003CSE P Eliminate the Branch Advantages of predicated execution +no branch hazard +creates straightline code; therefore better prefetching of instructions prefetching = fetch instructions before you need them to hide I- cache miss latency +more independent instructions, therefore better code scheduling Disadvantages of predicated execution -instructions on both paths are executed -may be hard to add predicated instructions to an existing instruction set -additional register pressure -complex conditions if nested loops -instructions cannot generate exceptions because you might not execute that path -good branch prediction might get the same effect

Spring 2003CSE P54825 Today’s Branch Prediction Strategy Static and dynamic branch prediction work together Predicting correlated branch prediction Pentium III (512 entries, 2-bit) Pentium Pro (4 history bits) gshare MIPS R12000 (2K entries, 11 bits of PC, 8 bits of history) UltraSPARC-3 (16K entries, 14 bits of PC, 12 bits of history) combined branch prediction Alpha has a combination of local (1K entries, 10 history bits) & global (4K entries) predictors 2 bits/every 2 instructions in the I-cache (UltraSPARC-1)

Spring 2003CSE P54826 Today’s Branch Prediction Strategy BTB 512 entries, 4-way set associative (Pentium Pro) 32 entries, 2-way set associative (?) no BTB; target address is calculated (MIPS R10000, Alpha 21164, UltraSPARC-3) next address every 4 instructions in the I-cache (Alpha 21264) “address” = I-cache entry & set Return address stack Alpha 21264, R10000, Pentium Pro, UltraSPARC-3 Predicated execution Alpha (conditional move) IA-64: Itanium (full predication)

Spring 2003CSE P54827 Calculating the Cost of Branches Factors to consider: branch frequency (every 4-6 instructions) correct prediction rate 1 bit: ~ 80% to 85% 2 bit: ~ high 80s to 90% correlated branch prediction: ~ 95% misprediction penalty Alpha 21164: 5 cycles; 21264: 7 cycles UltraSPARC 1: 4 cycles Pentium Pro: at least 9 cycles, 15 on average then have to multiply by the instruction width or misfetch penalty have the correct prediction but not know the target address yet (may also apply to unconditional branches)

Spring 2003CSE P54828 Calculating the Cost of Branches What is the probability that a branch is taken? Given: 20% of branches are unconditional branches of conditional branches, 66% branch forward & are evenly split between taken & not taken the rest branch backwards & are almost always taken

Spring 2003CSE P54829 Calculating the Cost of Branches What is the contribution to CPI of conditional branch stalls, given: 15% branch frequency a BTB for conditional branches only with a 10% miss rate 3-cycle miss penalty 92% prediction accuracy 7 cycle misprediction penalty base CPI is 1 BTB resultPredictionFrequency (per instruction)Penalty (cycles)Stalls miss--.15 *.10 = hitcorrect.15 *.90 *.92 = hitincorrect.15 *.90 *.08 = Total contribution to CPI.121

Spring 2003CSE P54830 Dynamic Branch Prediction, in Summary Stepping back & looking forward, how do you figure out what is whether branch prediction (or any other aspect of a processor) is still important to pursue? Look at technology trends How do the trends affect different aspects of prediction performance? What techniques improve the important factors?

Spring 2003CSE P54831 Prediction Research Predicting variable values Predicting load addresses Predicting many levels of branches