ECE 486/586 Computer Architecture Chapter 16 Branch Prediction

ECE 486/586 Computer Architecture Chapter 16 Branch Prediction
Herbert G. Mayer, PSU Status 3/8/2017

Motivation A computer architecture specification defines the instruction set, memory, connecting bus, and the way the processor interfaces with the real world, as viewed by that outside, real world This view is referred to as the ISA, the Instruction Set Architecture Expectation is that execution be fast! The need for speed holds in the presence of deep instruction pipes despite operations that change the flow of control To achieve its speed goal, a processor needs good branch prediction, something not visible in the ISA Some past architectures include branch prediction bits in instructions, e.g. Alpha; is this a good idea? Here we study branch prediction in detail

Motivation If we could predict the future, computation would be swift and accurate  All we’d need is clairvoyance! In such an imaginary world with perception of the future a CPU’s effect of stalls, created by branches in a pipelined architecture could be eliminated

Motivation Even before a branch instruction would be decoded, the pipeline could be primed again with the right instruction stream from a new location, i.e. the destination of the predicted branch; again assuming clairvoyance! Unfortunately, we generally don’t know for certain, whether a conditional branch will be taken until complete condition evaluation We also don’t know the destination of a branch, until that address has been computed Ditto for other flow-control instructions, such as calls, returns, conditional branches, exceptions, etc.

Motivation But we can guess (i.e. predict) the outcome of the condition for a branch, and we can guess the destination, and we can also guess that one wrong We cannot predict with certainty! To help us guess the right branch condition & destination, we remember it from a previous execution step, and then predict that the next condition & destination might be the same If this would help us guess correctly most of the time, we would still gain great performance advantages. Guessing always right would be nicer, but we are, after all, only mere mortals Branch prediction strategies learn from the past to guess future behavior correctly most of the time Practically this means almost 98% accurate predictions, which is necessary particularly on highly pipelined architectures

Motivation To exploit deep pipelines, very high prediction accuracy is mandatory; close to 98% may be not perfect but is quite accurate! Achieve this for the lottery and you end up rich! Each time a prediction is wrong, the pipe has to be flushed: arithmetic units hold invalid operands that are not needed, new ones must be loaded instead! The deeper the stages on a pipelined architecture, the more stringent the accuracy requirements for a branch prediction scheme On an Intel Cedar Mill (2006) processor, like Prescott, the pipeline is over 2 dozen stages deep; inside that complex there are about 5 branches on average, without branch prediction practically never reaching the steady state

Syllabus Introduction What’s Bad About Branches?
Static Branch Prediction Bimodal Branch Prediction Gshare Prediction Branch Prediction I-Cache Prediction With Perceptrons Dynamic Branch Prediction Prediction Accuracy for SPECint92 Appendix Bibliography

Introduction Execution on a highly pipelined and wide issue architecture suffers severe degradation, whenever an instruction disrupts the prefetched flow of operations, being in various stages of partial completion Typically, control-transfer instructions cause such pipeline hazards The higher the degree of pipelining, the larger the number of partially executed (fetched, decoded, operand-fetched, etc.) instructions to be discarded, when branches occur The pipeline must be flushed, primed again, i.e. be filled again with other, soon partially executed instructions

Introduction The bad news: about one in five operations are control-flow instruction, e.g. branch, call, return, conditional branch, exit, abort, exception, etc. This almost invalidates the grand architectural advantage of pipelining! If it were possible to predict a condition, and if the machine could predict the destination of a branch before generating it from the instruction stream, then as soon as any branch is fetched, the pipe could be filled correctly from the proper code address: stalls would be avoided Next follow statistics about branch prediction accuracy for a common benchmark on typical μP

Introduction Intel Core Duo two-level dynamic branch prediction vs. AMD K8, show benefit of Intel’s branch prediction investment; [13]

What’s Bad About Branches?
Performance Penalties, Delay, Disturbance: Branches disrupt the sequential control flow, and that disturbs anticipated next execution steps, i.e. this messes up I-caches! The higher the number of pipeline stages, the greater the performance penalty Thus, a high degree of pipelining is also a liability, not purely goodness! Branches cause I-cache disturbance due to some new address range Conditional branch must determine the future direction: fall-through or branch to new target? Unconditional branches must only determine the new instruction’s target

Determine Branch Direction: Cannot immediately fetch subsequent instruction, since its location is not known; how to fix this? Remedy 1: if possible, replace conditional execution via branch with a conditional move instruction! Remedy 2: if possible, relocate operations that compute the branch-condition ahead of the branch, so that waiting for condition’s result is minimized Remedy 3: make use of the delay penalty; e.g. Branch Delay Slot –fallen out of favor since 2000’s Remedy 4: Bias conditional branch toward NOT taken (or vice versa); so done in some static prediction schemes; except this is not sufficiently accurate! Remedy 5: Predict the condition and destination!!

Determine Branch Direction, Cont’d: Fill delay slot with useful instruction; e.g. Intel i860 processor This HW trick is being used less since the 2000s; often ended up being a noop anyway, i.e. unused Or, execute both paths speculatively. Once condition result is known, kill the superfluous path. Requires more HW, and can cause explosion of HW when jumping to further branches So done successfully on Itanium Processor Family (IPF); high HW cost, but with real performance boost Or predict branch direction, Remedy 5 covered here! Determine branch target: Must know target address, to fetch next; for that foresight, use prediction

Saturating Counter prediction (see bimodal algorithm) with 2 bits can reach 80% direction accuracy, in rare cases 90%, but not typical! Awesome policy: 2 data bits plus simple logic suffice for remarkable accuracy! Including global prediction, for all branches; and despite interference!

How Can This Possibly Work?
Students, rationalize, why, how, when this can possibly work? As opposed to random guess? Why would this work even with interference? When will this fail?

Static Branch Prediction

Static branch prediction is set at the latest by the start of execution Can be: Encoded in instruction bits, i.e. ISA visible Or can be invisible in object code, defined by the HW logic Does not change during execution, hence does not learn from failure Can be accurate, if a-priori assumption about branch conditions happen to match execution reality Is inaccurate otherwise Does not “learn” from failure

Common to static branch prediction is: Small cost in extra hardware and cache Poor accuracy, generally with ~70% ceiling Is insufficient for highly pipelined, or for multi-way, superscalar architectures Typical static prediction schemes are: Condition not taken: assumes conditional branch is not taken; pipeline continues to be filled with instruction physically after the conditional branch; example early Intel® 486 This proved to be correct only little over 40%; hence the 486 μP would have been better off, abstaining from prediction

Condition taken: assumes conditional branches are taken; pipeline continues to be filled with instructions at destination of conditional branch Correct about 60% of the time; can be acceptable only for shallow pipelines BTFN assumes execution is dominated by while loops; true in a great subset of HLL code While-loops un-optimized: have conditional branch around the loop body; direction being forward to first instruction after the loop body While-loops optimized: use BTFN prediction at end of loop, to predict backward taken, forward not; SW issue, but EEs must be aware!

Static Branch Prediction: While
Unoptimized code for while loop has two branches each iteration: One conditional, plus one unconditional So BTFN will fail Optimized object code has one branch initially start up cost Plus one conditional branch during each iteration Only for empty while loop is optimized version worse BTFN works for optimized Unoptimized While Loop: w: <code for cond> branch_if_false e_w <code for body> br w e_w: <code after while> Optimized While Loop: br e_w w: <code for body> e_w: <code for cond> branch_if_true w <code after while>

Single-bit bias, no profile: provide conditional instructions with bit in opcode, indicating whether condition is likely true; is clue for HW, e.g. on Alpha μP Compiler can analyze source code and make reasonable guess about condition’s outcome; to be encoded in the extra profile bit; reaches ~70% accuracy For example, exceptions and assertions are almost never taken; compiler generates clue in object code Single-bit bias, with profiling: run the program, initially compiled without profile in bias bit. Then, for all conditional branches, count number of times whether the condition was true during run; use count to set bias bit; achieves ~75% accuracy; requires extra run  Note: Trace Scheduling similarity! There the penalty for wrong prediction is correction code

Bimodal Branch Prediction

Branch behavior is far from random. Most branches are either usually taken or usually not taken; bimodal branch prediction exploits this! Bimodal simply differentiates between branches usually taken from branches usually not taken Can be done in many ways; an approximation is to use array of counters, indexed by low order address bits of the pc Each counter is n = 2 bits long; can be extended. For each branch taken, its counter increases, thus tracks branch history between 0 and 2n -1 Likewise, if not taken, counter is decremented, never shrinking below 0 but also never exceeding 3

Most significant bit determines the prediction Repeatedly taken branches are predicted to be taken again in the future And repeatedly not-taken branches will be predicted to be not-taken Using a 2-bit counter, the predictor can tolerate a branch going an unusual direction one time and keep predicting the usual branch direction For large counter tables, each branch can be mapped to a unique counter. For smaller tables, multiple branches share same counter, causing interference Plausible implementation is to store a tag with each counter and use a set-associative lookup to match counter tag with branch address

For a fixed number of counters, a set-associative table has better performance However, once the size of tags is accounted for, a simple array of counters often has better performance for a given predictor size McFarling [8], [16] used Spec90 benchmark to quantify the quality of bimodal prediction See lower of 3 curves in graph below Showing accuracy of about 90% for selected, commonly used benchmark [SPEC’89], when each branch has its own, local counter That accuracy is not sufficient for pipelined architectures! Need at least 97% accuracy!

Bimodal Branch Prediction Results, McFarling, 1993

Gshare Branch Prediction

Gshare Prediction Branch prediction via gshare copied from Oregon State website [17] The gshare method is similar to bimodal predictor Records branch history in n-bit shift register After maximum history length of n steps, old history just falls into the bit bucket Method XORs tagged address of branch instruction and branch history (n-bit shift register) Result used as index into table, holding prediction bits Two methods: tagged or tagless tables; tag being defined subset of address bits

Gshare Prediction, Tagged
Initial table entries are NULL Tagged method compares indexed entry with a tag If no match, predict branch not taken; tag is filled in and recorded in history (bi-modally) Initially there will always be a no match, yet the history starts being remembered; recorded as strongly not taken If match, use the corresponding prediction bits to predict the branch outcome Those prediction bits function as two-bit saturating counter With n bits in shift register, table size is 2n entries

Gshare Prediction, Tagged
Prediction via Tagged Gshare, Prediction bits = Bimodal

Gshare Prediction, Tagless
In tagless implementation, the table is directly indexed and the corresponding prediction bits are used to predict the branch outcome Oregon State experiments found that tagless table performs better By indexing an entry and use its contents (prediction bits) to predict the branch outcome HW requirements with tagless table are the same as for the bimodal prediction For some tests, publication by UCSD San Diego [18] records accuracy of over 96%

Gshare Prediction, Tagless
Tagless Gshare Outperformes Fixed HW Budget

I-Cache Branch Prediction

Branch Prediction Via I-Cache
One prediction bit per I-cache line: this scheme encodes no information in the instruction stream, i.e. no information is assembled into the conditional branch instruction Instead, each cache line holding x instructions in the I-cache has 1 associated prediction bit If set, bit predicts that next executed conditional branch in this I-cache line will be taken Problem: There may be no conditional branch in the line at all, thus wasting the bit in the cache More serious for performance, there may be multiple conditional branches, causing interference about the predictions of their respective conditions

Branch Prediction Via I-Cache
Advantage: low cost; order 1% of cache area; reaches up to 80% accuracy; amazingly 2 prediction bits per I-cache line: similar to above, but uses 2-bit saturating counter to predict next branch; can achieve additional accuracy > 80%; a single wrong guess does not disrupt the scheme; yet suffers similarly from waste & interference Branch History Table BHT: use history bits, or 2-bit saturating counter, or longer > 2 shift register for each represented branch, named history cache Did reach astonishing accuracy of 85%; so implemented in Pentium® But insufficient for highly pipelined processor

Prediction with Perceptrons

Prediction With Perceptrons
Key idea: use simplest possible neural network, the perceptron, offering better predictive capability than bimodal prediction Bimodal method reaches only up to ~90% accuracy Perceptron method is newest, but by itself not more accurate than older two-level method by Yeh & Patt General neural networks would be too costly in HW, yet perceptron are a simplest-possible version, thus the total cost in silicon is tolerable Perceptron benefits from longer branch histories Hardware resources needed for perceptron prediction scale linearly with history length Perceptron prediction works well for certain branch classes, best when combined with other, traditional schemes, just as a component of a hybrid predictor

Prediction via Perceptrons is a machine- learning technique With modest 4 kB HW budget, performs ~14% better than gshare Methodology is sufficiently simple to complete computations within single cycle Restriction: prediction uses simplest possible, single-layer perceptron, due to single cycle timing constraint!

Prediction Misses with Perceptrons
Jimenez, Lin [17], Perceptron Hybrid vs. Gshare, Bimodal, pure Perceptron

How Perceptrons Work: Single-layer perceptron consists of 1 artificial neuron, connected to multiple inputs xi, each with individual weight wi, to single output y, or yout The various xi are the i bits of a global branch history shift register; Astonishingly a: global shift register yields good accuracy! Inputs xi for i > 0 are bipolar, only hold value +1 or -1 A weight wi close to 0 means: input contributes little to outcome y. Or inversely: input contributes much Perceptron learns, generating output y, i.e. the prediction y, as a function of weighted inputs xi Target function y predicts, whether branch will be taken next time around

How Perceptrons Work: Perceptron tracks weighted positive and negative correlation between branch outcome in global history table and particular branch being predicted Weight wi indicates how much that input contributes Perceptron inputs and output: Input value xi = -1 means: was not taken Else input value xi = 1 means: was taken Output value y = -1 means: predict not taken! Else output value y = 0 or 1 means: predict taken! Output y is dot product of the weighted input vector Note: input x0 is always forced to be 1, serving as a bias input, as a reference point

x0 = 1 x1 x2 . . . xn-1 xn y w0 w1 w2 wn-1 wn y = w0 + ∑ xi wi i = 1 .. n From Jimenez, Lin [16]. Input w0 is a Reference Point Input

Perceptrons Early in Architecture
Mark I Perceptron Machine, Built about 1960, to Study AI

Training Perceptrons: When y has been computed, the following method trains perceptrons: Let t = -1 if the last branch was not taken Else let t = 1 if it was taken Define θ to be some tunable threshold to quantify, whether or not sufficient training has been completed The newly computed ynew will be -1, 0, or 1 And t and xi will only be either 1 or -1 The new value of ynew is computed as follows:

Training Perceptrons: 1 if y > θ ynew = 0 if -θ ≤ y ≤ +θ -1 if y < -θ if ynew ≠ t then: for i in 1 .. n loop wi = wi + t * xi end loop end if This method increments the ith weight wi when the branch outcome agrees with xi else wi decrements Repeated agreement grows weight significantly; for disagreement the weight becomes strongly negative The weight strongly influences the next prediction When prediction and reality disagree, weight = ~0

Core Algorithm: Perceptrons “learn” to correlate between prediction outcome in global history and actual branch! If actual branch outcome and xi agree Then wi is incremented Else wi is decremented Input to the bias weight w0 is always 1, meaning: w0 does not learn a correlation with previous prediction, but learns separately from the branch history Total number of perceptrons N used is dictated by available HW budget When executing a branch, the processor does:

For a large HW budget, the branch address maps to i, with i being an index a table of N perceptrons Else for a small HW budget, branch address hashes an index i in range 0 .. N-1 of perceptron entries; interference to be expected! Perceptron i is fetched into vector register Pi, i = 0..n, of weights w0 .. wn Compute y as dot product of P and global history reg. If y < 0, predict not taken, else taken! When actual branch outcome is known, use info to update y and weights of P P is stored into table_of_perceptrons[ i ]

Challenge for HW Designer to Implement All 6 steps in 1 Cycle

Two-Level Adaptive Branch Prediction

Dynamic Branch Prediction
Two-Level Adaptive Prediction: Yeh & Patt Method taken from Yeh and Patt paper, using 2 levels of branch history to predict next conditional branch: Level 1 is the history of the last n conditional branches (HR) Level 2 is the branch behavior of last s occurrences of that unique pattern of the last n conditional branches PT[ i ] Such history is collected on the fly, i.e. no need for a trial run, as used in some earlier, static schemes Yeh and Patt’s method uses a branch history register (HR) and a branch history pattern table (PT) Prediction is based on the branch behavior for the last s occurrences of the pattern in question Yeh & Patt published several similar papers repeatedly in the 1990s, slightly varying names for tables used; don’t get confused!

Two-Level Dynamic Branch Prediction The direction history (taken = 1, not taken = 0) of the last k conditional branches is stored in special-purpose cache, implemented as shift- register: the History Register (HR) HR shifts in the actual branch result of the most recent history, not the predicted result! And shifts out the oldest bit It can be global or local; global means there is 1 HR for all branches; local means 1 HR per branch Target addresses of the last branches reside in special purpose prediction cache, called the branch target address cache (BTAC) Focus here: Conditional branches

Two-Level Dynamic Branch Prediction Use the HR as an index into an array of patterns, called the pattern table (PT) Each pattern PT[*] typically implemented as a 2-bit counter predicting the future condition for this situation; but could be more bits, even less = 1 Same technique as shown earlier on p. 14 Once the current branch info has been completely computed, update the HR by shifting in the current condition, shifting out the oldest, leftmost bit And updating the selected PT[ HR[*] ] entry, indexed by the last history register state

Local Branch Prediction Local means that each conditional branch (modulo some maximum number) has its own, private branch prediction history cache For example, each conditional branch may have its own two-level, adaptive branch predictor, with a unique history buffer, and local pattern history table May even use a global history table, shared between conditional branches E.g. Intel Pentium MMX, Pentium II, and Pentium III use local branch predictors, with a local 4-bit branch history and a local pattern history table of 16 entries (entries = 24 per conditional)

Local Two-Level prediction scheme means: 1 HR (History Register) per Branch, modulo MAX entries Here HR has 6 bits: associated local PT has 26 = 64 entries

Two-Level Dynamic Prediction by Yeh and Patt Uses the by now familiar History Register (HR) Whether this is one local register per conditional branch Or a single global register, we differentiate later The HR has an associated Pattern Table PT HR is a k-bit shift register storing history of the last k outcomes of its associated conditional branch The PT is accessed (indexed) by this history register’s pattern, i.e. PT[ HR ] So the identified index can predict the next condition’s outcome

Two-Level Dynamic Prediction by Yeh and Patt That prediction is made by a finite state automaton (FSA), using the stored bits of the PT to make the next prediction The new state of the PT is derived from 2 inputs: previous state and real outcome of the branch, once the condition has actually been computed –and corrected, if actual differed from predicted Also the HR is updated by left-shifting in the new branch bit (e.g. 1 if taken, else 0), and the oldest bit falls into the bit bucket  to the left of the HR Usually each PT entry is a 2-bit saturating counter Reaches accuracy of ~ 97%. Yeh and Patt argue that for super-pipelined, high-issue architectures ~97% accuracy still must be improved!

Two-Level Dynamic Prediction by Yeh and Patt Figure below shows the scheme for one specific conditional branch instruction C0 HR can exist once, in which case it applies globally to all branch instructions, and then interferes with the prediction of other branches Or architecture may dedicate one local HR per branch, replicating n HRs, one for each of the last n distinct branch instructions Also, PT may exist once globally for all HR, or a private PT may exist for each HR, provided HRs are replicated per branch

Two-Level Dynamic Prediction by Yeh and Patt

Yeh and Patt Nomenclature Prediction scheme by Yeh and Patt (ref. [4] - [7]) can be effective, but consumes ample cache space For each branch instruction the method can consume a Branch History Register HR of k bits, plus an address tag, and a PT of 2k+1 bits for a 2-bit prediction pattern Could this same space be used better? Yeh and Patt measured the different accuracies for the same program, and same number of cache bits, varying the prediction scheme as follows: Instead of using always one HR per branch and one PT per branch, their experiments associate a variation of multiple PT entries with a branch HR

Yeh and Patt Nomenclature Surprisingly, Yeh and Patt observed good prediction accuracy for one global PT and measured this variation as well Since the total number of bits consumed for the cache was kept constant, a larger number of history bits or else a longer history of the last executed branches could be used Varying the number of BH registers and number of PT, leads to nomenclature PAp, PAg, GAp, GAg 

Yeh and Patt Nomenclature In reality there are just 3 meaningful choices: GAg, PAg, and PAp Complete measurements were conducted for a growing number budget of bits, from 8k to 128 k bits total cache space Interestingly, for sufficiently large cache storage, Yeh and Patt found that the best scheme, constrained by 128 k bits, is the PAg scheme Also unintuitively, PAg is the most cost-effective, delivering highest accuracy for a fixed HW budget, despite interference For other HW budgets, Patt and Yeh found different optimal schemes

Yeh and Patt Marketing  In their 1992 paper “Two-Level Adaptive Branch Prediction” U. Michigan, Yeh and Patt use skillful marketing language: Instead of documenting hit rate, they argue, comparisons should focus on the miss rate; clearly the 2 are equivalent in a complementary way! For their scheme published in the early 1990s, the miss rate shows a fantastic result of ~ 3% The state of the art best cases until then measured a miss rates of around 7% So they cleverly conclude that their improvement yields greater than 100% improvement of miss rates! Yeh and Patt have multiple skills 

Prediction Accuracy For SPECint92
Vertical axis below shows percentages of predicting branches in SPECint92. The horizontal axis shows accuracies of prediction methods

Summary Without good branch prediction, highly pipelined architectures would not be useful The number of transfer of control instructions dynamically executed would be too large to ever reach the steady state Static branch predictions are cost effective, but inadequate for deep pipes Dynamic branch prediction is required to achieve around 97% prediction accuracy, needed for reaching the steady state long term

Appendix: Some Definitions

Definitions Aliasing Synonym for interference
When the HW budget for a branch prediction method is small, multiple branches are associated with a single prediction case Small here means: the total number of branches is way larger than number of table entries for some prediction data structure Aliasing is less harmful than expected, but causes interference of one branch history with that of another branch Such interference is referred to as aliasing Notice similarity of hashing, where aliasing is perfectly tolerable up to some threshold fill-factor

Definitions BHT, acronym for Branch History Table (BHT)
The Branch History Table (BHT) is the collection of Branch History Registers (HR), used in Single-Level or Two-Level dynamic branch prediction There could be a.) one HR per conditional branch, b.) one HR each for the last n > 1 branches, or c.) just a single HR for all conditional branches The cost for the choice a.) can be excessive, yet is more accurate. Choice c.), while being the least accurate, also costs the least in terms of HW Often architects must select a compromise This trade-off of resource cost vs. accuracy is akin to the mapping policy employed in cache design

Definitions BHT, Cont’d
On actual branch prediction HW, just the last few branches executed have their associated HR, otherwise too much HW –silicon space– for the BHT would be consumed Each HR records for the last k executions of its associated conditional branch whether that branch was taken In a Two-Level dynamic branch prediction scheme, the HR has an associated Pattern Table (PT), indexed by the HR The entry in the PT guesses, whether the next branch will be taken. The cost in bits can be contained, because not all branches need to have an associated HR

Definitions Branch Prediction
Heuristic that guesses –based on past branching history– the destination of the current branch, the Boolean outcome of the next condition for a branch, or both, as soon as a branch instruction is being decoded 100% accurate prediction of a branch is, of course, not possible; neither the condition, nor the target Heuristics aim at guessing right most of the time For highly pipelined, superscalar architectures most of the time has to mean 97% or more

Definitions Branch Profiling
Compile a program with a special compiler directive. Then measure at run-time, for each conditional branch, how many times each branch was taken Next time this same program is compiled, the measured results of the prior run are available to the compiler. That info enables a compiler to bias conditional branches according to past behavior Underlying this scheme is the assumption that past behavior is a reflection of the future. Branch profiling is one of the static branch prediction schemes. It costs one additional execution and costs HW instruction bits, for the compiler to set the branch bias one way or another Generally, static prediction, even with the benefit of a profiling run, are not sufficiently effective

Definitions BTAC, Branch Target Address Cache
For very fast performance, it is not sufficient to know (i.e. guess) ahead of time, whether a conditional branch will be taken For any branch –including unconditional– the branch destination must be known (or guessed) a priori For this reason, each branch in a BTAC implementation has an associated target address, used by the instruction fetch unit to continue filling the pipeline from places other than the next one After complete decoding of an instruction, the target is accurately computed. But knowing the target earlier speeds up filling a potentially stalled pipeline

Definitions BTB, Branch Target Buffer
For very fast performance, it is best to know ahead of time whether or not a conditional branch will be taken, and where to such a branch leads The former can be implemented using a BHT with Pattern Table; the latter can be implemented using a BTAC The combination of these two is called the BTB This scheme is implemented on Intel Pentium Pro® and newer Intel architectures

Definitions BTFN, Backwards Taken Forward Not
A static prediction heuristic assuming that program execution time is dominated by loops, especially While Loops While loops are characterized by an unconditional branch at the end of the loop body back to the condition, and a conditional branch if false at the start, leading to the successor of the loop body The backward branch is always taken, and to the same destination; the forward branch, if the condition is false, is taken just once

Definitions BTFN, Backwards Taken Forward Not
Since While Statements are often executed repeatedly the BTFN heuristic guesses correctly the majority of the time This method has the inherent limitations of static schemes Also, many optimizers re-arrange the object code for While Statements in a way that the condition is moved to the end, obscuring this whole scheme Exercise to students: how to convert while-code with conditional branch at top + unconditional branch back an end, to a single conditional at end? Hint, there will be initial, fixed-cost overhead!

Definitions BTFN, create while loop with single repeated branch:
w: while <c> { -- original stylized source <s> <c> is condition, <s> statement code } //end while w: <c> straight code for while statement bf ew first branch, if false, O(n) times <s> b w second branch, O(n) times ew: b ew improved code: new branch, O(1) time w: <s> fixed added cost: 1 new branch ew: <c> bt w single repeated branch, O(n) times

Definitions Delay of Transfer, Delay Transfer Slot
Certain pipelined CPUs execute another instruction before the current unconditional branch That step before is the target instruction physically at the target of the branch The reason is to greedily recover some of the lost time caused by the pipeline stall. Thus, compilers or programmers can physically place the target instruction of the branch physically after the branch: Placed after the branch, executed before the branch completes, never reached normally Since it is supposed to be executed anyway, as soon as a branch has reached its target, and since the HW already executes it before completing the branch, time is saved

Definitions Delay of Transfer, Delay Transfer Slot
Note that at the target of such an unconditional branch the relocated instruction must be skipped; that enables the time saving! Example: Intel i860 architecture: When a suitable candidate cannot be found, a NOP instruction is placed physically after the branch, i.e. into the delay slot Done also on Sun SPARC architecture There are restrictions: for example, branch instructions and other control-transfer instructions cannot be placed into the delay slot. If that would happen, a phenomenon called code visiting would occur, with unpredictable side-effects at times; hence the restriction

Definitions Dynamic Branch Prediction
Branch prediction policy that changes dynamically with the execution of the program Dynamic branch prediction is architecture transparent, i.e. no bits are visible in the opcode Different from some static branch prediction methods, which have suitable bits in their opcode Antonym: Static Branch Prediction We focus on dynamic branch prediction here

Definitions Hazard Instruction i+1 is pre-fetched under the assumption it would be executed after instruction i Yet while decoding instruction i it becomes clear that that operation i is a transfer of control Hence subsequently pre-fetched instructions i+1… are be wasted This is called a hazard A hazard causes part of the pipeline to be flushed, while a stall (caused by data dependence) also causes a delay

Definitions History Register (HR)
k-bit shift register, associated with a conditional branch The bits indicate for each of the last k executions of that associated conditional branch, whether it was taken, 1 saying yes The newest bit shifts out the oldest, since a HR has only some limited, fixed length k of bits available

Definitions Interference, Branch Interference Synonym to Aliasing
When multiple branches are associated with one HW data structure (such as an HR or PT) the behavior of each branch will influence the data structure’s state However, the data will be used for the next branch, even if it is not the one having modified the most recent state Reason for doing this is limited HW availability, i.e. cost saving of HW (of silicon space) The effect is diminished precision

Definitions IPC Instructions per cycle: A measure for Instruction Level Parallelism IPC quantifies how many different instructions are being executed –not necessarily all to completion—during one single cycle? Desired to have an IPC rate > 1 Given sufficient parallelism, IPC can be >> 1 On conventional UP CISC architectures it is typical to have IPC << 1

Definitions Mispredicted Branch, AKA Miss
The branch condition or branch destination were predicted incorrectly As a consequence, the control of execution took a different flow than predicted This requires dynamic correction at run time and costs time The cost often is a stalled pipeline that has to be flushed and re-loaded

Definitions Mispredicted Branch Penalty
Number of cycles lost, due to having incorrectly guessed the change in flow of control, caused by a branch instruction Since prediction accuracy is never 100%, there always shall be some Mispredicted Branch Penalty Goal is to keep the number of mispredictions at or below 3% of all branches executed

Definitions Pattern Table (PT)
A HW table of entries, each specifying whether its associated conditional branch will be taken An entry in the PT is selected by using the history bits of a branch History Register (HR) This can be done by indexing, in which case the number of entries in the PT is 2k, with k being the number of bits stored in the History Register Otherwise, if the number of entries is < 2k, a hashing scheme is applied; causing interference! Each PT entry holds Boolean information about the next conditional branch: will it be taken or not?

Definitions Pipelining
Mode of execution, in which one instruction is initiated every cycle and ideally one retires every cycle, even though each requires multiple (possibly many) cycles to complete Highly pipelined Xeon processors, for example, have a > 20-stage pipeline

Definitions Saturating Counter
HW n-bit unsigned integer counter, n typically being for branch prediction HW When all bits are on and counting up continues, a saturating counter simply stays at the maximum value Similarly, when all bits are off and counting down continues, the saturating counter stays at 0 Creates a limited hysteresis effect on the behavior of the specific event that depends on this counter Architecture challenge: to select a history length (n bits) such that the cost is low and the accuracy sufficient to support overall goal > 97%

Definitions Shift Register
HW register with small number of bits, tracking a binary event If the event did occur, a 1 bit is shifted into the register at one end. This will be the newest bit The oldest bit is shifted out at the opposite end Conversely, if the event did NOT occur, a 0 bit is shifted in, and the oldest bit is shifted out All other bits shift their bit position by one place At any moment the shift register holds a history of the associated event’s last n occurrences

Definitions Stall If instruction i requires an operand o that is being computed by another instruction j, and j is not complete when i needs o, there exists dependence between the i and j, the wait thus created is called stall A stall prevents the two instructions from being executed simultaneously, since the instruction at step i must wait for the other to complete. See also: hazard, interlock Stall can also be caused by HW resource conflict: Some earlier instruction i may use HW resource m, while another instruction j needs m Generally j has to wait until i frees m, causing a stall for j

Definitions Static Branch Prediction
A branch prediction policy that is embedded in the binary code –ISA visible Or implemented in the hardware executing the branches –not ISA visible The policy does not change during execution of the program, even if known to be wrong all the time In the latter case, execution would be better off without branch prediction

Definitions Static Branch Prediction
BTFN heuristic is a static branch prediction policy Requires no opcode bits, hence is NOT ISA visible HW compares the destination of a branch with the conditional branch’s own address. Destinations smaller lead backwards and are assumed taken Destination addresses larger than the branch address are assumed not taken, and the next instruction predicted is the successor of the conditional branch Typical industry benchmarks (SPECint89) achieve almost 65% correct prediction with this simple scheme

Definitions Two-Level Branch Prediction
Instead of solely associating a local branch history register with a conditional branch, a two-level branch prediction scheme associates history bits (pattern table) with branch execution history Thus, each pattern of past branch behaviors has its own future prediction, costing more HW, but yielding better accuracy For example, each conditional branch may have a k- bit Branch History Register, which records for each of the last k executions, whether or not the condition was satisfied And each history pattern has an associated prediction of the future in another data structure; typically implemented as 2-bit saturating counter

Definitions Wide Issue
Older architectures issue (i.e. fetch, decode, etc.) one instruction at a time; for example, 1 instruction per clock cycle on a RISC architecture Computers after 1980 issue more than 1 instruction at a time; this is called a wide issue Synonym: super-scalar architecture More precisely, superscalar architectures require wide issue I-fetches Antonym: Single-issue

Bibliography Gwennap L. [1995]. “New Algorithm Improves Branch Prediction,” Microprocessor Report, March 1995, pp Gwennap L. [1995]. “New Algorithm Improves Branch Prediction,” MicroDesign Resources, Vol. 9, No. 4, March 27, 1995, on web at: Smith, J. [1981]. “A Study or Branch Prediction Strategies,” 8th International Symposium on Computer Architecture, pp Yeh, T. and Y. Patt [1991]. “Two-Level Adaptive Branch Prediction,” 24th International Symposium on Computer Architecture, November 1991, pp Yeh, T. and Y. Patt [1992]. “Alternative Implementations of Two-Level Adaptive Branch Prediction,” 19th International Symposium on Computer Architecture, May 1992, pp Yeh, T. and Y. Patt [1993]. “A Comparison of Dynamic Branch Predictors That Use Two Levels of Branch History,” 20th International Symposium on Computer Architecture, May 1993, pp McFarling, Scott [1993]. “Combining Branch Predictors”, WRL Technical Note TN 36, Digital Western Research Lab, June 1993

Bibliography Hilgendorf, R. B., et al. [1999]. “Evaluation of branch-prediction methods on traces from commercial applications.” IBM Journal of Research & Development Hsien-Hsin Sean Lee: “Branch Prediction”, Daniel A. Jiménez, Calvin Lin, [2000]. “Dynamic Branch Prediction with Perceptrons.” Proceedings of the 7th International Symposium on High Performance Computer Architecture Wikipedia, 2011, Real world Technologies: Perceptrons: McFarling, 2001 Jiménez and Lin:

Bibliography Branch prediction summary from Oregon State: UCSD:

ECE 486/586 Computer Architecture Chapter 16 Branch Prediction

Similar presentations

Presentation on theme: "ECE 486/586 Computer Architecture Chapter 16 Branch Prediction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECE 486/586 Computer Architecture Chapter 16 Branch Prediction

Similar presentations

Presentation on theme: "ECE 486/586 Computer Architecture Chapter 16 Branch Prediction"— Presentation transcript:

Similar presentations

About project

Feedback