Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Calvin Lin Dept. of Computer Science Rutgers University Univ. of Texas Austin Presented.

Slides:



Advertisements
Similar presentations
Branch prediction Titov Alexander MDSP November, 2009.
Advertisements

Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Dynamic Branch Prediction (Sec 4.3) Control dependences become a limiting factor in exploiting ILP So far, we’ve discussed only static branch prediction.
Pipelining and Control Hazards Oct
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Dynamic Branch Prediction
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.
EECC551 - Shaaban #1 lec # 5 Fall Static Conditional Branch Prediction Branch prediction schemes can be classified into static (at compilation.
EECS 470 Branch Prediction Lecture 6 Coverage: Chapter 3.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.
EECE476: Computer Architecture Lecture 20: Branch Prediction Chapter extra The University of British ColumbiaEECE 476© 2005 Guy Lemieux.
VLSI Project Neural Networks based Branch Prediction Alexander ZlotnikMarcel Apfelbaum Supervised by: Michael Behar, Spring 2005.
EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Goal: Reduce the Penalty of Control Hazards
Branch Target Buffers BPB: Tag + Prediction
EECC551 - Shaaban #1 lec # 5 Winter Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Branch Prediction Dimitris Karteris Rafael Pasvantidιs.
1 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 19, 2009 Topic: Instruction-Level Parallelism III (Dynamic Branch Prediction)
So far we have dealt with control hazards in instruction pipelines by:
Dynamic Branch Prediction
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.
Spring 2003CSE P5481 Control Hazard Review The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
EECC551 - Shaaban #1 lec # 5 Fall Static Conditional Branch Prediction Branch prediction schemes can be classified into static and dynamic.
Branch Prediction CSE 4711 Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications;
Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Evaluation of the Gini-index for Studying Branch Prediction Features Veerle Desmet Lieven Eeckhout Koen De Bosschere.
Analysis of Branch Predictors
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Branch Prediction Perspectives Using Machine Learning Veerle Desmet Ghent University.
Dynamic Branch Prediction
CS203 – Advanced Computer Architecture
Dynamic Branch Prediction
COMP 740: Computer Architecture and Implementation
CMSC 611: Advanced Computer Architecture
Perceptrons for Dummies
So far we have dealt with control hazards in instruction pipelines by:
Dynamic Branch Prediction
Advanced Computer Architecture
Pipelining and control flow
So far we have dealt with control hazards in instruction pipelines by:
Lecture 10: Branch Prediction and Instruction Delivery
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Adapted from the slides of Prof
Dynamic Hardware Prediction
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Presentation transcript:

Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Calvin Lin Dept. of Computer Science Rutgers University Univ. of Texas Austin Presented by: Rohit Mittal

2 Overview u Branch prediction background u Applying machine learning to branch prediction u Results and analysis u Future work and conclusions

3 Branch Prediction Background

4 Outline u What are branches? u Reducing branch penalties u Branch prediction u Why is branch prediction necessary? u Branch prediction basics u Issues which affect accurate branch prediction u Examples of real predictors

5 Branches u Instructions which can alter the flow of instruction execution in a program

6 The Context u How can we exploit program behavior to make it go faster? u Remove control dependences u Increase instruction-level parallelism

7 An Example u The inner loop of this code executes two statements each time through the loop. int foo (int w[], bool v[], int n) { intsum = 0; for (int i=0; i<n; i++) { if (v[i]) sum += w[i]; else sum += ~w[i]; } return sum; }

8 An Example continued u This C++ code computes the same thing with three statements in the loop. u This version is 55% faster on a Pentium 4. u Previous version had many mispredicted branch instructions. int foo2 (int w[], bool v[], int n) { intsum = 0; for (int i=0; i<n; i++) { int a = w[i]; int b = - (int) v[i]; sum += ~(a ^ b); } return sum; }

9 Branch Prediction u To speed up the process, pipelining overlaps execution of multiple instructions, exploiting parallelism between instructions. u Conditional branches create a problem for pipelining: the next instruction can't be fetched until the branch has executed, several stages later. u A branch predictor allows the processor to speculatively fetch and execute instructions down the predicted path. Branch predictors must be highly accurate to avoid mispredictions!

10 Why good Branch Prediction is necessary.. u Branches are frequent % u Today’s pipelines are deeper and wider u Higher performance penalty for stalling u High Misprediction Penalty u A lot of cycles can be wasted!!!

11 Branch Predictors Must Improve u The cost of a misprediction is proportional to pipeline depth u As pipelines deepen, we need more accurate branch predictors u Pentium 4 pipeline has 20 stages u Future pipelines will have > 32 stages Simulations with SimpleScalar/Alpha u Deeper pipelines allow higher clock rates by decreasing the delay of each pipeline stage u Decreasing misprediction rate from 9% to 4% results in 31% speedup for 32 stage pipeline

12 Branch Prediction u Predicting the outcome of a branch u Direction: u Taken / Not Taken u Direction predictors u Target Address u PC+offset (Taken)/ PC+4 (Not Taken) u Target address predictors u Branch Target Address Cache (BTAC) or Branch Target Buffer (BTB)

13 Why do we need branch prediction? u Branch prediction u Increases the number of instructions available for the scheduler to issue. Increases instruction level parallelism (ILP) u Allows useful work to be completed while waiting for the branch to resolve

14 Branch Prediction Strategies u Static u Decided before runtime u Examples: u Always-Not Taken u Always-Taken u Backwards Taken, Forward Not Taken (BTFNT) u Profile-driven prediction u Dynamic u Prediction decisions may change during the execution of the program

Dynamic Branch Prediction u Performance = ƒ(accuracy, cost of misprediction) u Branch History Table (BHT) is simplest u Also called a branch-prediction buffer u Lower bits of branch address index table of 1-bit values u Says whether or not branch taken last time u If branch was taken last time, then take again u Initially, bits are set to predict that all branches are taken

16 Problems : Two branches can have the same low-order bits. In a loop, 1-bit BHT will cause two mispredictions: End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it predicts exit instead of looping LOOP:LOADR1, 100(R2) MULR6, R6, R1 SUBI R2, R2, #4 BNEZR2, LOOP 1-bit Branch History Table

17 2-bit Predictor Predict Taken Predict Not Taken Predict Not Taken T T T T NT This idea can be extended to n-bit saturating counters –Increment counter when branch is taken –Decrement counter when branch is not taken –If counter <= 2 n-1, then predict the branch is taken; else not taken. Solution : 2-bit predictor scheme where change prediction only if mispredict twice in a row

18 Correlating Branches u Often the behavior of one branch is correlated with the behavior of other branches. u Example C code if (aa == 2)B1 aa = 0; if (bb == 2)B2 bb = 0; if (aa != bb)B3 cc = 4; u If the first two branches are not taken, the third one will be. u B3 can be predicted with 100% accuracy based on the outcomes of B1 and B2

19 Correlating Branches – contd. u Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch u Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table u In general, (m,n) predictor means record last m branches to select between 2 m history tables each with n-bit counters u Old 2-bit BHT is then a (0,2) predictor

20 Need Address at same time as Prediction u Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) u Note: must check for branch match now, since can’t use wrong branch address u Return instruction addresses predicted with stack Branch PCPredicted PC Predict taken or not taken PC of instruction FETCH =?

21 Branch Target Buffer u A branch-target buffer or branch-target cache stores the predicted address of branches that are predicted to be taken. u Values not in the buffer are predicted to be not taken. u The branch-target buffer is accessed during the IF stage, based on the k low order bits of the branch address. u If the branch-target is in the buffer and is predicted correctly, the one cycle stall is eliminated.

22 Branch Predictor Accuracy u Larger tables and smarter organizations yield better accuracy u Longer histories provide more context for finding correlations u Table size is exponential in history length u The cost is increased access delay and chip area

23 Alpha u 8-stage pipeline, mispredict penalty 7 cycle u 64 KB, 2-way instruction cache with line and way prediction bits (Fetch) u Each 4-instruction fetch block contains a prediction for the next fetch block u Hybrid predictor (Fetch) u 12-bit GAg (4K-entry PHT, 2 bit counters) u 10-bit PAg (1K-entry BHT, 1K-entry PHT, 3-bit counters)

24 Ultra Sparc III u 14-stage pipeline, branch prediction accessed in instruction fetch stages 2-3 u 16K-entry 2-bit counter Gshare predictor u Bimodal predictor which XOR’s PC bits with global history register (except 3 lower order bits) to reduce aliasing u Miss queue u Halves mispredict penalty by providing instructions for immediate use

25 Pentium III u Dynamic branch prediction u 512-entry BTB predicts direction and target, 4-bit history used with PC to derive direction u Static branch predictor for BTB misses u Branch Penalties: u Not Taken: no penalty u Correctly predicted taken: 1 cycle u Mispredicted: at least 9 cycles, as many as 26, average cycles

26 AMD Athlon K7 u 10-stage integer, 15-stage fp pipeline, predictor accessed in fetch u 2K-entry bimodal predictor, 2K-entry BTB u Branch Penalties: u Correct Predict Taken: 1 cycle u Mispredict penalty: at least 10 cycles

27 Applying Machine Learning to Branch Prediction

28 Branch Prediction is a Machine Learning Problem u So why not apply a machine learning algorithm? u Replace 2-bit counters with a more accurate predictor u Tight constraints on prediction mechanism u Must be fast and small enough to work as a component of a microprocessor u Artificial neural networks u Simple model of neural networks in brain cells u Learn to recognize and classify patterns u Most neural nets are slow and complex relative to tables u For branch prediction, we need a small and fast neural method

29 A Neural Method for Branch Prediction u Several neural methods were investigated u Most were too slow, too big, or not accurate enough u The perceptron [Rosenblatt `62, Block `62] u Very high accuracy for branch prediction u Prediction and update are quick, relative to other neural methods u Sound theoretical foundation; perceptron convergence theorem u Proven to work well for many classification problems

30 Branch-Predicting Perceptron u Inputs (x’s) are from branch history register u Weights (w’s) are small integers learned by on-line training u Output (y) gives prediction; dot product of x’s and w’s u Training finds correlations between history and outcome u w 0 – bias, independent of the history

31 Training Algorithm

32 Training Perceptrons u W’ – i.e. new weights vector, might be a worse set of weights for any other training example. It is not evident that this is a useful algorithm. u Perception Convergence Theorem: If any set of weights exist that correctly classify a finite set of training examples, then perceptron learning will come up with a (possibly different) set of weights that also correctly classifies all examples after a finite number of change steps, for a finite separable set of training examples.

33 Linear Separability u A limitation of perceptrons is that they are only capable of learning linearly separable functions u A boolean function over variables x i..n is linearly separable iff there exist values for w i..n such that all the true instances can be separated from all the false instances by a hyperplane defined by the solution of: n w 0 + ∑ x i w i = 0 i=1 i.e. If n = 2, the hyperplane is a line.

34 Linear Separability – contd. u Example: a perceptron can learn the logical AND for two inputs but not the XOR. u A perceptron can still give good predictions for inseparable functions but will not achieve 100% accuracy. In contrast a two level PHT (pattern history table) scheme like gshare can learn any boolean function if given enough time.

35 1. The Branch address is hashed into the table of perceptrons 2. The i th perceptron is fetched, into a vector register, P 1..n of weights. 3. The value of y is computed as the dot product of P and the global history register 4. The branch is predicted not taken if y is negative, or taken otherwise 5. Once this branch is resolved, the outcome is used by the training algorithm to update P 6. P is written back to the i th entry in the table Putting it all together – perceptron based predictor

36 Organization of the Perceptron Predictor u Keeps a table of perceptrons, indexed by branch address u Inputs are from branch history register u Predict taken if output  0, otherwise predict not taken u Key intuition: table size isn't exponential in history length, so we can consider much longer histories

37 Results and Analysis for the Perceptron Predictor

38 Results: Predictor Accuracy u Perceptron outperforms competitive hybrid predictor by 36% at ~4KB; 1.71% vs. 2.66%

39 Results: Large Hardware Budgets u Multi-component hybrid was the most accurate fully dynamic predictor known in the literature [Evers 2000] u Perceptron predictor is even more accurate

40 Results: IPC with high clock rate u Pentium 4-like: 20 cycle misprediction penalty, 1.76 GHz u 15.8% higher IPC than gshare, 5.7% higher than hybrid

41 Analysis: History Length u The fixed-length path branch predictor can also use long histories [Stark, Evers & Patt `98]

42 Analysis: Training Times u Perceptron “warms up’’ faster

43 Future Work and Conclusions

44 Future Work with Perceptron Predictor u Let's make the best predictor even better u Better representation u Better training algorithm u Latency is a problem u How can we eliminate the latency of the perceptron predictor?

45 Future Work with Perceptron Predictor u Value prediction u Predict which set of values is likely to be the result of a load operation to mitigate memory latency u Indirect branch prediction u Virtual dispatch u Switch statements in C

46 Future Work Characterizing Predictability u Branch predictability, value predictability u How can we characterize algorithms in terms of their predictability? u Given an algorithm, how can we transform it so that its branches and values are easier to predict? u How much predictability is inherent in the algorithm, and how much is an artifact of the program structure? u How can we compare different algorithms' predictability?

47 Conclusions u Neural predictors can improve performance for deeply pipelined microprocessors u Perceptron learning is well-suited for microarchitectural implementation u There is still a lot of work left to be done on the perceptron predictor in particular and microarchitectural prediction in general

48 The End