Idealized Piecewise Linear Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University and Departament d'Arquitectura de Computadors.

Slides:



Advertisements
Similar presentations
3.6 Support Vector Machines
Advertisements

© Negnevitsky, Pearson Education, Introduction Introduction Hebbian learning Hebbian learning Generalised Hebbian learning algorithm Generalised.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Sequential Logic Design
Copyright © 2013 Elsevier Inc. All rights reserved.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
David Burdett May 11, 2004 Package Binding for WS CDL.
CALENDAR.
Fast optimal instruction scheduling for single-issue processors with arbitrary latencies Peter van Beek, University of Waterloo Kent Wilken, University.
Solve Multi-step Equations
Factoring Quadratics — ax² + bx + c Topic
PP Test Review Sections 6-1 to 6-6
Briana B. Morrison Adapted from William Collins
Hash Tables.
André Seznec Caps Team IRISA/INRIA 1 Looking for limits in branch prediction with the GTL predictor André Seznec IRISA/INRIA/HIPEAC.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
© 2012 National Heart Foundation of Australia. Slide 2.
Artificial Intelligence
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s.
Analyzing Genes and Genomes
H-Pattern: A Hybrid Pattern Based Dynamic Branch Predictor with Performance Based Adaptation Samir Otiv Second Year Undergraduate Kaushik Garikipati Second.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Energy Generation in Mitochondria and Chlorplasts
9. Two Functions of Two Random Variables
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.
Computer Science Department University of Central Florida Adaptive Information Processing: An Effective Way to Improve Perceptron Predictors Hongliang.
André Seznec Caps Team IRISA/INRIA 1 The O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.
Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Calvin Lin Dept. of Computer Science Rutgers University Univ. of Texas Austin Presented.
1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.
VLSI Project Neural Networks based Branch Prediction Alexander ZlotnikMarcel Apfelbaum Supervised by: Michael Behar, Spring 2005.
Perceptrons Branch Prediction and its’ recent developments
Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.
Optimized Hybrid Scaled Neural Analog Predictor Daniel A. Jiménez Department of Computer Science The University of Texas at San Antonio.
1 Storage Free Confidence Estimator for the TAGE predictor André Seznec IRISA/INRIA.
Evaluation of the Gini-index for Studying Branch Prediction Features Veerle Desmet Lieven Eeckhout Koen De Bosschere.
Analysis of Branch Predictors
1 Revisiting the perceptron predictor André Seznec IRISA/ INRIA.
The Journal of Instruction Level Parallelism Championship Branch Prediction website: Dan Connors, Univ. of Colorado Tom Conte,
Predicting Conditional Branches With Fusion-Based Hybrid Predictors Gabriel H. Loh Yale University Dept. of Computer Science Dana S. Henry Yale University.
Branches Daniel Ángel Jiménez Departments of Computer Science UT San Antonio & Rutgers.
André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC.
Idealized Piecewise Linear Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.
Branch Prediction Perspectives Using Machine Learning Veerle Desmet Ghent University.
André Seznec Caps Team IRISA/INRIA 1 Analysis of the O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.
Fast Path-Based Neural Branch Prediction Daniel A. Jimenez Presented by: Ioana Burcea.
Samira Khan University of Virginia April 12, 2016
Multiperspective Perceptron Predictor Daniel A. Jiménez Department of Computer Science & Engineering Texas A&M University.
Multilayer Perceptron based Branch Predictor
Samira Khan University of Virginia Dec 4, 2017
Perceptrons for Dummies
The O-GEHL branch predictor
Samira Khan University of Virginia Mar 6, 2019
Chris Wilkerson, MRL/MTL Intel
Presentation transcript:

Idealized Piecewise Linear Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University and Departament d'Arquitectura de Computadors Universitat Politècnica de Catalunya

2 This Talk u Brief introduction to conditional branch prediction u Some motivation, some background u Introduction to neural branch prediction u Perceptron predictor u Mathematical intuition u Some pictures and movies u Piecewise Linear Branch Prediction u The algorithm u Why its better u Idealized Piecewise Linear Branch Prediction u My entry into the championship branch predictor contest

3 Pipelining and Branches Instruction fetch Instruction decode Execute Memory access Write back Pipelining overlaps instructions to exploit parallelism, allowing the clock rate to be increased. Branches cause bubbles in the pipeline, where some stages are left idle. Unresolved branch instruction

4 Branch Prediction Instruction fetch Instruction decode Execute Memory access Write back A branch predictor allows the processor to speculatively fetch and execute instructions down the predicted path. Speculative execution Branch predictors must be highly accurate to avoid mispredictions!

5 Branch Predictors Must Improve u The cost of a misprediction is proportional to pipeline depth u As pipelines deepen, we need more accurate branch predictors u Pentium 4 pipeline has 31 stages! Simulations with SimpleScalar/Alpha u Deeper pipelines allow higher clock rates by decreasing the delay of each pipeline stage u Decreasing misprediction rate from 9% to 4% results in 31% speedup for 32 stage pipeline

6 Previous Work on Branch Prediction u The architecture literature is replete with branch prediction papers u Most refine two-level adaptive branch prediction [Yeh & Patt 91] u A 1 st -level table records recent global or per-branch pattern histories u A 2 nd -level table learns correlations between histories and outcomes u Refinements focus on reducing destructive interference in the tables u Some of the better refinements (not an exhaustive list): u gshare [McFarling `93] u agree [Sprangle et al. `97] u hybrid predictors [Evers et al. `96] u skewed predictors [Michaud et al. `93]

7 Conditional Branch Prediction is a Machine Learning Problem u The machine learns to predict conditional branches u So why not apply a machine learning algorithm? u Artificial neural networks u Simple model of neural networks in brain cells u Learn to recognize and classify patterns u We used fast and accurate perceptrons [Rosenblatt `62, Block `62] for dynamic branch prediction [Jiménez & Lin, HPCA 2001] u We were the first to use single-layer perceptrons and to achieve accuracy superior to PHT techniques. Previous work used LVQ and MLP for branch prediction [Vintan & Iridon `99].

8 Input and Output of the Perceptron u The inputs to the perceptron are branch outcome histories u Just like in 2-level adaptive branch prediction u Can be global or local (per-branch) or both (alloyed) u Conceptually, branch outcomes are represented as u +1, for taken u -1, for not taken u The output of the perceptron is u Non-negative, if the branch is predicted taken u Negative, if the branch is predicted not taken u Ideally, each static branch is allocated its own perceptron

9 Branch-Predicting Perceptron u Inputs (xs) are from branch history and are -1 or +1 u n + 1 small integer weights (ws) learned by on-line training u Output (y) is dot product of xs and ws; predict taken if y 0 u Training finds correlations between history and outcome

10 Training Algorithm

11 What Do The Weights Mean? u The bias weight, w 0 : u Proportional to the probability that the branch is taken u Doesnt take into account other branches; just like a Smith predictor u The correlating weights, w 1 through w n : u w i is proportional to the probability that the predicted branch agrees with the i th branch in the history u The dot product of the ws and xs u w i × x i is proportional to the probability that the predicted branch is taken based on the correlation between this branch and the i th branch u Sum takes into account all estimated probabilities u Whats θ? u Keeps from overtraining; adapt quickly to changing behavior

12 Organization of the Perceptron Predictor u Keeps a table of m perceptron weights vectors u Table is indexed by branch address modulo m [Jiménez & Lin, HPCA 2001]

13 Mathematical Intuition A perceptron defines a hyperplane in n+1-dimensional space: For instance, in 2D space we have: This is the equation of a line, the same as

14 Mathematical Intuition continued In 3D space, we have Or you can think of it as i.e. the equation of a plane in 3D space This hyperplane forms a decision surface separating predicted taken from predicted not taken histories. This surface intersects the feature space. Is it a linear surface, e.g. a line in 2D, a plane in 3D, a cube in 4D, etc.

15 Example: AND u Here is a representation of the AND function u White means false, black means true for the output u -1 means false, +1 means true for the input -1 AND -1 = false -1 AND +1 = false +1 AND -1 = false +1 AND +1 = true

16 Example: AND continued u A linear decision surface (i.e. a plane in 3D space) intersecting the feature space (i.e. the 2D plane where z=0) separates false from true instances

17 Example: AND continued u Watch a perceptron learn the AND function:

18 Example: XOR u Heres the XOR function: -1 XOR -1 = false -1 XOR +1 = true +1 XOR -1 = true +1 XOR +1 = false Perceptrons cannot learn such linearly inseparable functions

19 My Previous Work on Neural Predictors u The perceptron predictor uses only pattern history information u The same weights vector is used for every prediction of a static branch u The i th history bit could come from any number of static branches u So the i th correlating weight is aliased among many branches u The newer path-based neural predictor uses path information u The i th correlating weight is selected using the i th branch address u This allows the predictor to be pipelined, mitigating latency u This strategy improves accuracy because of path information u But there is now even more aliasing since the i th weight could be used to predict many different branches

20 Piecewise Linear Branch Prediction u Generalization of perceptron and path-based neural predictors u Ideally, there is a weight giving the correlation between each u Static branch b, and u Each pair of branch and history position (i.e. i) in bs history u b might have 1000s of correlating weights or just a few u Depends on the number of static branches in bs history u First, Ill show a practical version

21 The Algorithm: Parameters and Variables u GHL – the global history length u GHR – a global history shift register u GA – a global array of previous branch addresses u W – an n × m × (GHL + 1) array of small integers

22 The Algorithm: Making a Prediction Weights are selected based on the current branch and the i th most recent branch

23 The Algorithm: Training

24 Why Its Better u Forms a piecewise linear decision surface u Each piece determined by the path to the predicted branch u Can solve more problems than perceptron Perceptron decision surface for XOR doesnt classify all inputs correctly Piecewise linear decision surface for XOR classifies all inputs correctly

25 Learning XOR From a program that computes XOR using if statements perceptron predictionpiecewise linear prediction

26 A Generalization of Neural Predictors u When m = 1, the algorithm is exactly the perceptron predictor u W[n,1,h+1] holds n weights vectors u When n = 1, the algorithm is path-based neural predictor u W[1,m,h+1] holds m weights vectors u Can be pipelined to reduce latency u The design space in between contains more accurate predictors u If n is small, predictor can still be pipelined to reduce latency

27 Generalization Continued Perceptron and path- based are the least accurate extremes of piecewise linear branch prediction!

28 Idealized Piecewise Linear Branch Prediction u Presented at CBP workshop at MICRO 2004 u Hardware budget limited to 64K bits u No other limitations u Get rid of n and m u Allow 1 st and 2 nd dimensions of W to be unlimited u Now branches cannot alias one another; accuracy much better u One small problem: unlimited amount of storage required u How to squeeze this into 65,792 bits for the contest?

29 Hashing u 3 indices of W : i, j, & k, index arbitrary numbers of weights u Hash them into 0..N-1 weights in an array of size N u Collisions will cause aliasing, but more uniformly distributed u Hash function uses three primes H 1 H 2 and H 3:

30 More Tricks u Weights are 7 bits, elements of GA are 8 bits u Separate arrays for bias weights and correlating weights u Using global and per-branch history u An array of per-branch histories is kept, alloyed with global history u Slightly bias the predictor toward not taken u Dynamically adjust history length u Based on an estimate of the number of static branches u Extra weights u Extra bias weights for each branch u Extra correlating weights for more recent history bits u Inverted bias weights that track the opposite of the branch bias

31 Parameters to the Algorithm #define NUM_WEIGHTS 8590 #define NUM_BIASES 599 #define INIT_GLOBAL_HISTORY_LENGTH 30 #define HIGH_GLOBAL_HISTORY_LENGTH 48 #define LOW_GLOBAL_HISTORY_LENGTH 18 #define INIT_LOCAL_HISTORY_LENGTH 4 #define HIGH_LOCAL_HISTORY_LENGTH 16 #define LOW_LOCAL_HISTORY_LENGTH 1 #define EXTRA_BIAS_LENGTH 6 #define HIGH_EXTRA_BIAS_LENGTH 2 #define LOW_EXTRA_BIAS_LENGTH 7 #define EXTRA_HISTORY_LENGTH 5 #define HIGH_EXTRA_HISTORY_LENGTH 7 #define LOW_EXTRA_HISTORY_LENGTH 4 #define INVERTED_BIAS_LENGTH 8 #define HIGH_INVERTED_BIAS_LENGTH 4 #define LOW_INVERTED_BIAS_LENGTH 9 #define NUM_HISTORIES 55 #define WEIGHT_WIDTH 7 #define MAX_WEIGHT 63 #define MIN_WEIGHT -64 #define INIT_THETA_UPPER 70 #define INIT_THETA_LOWER -70 #define HIGH_THETA_UPPER 139 #define HIGH_THETA_LOWER -136 #define LOW_THETA_UPPER 50 #define LOW_THETA_LOWER -46 #define HASH_PRIME_ U #define HASH_PRIME_ U #define HASH_PRIME_ U #define TAKEN_THRESHOLD 3 All determined empirically with an ad hoc approach

32 Per-Benchmark Accuracy u I used several highly accurate predictors to compete with my predictor u I measured the potential of my technique using an unlimited hardware budget

33 Scores for the 6 Finalists (out of 18 entries) 1. Hongliang Gao, Huiyang Zhou André Seznec Gabriel Loh Daniel A. Jiménez Pierre Michaud Veerle Desmet et al scores are in average MPKI (mispredicts per 1000 insts) (corrected) over a suite of 20 traces from Intel 5 of the 6 finalists used ideas from the perceptron predictor in their entries

34 References u Jiménez and Lin, HPCA 2001 (perceptron predictor) u Jiménez and Lin, TOCS 2002 (global/local perceptron) u Jiménez, MICRO 2003 (path-based neural predictor) u Jiménez, ISCA 2005 (piecewise linear branch prediction) u Juan, Sanjeevan, Navarro, SIGARCH Comp. News, 1998 (dynamic history length fitting) u Skadron, Martonosi, Clark, PACT 2000 (alloyed history)

35 The End

36 Extra Slides u Following this slide are some slides cut from the talk to fit within time constraints.

37 Program to Compute XOR int f () { int a, b, x, i, s = 0; for (i=0; i<100; i++) { a = rand () % 2; b = rand () % 2; if (a) { if (b) x = 0; else x = 1; } else { if (b) x = 1; else x = 0; } if (x) s++; /* this is the branch */ } return s; }

38 Example: XOR continued u Watch a perceptron try to learn XOR