CS 704 Advanced Computer Architecture

CS 704 Advanced Computer Architecture
Lecture 15 Instruction Level Parallelism (Dynamic Branch Prediction) Prof. Dr. M. Ashraf Chughtai Welcome to the 14th lecture of the series of lectures on Advanced Computer Architecture

Lecture 15 – Instruction Level Parallelism -Dynamic (4)
Today's Topics Recap - Lecture 14 Dynamic Branch Prediction Branch Prediction Buffer Examples of Branch Predictor Summary MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Recap: Lecture 14 Tomasulo's Approach for IBM 360/91 to achieve high Performance without special compilers Here, the control and buffers are distributed with Function Units (FU) Registers in instructions are replaced by values or pointers to reservation stations(RS) ; i.e., the registers are renamed Unlike Scoreboard, Tomasulo can have multiple loads outstanding MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Recap: Lecture 14 These two properties allow to issue an instruction having name dependence ; e.g., MULT is issued which has name dependence of register F2 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Recap: Lecture 14 Tomasulo eliminates the WAR hazard as in this example ADD.D writes the result in Cycle 11 even if the DIV.D will start execution in Cycle 16 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Recap: Lecture 14 Tomasulo issues in-order and may execute out- of-order MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Recap: Lecture 14 Here, the integer instructions SUBI and BNEZ are executed out-of-order to evaluate the condition The perdition Branch-Taken is implemented by repeating the loop instruction as shown MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Recap: Lecture 14 The perdition Branch-Taken is implemented by two iterations of the code R1 has been initialized to 80 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Recap: Lecture 14 L.D is issued in 6th clock cycle, prior to the condition evaluation – Predict Branch Taken R1 is updated in Clock 6, by executing SUB in Clock cycle 5 F0 never sees the result SUBI and BNZE are issued in Clock Cycle 4 and 5 respectively MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Recap: Lecture 14 MUL1 issued in clock cycle 2 does not start execution till Wr to F0 by LD is complete to avoid WAR Hazard MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Recap: Lecture 14 L.D 1 issues in cycle 1, completes execution in cycle 9 ( 8 CPI first time) It writes to F0 in cycle 10 LD 2 issued in cycle 6 completes execution (4 CPI second time So MUL1 will start in cycle 11 avoiding WAR Hazard SD1 will start execution on the completion of MUL1 to avoid WAW hazard SUBI and BNEZ issued in clock cycles 9 and 10 respectively SUBI completes execution in 10 cycle, updates R1 to the next iteration MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Recap: Lecture 14 MUL1 execution started in cycle 11 completes in cycle 14 write result in F4 in cycle 15 SD1 issued in cycle 3, will start execution in Cycle 16 avoiding WAR hazard MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Recap: Lecture 14 MUL1 execution started in cycle 11 completes in cycle 14 write result in F4 in cycle 15 SD1 issued in cycle 3, will start execution in Cycle 16 completes in cycle 18 SBI issued in cycle 16 update R1 for next iteration in cycle 18 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Recap: Lecture 14 MUL2 execution started in cycle 12 completed in cycle 15 write result in F4 in cycle 16 SD2 issued in cycle 8, start s execution in Cycle 17 after MUL2 writes result in cycle 16 to avoid WAR hazard MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Introduction to Dynamic Branch Prediction
In the last lecture, we considered a loop- based example, to discuss the Tomasulo’s approach to overcome the WAW and WAR hazards Here, we observed that dynamically scheduled pipeline can yield high performance provided branches are predicted accurately MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Branch History Table If the prediction is wrong, then invert prediction-bit MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

1-bit Dynamic Branch Prediction
Problem: - In a loop, 1-bit BHT will cause two mispredictions in a row 1-bit predictor mispredict at twice the rate that the branch is not-taken Let us consider an example of loop- branch (For i=1 to 10); i.e., the branch is taken 9 times and not-taken once MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

1-bit Dynamic Branch Prediction … Conclusion
As the Performance = ƒ (accuracy, cost of mispredictions) The accuracy of the predictor is expected to match the taken-branch frequency, which in the previous example is 9 out of 10 (90%) But the 1-bit prediction has 8 out of 10 (80%) MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

2 bits are used to encode 4-states in the system (counter) Say: States 00 and 01 for Predict Not-Taken States 10 and 11 for Predict Taken MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

NT Predict Taken State 11 Predict Not Taken State 01 State 10 Taken State 00 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

In a saturating counter implementation: 2-bit counter saturates at: - 00 (Predict Taken) or - 11 (Predict Not taken) The counter is incremented when a branch is taken and decremented when it is not taken; e.g., - 00 to 01 for Taken when predicted not taken - 10 to 11 for Taken when predicted taken MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Here, when the counter is greater than or equal to ½ of its maximum value (>=10; i.e., state 01 and 11) branch is predicted as taken; otherwise (i.e., <10: state 10 and 00) the branch is predicted as untaken Let us try the example of loop For i=1,10 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Let us try the example of loop For i=1,10 Iteration P.S. Branch NS Prediction 0 -- not Taken 11 Taken 1 11 Taken 11 Taken 2 11 Taken 11 Taken : Taken 11 Taken Not taken 10 Taken Prediction fails once only MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Branch Prediction Buffer (BPB) or BHT Implementation
MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Branch Prediction Buffer (BPB) or BHT Implementation
If Prediction is wrong Then prediction bits are changed – In case Predicted Taken: State changes 11 10) Predicted not taken: State changes 0001 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Branch History Table Accuracy
For example Place Fig. 3.8 pp 200 here Here, for SPEC89 benchmark A branch prediction buffer with 4096 entries results in: - Prediction accuracy ranging from: 99% to 82 % or - Mispredictions rate of 1% - 18% MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Branch History Table Accuracy wrt size
Insert Fig. 3.9 pp 201 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Impact of size on accuracy of BHT
As we try to exploit more ILP, the accuracy of the Branch Predictor becomes critical Here, the accuracy of the predictor is shown by increasing the size of the buffer as 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT Simply increasing the number of bits per predictor without changing the predictor structure has little impact – so we have to look at other methods to increase the accuracy of the predictors MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Correlating Branches The 2-bit predictor scheme uses only the recent behavior of the single branch to predict the future behavior of branch In practice, the behavior of other branches, rather than only a single branch, we are trying to predict, may also influence the prediction accuracy Let us consider the worst case of SPEC92 benchmark for 2-bit predictor MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Correlating Branches SPEC92 benchmark for 2-bit predictor example:
Assume aa is assigned R1 and bb the register R2 IF (aa==2) DSUBUI R3, R1, #2 aa=0; BNEZ R3, L1 ; branch b1 (aa!=2) DADD R1, R0, R0 ; aa=0 Not Branch IF (bb==2) L1 DSUBUI R3, R2, #2 bb=0; BNEZ R3, L2 ; branch b2 (bb!=2) DADD R2, R0, R0 ; bb=0 Not Branch IF (aa!=bb) L2 DSUBU R3, R1,R2 { BEQZ R3, L3 ; branch b3 (aa=bb) Here, the behavior of b3 (L2) is correlated with the behavior of b1 and b2 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Correlating Branches Here, if b1 and b2 are both not- taken (aa=0; bb=0) then b3 is taken A predictor that uses the behavior of a single branch to predict the behavior of that branch cannot capture this behavior So we need a correlating branch predictor MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Correlating Branch Predictors
Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Correlating Branch Predictors
In general, (m,n) predictor means record last m branches to select between 2m history tables each with n-bit counters Old 2-bit BHT is then a (0,2) predictor MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Correlating Branch Predictor: Example
Let us consider an illustrative code: (d is assigned to R1) IF (d==0) BNEZ R1, L1 ; branch b1 (d!=0) d=1; DADDIU R1,R0,#1 ; branch not taken, d=1 IF (d==1) L1: DADDIU R3, R1, #-1 BNEZ R3, L2 ; branch b2 – (d!=1) The working of correlating predictor is as follows Initial d d==0? b1 d before b2 d==1? b2 0 yes NT 1 yes NT No T 1 yes NT No T 2 no T Here, if b1 is not taken b2 will not be taken – ……. next MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

We write the pair of prediction bits as: Prediction if last branch in the program is not-taken/ Prediction if the last branch is taken Therefore, the 4 possible combinations are: Prediction bits New Prediction if last New Prediction if last branch Not Taken Branch Taken NT / NT NT NT NT/T NT T T/NT T NT T/T T T MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

The action of the 1-bit predictor with 1-bit of correlation, written as (1,1) for the above example is shown here (Fig …. pp 203 In this case the only misprediction is on the first iteration, when d=2 as this is not correlated with the previous perdition MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Correlating Branches (2,2) branch prediction buffer uses 2-bit global history to choose from among 4 predictors for each branch address Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction Branch address 2-bits per branch predictors Prediction 2-bit global branch history MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Accuracy of Different Schemes
18% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT Frequency of Mispredictions 0% MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Branch History Table or Branch Target Buffer
PC instruction to Fetch Number of entries in Branch target Buffer Lookup Predicted PC Branch Predicted Taken or Not Taken No: Inst. Is not predicted to be branch Proceed Normally Yes: Inst. Is branch and predicted PC should be used as the next PC MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Dynamic Branch Prediction Summary
Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlated with next branch Branch Target Buffer: include branch address & prediction Predicated Execution can reduce number of branches, number of mispredicted branches MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

Asslam-u-aLacum and ALLAH Hafiz MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4)

CS 704 Advanced Computer Architecture

Similar presentations

Presentation on theme: "CS 704 Advanced Computer Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 704 Advanced Computer Architecture

Similar presentations

Presentation on theme: "CS 704 Advanced Computer Architecture"— Presentation transcript:

Similar presentations

About project

Feedback