Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Similar presentations


Presentation on theme: "CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation."— Presentation transcript:

1 CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

2 Nested Loops DADDIUR1,R0,#80 Loop1:L.DF2,1600(R1) DADDIUR2,R0,#40 Loop2:L.DF0,1000(R2) ADD.DF0,F0,F2 S.DF0,1000(R2) DADDIUR2,R2,#−8 BNEZR2,Loop2 DADDIUR1,R1,#−8 BNEZR1,Loop1 How many times do Loop1 and Loop2 iterate?

3 BNEZ R2,Loop2 Branch history:TTTTN|TTTTN|TTTTN|TTTTN|… N means branch not taken. 1-bit predictor:TTTTT|NTTTT|NTTTT|NTTTT|… → two errors per iteration. 2-bit predictor:TTTTT|TTTTT|TTTTT|TTTTT|… → one error per iteration. The error behavior for Loop1 is similar. Put more bits in the counter to improve error behavior?

4 Global Branch History Global branch history: TTTTN|T|TTTTN|T|TTTTN|T|TTTTN|T| … Loop22222 |1| 22222 |1| 22222 |1| 22222 |1| … Can we use global branch history to get a better result? (On previous slide, we looked at local branch history.)

5 5-Bit Global Branch History We keep a 5-bit global branch history, and use the bit pattern to choose one of 2 5 1-bit predictors: TTTTTN TTTTNT TTTNTT TTNTTT TNTTTT NTTTTT ….…. NNNNNT We get 100% accuracy in the steady state. This strategy works if at least 5 bits are used.

6 Correlating Branch Predictors (p. 200) A 2-bit predictor uses only the recent behavior of a single branch. SPEC92 benchmark eqntott (the worst case in Figures 3.8 and 3.9 with an 18% error rate): if (aa==2) aa=0; if (bb==2) bb=0; if (aa!=bb){

7 MIPS Code Assume that aa and bb are assigned to R1 and R2: DSUBUIR3,R1,#2 BNEZR3,L1;branch b1 (aa!=2) DADDR1,R0,R0;aa=0 L1:DSUBUIR3,R2,#2 BNEZR3,L2;branch b2 (bb!=2) DADDR2,R0,R0;bb=0 L2:DSUBUR3,R1,R2;R3=aa−bb BEQZR3,L3;branch b3 (aa==bb) Consider the branches. The behavior of branch b3 is correlated with the behavior of branches b1 and b2: if both b1 and b2 are not taken, then b3 will be taken (as aa and bb are equal).

8 Simplified Example (p. 202) Suppose that d has values 0, 1, and 2: if(d==0) d=1; if(d==1) MIPS Code: Assume that d is assigned to R1: BNEZR1,L1;branch b1 (d!=0) DADDUIR1,R0,#1;d==0, so d=1 L1:DADDUIR3,R1,#−1 BNEZR3,L2;branch b2 (d!=1) … L2:

9 Figure 3.10. Possible execution sequence Initial value of d d==0?b1Value of d before b2 d==1?b2 0YesNT1YesNT 1NoT1YesNT 2NoT2 T

10 Figure 3.11. Behavior of 1-bit predictor initialized to NT Suppose that d = 2, 0, 2, 0, … Misprediction Rate = 100%! d=?b1 prediction b1 actionNew b1 prediction b2 prediction b2 action New b2 prediction 2NTTT TT 0T T 2 TT TT 0T T

11 Figure 3.12. Meaning of Prediction Bits Prediction bitsPrediction if last branch not taken Prediction if last branch taken NT/NTNT NT/TNTT T/NTTNT T/TTT

12 Fig. 3.13. Action of 1-bit predictor with 1 bit of correlation. Initialized to NT/NT d=?b1 prediction b1 action New b1 prediction b2 prediction b2 action New b2 prediction 2NT/NTTT/NTNT/NTTNT/T 0T/NTNTT/NTNT/TNTNT/T 2T/NTT NT/TT 0T/NTNTT/NTNT/TNTNT/T

13 Figure 3.14. A (2,2) Branch Prediction Buffer This buffer uses a 2-bit global history to choose from among 2 2 predictors for each branch address. Each predictor is in turn a 2-bit predictor for that branch. Figure 3.12 shows a (1,1) branch prediction buffer.

14 Figure 3.15. Comparison of 2-bit Predictors

15 Tournament Predictors (p. 206) Adaptively combine local and global predictors. Alpha 21264 has a tournament predictor using 4K 2-bit counters indexed by the local branch address to choose from between a global predictor and a local predictor. The global predictor also has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor. The local predictor consists of a 2-level predictor. The top level is a local history table consisting of 1024 10-bit entries. The entry is used to index a table of 1K entries consisting of 3-bit saturating counters, providing the local prediction. (Total = 29K bits. For SPECfp95 benchmarks, less than 1 misprediction per 1000 completed instructions.)

16 Fig. 3.16. State Transition Diagram for Tournament Predictor The counter is incremented whenever the “predicted” predictor is correct and the other predictor is incorrect, and it is decremented in the reverse situation.

17 Figure 3.17. Fraction of predictions from local predictor for a tournament predictor using SPEC89

18 Figure 3.18. Misprediction rates for three different predictors on SPEC89 as total # of bits is increased


Download ppt "CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation."

Similar presentations


Ads by Google