Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Slides:



Advertisements
Similar presentations
Branch prediction Titov Alexander MDSP November, 2009.
Advertisements

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Dynamic Branch Prediction (Sec 4.3) Control dependences become a limiting factor in exploiting ILP So far, we’ve discussed only static branch prediction.
Compiler techniques for exposing ILP
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Pipelining and Control Hazards Oct
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
COMP4611 Tutorial 6 Instruction Level Parallelism
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Dynamic Branch Prediction
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
EECC551 - Shaaban #1 lec # 6 Fall Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
Instruction Level Parallelism (ILP) Colin Stevens.
EECE476: Computer Architecture Lecture 20: Branch Prediction Chapter extra The University of British ColumbiaEECE 476© 2005 Guy Lemieux.
EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
EECC551 - Shaaban #1 lec # 7 Fall Hardware Dynamic Branch Prediction Simplest method: –A branch prediction buffer or Branch History Table.
EECC551 - Shaaban #1 lec # 8 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
COMP381 by M. Hamdi 1 (Recap) Control Hazards. COMP381 by M. Hamdi 2 Control (Branch) Hazard A: beqz r2, label B: label: P: Problem: The outcome.
Dynamic Branch Prediction
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.
1 Instruction Level Parallelism Vincent H. Berk October 15, 2008 Reading for today: A.7 – A.8 Reading for Friday: 2.1 – 2.5 Project Proposals Due Right.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Instruction-Level Parallelism and Its Dynamic Exploitation
CS203 – Advanced Computer Architecture
Dynamic Branch Prediction
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
COMP 740: Computer Architecture and Implementation
CS 704 Advanced Computer Architecture
Lecture 6: Static ILP, Branch prediction
Coe818 Advanced Computer Architecture
Advanced Computer Architecture
/ Computer Architecture and Design
CS 704 Advanced Computer Architecture
Lecture 10: Branch Prediction and Instruction Delivery
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Midterm 2 review Chapter
Adapted from the slides of Prof
Dynamic Hardware Prediction
CMSC 611: Advanced Computer Architecture
Loop-Level Parallelism
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501

Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest –Also called a branch-prediction buffer –Lower bits of branch address index table of 1-bit values –Says whether or not branch taken last time –If branch was taken last time, then take again –Initially, bits are set to predict that all branches are taken Problem: in a loop, 1-bit BHT will cause two mispredictions: –End of loop case, when it exits instead of looping as before –First time through loop on next time through code, when it predicts exit instead of looping LOOP:LOADR1, 100(R2) MULR6, R6, R1 SUBI R2, R2, #4 BNEZR2, LOOP

Dynamic Branch Prediction Solution: 2-bit predictor scheme where change prediction only if mispredict twice in a row: (Figure 4.13, p. 264) This idea can be extended to n-bit saturating counters –Increment counter when branch is taken –Decrement counter when branch is not taken –If counter <= 2 n-1, then predict the branch is taken; else not taken. T T T T NT Predict Taken Predict Not Taken Predict Taken Predict Not Taken

2-bit BHT Accuracy Mispredict because: –First time branch encountered –Wrong guess for that branch (e.g., end of loop) –Got branch history of wrong branch when index the table (can happen for large programs) With a 4096 entry 2-bit table misprediction rates vary depending on the program. –1% for nasa7, tomcatv (lots of loops with many iterations) –9% for spice –12% for gcc –18% for eqntott (few loops, relatively hard to predict) A 4096 entry table is about as good as an infinite table. Instead of using a separate table, the branch prediction bits can be stored in the instruction cache.

Correlating Branches Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table In general, (m,n) predictor means record last m branches to select between 2 m history talbes each with n-bit counters –Old 2-bit BHT is then a (0,2) predictor

Correlating Branches Often the behavior of one branch is correlated with the behavior of other branches. For example C CODEDLX CODE if (aa == 2)SUBI R3, R1, #2; BNEZ R3, L1 aa = 0;ADD R1, R0, R0 if (bb == 2)L1:SUBI R3, R2, #2; BNEZ R3, L2 bb = 0;ADD R2, R0, R0 if (aa != bb)L2:SUBI R3, R1, R2; BEQZ R3, L3 cc = 4;ADD, R4, R0, #4 L3: If the first two branches are not taken, the third one will be.

Correlating Predicators Correlating predicators or two- level predictors use the behavior of other branches to predict if the branch is taken. –An (m, n) predictor uses the behavior of the last m branches to chose from (2 m ) n-bit predictors. –The branch predictor is accessed using the low order k bits of the branch address and the m-bit global history. –The number of bits needed to implement an (m, n) predictor, which uses k bits of the branch address is 2 m x n x 2 k –In the figure, we have m = 2, n = 2, k=4 2 2 x 2 x 2 4 = 128 bits Branch address 2-bits per branch predictors (2, 2) predictor

Accuracy of Different Schemes (Figure 4.21, p. 272) 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 18% Frequency of Mispredictions

Branch-Target Buffers DLX computes the branch target in the ID stage, which leads to a one cycle stall when the branch is taken. A branch-target buffer or branch-target cache stores the predicted address of branches that are predicted to be taken. Values not in the buffer are predicted to be not taken. The branch-target buffer is accessed during the IF stage, based on the k low order bits of the branch address. If the branch-target is in the buffer and is predicted correctly, the one cycle stall is eliminated.

Branch Target Buffer PC PC of instructionPredicted Target PC = No Yes Predict not a taken branch - proceed normally Predict a taken branch - use Predicted PC as Next PC k For more than single bit predictors also need to store prediction information Instead of storing predicted target PC, store the target instruction

Issuing Multiple Instructions/Cycle Two variations Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) –IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000 (Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates Anticipated success lead to use of Instructions Per Clock cycle (IPC) vs. CPI

Superscalar DLX Superscalar DLX: 2 instructions; 1 FP op, 1 other – Fetch 64-bits/clock cycle; Int on left, FP on right – Can only issue 2nd instruction if 1st instruction issues – 2 more ports for FP registers to do FP load or FP store and FP op TypePipeStages Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB 1 cycle load delay expands to 3 cycles in 2-way SS –instruction in right half can’t use it, nor instructions in next slot Branches also have a delay of 3 cycles

Unrolled Loop that Minimizes Stalls for Scalar 1 Loop:LDF0,0(R1) 2 LDF6,-8(R1) 3 LDF10,-16(R1) 4 LDF14,-24(R1) 5 ADDDF4,F0,F2 6 ADDDF8,F6,F2 7 ADDDF12,F10,F2 8 ADDDF16,F14,F2 9 SD0(R1),F4 10 SD-8(R1),F8 12 SUBIR1,R1,#32 11 SD16(R1),F12 13 BNEZR1,LOOP 14 SD8(R1),F16 14 clock cycles, or 3.5 per iteration LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles

Loop Unrolling in Superscalar Integer instructionFP instructionClock cycle Loop:LD F0,0(R1)1 LD F6,-8(R1)2 LD F10,-16(R1)ADDD F4,F0,F23 LD F14,-24(R1)ADDD F8,F6,F24 LD F18,-32(R1)ADDD F12,F10,F25 SD 0(R1),F4ADDD F16,F14,F26 SD -8(R1),F8ADDD F20,F18,F27 SD -16(R1),F128 SUBI R1,R1,#4010 SD 16(R1),F169 BNEZ R1,LOOP11 SD 8(R1),F2012 Unrolled 5 times to avoid delays (+1 due to SS) 12 clocks, or 2.4 clocks per iteration

Dynamic Scheduling in Superscalar How to issue two instructions and keep in-order instruction issue for Tomasulo? –Assume 1 integer + 1 floating point –1 Tomasulo control for integer, 1 for floating point Issue 2X clock rate, so that issue remains in order Only FP loads might cause dependency between integer and FP issue: –Replace load reservation station with a load queue; operands must be read in the order they are fetched –Load checks addresses in Store Queue to avoid RAW violation –Store checks addresses in Load & Store Queues to avoid WAR,WAW –Called “decoupled architecture”

Limits of Superscalar While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: –Exactly 50% FP operations –No hazards If more instructions issue at same time, greater difficulty of decode and issue –Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue Issue rates of modern processors vary between 2 and 8 instructions per cycle.

VLIW Processors Very Long Instruction Word (VLIW) processors – Tradeoff instruction space for simple decoding –The long instruction word has room for many operations –By definition, all the operations the compiler puts in the long instruction word can execute in parallel –E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch »16 to 24 bits per field => 7*16 (112) to 7*24 (168) bits wide –Need compiling technique that schedules across branches

Loop Unrolling in VLIW Memory MemoryFPFPInt. op/Clock reference 1reference 2operation 1 op. 2 branch LD F0,0(R1)LD F6,-8(R1)1 LD F10,-16(R1)LD F14,-24(R1)2 LD F18,-32(R1)LD F22,-40(R1)ADDD F4,F0,F2ADDD F8,F6,F23 LD F26,-48(R1)ADDD F12,F10,F2ADDD F16,F14,F24 ADDD F20,F18,F2ADDD F24,F22,F25 SD 0(R1),F4SD -8(R1),F8ADDD F28,F26,F26 SD -16(R1),F12SD -24(R1),F167 SD -32(R1),F20SD -40(R1),F24SUBI R1,R1,#488 SD -0(R1),F28BNEZ R1,LOOP9 Unroll loop 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration Need more registers in VLIW

Limits to Multi-Issue Machines Limitations specific to either SS or VLIW implementation –Decode/issue in SS –VLIW code size: unroll loops + wasted fields in VLIW –VLIW lock step => 1 hazard & all instructions stall –VLIW & binary compatibility is practical weakness Inherent limitations of ILP –1 branch in 5 instructions => how to keep a 5-way VLIW busy? –Latencies of units => many operations must be scheduled –Need about Pipeline Depth x No. Functional Units of independent instructions to keep all busy

Summary Branch Prediction –Branch History Table: 2 bits for loop accuracy –Correlation: Recently executed branches correlated with next branch –Branch Target Buffer: include branch address if predicted taken Superscalar and VLIW –CPI < 1 –Superscalar is more hardware dependent (dynamic) –VLIW is more compiler dependent (static) –More instructions issue at same time => larger penalties for hazards