1/24/2016 11:00 PM 1 of 86 Pipelining Chapter 6. 1/24/2016 11:00 PM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which.

Slides:



Advertisements
Similar presentations
Morgan Kaufmann Publishers The Processor
Advertisements

Chapter Six 1.
Instruction-Level Parallelism (ILP)
Pipelined Processor II (cont’d) CPSC 321
S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.
Computer Organization
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
Pipelining III Andreas Klappenecker CPSC321 Computer Architecture.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
Chapter Six Enhancing Performance with Pipelining
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
 The actual result $1 - $3 is computed in clock cycle 3, before it’s needed in cycles 4 and 5  We forward that value to later instructions, to prevent.
Appendix A Pipelining: Basic and Intermediate Concepts
Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
11/13/2015 8:57 AM 1 of 86 Pipelining Chapter 6. 11/13/2015 8:57 AM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which.
CMPE 421 Parallel Computer Architecture
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
CMPE 421 Parallel Computer Architecture Part 2: Hardware Solution: Forwarding.
1 (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann,
Computing Systems Pipelining: enhancing performance.
Branch Hazards and Static Branch Prediction Techniques
Pipelining Example Laundry Example: Three Stages
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
CSIE30300 Computer Architecture Unit 05: Overcoming Data Hazards Hsin-Chou Chi [Adapted from material by and
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
CMPE 421 Parallel Computer Architecture Part 3: Hardware Solution: Control Hazard and Prediction.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Datapath and Control AddressInstruction Memory Write Data Reg Addr Register File ALU Data Memory Address Write Data Read Data PC Read Data Read Data.
LECTURE 9 Pipeline Hazards. PIPELINED DATAPATH AND CONTROL In the previous lecture, we finalized the pipelined datapath for instruction sequences which.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Computer Organization
Stalling delays the entire pipeline
CDA 3101 Spring 2016 Introduction to Computer Organization
Pipelining Chapter 6.
Morgan Kaufmann Publishers
Single Clock Datapath With Control
Pipeline Implementation (4.6)
Chapter 4 The Processor Part 4
ECS 154B Computer Architecture II Spring 2009
ECS 154B Computer Architecture II Spring 2009
Design of the Control Unit for Single-Cycle Instruction Execution
Chapter 4 The Processor Part 3
Review: MIPS Pipeline Data and Control Paths
Morgan Kaufmann Publishers The Processor
Morgan Kaufmann Publishers The Processor
Pipelining review.
Pipelining Chapter 6.
Morgan Kaufmann Publishers Enhancing Performance with Pipelining
Design of the Control Unit for One-cycle Instruction Execution
Pipelining in more detail
The Processor Lecture 3.6: Control Hazards
The Processor Lecture 3.2: Building a Datapath with Control
Control unit extension for data hazards
Instruction Execution Cycle
CSC3050 – Computer Architecture
Pipeline Control unit (highly abstracted)
Pipelining (II).
Control unit extension for data hazards
Introduction to Computer Organization and Architecture
Control unit extension for data hazards
Pipelining - 1.
ELEC / Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher.
Presentation transcript:

1/24/ :00 PM 1 of 86 Pipelining Chapter 6

1/24/ :00 PM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which multiple instructions are overlapped in execution. Pipelining is an implementation technique in which multiple instructions are overlapped in execution. Pipelining improves performance by increasing instruction throughput. Pipelining improves performance by increasing instruction throughput. The execution time of an individual instruction is not decreased. The execution time of an individual instruction is not decreased.

1/24/ :00 PM 3 of 86 Analogy Doing laundry: Doing laundry: 1. Put clothes in washer to wash. 2. Put clothes in dryer to dry. 3. Put clothes on table to fold. 4. Put clothes away.

1/24/ :00 PM 4 of 86 Analogy Non-pipelined: Non-pipelined:

1/24/ :00 PM 5 of 86 Analogy Pipelined: Pipelined:

1/24/ :00 PM 6 of 86 Example Assume that the operation time for the major functional units are: Assume that the operation time for the major functional units are: 200 ps for memory access 200 ps for memory access 200 ps for ALU operation 200 ps for ALU operation 100 ps for register access 100 ps for register access

MIPS Instructions 5 stages for a MIPS instruction: 5 stages for a MIPS instruction: Fetch → Reg. Read → ALU Op. → Data access → Reg. Write lw $s1, 100($s2) lw $s1, 100($s2) sw $s1, 100($s2) sw $s1, 100($s2) add $s1, $s2, $s3 add $s1, $s2, $s3 beq $s1, $s2, 25 beq $s1, $s2, 25 1/24/ :00 PM 7 of 86

1/24/ :00 PM 8 of 86 Example Instruc tion Fetch Reg read ALU op Data access Reg write Total time lw ps sw ps add ps beq ps Execution time for each instruction class: Execution time for each instruction class:

1/24/ :00 PM 9 of 86 Example For the single-cycle design: For the single-cycle design: Must allow for the slowest instruction – lw. Must allow for the slowest instruction – lw. So the time required for every instruction is 800 ps. So the time required for every instruction is 800 ps.

1/24/ :00 PM 10 of 86 Example Non-pipelined for three lw instructions: Non-pipelined for three lw instructions:

1/24/ :00 PM 11 of 86 Example Non-pipelined for three lw instructions: Non-pipelined for three lw instructions: The time between the first and the fourth instructions is 3 x 800 ps = 2400 ps. The time between the first and the fourth instructions is 3 x 800 ps = 2400 ps.

1/24/ :00 PM 12 of 86 Example For the pipelined multi-cycle design: For the pipelined multi-cycle design: Each clock cycle must be long enough to accommodate the slowest operation. Each clock cycle must be long enough to accommodate the slowest operation. So the time required for every clock cycle is 200 ps. So the time required for every clock cycle is 200 ps.

1/24/ :00 PM 13 of 86 Example Pipelined for three lw instructions : Pipelined for three lw instructions :

1/24/ :00 PM 14 of 86 Example Pipelined for three lw instructions: Pipelined for three lw instructions: The time between the first and the fourth instructions is 3 x 200 ps = 600 ps. The time between the first and the fourth instructions is 3 x 200 ps = 600 ps. 2400/600 = /600 = 4. A fourfold performance improvement. A fourfold performance improvement.

1/24/ :00 PM 15 of 86 Pipeline Hazards Structural hazards Structural hazards Data hazards Data hazards Control hazards Control hazards

1/24/ :00 PM 16 of 86 Structural Hazards There is a structural hazard when the hardware cannot support the combination of instructions that we want to execute in the same clock cycle. There is a structural hazard when the hardware cannot support the combination of instructions that we want to execute in the same clock cycle. Analogy: Having a washer/dryer combination. Analogy: Having a washer/dryer combination.

1/24/ :00 PM 17 of 86 Example What happens if we execute four lw instructions one after another… What happens if we execute four lw instructions one after another…

1/24/ :00 PM 18 of 86 Example What happens if we execute four lw instructions one after another… What happens if we execute four lw instructions one after another…

1/24/ :00 PM 19 of 86 Example What happens if we execute four lw instructions one after another… What happens if we execute four lw instructions one after another…

1/24/ :00 PM 20 of 86 Example What happens if we execute four lw instructions one after another… What happens if we execute four lw instructions one after another… The 1 st instruction is accessing data while the 4 th instruction is being fetched. The 1 st instruction is accessing data while the 4 th instruction is being fetched.

Solution Have two separate memories – Have two separate memories – One for instruction One for data 1/24/ :00 PM 21 of 86

1/24/ :00 PM 22 of 86 Data Hazards Data hazards occur when the pipeline must be stalled because one step must wait for another to complete. Data hazards occur when the pipeline must be stalled because one step must wait for another to complete. Arise from the dependence of one instruction on an earlier one that is still in the pipeline. Arise from the dependence of one instruction on an earlier one that is still in the pipeline. add$s0, $t0, $t1 sub$t2, $s0, $t3

1/24/ :00 PM 23 of 86 Solution 1 Compilers can remove the data hazard by moving non-dependent instructions in between. Compilers can remove the data hazard by moving non-dependent instructions in between.

1/24/ :00 PM 24 of 86 Solution 2 Observation: we don’t need to wait for the add instruction to complete before trying to resolve the data hazard. Observation: we don’t need to wait for the add instruction to complete before trying to resolve the data hazard. As soon as the ALU creates the sum for the add, we can supply it as an input for the subtract. As soon as the ALU creates the sum for the add, we can supply it as an input for the subtract.

1/24/ :00 PM 25 of 86 Forwarding Forwarding or bypassing is when extra hardware is added to retrieve the missing item early from the internal resources. Forwarding or bypassing is when extra hardware is added to retrieve the missing item early from the internal resources.

1/24/ :00 PM 26 of 86 Forwarding Forwarding paths are valid only if the destination stage is later in time than the source stage. Forwarding paths are valid only if the destination stage is later in time than the source stage.

1/24/ :00 PM 27 of 86 Forwarding What happens when we have a sub instruction after a lw instruction? What happens when we have a sub instruction after a lw instruction?

1/24/ :00 PM 28 of 86 Forwarding

1/24/ :00 PM 29 of 86 Pipeline Stall Even with forwarding, we need to stall one stage for a load-use data hazard. Even with forwarding, we need to stall one stage for a load-use data hazard. This is referred to as a pipeline stall. This is referred to as a pipeline stall.

1/24/ :00 PM 30 of 86 Example of reordering code Consider the following code segment in C: Consider the following code segment in C: A = B + E; C = B + F; Assume that all variables are in memory and are addressable as offsets from $t0. Assume that all variables are in memory and are addressable as offsets from $t0.

1/24/ :00 PM 31 of 86 Example of reordering code The corresponding MIPS code is: The corresponding MIPS code is: lw$t1, 0($t0)// load B; offset from $t0 lw$t2, 4($t0)// load E add$t3, $t1, $t2// B + E sw$t3, 12($t0) lw$t4, 8($t0) add$t5, $t1, $t4 sw$t5, 16($t0)

1/24/ :00 PM 32 of 86 Example of reordering code What are the problems? What are the problems? lw$t1, 0($t0)// load B; offset from $t0 lw$t2, 4($t0)// load E add$t3, $t1, $t2// B + E sw$t3, 12($t0) lw$t4, 8($t0) add$t5, $t1, $t4 sw$t5, 16($t0)

1/24/ :00 PM 33 of 86 Example of reordering code What are the problems? What are the problems? lw$t1, 0($t0)// load B; offset from $t0 lw$t2, 4($t0)// load E add$t3, $t1, $t2// B + E sw$t3, 12($t0) lw$t4, 8($t0) add$t5, $t1, $t4 sw$t5, 16($t0)

1/24/ :00 PM 34 of 86 Example of reordering code Code re-ordered with no stalls Code re-ordered with no stalls lw$t1, 0($t0)// load B; offset from $t0 lw$t2, 4($t0)// load E lw$t4, 8($t0) add$t3, $t1, $t2// B + E sw$t3, 12($t0) add$t5, $t1, $t4 sw$t5, 16($t0)

1/24/ :00 PM 35 of 86 Control Hazards A control hazard (also called branch hazard) arises from the need to make a decision based on the results of one instruction while others are executing. A control hazard (also called branch hazard) arises from the need to make a decision based on the results of one instruction while others are executing. The proper instruction cannot execute in the proper clock cycle because the instruction that was fetched is not the one that is needed. The proper instruction cannot execute in the proper clock cycle because the instruction that was fetched is not the one that is needed. Caused by the branch instruction. Caused by the branch instruction.

1/24/ :00 PM 36 of 86 Pipelined Datapath

1/24/ :00 PM 37 of 86 Pipelined Datapath

1/24/ :00 PM 38 of 86 Pipelined Datapath

1/24/ :00 PM 39 of 86 Pipelined DP for lw Instruction fetch

1/24/ :00 PM 40 of 86 Pipelined DP for lw Instruction decode

1/24/ :00 PM 41 of 86 Pipelined DP for lw Instruction execute

1/24/ :00 PM 42 of 86 Pipelined DP for lw Memory access

1/24/ :00 PM 43 of 86 Pipelined DP for lw Write back

1/24/ :00 PM 44 of 86 Pipelined DP for lw To properly handle write back

1/24/ :00 PM 45 of 86 Pipelined Control The pipelined registers are written at each clock cycle, so there’s no separate write signals for them (IF/ID, ID/EX, EX/MEM, and MEM/WB) The pipelined registers are written at each clock cycle, so there’s no separate write signals for them (IF/ID, ID/EX, EX/MEM, and MEM/WB) To specify control for the pipeline, we need only set the control values during each pipeline stage. To specify control for the pipeline, we need only set the control values during each pipeline stage. Each control line is associated with a component active in only a single pipeline stage. Each control line is associated with a component active in only a single pipeline stage.

1/24/ :00 PM 46 of 86 Pipelined Control Divide the control lines into five groups: Divide the control lines into five groups: 1. Instruction fetch – same operation in every clock cycle, therefore always asserted. 2. Instruction decode – same as Execution/address calculation – the signals to be set are RegDst, ALUOp and ALUSrc. 4. Memory access – the signals to be set are Branch, MemRead and MemWrite. PCSrc is asserted by ALU 5. Write back – the signals to be set are MemtoReg and RegWrite.

1/24/ :00 PM 47 of 86 Pipelined Control The 9 control signals The 9 control signals

1/24/ :00 PM 48 of 86 Pipelined Control Implementing pipelined control means setting the nine control lines to these values in each stage for each instruction. Implementing pipelined control means setting the nine control lines to these values in each stage for each instruction.

1/24/ :00 PM 49 of 86 Pipelined Control The 9 control signals The 9 control signals

1/24/ :00 PM 50 of 86 Pipelined Control 4 of the 9 control lines are used in the EX stage. 4 of the 9 control lines are used in the EX stage. 5 are passed on to the EX/MEM register 5 are passed on to the EX/MEM register

1/24/ :00 PM 51 of 86 Pipelined Control 3 of the 9 lines are used in the MEM stage. 3 of the 9 lines are used in the MEM stage. 2 are passed on to the MEM/WB register 2 are passed on to the MEM/WB register

1/24/ :00 PM 52 of 86 Pipelined Control 2 of the 9 control lines are used in the WB stage. 2 of the 9 control lines are used in the WB stage.

1/24/ :00 PM 53 of 86 Pipelined Control

1/24/ :00 PM 54 of 86 Data Hazards Pipelined dependences for 5 instructions Pipelined dependences for 5 instructions

1/24/ :00 PM 55 of 86 Forwarding

1/24/ :00 PM 56 of 86 Datapath with Forwarding Unit Ignores forwarding of a store value to a store instruction. Ignores forwarding of a store value to a store instruction.

1/24/ :00 PM 57 of 86 Forwarding Unit The forwarding unit controls the ALU multiplexors to replace the value from a general- purpose register with the value from the proper pipeline register. The forwarding unit controls the ALU multiplexors to replace the value from a general- purpose register with the value from the proper pipeline register.

1/24/ :00 PM 58 of 86 Data Hazards and Stalls One case where forwarding cannot solve the problem is when an instruction tries to read a register following a load instruction that writes the same register. One case where forwarding cannot solve the problem is when an instruction tries to read a register following a load instruction that writes the same register. E.g. a lw followed by a sub E.g. a lw followed by a sub

1/24/ :00 PM 59 of 86 Data Hazards and Stalls Since the dependence between the lw and the and goes back in time, this hazard cannot be solved by forwarding. Since the dependence between the lw and the and goes back in time, this hazard cannot be solved by forwarding.

1/24/ :00 PM 60 of 86 Inserting a Stall

1/24/ :00 PM 61 of 86 Inserting a Stall The and instruction is turned into a nop The and instruction is turned into a nop All instructions beginning with the and instruction are delayed one cycle. All instructions beginning with the and instruction are delayed one cycle.

1/24/ :00 PM 62 of 86 Hazard Detection Unit

1/24/ :00 PM 63 of 86 Hazard Detection Unit The hazard detection unit controls the writing of the PC and IF/ID registers plus the multiplexor that chooses between the real control values and all 0s. The hazard detection unit controls the writing of the PC and IF/ID registers plus the multiplexor that chooses between the real control values and all 0s. The hazard detection unit stalls and deasserts the control fields if the load-use hazard test is true. The hazard detection unit stalls and deasserts the control fields if the load-use hazard test is true.

1/24/ :00 PM 64 of 86 Control Hazard Pipeline hazards involving branches. Pipeline hazards involving branches. The branch instruction decides whether to branch in the MEM stage (clock cycle 4 in the figure). The branch instruction decides whether to branch in the MEM stage (clock cycle 4 in the figure). In the meantime, three following instructions will have begun execution. In the meantime, three following instructions will have begun execution.

1/24/ :00 PM 65 of 86 Control Hazard

1/24/ :00 PM 66 of 86 Solutions for Control Hazards 1. Assume branch not taken Continue execution down the sequential instruction stream. Continue execution down the sequential instruction stream. If the branch is taken, the instructions that are in the pipeline must be discarded. If the branch is taken, the instructions that are in the pipeline must be discarded. Execution continues at the branch target. Execution continues at the branch target. If branches are untaken half the time, and if it costs little to discard the instructions, then this optimization halves the cost of control hazards. If branches are untaken half the time, and if it costs little to discard the instructions, then this optimization halves the cost of control hazards.

1/24/ :00 PM 67 of 86 Solutions for Control Hazards 1. Assume branch not taken Discarding instructions means to flush instructions in the IF, ID, and Ex stages of the pipeline. Discarding instructions means to flush instructions in the IF, ID, and Ex stages of the pipeline. Change the original control values to 0s, and let them percolate through the pipeline. Change the original control values to 0s, and let them percolate through the pipeline.

1/24/ :00 PM 68 of 86 Solutions for Control Hazards 2. Reducing the delay of branches Reduce the cost of the taken branch. Reduce the cost of the taken branch. Move the branch execution earlier in the pipeline so that fewer instructions need to be flushed. Move the branch execution earlier in the pipeline so that fewer instructions need to be flushed. Requires two actions to occur earlier: Requires two actions to occur earlier:

1/24/ :00 PM 69 of 86 Solutions for Control Hazards 2. Reducing the delay of branches Reduce the cost of the taken branch. Reduce the cost of the taken branch. Move the branch execution earlier in the pipeline so that fewer instructions need to be flushed. Move the branch execution earlier in the pipeline so that fewer instructions need to be flushed. Requires two actions to occur earlier: Requires two actions to occur earlier: i. Computing the branch target address.

1/24/ :00 PM 70 of 86 Solutions for Control Hazards 2. Reducing the delay of branches Reduce the cost of the taken branch. Reduce the cost of the taken branch. Move the branch execution earlier in the pipeline so that fewer instructions need to be flushed. Move the branch execution earlier in the pipeline so that fewer instructions need to be flushed. Requires two actions to occur earlier: Requires two actions to occur earlier: i. Computing the branch target address. ii. Evaluating the branch decision.

1/24/ :00 PM 71 of 86 Solutions for Control Hazards 2. Reducing the delay of branches i. Computing the branch target address. Easy. Easy. Already have the PC and the immediate field in the IF/ID pipeline register. Already have the PC and the immediate field in the IF/ID pipeline register. Just move the branch adder from the EX stage to the ID stage. Just move the branch adder from the EX stage to the ID stage. The address calculation will be performed for all instructions, but only used when needed. The address calculation will be performed for all instructions, but only used when needed.

1/24/ :00 PM 72 of 86 Branch adder location Move from EX to ID stage Move from EX to ID stage

1/24/ :00 PM 73 of 86 Solutions for Control Hazards 2. Reducing the delay of branches ii. Evaluating the branch decision. Harder. Harder. Need to compare the two registers read during the ID stage. Need to compare the two registers read during the ID stage. During ID, we must During ID, we must Decode the instruction Decode the instruction Decide whether a bypass to the equality unit is needed. Source can come from EX/MEM or MEM/WB pipeline registers. Decide whether a bypass to the equality unit is needed. Source can come from EX/MEM or MEM/WB pipeline registers. Complete the comparison. Complete the comparison. Set the PC to the branch address if necessary. Set the PC to the branch address if necessary.

1/24/ :00 PM 74 of 86 Solutions for Control Hazards 2. Reducing the delay of branches ii. Evaluating the branch decision. The values in a branch comparison are needed during ID but may be produced later in time  can cause a data hazard and a stall might be needed. The values in a branch comparison are needed during ID but may be produced later in time  can cause a data hazard and a stall might be needed. Ex. If an ALU instruction immediately preceding a branch produces one of the operands for the comparison in the branch, a stall will be required. Why? Ex. If an ALU instruction immediately preceding a branch produces one of the operands for the comparison in the branch, a stall will be required. Why?

1/24/ :00 PM 75 of 86 Solutions for Control Hazards 2. Reducing the delay of branches ii. Evaluating the branch decision. The values in a branch comparison are needed during ID but may be produced later in time  can cause a data hazard and a stall might be needed. The values in a branch comparison are needed during ID but may be produced later in time  can cause a data hazard and a stall might be needed. Ex. If an ALU instruction immediately preceding a branch produces one of the operands for the comparison in the branch, a stall will be required. Ex. If an ALU instruction immediately preceding a branch produces one of the operands for the comparison in the branch, a stall will be required. Because the EX stage for the ALU instruction will occur after the ID cycle of the branch. Because the EX stage for the ALU instruction will occur after the ID cycle of the branch.

1/24/ :00 PM 76 of 86 Solutions for Control Hazards 2. Reducing the delay of branches ii. Evaluating the branch decision. Ex. If a load instruction immediately preceding a branch produces one of the operands for the comparison in the branch, two stalls will be required. Ex. If a load instruction immediately preceding a branch produces one of the operands for the comparison in the branch, two stalls will be required. Because the result from the load appears at the end of the MEM cycle but is needed at the beginning of the ID cycle of the branch. Because the result from the load appears at the end of the MEM cycle but is needed at the beginning of the ID cycle of the branch.

1/24/ :00 PM 77 of 86 Solutions for Control Hazards 2. Reducing the delay of branches Moving the branch execution to the ID stage is an improvement since it reduces the penalty of a branch to only one instruction if the branch is taken, namely, the one currently being fetched. Moving the branch execution to the ID stage is an improvement since it reduces the penalty of a branch to only one instruction if the branch is taken, namely, the one currently being fetched. Zeros the instruction field of the IF/ID pipeline register. Zeros the instruction field of the IF/ID pipeline register. Clearing the register transforms the fetched instruction into a nop. Clearing the register transforms the fetched instruction into a nop.

1/24/ :00 PM 78 of 86 Solutions for Control Hazards 3. Dynamic branch prediction Assuming a branch is not taken is one simple form of branch prediction. Assuming a branch is not taken is one simple form of branch prediction. With deeper pipelines and multiple issue, branch penalty increases in terms of instructions lost. With deeper pipelines and multiple issue, branch penalty increases in terms of instructions lost. A simple static branch prediction wastes too much performance. A simple static branch prediction wastes too much performance. Possible to try to predict branch behavior dynamically (i.e. during program execution). Possible to try to predict branch behavior dynamically (i.e. during program execution).

1/24/ :00 PM 79 of 86 Dynamic Branch Prediction Implementation: A branch prediction buffer or branch history table is used. A branch prediction buffer or branch history table is used. This is a small memory indexed by the lower portion of the address of the branch instruction. This is a small memory indexed by the lower portion of the address of the branch instruction. The memory contains a bit that says whether the branch was recently taken or not. The memory contains a bit that says whether the branch was recently taken or not.

1/24/ :00 PM 80 of 86 Dynamic Branch Prediction Look up the address of the instruction to see if a branch was taken the last time this instruction was executed. Look up the address of the instruction to see if a branch was taken the last time this instruction was executed. If so, then fetch the new instruction from the same place. If so, then fetch the new instruction from the same place.

1/24/ :00 PM 81 of 86 Dynamic Branch Prediction The bit may have been put there by another branch instruction that has the same low-order address bits. The bit may have been put there by another branch instruction that has the same low-order address bits. If the hint is wrong then If the hint is wrong then The incorrectly predicted instructions are deleted. The incorrectly predicted instructions are deleted. The prediction bit is inverted and stored back. The prediction bit is inverted and stored back. The proper sequence is fetched and executed. The proper sequence is fetched and executed.

1/24/ :00 PM 82 of 86 Dynamic Branch Prediction Problem: If the branch is almost always taken, we will likely predict incorrectly twice, rather than once, when it is not taken. If the branch is almost always taken, we will likely predict incorrectly twice, rather than once, when it is not taken.Example: Consider a loop branch that branches nine times in a row, then is not taken once on the tenth time. What is the prediction accuracy assuming the prediction bit for this branch remains in the prediction buffer? Consider a loop branch that branches nine times in a row, then is not taken once on the tenth time. What is the prediction accuracy assuming the prediction bit for this branch remains in the prediction buffer?

1/24/ :00 PM 83 of 86 Dynamic Branch Prediction Answer: The steady-state prediction behavior will mispredict on the first and last loop iterations. The steady-state prediction behavior will mispredict on the first and last loop iterations. Mispredicting the last iteration is inevitable since the prediction bit will say taken during the first nine times. Mispredicting the last iteration is inevitable since the prediction bit will say taken during the first nine times. Mispredicting on the first iteration happens because the bit is flipped on prior execution of the last iteration of the loop. Mispredicting on the first iteration happens because the bit is flipped on prior execution of the last iteration of the loop.

1/24/ :00 PM 84 of 86 Dynamic Branch Prediction The prediction accuracy for this branch that is taken 90% of the time is only 80% (8 out of 10). The prediction accuracy for this branch that is taken 90% of the time is only 80% (8 out of 10). Ideally, the accuracy of the predictor should match the taken branch frequency for these highly regular branches. Ideally, the accuracy of the predictor should match the taken branch frequency for these highly regular branches.

1/24/ :00 PM 85 of 86 Dynamic Branch Prediction A 2-bit prediction scheme. A 2-bit prediction scheme. A prediction must be wrong twice before the bit is changed. A prediction must be wrong twice before the bit is changed.

1/24/ :00 PM 86 of 86 2-bit prediction scheme

1/24/ :00 PM 87 of 86 Solutions for Control Hazards 4. Scheduling the branch delay slot d

1/24/ :00 PM 88 of 86 Partial MIPS Instructions Instruction OP (6) rs (5) rt (5) rd (5) shamt (5) funct (6) LW35rsrdoffset SW43rsrdoffset BEQ4rsrtoffset ADD0rsrtrd032 SUB0rsrtrd034 AND0rsrtrd036 OR0rsrtrd037 SLT0rsrtrd042 ADDI8rsrtimm OUT63rs * All numbers are in decimal.