Appendix A. Pipelining: Basic and Intermediate Concept

Appendix A. Pipelining: Basic and Intermediate Concept
Rung-Bin Lin Appendix A. Pipelining: Basic and Intermediate Concept What is Pipelining? Pipelining is an implementation technique whereby multiple instructions are overlaped in execution. Pipe stage (pipe segment) Throughput Machine cycle: The time required between moving an instruction one step down the pipeline. This time is equal to the time required for the slowest pipe stage. In a computer, the machine cycle is usually one clock cycle. The pipeline designer‘s goal is to balance the length of each pipe stage. If the stages are perfectly balanced,

A Simple Implementation of A RISC ISA
Five-cycle implementation Instruction fetch cycle (IF) Instruction decode/register fetch cycle (ID) Operand fetches; Sign-extending the immediate field; Decoding is done in parallel with reading registers. This technique is known as fixed-field decoding; Test branch condition and computed branch address; finished branching at the end of this cycle. Execution/effective address cycle (EX) Memory reference; Register-Register ALU instruction; Register-Immediate ALU instruction; Memory access/branch completion cycle (MEM) Write-back cycle (WB) Load instruction;

Performance of the Five-Cycle Implementation
CPI=4.54 Branch instructions (12%) take 2 cycles Store instructions (10%) require 4 cycles Others takes 5 cycles

The Classic Five-Stage Pipeline for a RSIC Processor

The RISC Pipeline with Registers

Instruction Issue The process of letting an instruction move from the instruction decode stage (ID) into execution stage (EX) of this pipeline.

Basic Performance Issues in Pipelining
Pipelining increasing instruction execution throughput, but it does not reduce the execution time of an individual instruction due to pipeline overhead. Register delay Clock skew The limitation of pipeline depth is due to Pipeline latency Pipe stage imbalance Pipeline overhead Example in A-10.

The Major Hurdle of Pipelining - Pipelining Hazards
A hazard is a situation that prevents the next instruction in the instruction stream from executing during its designated clock cycle. Three classes of hazards Structural hazard: Arise from resource conflicts. Data hazard: Arise when an instruction depends on the results of a previous instruction. Control hazard: Arise from branches and other instructions that change the PC. A pipeline can be stalled by a hazard. To eliminate hazards, Instructions issued later than the stalled instruction are also stalled. Instructions issued earlier than the stalled one must continue. Note that a cache miss stalls the whole pipeline.

Performance of Pipeline with Stalls
When pipelining is thought of as decreasing the CPI,

When pipelining is thought of as improving the clock cycle time,

Structural Hazards Due to resource conflicts (Example in A-14)
Due to some functional unit being not fully pipelined. When some resources have not been duplicated enough.

Data Hazards A memory access depends on the results of unfinishing instructions.

Forwarding (Bypassing) ALU Results To Minimize Hazards

Forwarding (Bypassing) Results to Store

Bypassing Results of LOAD

Data Hazard Classification
Consider two instructions i and j, with i occurring before j, the possible hazards are, RAW (read after write) : j tries to read a source before i writes it. WAW (write after write): j tries to write an operand before it is written by i. For example, LW R1, 0(R2) IF ID EX MEM1 MEM2 WB DADD R1, R2, R IF ID EX WB WAR (write after read): j tries to write a destination before it is read by i. For example, if read is done in the second half of MEM2, and write is done in the first half of WB. SW 0(R1), R IF ID EX MEM1 MEM2 WB DADD R2, R3, R IF ID EX WB RAR (read after read): not a hazard.

Data Hazards Requiring Stalls
Pipeline interlock A piece of hardware that detects a hazard and stalls the pipeline until the hazard is cleared. Load interlock Example (Fig. A.10 at A-21)

Control Hazards Caused by the instructions that change PC. Some basics
If a branch changes the PC to its target address, it is a taken branch. If it does not change the PC, it falls through or it is not taken. Recall that if an instruction i is a taken branch, the PC is normally not changed until the end of ID. A stall cycle is required. Branch Instruction IF ID EX MEM WB Branch successor IF IF ID EX MEM WB Branch successor IF ID EX MEM WB Branch successor IF ID EX MEM WB

Branch Penalty Branch delay: The length of a control hazard.
Branch penalty: The branch delay, unless it is dealt with, turns into branch penalty. The deeper the pipeline, the worse the branch penalty. The number of branch stalls can be reduced by two steps Find out whether the branch is taken or not taken earlier in the pipeline. Compute the taken PC (i.e., the address of the branch target) earlier. Branch behavior in programs Average frequency of taken branches : 67% 60% of the forward branches are taken. 85% of the backward branches are taken.

Reducing Pipeline Branch Penalties
Static branch prediction methods (Compile-time guess). Free or flush the pipeline Holding or deleting any instructions after the branch until the branch destination is known. Predict-not-taken (untaken) (Fig. A.12 in A-23) Predict-taken Does it have any advantage? Ans: no. Delayed branch: The execution cycle with a branch delay n is Branch instruction Sequential successor 1 Sequential successor 2 … Sequential successor n (n=1 for MIPS) Branch target if taken

Scheduling the Branch Delay Slot

Effectiveness of Scheduling Branch Delay Slots
Requirements for being effective Scheduling from before: Always Scheduling from target: Taken Scheduling from fall through: Not taken The limitation on delayed-branch scheduling arises from The restrictions on the instructions that are scheduled into the delay slots. The ability to predict at compile time whether a branch is likely to be taken or not. Using canceling or nullified branch to relieve the limlits In a canceling branch, the instruction includes the direction that the branch was predicted. When the branch behaves as predicted, the instruction in the branch delay slot is simply executed. Otherwise, the instruction in the branch delay slot is simply turned into a No-Op.

How Is Pipelining Implemented?
Unpipelined 5-cycle implementation

Simple Pipelining Implementation for MIPS

Implementing the Control for MIPS Pipeline
Implementing the control focuses on detecting of hazards and generating of control signals for forwarding. Hazard detection All the data hazards can be checked and forwarding control signals can be set during the ID phase. If a data hazard exists, the instruction is stalled before it is issued. Or, alternatively, hazards forwarding are checked at the beginning of a clock cycle that uses an operand (EX and MEM for the MIPS pipeline). Implementing the logic for hazard detection Hazard detection by comparing the destination and sources of adjacent instructions (fig. A.20 on page A-34). An example shows detecting of all load interlocks when the instruction using the load result in the ID stage (fig. A.21 on page A-34).

Implementing Forwarding Logic
Forwarding sources: ALU or data memory output. Forwarding destination: ALU input, data memory input, or zero detection unit (for BRANCH). The forwarding can be implemented by checking the following conditions EX/MEM.IR.destination =ID/EX.IR.source ? MEM/WB.IR.destination = ID/EX.IR.source ? MEM/WB.IR.destination = EX/MEM.IR.source?

Forwarding Data to the Two ALU Inputs

Dealing with Branches in the Pipeline

What Makes Pipelining Hard to Implement
Exception (interrupt, fault) makes pipelining difficult to implement. Instruction set complications

Types of Exceptions Types I/O device request Invoking an OS service from a user program Tracing instruction execution Breakpoint Integer arithmetic overflow or underflow FP arithmetic anomaly Page fault Misaligned memory access Memory-protection violation Using an undefined instruction Hardware malfunction Power failure Exceptions for different architecture (fig. A.26 on page A-40).

Classification of Exceptions
Synchronous versus asynchronous If the event occurs at the same place every time that the program is executed with the same data and memory allocation, the event is called synchronous. User requested versus coerced User maskable versus nonmaskable Within versus between instruction Depend on whether the event prevents instruction completion by occurring in the middle of execution or whether it is recognized between instructions. Resume versus terminate (fig on page 182).

Action Requirements for Different Exception Types (Fig. A
Action Requirements for Different Exception Types (Fig. A.27 on page A-42) Actions Resume Terminate The most difficult exceptions have two properties: They occur within instructions (i.e. at EX or MEM stages). They must be restartable (must save the PC of the instruction at which to restart).

Exception Handling Stopping and restarting execution
Force a trap instruction on the next IF Until the trap is taken, turn off all writes for the faulting instruction and for all instructions that follow in the pipeline. After the exception-handling routine in the operating system receives control, it immediately saves the PC of the faulting instruction. IF ID EX MEM WB <--- Faulting instruction IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM Trap instruction -> IF ID EX If delayed branch is used, we need to save and restore as many PCs as the length of the branch delay plus one.

Precise Interrupt If a pipeline can be stopped so that the instructions just before the faulting instruction are completed and those after it can be restarted from scratch. Supporting precise interrupts is a requirement in many systems. Exceptions in DLX With pipelining, multiple exceptions may occur in the same clock cycle. (fig. A.28 on page A-44).

Implementations of Precise Exceptions
Principle The pipeline should be able to handle the exceptions caused by instruction i prior to the exceptions caused by instruction i+1. Implementation Hardware posts all exceptions caused by a given instruction in a status vector associated that instruction. Once an exception indication is set in the exception status vector, any control signal that may cause a data value to be written is turned off. When an instruction enters WB, the exception status vector is checked, if any exceptions are posted, they are handled in the order in which they would occur in time on an unpipelined machine. This will guarantee that all exceptions will be seen on instruction i before any are seen on i+1.

Instruction Committed
When an instruction is guaranteed to complete, it is called committed. In the MIPS pipeline, all instructions are committed when they reach the end of the MEM stage and no instruction updates the state before that stage. Thus precise exceptions are straight forward.

Instruction Set Complications
Some machines have instructions that change the state in the middle if the instruction execution. VAX: Autoincrement addressing mode. VAX or IBM 360: String copy. Implicitly set condition code. Cause difficulties in scheduling any pipeline delays between setting condition code and the branch. ADD XXX <--- Set condition code C. … <- Can not place instructions that change C. BR C, YYY <--- Use C for branch. In fact, the condition code must be treated as an operand that requires hazard detection for RAW hazards with branch no matter the condition code is set implicitly or explicitly Multicycle operations in VAX

Extending the MIPS Pipeline to Handle Multi-Cycle Operations
Assuming four separate functional units in our MIPS implementation Integer unit Handle loads and stores, ALU operations and branches. FP and integer multiplier FP adder FP and integer divider If an instruction cannot proceed to the EX stage , the entire pipeline behind that instruction will be stalled.

MIPS Pipeline with Multi-cycle Functional Units

Pipelining Multi-cycle Functional Units

Latency and Initiation(repeat interval)
The number of intervening cycles between an instruction that produces a result and an instruction that uses the result. Initiation (repeat) interval The number of cycles that must elapse between issuing two operations of a given type. Latency and initiation interval for pipelining multi-cycle functional units Functional Unit Latency Initiation interval Integer ALU Data memory access 1 1 FP add FP (integer) multiply 6 1 FP (integer) divide

Hazards and Forwarding in Longer Latency Pipelines
Hazard detection and forwarding for a pipeline as before. Structural hazards can occur because the divide unit is not fully pipelined. The number of register writes can be larger than 1 because the instructions have varying running time. WAW hazards are possible, but WAR hazards are not possible. Instructions can complete in a different order than they were issued, causing problems with exceptions. Stalls for RAW hazards will be more frequent because of longer latency. Assuming all hazard detection is done in ID, three checks must be done before issuing an instruction: Check for structural hazards Check for a RAW data hazard Check for a WAW data hazard

RAW Hazards Caused by Longer Pipeline
Fig. A.33

Structural Hazards in Longer Pipeline
Fig. A.34

Maintaining Precise Exceptions (1)
Problems caused by out-of-order completion DIV.D F0, F2, F4 ADD.D F10, F10, F8 SUB.D F12, F12, F14 Four possible approaches Ignore the problem and settle for imprecise exceptions Buffer the results of an operation until all the operations that were issued earlier are completed. History file approach: Buffer the original register values. Future file approach: Keep the newer values of registers. Allow the exceptions to become somewhat imprecise, but to keep enough information so that the trap-handling routines can create a precise sequence for exceptions. This means knowing what operations were in the pipeline and their PCs.

Maintaining Precise Exceptions (2)
Worst-case scenario: Instruction 1: A long-running instruction that interrupts. Instruction 2 : not completed. …. Instruction n-1: not completed. Instruction n: completed. <-- The latest completed instruction. The software must simulate the instruction 1 through instruction n-1 and restart the execution at instruction n+1. Allows the instruction issue to continue only if it is certain that all the instructions before the issuing instruction will complete without causing an exception. This sometimes means stalling the machine to maintain precise exceptions.

Number of Stalls per FP Operation

Performance of a MIPS FP Pipeline

Overview of The MIPS R4000 Pipeline
An implementation of MIPS64 Eight pipeline stages (superpipelining)

Load Delay in MIPS R4000

Branch Delay in MIPS R4000

CPI of MIPS R4000

Concluding Remarks We can spend a little money to buy a very powerful computer today.

Appendix A. Pipelining: Basic and Intermediate Concept

Similar presentations

Presentation on theme: "Appendix A. Pipelining: Basic and Intermediate Concept"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Appendix A. Pipelining: Basic and Intermediate Concept

Similar presentations

Presentation on theme: "Appendix A. Pipelining: Basic and Intermediate Concept"— Presentation transcript:

Similar presentations

About project

Feedback