Overview What are pipeline hazards? Types of hazards

Overview What are pipeline hazards? Types of hazards
Structural Data Control Performance implications Basic techniques Exceptions

Pipeline Hazards IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM Reg DM
ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg ALU

Types of Hazards Hazards Types
Prevent next instruction(s) from executing Reduce the performance from ideal speedup Types Structural Hardware resource conflicts Data Dependencies among operands Control Jumps and branches

1 + Pipeline stall cycles per instruction
Pipeline Stalls Average instruction time unpipelined Speedup from pipelining = Average instruction time pipelined CPI unpipelined Clock cycle unpipelined = X CPI pipelined Clock cycle pipelined Pipeline depth 1 + Pipeline stall cycles per instruction =

Structural Hazards Overlapped execution of instructions:
Pipelining of functional units Duplication of resources Structural Hazard When the pipeline can not accommodate some combination of instructions Consequences Stall Increase of CPI from its ideal value (1)

Pipelining of Functional Units
Fully pipelined M1 M2 M3 M4 M5 FP Multiply IF ID MEM WB EX Partially pipelined M1 M2 M3 M4 M5 FP Multiply IF ID MEM WB EX Not pipelined M1 M2 M3 M4 M5 FP Multiply IF ID MEM WB EX

To pipeline or Not to pipeline
Elements to consider Effects of pipelining and duplicating units Increased costs Higher latency (pipeline register overhead) Frequency of structural hazard Example: unpipelined FP multiply unit in DLX Latency: 5 cycles Impact on mdljdp2 program? Frequency of FP instructions: 14% Depends on the distribution of FP multiplies Best case: uniform distribution Worst case: clustered, back-to-back multiplies

Resource Duplication Load Inst 1 Inst 2 Stall Inst 3 M Reg M Reg Reg M
ALU Reg Inst 1 M Reg M ALU Inst 2 M Reg M Reg ALU Stall Inst 3 M Reg M Reg ALU

Resource Duplication - Example
A - machine with structural hazard B - machine without structural hazard Data references: 40% CPI (B): 1 Clock time (B): 1.05 x clock_time (A) CPU time (B) = IC * 1 * clock_time(B) CPU time (A) = IC * ( ) * clock_time(B)/1.05 CPU_time (A) = 1.3 * CPU_time (B) Does the distribution of load/store instructions within the program matter?

Data Hazards IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg
ADD R1, R2, R3 ALU IM Reg DM Reg SUB R4, R1, R5 ALU IM Reg DM Reg ALU AND R6, R1, R7 IM Reg DM Reg ALU OR R8, R1, R9 IM Reg DM XOR R10, R1, R11 ALU

Forwarding IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM
ADD R1, R2, R3 ALU IM Reg DM Reg SUB R4, R1, R5 ALU IM Reg DM Reg ALU AND R6, R1, R7 IM Reg DM Reg ALU OR R8, R1, R9 IM Reg DM XOR R10, R1, R11 ALU

Types of Data Hazards RAW - read after write WAW - write after write
WAR - write after read DLX Only RAW hazards with registers Memory references are always kept in order LW R1, 0(R2) IF ID EX M1 M2 WB ADD R1, R2, R IF ID EX WB SW 0(R1), R IF ID EX M1 M2 WB ADD R2, R3, R IF ID EX WB

Stalls in Data Hazards IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM
LW R1, 0(R2) ALU IM Reg DM Reg SUB R4, R1, R5 ALU IM Reg DM Reg ALU AND R6, R1, R7 IM Reg DM Reg ALU OR R8, R1, R9

Pipeline Interlocks IM Reg DM Reg IM Reg DM Reg Reg DM IM IM Reg
LW R1, 0(R2) ALU IM Reg DM Reg SUB R4, R1, R5 ALU Reg DM IM ALU AND R6, R1, R7 IM Reg ALU OR R8, R1, R9 LW R1, 0(R2) IF ID EX MEM WB SUB R4, R1, R IF ID stall EX MEM WB AND R6, R1, R IF stall ID EX MEM WB OR R8, R1, R stall IF ID EX MEM WB

Compiler Scheduling (1/3)
a = b + c; a = b + c; LW Rb, b LW Rb, b d = e - f; LW Rc, c LW Rc, c ADD Ra, Rb, Rc LW Re, e SW a, Ra ADD Ra, Rb, Rc LW Re, e LW Rf, f LW Rf, f SW a, Ra SUB Rd, Re, Rf SUB Rd, Re, Rf SW d, Rd SW d, Rd LW R1, b IF ID EX MEM WB LW R2, c IF ID EX MEM WB ADD R3, R1, R IF ID stall EX MEM WB SW a, R IF stall ID EX MEM WB

LW Rb,b IF ID EX MEM WB LW Rc,c IF ID EX MEM WB ADD Ra,Rb,Rc IF ID EX MEM WB SW a,Ra IF ID EX MEM WB LW Re,e IF ID EX MEM WB LW Rf,f IF ID EX MEM WB SUB Rd,Re,Rf IF ID EX MEM WB SW d,Rd IF ID EX MEM WB LW Rb,b IF ID EX MEM WB LW Rc,c IF ID EX MEM WB LW Re,e IF ID EX MEM WB ADD Ra,Rb,Rc IF ID EX MEM WB LW Rf,f IF ID EX MEM WB SW a,Ra IF ID EX MEM WB SUB Rd,Re,Rf IF ID EX MEM WB SW d,Rd IF ID EX MEM WB

Eliminates load interlocks Demands more registers Simple scheduling Basic block (sequential segment of code) Good for simple pipelines Percentage of loads that result in a stall FP: 13% Int: 25%

Control for the DLX Pipeline
Goal: simple control hardware to handle interlocks and forwarding Options Centralized Check interlocks and forwarding during ID Distributed Check interlocks/forwarding at the beginning of EX, MEM

Control for RAW Load Hazards
Compare load’s destination (R1) with sources of the 2 adjacent instructions Situations LW R1, 10(R2) No dependence Dependence requiring stall Dependence overcome by forwarding Dependence with accesses in order

Load Interlock Implementation
RAW load interlock detection during ID Load instruction in EX Instruction that needs the load data in ID Logic to detect load interlock Action (insert the pipeline stall) ID/EX.IR0..5 = 0 (no-op) Re-circulate contents of IF/ID ID/EX.IR IF/ID.IR Comparison Load r-r ALU ID/EX.IR == IF/ID.IR6..10 Load r-r ALU ID/EX.IR == IF/ID.IR11..15 Load Load, Store, r-i ALU, branch ID/EX.IR == IF/ID.IR6..10

Forwarding Implementation (1/2)
Source: ALU or MEM output Destination: ALU, MEM or Zero? input(s) Compare (forwarding to ALU input): Source Destination Opcode of source instructions R-R ALU, R-I ALU, Load (ID/EX.IR 0..5, EX/MEM.IR 0..5) Opcode of destination instructions (ID/EX.IR 0..5) R-R ALU, R-I ALU, Load, Store, Branch EX/MEM.IR EX/MEM.IR MEM/WB.IR MEM/WB.IR ID/EX.IR 6..10 ID/EX.IR

Forwarding Implementation (2/2)
Zero? M u x EX/MEM MEM/WB ID/EX Data memory ALU M u x

Control Hazards Stall the pipeline until we reach MEM
Branch IF ID EX MEM WB Branch successor IF stall stall IF ID EX MEM WB Branch successor IF ID EX MEM WB Branch successor IF ID EX MEM WB Branch successor IF ID EX MEM Branch successor IF ID EX Stall the pipeline until we reach MEM Easy, but expensive Three cycles for every branch To reduce the branch delay Find out branch is taken or not taken ASAP Compute the branch target ASAP

Optimized Branch Execution
Add Mux 4 Zero? Add Mux PC Instr. Cache Mux ALU Regs Data Cache Sign extend IF/ID ID/EX EX/MEM MEM/WB

Branch Behavior in Programs
Integer FP Forward conditional branches % 7% Backward conditional branches % % Unconditional branches % % Branches taken % %

Reduction of Branch Penalties
Static, compile-time, branch prediction schemes 1 Stall the pipeline Simple in hardware and software 2 Treat every branch as not taken Continue execution as if branch were normal instruction If branch is taken, turn the fetched instruction into a no-op 3 Treat every branch as taken Useless in DLX 4 Delayed branch Sequential successors (in delay slots) are executed anyway No branches in the delay slots

Predict-not-taken Scheme
Untaken Branch IF ID EX MEM WB Instruction i IF ID EX MEM WB Instruction i IF ID EX MEM WB Instruction i IF ID EX MEM WB Instruction i IF ID EX MEM WB Taken Branch IF ID EX MEM WB Instruction i IF stall stall stall stall (clear the IF/ID register) Branch target IF ID EX MEM WB Branch target IF ID EX MEM WB Branch target IF ID EX MEM WB Compiler organizes code so that the most frequent path is the not-taken one

Optimizations of the Branch Slot
ADD R1,R2,R3 if R2=0 then SUB R4,R5,R6 ADD R1,R2,R3 if R1=0 then ADD R1,R2,R3 if R1=0 then OR R7,R8,R9 SUB R4,R5,R6 From target From before From fall through SUB R4,R5,R6 ADD R1,R2,R3 if R1=0 then if R2=0 then ADD R1,R2,R3 if R1=0 then ADD R1,R2,R3 OR R7,R8,R9 SUB R4,R5,R6 SUB R4,R5,R6

Branch Slot Requirements
Strategy Requirements Improves performance a) From before Branch must not depend on delayed Always instruction b) From target Must be OK to execute delayed When branch is taken instruction if branch is not taken c) From fall Must be OK to execute delayed When branch is not taken through instruction if branch is taken Limitations in delayed-branch scheduling Restrictions on instructions that are scheduled Ability to predict branches at compile time

Cancelling Branch Instructions
Cancelling branch includes the predicted direction Incorrect prediction => delay-slot instruction becomes no-op Helps the compiler to fill branch delay slots (no requirements for b and c) Behavior of a predicted-taken cancelling branch Untaken Branch IF ID EX MEM WB Instruction i IF stall stall stall stall (clear the IF/ID register) Instruction i IF ID EX MEM WB Instruction i IF ID EX MEM WB Instruction i IF ID EX MEM WB Taken Branch IF ID EX MEM WB Instruction i IF ID EX MEM WB Branch target IF ID EX MEM WB Branch target i IF ID EX MEM WB Branch target i IF ID EX MEM WB

Performance of Branch Schemes
Branch penalty Effective CPI Scheduling Conditional Unconditional Average with branch stalls scheme Int FP Int FP Int FP Stall pipeline Predict taken Predict not taken Delayed branch

Example Role of pipeline delay on branch penalty (R4000)
computation of branch target (3 stages) evaluation of branch condition (4 stages) Branch scheme Penalty uncond Penalty untaken Penalty taken Stall pipeline Predict taken Predict untaken Predict untaken Taken Branch IF IS RF EX DF DS TC WB Instruction i IF IS RF stall stall stall … Instruction i IF IS stall stall stall ... Instruction i IF stall stall stall ... Branch target IF IS RF DF DS TC WB

Static Branch Prediction
Correct predictions Reduce branch hazard penalty Help the scheduling of data hazards: Prediction methods Examination of program behavior (benchmarks) Use of profile information from previous runs LW R1, 0(R2) SUB R1, R1, R3 BEQZ R1, L OR R4, R5, R6 ADD R10, R4, R3 L: ADD R7, R8, R9 If branch is almost never taken If branch is almost always taken

Performance of the DLX Integer Pipeline

Exceptions I/O device request Operating system call
Tracing instruction execution Breakpoint Integer overflow FP arithmetic anomaly Page fault Misaligned memory access Memory protection violation Undefined instruction Hardware malfunctions Power failure

Types of Exceptions Synchronous vs. asynchronous
User requester vs. coerced User maskable vs. nonmaskable Within vs. between instructions Resume vs. terminate Most difficult Occur in the middle of the instruction (EX or MEM) Must be able to restart

Stopping and Restarting Execution
TRAP, RTE instructions IAR register Safely save the state of the pipeline Force a TRAP on the next IF Until the TRAP is taken, turn off all writes for the faulting instruction and the following ones. Exception-handling routine saves the PC of the faulting instruction For delayed branches we need to save more PCs Precise Exceptions

Overview What are pipeline hazards? Types of hazards

Similar presentations

Presentation on theme: "Overview What are pipeline hazards? Types of hazards"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Overview What are pipeline hazards? Types of hazards

Similar presentations

Presentation on theme: "Overview What are pipeline hazards? Types of hazards"— Presentation transcript:

Similar presentations

About project

Feedback