Download presentation
Presentation is loading. Please wait.
Published byPhebe Richardson Modified over 5 years ago
1
Overview What are pipeline hazards? Types of hazards
Structural Data Control Performance implications Basic techniques Exceptions
2
Pipeline Hazards IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM Reg DM
ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg ALU
3
Types of Hazards Hazards Types
Prevent next instruction(s) from executing Reduce the performance from ideal speedup Types Structural Hardware resource conflicts Data Dependencies among operands Control Jumps and branches
4
1 + Pipeline stall cycles per instruction
Pipeline Stalls Average instruction time unpipelined Speedup from pipelining = Average instruction time pipelined CPI unpipelined Clock cycle unpipelined = X CPI pipelined Clock cycle pipelined Pipeline depth 1 + Pipeline stall cycles per instruction =
5
Structural Hazards Overlapped execution of instructions:
Pipelining of functional units Duplication of resources Structural Hazard When the pipeline can not accommodate some combination of instructions Consequences Stall Increase of CPI from its ideal value (1)
6
Pipelining of Functional Units
Fully pipelined M1 M2 M3 M4 M5 FP Multiply IF ID MEM WB EX Partially pipelined M1 M2 M3 M4 M5 FP Multiply IF ID MEM WB EX Not pipelined M1 M2 M3 M4 M5 FP Multiply IF ID MEM WB EX
7
To pipeline or Not to pipeline
Elements to consider Effects of pipelining and duplicating units Increased costs Higher latency (pipeline register overhead) Frequency of structural hazard Example: unpipelined FP multiply unit in DLX Latency: 5 cycles Impact on mdljdp2 program? Frequency of FP instructions: 14% Depends on the distribution of FP multiplies Best case: uniform distribution Worst case: clustered, back-to-back multiplies
8
Resource Duplication Load Inst 1 Inst 2 Stall Inst 3 M Reg M Reg Reg M
ALU Reg Inst 1 M Reg M ALU Inst 2 M Reg M Reg ALU Stall Inst 3 M Reg M Reg ALU
9
Resource Duplication - Example
A - machine with structural hazard B - machine without structural hazard Data references: 40% CPI (B): 1 Clock time (B): 1.05 x clock_time (A) CPU time (B) = IC * 1 * clock_time(B) CPU time (A) = IC * ( ) * clock_time(B)/1.05 CPU_time (A) = 1.3 * CPU_time (B) Does the distribution of load/store instructions within the program matter?
10
Data Hazards IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg
ADD R1, R2, R3 ALU IM Reg DM Reg SUB R4, R1, R5 ALU IM Reg DM Reg ALU AND R6, R1, R7 IM Reg DM Reg ALU OR R8, R1, R9 IM Reg DM XOR R10, R1, R11 ALU
11
Forwarding IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM
ADD R1, R2, R3 ALU IM Reg DM Reg SUB R4, R1, R5 ALU IM Reg DM Reg ALU AND R6, R1, R7 IM Reg DM Reg ALU OR R8, R1, R9 IM Reg DM XOR R10, R1, R11 ALU
12
Types of Data Hazards RAW - read after write WAW - write after write
WAR - write after read DLX Only RAW hazards with registers Memory references are always kept in order LW R1, 0(R2) IF ID EX M1 M2 WB ADD R1, R2, R IF ID EX WB SW 0(R1), R IF ID EX M1 M2 WB ADD R2, R3, R IF ID EX WB
13
Stalls in Data Hazards IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM
LW R1, 0(R2) ALU IM Reg DM Reg SUB R4, R1, R5 ALU IM Reg DM Reg ALU AND R6, R1, R7 IM Reg DM Reg ALU OR R8, R1, R9
14
Pipeline Interlocks IM Reg DM Reg IM Reg DM Reg Reg DM IM IM Reg
LW R1, 0(R2) ALU IM Reg DM Reg SUB R4, R1, R5 ALU Reg DM IM ALU AND R6, R1, R7 IM Reg ALU OR R8, R1, R9 LW R1, 0(R2) IF ID EX MEM WB SUB R4, R1, R IF ID stall EX MEM WB AND R6, R1, R IF stall ID EX MEM WB OR R8, R1, R stall IF ID EX MEM WB
15
Compiler Scheduling (1/3)
a = b + c; a = b + c; LW Rb, b LW Rb, b d = e - f; LW Rc, c LW Rc, c ADD Ra, Rb, Rc LW Re, e SW a, Ra ADD Ra, Rb, Rc LW Re, e LW Rf, f LW Rf, f SW a, Ra SUB Rd, Re, Rf SUB Rd, Re, Rf SW d, Rd SW d, Rd LW R1, b IF ID EX MEM WB LW R2, c IF ID EX MEM WB ADD R3, R1, R IF ID stall EX MEM WB SW a, R IF stall ID EX MEM WB
16
Compiler Scheduling (2/3)
LW Rb,b IF ID EX MEM WB LW Rc,c IF ID EX MEM WB ADD Ra,Rb,Rc IF ID EX MEM WB SW a,Ra IF ID EX MEM WB LW Re,e IF ID EX MEM WB LW Rf,f IF ID EX MEM WB SUB Rd,Re,Rf IF ID EX MEM WB SW d,Rd IF ID EX MEM WB LW Rb,b IF ID EX MEM WB LW Rc,c IF ID EX MEM WB LW Re,e IF ID EX MEM WB ADD Ra,Rb,Rc IF ID EX MEM WB LW Rf,f IF ID EX MEM WB SW a,Ra IF ID EX MEM WB SUB Rd,Re,Rf IF ID EX MEM WB SW d,Rd IF ID EX MEM WB
17
Compiler Scheduling (3/3)
Eliminates load interlocks Demands more registers Simple scheduling Basic block (sequential segment of code) Good for simple pipelines Percentage of loads that result in a stall FP: 13% Int: 25%
18
Control for the DLX Pipeline
Goal: simple control hardware to handle interlocks and forwarding Options Centralized Check interlocks and forwarding during ID Distributed Check interlocks/forwarding at the beginning of EX, MEM
19
Control for RAW Load Hazards
Compare load’s destination (R1) with sources of the 2 adjacent instructions Situations LW R1, 10(R2) No dependence Dependence requiring stall Dependence overcome by forwarding Dependence with accesses in order
20
Load Interlock Implementation
RAW load interlock detection during ID Load instruction in EX Instruction that needs the load data in ID Logic to detect load interlock Action (insert the pipeline stall) ID/EX.IR0..5 = 0 (no-op) Re-circulate contents of IF/ID ID/EX.IR IF/ID.IR Comparison Load r-r ALU ID/EX.IR == IF/ID.IR6..10 Load r-r ALU ID/EX.IR == IF/ID.IR11..15 Load Load, Store, r-i ALU, branch ID/EX.IR == IF/ID.IR6..10
21
Forwarding Implementation (1/2)
Source: ALU or MEM output Destination: ALU, MEM or Zero? input(s) Compare (forwarding to ALU input): Source Destination Opcode of source instructions R-R ALU, R-I ALU, Load (ID/EX.IR 0..5, EX/MEM.IR 0..5) Opcode of destination instructions (ID/EX.IR 0..5) R-R ALU, R-I ALU, Load, Store, Branch EX/MEM.IR EX/MEM.IR MEM/WB.IR MEM/WB.IR ID/EX.IR 6..10 ID/EX.IR
22
Forwarding Implementation (2/2)
Zero? M u x EX/MEM MEM/WB ID/EX Data memory ALU M u x
23
Control Hazards Stall the pipeline until we reach MEM
Branch IF ID EX MEM WB Branch successor IF stall stall IF ID EX MEM WB Branch successor IF ID EX MEM WB Branch successor IF ID EX MEM WB Branch successor IF ID EX MEM Branch successor IF ID EX Stall the pipeline until we reach MEM Easy, but expensive Three cycles for every branch To reduce the branch delay Find out branch is taken or not taken ASAP Compute the branch target ASAP
24
Optimized Branch Execution
Add Mux 4 Zero? Add Mux PC Instr. Cache Mux ALU Regs Data Cache Sign extend IF/ID ID/EX EX/MEM MEM/WB
25
Branch Behavior in Programs
Integer FP Forward conditional branches % 7% Backward conditional branches % % Unconditional branches % % Branches taken % %
26
Reduction of Branch Penalties
Static, compile-time, branch prediction schemes 1 Stall the pipeline Simple in hardware and software 2 Treat every branch as not taken Continue execution as if branch were normal instruction If branch is taken, turn the fetched instruction into a no-op 3 Treat every branch as taken Useless in DLX 4 Delayed branch Sequential successors (in delay slots) are executed anyway No branches in the delay slots
27
Predict-not-taken Scheme
Untaken Branch IF ID EX MEM WB Instruction i IF ID EX MEM WB Instruction i IF ID EX MEM WB Instruction i IF ID EX MEM WB Instruction i IF ID EX MEM WB Taken Branch IF ID EX MEM WB Instruction i IF stall stall stall stall (clear the IF/ID register) Branch target IF ID EX MEM WB Branch target IF ID EX MEM WB Branch target IF ID EX MEM WB Compiler organizes code so that the most frequent path is the not-taken one
28
Optimizations of the Branch Slot
ADD R1,R2,R3 if R2=0 then SUB R4,R5,R6 ADD R1,R2,R3 if R1=0 then ADD R1,R2,R3 if R1=0 then OR R7,R8,R9 SUB R4,R5,R6 From target From before From fall through SUB R4,R5,R6 ADD R1,R2,R3 if R1=0 then if R2=0 then ADD R1,R2,R3 if R1=0 then ADD R1,R2,R3 OR R7,R8,R9 SUB R4,R5,R6 SUB R4,R5,R6
29
Branch Slot Requirements
Strategy Requirements Improves performance a) From before Branch must not depend on delayed Always instruction b) From target Must be OK to execute delayed When branch is taken instruction if branch is not taken c) From fall Must be OK to execute delayed When branch is not taken through instruction if branch is taken Limitations in delayed-branch scheduling Restrictions on instructions that are scheduled Ability to predict branches at compile time
30
Cancelling Branch Instructions
Cancelling branch includes the predicted direction Incorrect prediction => delay-slot instruction becomes no-op Helps the compiler to fill branch delay slots (no requirements for b and c) Behavior of a predicted-taken cancelling branch Untaken Branch IF ID EX MEM WB Instruction i IF stall stall stall stall (clear the IF/ID register) Instruction i IF ID EX MEM WB Instruction i IF ID EX MEM WB Instruction i IF ID EX MEM WB Taken Branch IF ID EX MEM WB Instruction i IF ID EX MEM WB Branch target IF ID EX MEM WB Branch target i IF ID EX MEM WB Branch target i IF ID EX MEM WB
31
Performance of Branch Schemes
Branch penalty Effective CPI Scheduling Conditional Unconditional Average with branch stalls scheme Int FP Int FP Int FP Stall pipeline Predict taken Predict not taken Delayed branch
32
Example Role of pipeline delay on branch penalty (R4000)
computation of branch target (3 stages) evaluation of branch condition (4 stages) Branch scheme Penalty uncond Penalty untaken Penalty taken Stall pipeline Predict taken Predict untaken Predict untaken Taken Branch IF IS RF EX DF DS TC WB Instruction i IF IS RF stall stall stall … Instruction i IF IS stall stall stall ... Instruction i IF stall stall stall ... Branch target IF IS RF DF DS TC WB
33
Static Branch Prediction
Correct predictions Reduce branch hazard penalty Help the scheduling of data hazards: Prediction methods Examination of program behavior (benchmarks) Use of profile information from previous runs LW R1, 0(R2) SUB R1, R1, R3 BEQZ R1, L OR R4, R5, R6 ADD R10, R4, R3 L: ADD R7, R8, R9 If branch is almost never taken If branch is almost always taken
34
Performance of the DLX Integer Pipeline
35
Exceptions I/O device request Operating system call
Tracing instruction execution Breakpoint Integer overflow FP arithmetic anomaly Page fault Misaligned memory access Memory protection violation Undefined instruction Hardware malfunctions Power failure
36
Types of Exceptions Synchronous vs. asynchronous
User requester vs. coerced User maskable vs. nonmaskable Within vs. between instructions Resume vs. terminate Most difficult Occur in the middle of the instruction (EX or MEM) Must be able to restart
37
Stopping and Restarting Execution
TRAP, RTE instructions IAR register Safely save the state of the pipeline Force a TRAP on the next IF Until the TRAP is taken, turn off all writes for the faulting instruction and the following ones. Exception-handling routine saves the PC of the faulting instruction For delayed branches we need to save more PCs Precise Exceptions
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.