Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hazard Analysis Procedure  Memory Data Dependences ­Output Dependence (WAW) ­Anti Dependence (WAR) ­True Data Dependence (RAW)  Register Data Dependences.

Similar presentations


Presentation on theme: "Hazard Analysis Procedure  Memory Data Dependences ­Output Dependence (WAW) ­Anti Dependence (WAR) ­True Data Dependence (RAW)  Register Data Dependences."— Presentation transcript:

1 Hazard Analysis Procedure  Memory Data Dependences ­Output Dependence (WAW) ­Anti Dependence (WAR) ­True Data Dependence (RAW)  Register Data Dependences ­Output Dependence (WAW) ­Anti Dependence (WAR) ­True Data Dependence (RAW)  Control Dependences

2 Terminology  Pipeline Hazards: ­Potential violations of program dependences ­Must ensure program dependences are not violated  Hazard Resolution: ­Static Method: Performed at compiled time in software ­Dynamic Method: Performed at run time using hardware Stall, Flush or Forward  Pipeline Interlock: ­Hardware mechanisms for dynamic hazard resolution ­Must detect and enforce dependences at run time

3 Necessary Conditions for Data Hazards i:r k  _ j:r k  _ Reg Write iOjiOj i:_  r k j:r k  _ Reg Write Reg Read iAjiAj i:r k  _ j:_  r k Reg Read Reg Write iDjiDj stage X stage Y dist(i,j)  dist(X,Y)  ?? dist(i,j) > dist(X,Y)  ?? WAW HazardWAR HazardRAW Hazard dist(i,j)  dist(X,Y)  Hazard!! dist(i,j) > dist(X,Y)  Safe

4 Inter-instruction Data Hazards Pipe StageALU Inst.Load inst.Store inst.Branch inst. 1. IFI-cache PC

5 F MEM Pipelining: Steady State IFIDRDALUMEM IFIDRDALUMEM IFIDRDALUMEM IFIDRDALU t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 IFIDRD ALU IFID RD IF ID Inst h Inst i Inst j Inst k Inst l WB

6 F MEM Pipelining: Data Hazards IFIDRDALUMEM IFIDRDALUMEM IFIDRDALUMEM IFIDRDALU t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 IFIDRD ALU IFID RD IF ID Inst i Inst j Inst k Inst l WB i: r k  _ j: _  r k i j Inst h

7 F MEM IFIDRDALUMEM IFIDRDALUMEM IFIDStall IFStall RDALU t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 IDRD ALU IFID RD IF ID Inst h Inst i Inst j Inst k Inst l WB i: r k  _ bubble j: _  r k i j Stall IFIDRDALUMEM IFIDRDALUMEM IFIDStall IFStall t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 RD ALU ID RD IF ID Inst h Inst i Inst j Inst k Inst l WB i: r k  _ bubble j: _  r k i j Stall IF IDRDALUMEM IFIDRDALUMEM IFIDStalled in RD IFStalled in ID t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 RD ID Inst h Inst i Inst j Inst k Inst l WB i: r k  _ bubble j: _  r k i j Stalled in IF IF Pipelining: Stall on Data Hazard Stall==make the younger instruction wait until the hazard has passed 1. stop all up-stream stages 2. drain all down-stream stages

8 Pipeline: Stall on Data Hazard WB MEM ALU RD IhIh ID IiIi IF t0t0 IkIk IlIl t 10 IhIh IiIi IjIj t1t1 IhIh IiIi IjIj IkIk t2t2 IhIh IiIi I j stall I k stall I l stall t3t3 IhIh IiIi nop I j stall I k stall I l stall t4t4 IiIi nop I j stall I k stall I l stall t5t5 nop IjIj IkIk IlIl t7t7 IjIj IkIk IlIl t8t8 IjIj IkIk IlIl t9t9 IjIj IkIk IlIl t6t6

9 case 2: i: r k  _ j:.... k:.... l: _  r k case 1: i: r k  _ j:.... k: _  r k case 4: i: r k  _ j: _  r k k: _  r k l: _  r k case 3: i: r k  _ j: r k  _ k: r k  _ l: _  r k F MEM What should these cases look like? IFIDRDALUMEM IFIDRDALUMEM IFIDRDALUMEM IFIDRDALU t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 IFIDRD ALU IFID RD IF ID Inst h Inst i Inst j Inst k Inst l WB i j

10 Stall Conditions Pipe StageALU Inst.Load inst.Store inst.Branch inst. 1. IFI-cache PC

11 Inter-instruction Control Hazards Pipe StageALU Inst.Load inst.Store inst.Branch inst. 1. IFI-cache PC

12 Pipeline: Control Hazard i: br k i+1: xxxxxx k: yyyyyy F MEM IFIDRDALUMEM IFIDRDALUMEM IFIDRDALUMEM IFIDRDALU t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 IFIDRD ALU IFID RD IF ID i i+1 i+2 i+3 i+4 WB

13 Pipeline: Control Hazard i: br k i+1: xxxxxx k: yyyyyy IF k+2 IF i ID i RD i ALU i MEM i IF i+1 ID i+1 RD i+1 ALU i+1 IF k IF i+2 ID i+2 RD i+2 IF k+1 IF i+3 ID i+3 t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 IF i+4 WB i ID k ID k+1 RD k Squashed Instructions

14 Pipeline Flush on Control Hazards IkIk I k+1 I k+2 I k+3 I k+4 I k+5 t 10 WB IiIi MEM I i+1 IiIi ALU I i+2 I i+1 IiIi RD I i+3 I i+2 I i+1 IiIi ID I i+4 I i+3 I i+2 I i+1 IiIi IF t4t4 t3t3 t2t2 t1t1 t0t0 nop IkIk I k+1 t6t6 nop IkIk I k+1 I k+2 t7t7 nop IkIk I k+1 I k+2 I k+3 t8t8 nop IkIk I k+1 I k+2 I k+3 I k+4 t9t9 IiIi nop IkIk t5t5

15 Branch Instruction Interlock:

16 k: yyyy.. i: Br. to k: i+1:xxxx.. Delayed Branches IF i ID i RD i ALU i MEM i IF i+1 ID i+1 RD i+1 ALU i+1 IF k IF i+2 ID i+2 RD i+2 IF k+1 IF i+3 ID i+3 t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 IF i+4 WB i ID k IF k+2 ID k+1 MEM i+1 ALU i+2 RD i+3 ID i+4 WB i+1 MEM i+2 ALU i+3 RD i+4 IF k+2 ALU i+4 MEM k+3 WB i+2 i: i+1: i+2: i+3: i+4: k: k+1: k+2: Branch Delay Slots

17 Penalties Due to Stalling for RAW Leading Inst i ALULoadBranch Trailing Inst j ALU, L/S, Br. Hazard registerInt. Reg. (Ri) PC Register WRITE stage (i) WB (stage 6) MEM (stage 5) Register READ stage (j) RD (stage 3) IF (stage 1) RAW distance or penalty: 3 cycles 4 cycles

18 F MEM Forwarding Path Analysis IFIDRDALUMEM IFIDRDALUMEM IFIDRDALUMEM IFIDRDALU t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 IFIDRD ALU IFID RD IF ID Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 WB

19 Critical Forwarding Paths Pipe StageALU Inst.Load inst.Store inst.Branch inst. 1. IFI-cache PC

20 Penalties with Forwarding Paths Leading Inst i (producer) ALULoadBranch Trailing Inst j (consumer) ALU, L/S, Br. Hazard registerInt. Reg. (Ri) PC Value Produced stage (i) ALU (stage 4)MEM (stage 5) Value Consumed stage (j) ALU (stage 4) IF (stage 1) Forward from outputs of: ALU,MEM,WBMEM,WBMEM Forward to input of:ALU IF RAW distance or penalty: 0 cycles1 cycles4 cycles

21 Implementation of Pipeline Interlock: ALU i+1: _  R X i+2: _  R X i+3: _  R X i: R X  _ (i  i+1) Forwarding via Path a i+1: R y  _ i+1: R Y  _ i+2: R z  _ (i  i+2) Forwarding via Path b (i  i+3) i writes R1 before i+3 reads R1 dist=1dist=2dist=3

22 ALU Forwarding Paths

23 Forwarding Path(s) for Load Instructions: (i  i+2) i writes R1 before i+2 reads R1 (i  i+1) Forwarding via Path d (i  i+1) Stall i+1 i+1: _  R X i+2: _  R X i+3: _  R X i: R X  mem[ ] i+1: R y  _ i+1: R Y  _ i+2: R z  _ dist=1dist=2dist=3

24 Load Forwarding Path LOAD MEM Stall IF,ID,RD,ALU D a t a A d d D-Cache r LOAD WB

25 Load Delay Slot (MIPS R2000) IFIDRDALUMEM IFIDRDALUMEM IFIDRDALUMEM t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 WB j: k: h:R k  -- …… i: R k  MEM[ - ] j:--  R k k:--  R k Which (R k ) do we really mean? - The effect of a “delayed” Load is not visible to the instructions in its delay slots. i:

26 Interrupts  An unexpected transfer of control flow ­Control is given to a system program and later given back (restartable) ­Transparent to the interrupted program  Asynchronous Interrupt (e.g. I/O) can be taken at some convenient point  Synchronous Interrupt (e.g. divide by 0) is associated with a particular instruction i1i1 i2i2 i3i3 H1H1 H2H2 HnHn …. interrupt restart

27 Precise Interrupts Sequential Code SemanticsOverlapped Execution i1: xxxx i2: xxxx i3: xxxx i2 i1 i3 i2: i1: i3: A precise interrupt appears (to the interrupt handler) to take place exactly between two instructions

28 What to Do on an Interrupt  Find the point of interrupt in the instruction stream  Allow instructions prior to the interrupt point to complete  Save enough state to allow restarting at the same point ­Program Counter, status registers, etc. (registers that are clobbered during transition) ­Why not general purpose register file??  Start executing from a pre-specified interrupt handler address ­Almost always involves a change in “protection mode” ­Must be careful to not leave any loophole such that a draining user instruction can act like a “privileged” instruction during transition  “Return-from-Interrupt” Instruction: Jump to the saved PC and also restore clobbered states from saved copies

29 Stopping and Restarting a Pipeline t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 IF I1I1 I2I2 I3I3 I4I4 I5I5 I6I6 nop I IH I IH+1 I IH+2 I IH+3 ID I1I1 I2I2 I3I3 I4I4 I5I5 nop I IH I IH+1 I IH+2 RD I1I1 I2I2 I3I3 I4I4 nop I IH I IH+1 ALU I1I1 I2I2 I3I3 nop I IH MEM I1I1 I2I2 nop WB I1I1 I2I2 nop

30 Stage namePhaseFunction performed 1. IF 11 Translate virtual instr. addr. using TLB 22 Access I-cache using physical address 2. RD 11 Return instr. from I-cache, check tags & parity 22 Read reg. file; if a branch, generate target addr. 3. ALU 11 Start ALU op.; if a branch, check br. Condition 22 Finish ALU op.; if a load/store, translate virtual addr. 4. MEM 11 Access D-cache 22 Return data from D-cache, check tags & parity 5. WB 11 Write register file 22 --- Real Pipelined Processor Example: MIPS R2000

31 Intel i486 5-Stage “CISC” Pipeline Stage nameFunction performed 1. Instruction FetchFetch instruction from the 32-byte prefetch queue (prefetch unit fills and flushes prefetch queue) 2. Instruction Decode-1Translate instr. into control signals or microcode addr. Initiate addr. generation and memory access 3. Instruction Decode-2Access microcode memory Outputs microinstruction to execution unit 4. ExecuteExecute ALU and memory accessing operations 5. Register Write-backWrite back results to register

32 IBM’s Experience on Pipelined Processors [Agerwala and Cocke 1987] Attributes and Assumptions:  Memory Bandwidth ­at least one word/cycle to fetch 1 instruction/cycle from I-cache ­40% of instructions are load/store, require access to D-cache  Code Characteristics (dynamic) ­loads - 25% ­stores - 15% ­ALU/RR - 40% ­branches - 20% 1/3 unconditional (always taken); 1/3 conditional taken; 1/3 conditional not taken

33 More Statistics and Assumptions  Cache Performance ­hit ratio of 100% is assumed in the experiments ­cache latency: I-cache = i; D-cache = d; default: i=d=1 cycle  Load and Branch Scheduling ­loads: 25% cannot be scheduled 75% can be moved back 1 instruction ­branches: unconditional - 100% schedulable conditional - 50% schedulable

34 CPI Calculations I  No cache bypass of reg. file, no scheduling of loads or branches ­Load Penalty: 2 cycles (0.25*2=0.5) ­Branch Penalty: 2 cycles (0.2*0.66*2=0.27) ­Total CPI: = 1.77 CPI  Bypass, no scheduling of loads or branches ­Load Penalty: 1 cycle (0.25*1=0.25) ­Total CPI: = 1.52 CPI

35 CPI Calculations II  Bypass, scheduling of loads and branches ­Load Penalty: 75% can be moved back 1 => no penalty remaining 25% => 1 cycle penalty (0.25*0.25*1=0.063) ­Branch Penalty: 1/3 Uncond. 100% schedulable => 1 cycle (.2*.33=.066) 1/3 Cond. Not Taken, if biased for NT => no penalty 1/3 Cond. Taken 50% schedulable => 1 cycle (0.2*.33*.5=.033) 50% unschedulable => 2 cycles (.2*.33*.5*2=.066) ­Total CPI: = 1.23 CPI

36 CPI Calculations III  Parallel target address generation ­90% of branches can be coded as PC relative i.e. target address can be computed without register access ­A separate adder can compute (PC+offset) in the decode stage ­Branch Penalty: ­Conditional:Unconditional: ­Total CPI: = 1.15 CPI = 0.87 IPC PC-relative addressing SchedulableBranch penalty YES (90%)YES (50%)0 cycle YES (90%)NO (50%)1 cycle NO (10%)YES (50%)1 cycle NO (10%)NO (50%)2 cycles

37 Pipelined Depth (T/ k +S ) T S S T/k k-stage pipelined unpipelined ?

38 Limitations of Scalar Pipelines  Upper Bound on Scalar Pipeline Throughtput Limited by IPC = 1  Inefficient Unification Into Single Pipeline Long latency for each instruction  Performance Lost Due to Rigid Pipeline Unnecessary stalls

39 Stalls in an Inorder Scalar Pipeline Instructions are in order with respect to any one stage i.e. no dynamic reordering


Download ppt "Hazard Analysis Procedure  Memory Data Dependences ­Output Dependence (WAW) ­Anti Dependence (WAR) ­True Data Dependence (RAW)  Register Data Dependences."

Similar presentations


Ads by Google