2Can Pipelining Get Us Into Trouble? Yes: Pipeline HazardsStructural hazards: attempt to use the same resource two different ways at the same timeE.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV)Control hazards: attempt to make a decision before condition is evaluatedE.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load inBranch instructionsData hazards: attempt to use item before it is readyE.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryerInstruction depends on result of prior instruction still in the pipeline
3Structural HazardA relation between two instructions indicating that the two instructions may want to use the same hardware resource (function unit, register file port, shared bus, cache port, etc.) at the same timeMIPS pipeline as designed so far does not have structural hazardBut we had to avoid itUsually occurs when a functional unit is not fully pipelined (e.g., in floating point pipeline)
4Single Memory Port / Structural Hazard Time (clock cycles)Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7RegALUIfetchDMemInstr.OrdeLoadRegALUIfetchDMemInstr 1Instr 2RegALUIfetchDMemInstr 3RegALUIfetchDMemInstr 4RegALUIfetchDMem
5Single Memory Port / Structural Hazard Time (clock cycles)Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7ALUInstr.OrdeLoadIfetchRegDMemRegRegALUDMemIfetchInstr 1RegALUDMemIfetchInstr 2BubbleStallRegALUDMemIfetchInstr 3How do you “bubble” the pipe?
6Single Memory Port / Structural Hazard Instead of stalling the pipelineOther solutionsMake dual ported memoryPhysically separate memory architecture into instruction and data (Harvard Architecture from Harvard Mark I project of IBM led by Dr. Howard Aiken)Another typical structural hazardFunctional unit is not fully pipelined due to cost/complexityPipeline interval > 1 pipe stage
7Example: Cost of Structural Hazard Suppose that 40% of instruction mix are loads or stores, and that theideal CPI of the pipelined machine is 1. Assume that the machine withthe structural hazard has a clock rate that is 5% higher than the clockrate of the machine without the hazard. Which pipeline is faster, and byhow much?
8Data Hazards add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 Instr.Ordeadd r1,r2,r3sub r4,r1,r3and r6,r1,r7or r8,r1,r9xor r10,r1,r11RegALUDMemIfetch
9Three Generic Data Hazards True (or Flow) Dependency (Read After Write, or RAW)A later instruction tries to read operand before earlier instructions write itI: add r1,r2,r3J: sub r4,r1,r3
10RAW HazardsTrue (value, flow) dependence between instructions i and j means i produces a result value that j usesThis is a producer-consumer relationshipThis is a dependence based on values, not on the names of the containers of the valuesEvery true dependence is a RAW hazardNot every RAW hazard is a true dependenceAny RAW hazard that cannot be removed by renaming is a true dependenceOriginal program1: A = B+C2: A = D+E3: G = A+HRenamed Program1: X = B+C2: A = D+E3: G = A+HTrue dependence: (2,3)RAW hazard: (2,3)True dependence: (2,3)RAW hazard: (1,3), (2,3)
11Three Generic Data Hazards Anti-Dependency (Write After Read, or WAR)A later instruction tries to write operand before earlier instructions read itThis hazard results from reuse of the same registerCan’t happen in our simple 5 stage pipeline because:All instructions take 5 stages, andReads are always in stage 2, andWrites are always in stage 5I: add r2, r1,r3J: sub r1,r4,r3
12Three Generic Data Hazards Output Dependency (Write After Write, or WAW)A later instruction tries to write operand before earlier instructions write itThis hazard results from reuse of the same registerCan’t happen in our simple 5 stage pipeline because:All instructions take 5 stages, andReads are always in stage 2, andWrites are always in stage 5I: add r1,r2,r3J: sub r1,r4,r3
13More on WAR and WAW WAR and WAW hazards are name dependences Two instructions happen to use the same register (name), although they don’t have toCan often be eliminated by renaming, either in software or hardwareImplies the use of additional resources, hence additional costRenaming is not always possible: implicit operands such as accumulator, PC, or condition codes cannot be renamed
14How to Break the Dependency Dependency reduces concurrencyCan we breakTrue dependency (RAW)Name dependency or False dependency (WAR, WAW)
15Software Solution Have compiler guarantee no hazards Where do we insert the “nops” ? sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2)Problem: this really slows us down!
18Forwarding Unit 1. Forwarding between ALUOut and ALUMuxA sub $2, $1, $3and $12, $2, $5EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 =>Use EX/MEM.ALUOut instead of ID/EX.Aa. Some instructions do not write registersb. Every use of $0 as an operand must yield an operand valueof zeroIf ( EX/MEM.RegWrite &(EX/MEM.RegisterRd ≠ 0) &(EX/MEM.RegisterRd = ID/EX.RegisterRs) )ForwardA= 01
19Forwarding Unit 2. Forwarding between ALUOut and ALUMuxB sub $2, $1, $3and $12,$5, $2EX/MEM.RegisterRd = ID/EX.RegisterRt = $2 =>Use EX/MEM.ALUOut instead of ID/EX.BIf ( EX/MEM.RegWrite &(EX/MEM.RegisterRd ≠ 0) &(EX/MEM.RegisterRd = ID/EX.RegisterRt) )ForwardB= 01
29Some Other Data Dependencies add $1, $1, $ F | D | X | M | Wsw $7, 0($1) F | D | X | M | Wsw $8, 0($1) F | D | X | M | Wsw $9, 0($1) F | D | X | M | Wadd $1, $1, $ F | D | X | M | Wsw $1, 0($7) F | D | X | M | Wsw $1, 0($8) F | D | X | M | Wsw $1, 0($9) F | D | X | M | Wlw $1, 0($2) F | D | X | M | Wsw $1, 0($7) F | D | X | M | Wsw $1, 0($8) F | D | X | M | Wsw $1, 0($9) F | D | X | M | W
30Can't always forward Load word can still cause a hazard lw r1, 0(r2) Time (clock cycles)RegALUDMemIfetchInstr.Ordelw r1, 0(r2)sub r4,r1,r6and r6,r1,r7or r8,r1,r9
31Data Hazard Even with Forwarding Time (clock cycles)Instr.OrdeRegALUDMemIfetchlw r1, 0(r2)NO ISSUERegIfetchALUDMemBubblesub r4,r1,r6IfetchALUDMemRegBubbleand r6,r1,r7BubbleIfetchRegALUDMemor r8,r1,r9Thus, we need a hazard detection unit to “stall” the loadinstruction
32Stalling Hazard detection unit: When the pipeline is stalled: If ( ID/EX.MemRead &((ID/EX.RegisterRt = IF/ID.RegisterRs) |(ID/EX.RegisterRt = IF/ID.RegisterRt) ))stall the pipelineWhen the pipeline is stalled:Do not fetch a new instruction: Prevent PC and IF/ID registers from changingCreate a “buble” in the pipeline: Set all control signals to 0 to create a “do nothing” instruction
33Hazard Detection Unit P C I n s t r u c i o m e y R g M x l A L U E X WBD/aHzdFw.
34Code rescheduling to Avoid Load Hazards Try producing fast code fora = b + c;d = e – f;assuming a, b, c, d ,e, and f in memory.Slow code:LW Rb,bLW Rc,cADD Ra,Rb,RcSW a,RaLW Re,eLW Rf,fSUB Rd,Re,RfSW d,RdFast code:LW Rb,bLW Rc,cLW Re,eADD Ra,Rb,RcLW Rf,fSW a,RaSUB Rd,Re,RfSW d,RdCompiler optimizes for performance. Hardware checks for safety.
35Branch in the Pipelined Datapath Computes branchtarget addressComputes branchoutcomeInstruciomeyAd432lShfF/DEXMWBx1PCaRg6LUZChanges PC
36Branch (Control) Hazards When we decide to branch, other instructions are inthe pipeline!RegALUDMemIfetch10: beq r1,r3,3614: and r2,r3,r518: or r6,r1,r722: add r8,r1,r936: xor r10,r1,r11RegALUDMemIfetchRegALUDMemIfetchALUIfetchRegDMemRegALUIfetchRegDMem
37Solving Branch Hazards Stall the pipeline until the branch is completeBrach is detected in ID stagePipeline is stalledPipeline is started in IF stageNext instructionBranch targetThree clock cycles will be lost for each branch !!!
39Reducing Taken Branch Penalty Branch is completed in ID stageIf branch is taken, flush the pipeline1 cycle loss for a taken branchTaken branchFDXMWBranch + 1FLBranch targetBT + 1
40Flushing the Instruction After Branch PCInstruciomey4RgMxALUEXWBD/aHzdFw.lhS=f2
41Predict–not-Taken (Predict-Untaken) Continue execution after the branchIf branch is not taken, no penaltyIf branch is taken, flush the pipeline and loss of 1clock cyclesWhat about Predict-Taken?
42Delayed Branches Execution cycle with a branch delay of length n: branch instruction sequential successor1 sequential successor sequential successornbranch target if takenInstructions in the branch delay slot are executed irrespective of branch outcomeBranch delay of length n
43Delayed Branches on MIPS One branch delay slot on MIPSTaken and untaken branch behaviour are similarCompiler must fill in the branch delay slot with useful instructions
44Delayed BranchesQuestion: What instruction do we put in the branch delay slot?Fill with NOP (always possible)Fill from before (not always possible)Fill from target (not always possible)Fill from fall-through (not always possible)
45Filling Branch Delay Slot Make sure R7 will not be used in taken path before redefined
47Cancelling BranchesImproves the ability of the compiler to fill in delay slotsInstruction includes a bit showing its predicted directionWhen branch behaves as predicted, instruction in the delay slot is executedWhen branch is incorrectly predicted, instruction in the delay slot is turned to NOP
49Summary: Pipelining Reduce CPI by overlapping many instructions Average throughput of approximately 1 CPI with fast clockUtilize capabilities of the DatapathStart next instruction while working on the current oneLimited by length of longest stage (plus fill/flush)Detect and resolve hazardsWhat makes it easyAll instructions are the same lengthJust a few instruction formatsMemory operands appear only in loads and storesWhat makes it hard?Structural hazards: suppose we had only one memoryControl hazards: need to worry about branch instructionsData hazards: an instruction depends on a previous instruction