# Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 1 Lecture 8 Advanced Pipeline.

## Presentation on theme: "Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 1 Lecture 8 Advanced Pipeline."— Presentation transcript:

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 1 Lecture 8 Advanced Pipeline

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 2 Extending the DLX to Handle Multi-cycle Operations IF IDMEM WB EX IF ID MEM WB EX int unit EX FP/int Multiply EX FP/Int divider EX FP adder DLX pipeline with 3 additional unpipelined, floadting-point functional units

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 3 Multicycle Operations IFID MEM WB EX integer unit M1 M2M3M4 M5 M6M7 FP/integer multiply A1A2 A3A4 FP adder DIV FP/integer divider 24 clock cycles

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 4 Latency and Initiation Interval Latency: Number of intervening cycles between an instruction that produces a result and an instruction that uses the result Initiation Interval: number of cycles that must elapse between issuing of two operations of a given type Integer ALU 0 1 Load 1 1 FP add 3 1 FP mul 6 1 FP div 24 25 Data needed Result available * FP LD and ST are same as integer by having 64-bit path to memory. MULTD IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB ADDD IF ID AI A2 A3 A4 MEM WB LD* IF ID EX MEM WB SD* IF ID EX MEM WB Example latency initiation interval

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 5 Floating Point Operations FP Instruction Latency Initiation Interval (MIPS R4000) Add, Subtract43 Multiply84 Divide3635 Square root112111 Negate21 Absolute value21 FP compare32 Cycles before using result Cycles before issuing instr of the same type Floating Point: long execution time Also, pipeline FP execution unit may initiate new instructions without waiting full latency Reality: MIPS R4000

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 6 Complications Due to FP Operations (in DLX) –Because the divide unit is not fully pipelined, structural hazards can ocur –WAW hazards are possible, since instructions no longer reach WB in order. (WAR hazards are not possible, since register reads always occur in ID) –Instructions can complete in a different order than they were issued, causing problems with exceptions –Because of longer latency of operations, stalls for RAW hazards will be more frequent

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 7 Summary of Pipelining Basics Hazards limit performance –Structural: need more HW resources –Data: need forwarding, compiler scheduling –Control: early evaluation of PC, delayed branch, prediction Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency Interrupts, FP Instruction Set makes pipelining harder Compilers reduce cost of data and control hazards –Load delay slots –Branch delay slots –Branch prediction

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 8 Case Study: MIPS R4000 and Introduction to Advanced Pipelining

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 9 Case Study: MIPS R4000 Pipeline 8 Stage Pipeline: IF First half of fetching of instruction PC selection Initiation of instruction cache access IS - Second half of fetching of instruction Access to instruction cache RFInstruction decode, register fetch, hazard checking, and also instruction cache hit detection(tag check) EX Execution Effective address calculation ALU operation Branch target computation and condition evaluation DF - First half of access to data cache DS - Second half of access to data cache TC - Tag check for data cache hit WB -Write back for loads and register-register operations

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 10 The Pipeline Structure of the R4000 Instruction Memory REG ALU Data Memory REG Instruction is available Tag check load data available IF IS RF EX DF DS TC WB

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 11 Case Study: MIPS R4000 LOAD Latency 2 Cycle Load Latency Load data available with forwarding LD R1, X IF IS RF EX DF DS TC WB IF IS RF EX DF DS IF IS RF EX DF DS... ADD R3, R1, R2 IF IS RF EX DF DS TC WB IF IS RF EX DF DS TC... IF IS RF EX DF... EX Load data needed EX 2 Stall Cycles

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 12 Case Study: MIPS R4000 LOAD Followed by ALU Instructions 2 cycle Load Latency with Forwarding Circuit IFIS IF RF IS IF EX RF IS IF DF stall DS stall TC EX RF IS WB DF... EX... RF... LW R1 ADD R2, R1 SUB R3, R1 OR R4, R1 Forwarding

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 13 Case Study: MIPS R4000 Branch Latency Predict NOT TAKEN strategy NOT TAKEN: one-cycle delayed slot TAKEN: one-cycle delayed slot followed by two stalls - 3 cycle latency NOT TAKEN R4000 uses Predict NOT TAKEN Delay Slot plus 2 stall cycles IFIS IF RF IS IF RF IS IF DF EX RF IS DS DF EX RF IS TC DS DF EX RF WB TC... DS... DF... EX... NOT TAKEN NOT TAKEN Br Delay Slot Br instr +2 Br instr +3 Br instr +4 EX IF DS DF IS TC DS RF IFIS IF RF ISRF DF EX WB TC... EX... EX TAKEN TAKEN Br Delay Slot StallStall Br Target instr IF Branch target address available after EX stage

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 14 Extending DLX to Handle Floating Point Operations IFID MEM WB Integer Unit(EX) FP/integer multiply FP Multiplier FP Adder FP Divider

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 15 MIPS R4000 FP Unit FP Adder, FP Multiplier, FP Divider Last step of FP Multiplier/Divider uses FP Adder HW 8 kinds of stages in FP units: (single copy of each) StageFunctional unitDescription AFP adderMantissa ADD stage DFP dividerDivide pipeline stage EFP multiplierException test stage MFP multiplierFirst stage of multiplier NFP multiplierSecond stage of multiplier RFP adderRounding stage SFP adderOperand shift stage U Unpack FP numbers

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 16 MIPS R4000 FP Pipe Stages FP Instr 1 2 3 4 5 6 7 8  latency Add, SubtractU S+AA+RR+S4 MultiplyU E+MMMMNN+AR 8 DivideU AR D 27 … D+AD+R, D+A, D+R, A, R 36 Square rootU E(A+R) 108  AR 112 NegateU S 2 Absolute valueU S 2 FP compareU AR 3 Stages: MFirst stage of multiplierN Second stage of multiplier RRounding stageA Mantissa ADD stage SOperand shift stageD Divide pipeline stage UUnpack FP numbersE Exception test stage

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 17 Latency and Initiation Intervals FP Instruction Latency Initiation Interval Add, Subtract43 Multiply84 Divide3635 Square root112111 Negate21 Absolute value21 FP compare32

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 18 MIPS R4000 FP Pipe Stages Add Issue U S+A A+R R+S Add Stall U S+A A + R R +S Add Issue U S+A A+R R+S Multiply Issue U M M M M N N+ A R clock cycle OperationIssue/stall 0 1 2 3 4 5 6 7 8 9 10 11 12 A A A A ADD issued at 5 cycles after Multiply will stall 1 cycle. Stall ADD issued at 4 cycles after Multiply will stall 2 cycles.

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 19 R4000 Performance Not an ideal pipeline CPI of 1: –Load stalls –Branch stalls: (2 cycles for taken br. + unfilled branch slots or cancelled branch delay slots) –FP result stalls: RAW data hazard (latency) –FP structural stalls: Not enough FP hardware (parallelism) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 eqntott espresso gcc li doduc nasa7 ora spice2g6 su2cor tomcatv BaseLoad stallsBranch stallsFP result stallsFP structural stalls Integer programsFloating Point programs Pipeline CPI

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 20 Advanced Pipeline And Instruction Level Parallelism

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 21 Advanced Pipelining and Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Beyond single block to get more instruction level parallelism Loop level parallelism is one opportunity, SW and HW... Branch Target... Branch instruction... Any instruction... Branch instruction... Block of Code

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 22 Advanced Pipelining and Instruction Level Parallelism Loop unrollingControl stalls Basic pipeline schedulingRAW stalls Dynamic scheduling with scoreboardingRAW stalls Dynamic scheduling with register renamingWAR and WAW stalls Dynamic branch predictionControl stalls Issuing multiple instructions per cycleIdeal CPI Compiler dependence analysisIdeal CPI and data stalls Software pipelining and trace schedulingIdeal CPI and data stalls SpeculationAll data and control stalls Dynamic memory disambiguationRAW stalls involving memory Technique Reduces

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 23 Basic Pipeline Scheduling and Loop Unrolling FP unit latencies Instruction producing Instruction using Latency in result result clock cycles FP ALU op Another FP ALU op3 FP ALU op Store double 2 Load double* FP ALU op1 Load double* Store double 0 * Same as integer Load since there is a 64-bit data path from/to memory. Fully pipelined or replicated --- no structural hazards, issue on every clock cycle for ( i =1; i <= 1000; i++) x[i] = x[i] + s;

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 24 Loop:LDF0,0(R1);R1 is the pointer to a vector ADDDF4,F0,F2;F2 contains a scalar value SD0(R1),F4;store back result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slot FP Loop Hazards Where are the stalls? InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Integer opInteger op0

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 25 FP Loop Showing Stalls 1 Loop:LDF0,0(R1);F0=vector element 2stall 3 ADDDF4,F0,F2;add scalar in F2 4stall 5stall 6 SD0(R1),F4;store result 7 SUBIR1,R1,8;decrement pointer 8B (DW) 8stall 9 BNEZR1,Loop;branch R1!=zero 10stall ;delayed branch slot Rewrite code to minimize stalls?

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 26 Reducing Stalls 1 Loop:LDF0,0(R1) 2stall 3 ADDDF4,F0,F2 4stall 5stall 6 SD0(R1),F4 7 SUBIR1,R1,#8 8 stall 9 BNEZR1,Loop 10stall For Load-ALU latency There is only one instruction left, i.e., BNEZ. When we do that, SD instruction fills the delayed branch slot. For ALU-ALU latency Reading R1 by LD is done before Writing R1 by SUBI. Yes we can. Consider moving SUBI into this Load Delay Slot. When we do this, we need to change the immediate value 0 to 8 in SD 8

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 27 Revised FP Loop to Minimize Stalls 1 Loop:LDF0,0(R1) 2 SUBIR1,R1,#8 3 ADDDF4,F0,F2 4stall 5 BNEZR1,Loop;delayed branch 6 SD8(R1),F4;altered when move past SUBI InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Unroll loop 4 times to make the code faster

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 28 Unroll Loop 4 Times 1 Loop:LDF0,0(R1) 2 ADDDF4,F0,F2 3 SD0(R1),F4 ;drop SUBI & BNEZ 4 LDF6,-8(R1) 5 ADDDF8,F6,F2 6 SD-8(R1),F8 ;drop SUBI & BNEZ 7 LDF10,-16(R1) 8 ADDDF12,F10,F2 9 SD-16(R1),F12 ;drop SUBI & BNEZ 10 LDF14,-24(R1) 11 ADDDF16,F14,F2 12 SD-24(R1),F16 13 SUBIR1,R1,#32;alter to 4*8 14 BNEZR1,Loop 15 NOP 15 + 4 x (1*+2 + )+1^= 28 clock cycles, or 7 per iteration. 1*: LD to ADDD stall 1 cycle 2 + : ADDD to SD stall 2 cycles 1^: Data dependency on R1 Rewrite loop to minimize the stalls

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 29 Unrolled Loop to Minimize Stalls 1 Loop:LDF0,0(R1) 2 LDF6,-8(R1) 3 LDF10,-16(R1) 4 LDF14,-24(R1) 5 ADDDF4,F0,F2 6 ADDDF8,F6,F2 7 ADDDF12,F10,F2 8 ADDDF16,F14,F2 9 SD0(R1),F4 10 SD-8(R1),F8 11 SUBIR1,R1,#32 12 SD16(R1),F12 ; 16-32= -16 13 BNEZR1,LOOP 14 SD8(R1),F16; 8-32 = -24 Assumptions - OK to move SD past SUBI even though SUBI changes R1 SUBI IF RF EX MEM WB SD IF ID EX MEM WB BNEZ IF ID EX MEM WB - OK to move loads before stores(Get right data) - When is it safe for compiler to do such changes? 14 clock cycles, or 3.5 per iteration

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 30 Compiler Perspectives on Code Movement Definitions: Compiler is concerned about dependencies in the program, whether this causes a HW hazard or not depends on a given pipeline Data dependencies (RAW if a hazard for HW): Instruction j is data dependent on instruction i if either –Instruction i produces a result used by instruction j, or –Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. Easy to determine for registers (fixed names) Hard for memory: –Does 100(R4) = 20(R6)? –From different loop iterations, does 20(R6) = 20(R6)?

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 31 Compiler Perspectives on Code Movement Name Dependence: Two instructions use the same name(register or memory location) but they do not exchange data Two kinds of Name Dependence Instruction i precedes instruction j –Antidependence (WAR if a hazard for HW) Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first –Output dependence (WAW if a hazard for HW) Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 32 Again Hard for Memory Accesses –Does 100(R4) = 20(R6)? –From different loop iterations, does 20(R6) = 20(R6)? Our example required compiler to know that if R1 doesn’t change then: 0(R1) ¹ -8(R1) ¹ -16(R1) ¹ -24(R1) 1 There were no dependencies between some loads and stores, so they could be moved by each other. Compiler Perspectives on Code Movement

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 33 Compiler Perspectives on Code Movement Control Dependence Example if p1 {S1;}; if p2 {S2;} S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 34 Compiler Perspectives on Code Movement Two (obvious) constraints on control dependencies: –An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. –An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch. Control dependencies may be relaxed in some systems to get parallelism; get the same effect if preserve the order of exceptions and data flow

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 35 When Safe to Unroll Loop? Example: When a loop is unrolled, where are data dependencies? (A,B,C distinct, non-overlapping) for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */ 1. S2 uses the value A[i+1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This is a loop-carried dependence between iterations Implies that iterations are dependent, and can’t be executed in parallel Not the case for our example; each iteration was distinct

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 36 When Safe to Unroll Loop? Example: Where are data dependencies? (A,B,C,D distinct & non-overlapping) Following looks like there is a loop carried dependence for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i];} /* S2 */ However, we can rewrite it as follows for loop carried dependence-free A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1];} B[101] = C[100]+D[100];

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 37 SummarySummary Instruction Level Parallelism in SW or HW Loop level parallelism is easiest to see SW parallelism dependencies defined for a program, hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can unroll loops