Presentation on theme: "Now that’s what I call dirty laundry"— Presentation transcript:
1Now that’s what I call dirty laundry PipeliningBetween Spring break and 411 problems sets, I haven’t had a minute to do laundryNow that’s what I call dirty laundryRead Chapter 6.1
2Forget 411… Let’s Solve a “Relevant Problem” INPUT:dirty laundryDevice: WasherFunction: Fill, Agitate, SpinWasherPD = 30 minsOUTPUT:4 more weeksDevice: DryerFunction: Heat, SpinDryerPD = 60 mins
3Total = WasherPD + DryerPD One Load at a TimeEveryone knows that the real reason that UNC students put off doing laundry so long is *not* because they procrastinate, are lazy, or even have better things to do.The fact is, doing laundry one load at a time is not smart.(Sorry Mom, but you were wrong about this one!)Step 1:Step 2:Total = WasherPD + DryerPD= _________ mins90
4Doing N Loads of Laundry Here’s how they do laundry at Duke, the “combinational” way.(Actually, this is just an urban legend. No one at Duke actually does laundry. The butler’s all arrive on Wednesday morning, pick up the dirty laundry and return it all pressed and starched by dinner)Step 1:Step 2:Step 3:Step 4:…Total = N*(WasherPD + DryerPD)= ____________ minsN*90
5Doing N Loads… the UNC way UNC students “pipeline” the laundry process.That’s why we wait!Step 1:Step 2:Step 3:…Actually, it’s more like N* if we account for the startup transient correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs.Total = N * Max(WasherPD, DryerPD)= ____________ minsN*60
6Recall Our Performance Measures Latency: The delay from when an input is established until the output associated with that input becomes valid.(Duke Laundry = _________ mins)( UNC Laundry = _________ mins)Throughput:The rate of which inputs or outputs are processed.(Duke Laundry = _________ outputs/min)( UNC Laundry = _________ outputs/min)90Assuming that the wash is started as soon as possible and waits (wet) in the washer until dryer is available.120Even thoughwe increaselatency, it takes less time per load1/901/60
7Okay, Back to Circuits… X F(X) G(X) P(X) HXP(X)For combinational logic: latency = tPD, throughput = 1/tPD. We can’t get the answer faster, but are we making effective use of our hardware at all times?XF(X)G(X)P(X)F & G are “idle”, just holding their outputs stable while H performs its computation
8Pipelined Circuits use registers to hold H’s input stable! unpipelined Now F & G can be working on input Xi+1 while H is performing its computation on Xi. We’ve created a 2-stage pipeline : if we have a valid input X during clock cycle j, P(X) is valid during clock j+2.FGHXP(X)152025Suppose F, G, H have propagation delays of 15, 20, 25 ns and we are using ideal zero-delay registers (ts = 0, tpd = 0):Pipelining uses registers to improve the throughput of combinational circuitsunpipelined2-stage pipelinelatency45______throughput1/45______50worse1/25better
9Pipeline Diagrams Clock cycle i i+1 i+2 i+3 Input Xi Xi+1 F(Xi) G(Xi) HXP(X)152025This is an example of parallelism. Atany instant we arecomputing 2 results.Clock cycleii+1i+2i+3InputXiXi+1F(Xi)G(Xi)Xi+2F(Xi+1)G(Xi+1)H(Xi)Xi+3F(Xi+2)G(Xi+2)H(Xi+1)H(Xi+2)…F RegPipeline stagesG RegH RegThe results associated with a particular set of input data moves diagonally through the diagram, progressing through one pipeline stage each clock cycle.
10This bottleneck is the only problem Pipelining SummaryAdvantages:Higher throughput than combinational systemDifferent parts of the logic work on different parts of the problem…Disadvantages:Generally, increases latencyOnly as good as the *weakest* link (often called the pipeline’s BOTTLENECK)Isn’t there a way around this “weak link” problem?This bottleneck is the only problem
11How do UNC students REALLY do Laundry? They work around the bottleneck. First, they find a place with twice as many dryers as washers.Throughput = ______ loads/minLatency = ______ mins/loadStep 1:Step 2:1/30Step 3:Step 4:90
12Better Yet… Parallelism We can combine interleaving and pipelining with parallelism.Throughput =_______ load/minLatency = _______ minStep 1:Step 2:2/30 = 1/15Step 3:90Step 4:Step 5:
13“Classroom Computer”There are lots of problem sets to grade, each with six problems. Students in Row 1 grade Problem 1 and then hand it back to Row 2 for grading Problem 2, and so on… Assuming we want to pipeline the grading, how do we time the passing of papers between rows?Psets inRow 1Row 2Row 3Row 4Row 5Row 6
14Controls for “Classroom Computer” SynchronousAsynchronousTeacher picks time interval long enough for worst-case student to grade toughest problem. Everyone passes psets at end of interval.Teacher picks variable time interval long enough for current students to grade current set of problems. Everyone passes psets at end of interval.GloballyTimedStudents raise hands when they finish grading current problem. Teacher checks every 10 secs, when all hands are raised, everyone passes psets to the row behind. Variant: students can pass when all students in a “column” have hands raised.Students grade current problem, wait for student in next row to be free, and then pass the pset back.LocallyTimed
15Control Structure Taxonomy Easy to design but fixed-sized interval can be wasteful (no data-dependencies in timing)Large systems lead to very complicated timing generators… just say no!SynchronousAsynchronousCentralized clocked FSM generates all control signals.Central control unit tailors current time slice to current tasks.GloballyTimedStart and Finish signals generated by each major subsystem, synchronously with global clock.Each subsystem takes asynchronous Start, generates asynchronous Finish (perhaps using local clock).LocallyTimedThe “next big idea” for the last several decades: a lot of design work to do in general, but extra work is worth it in special casesThe best way to build large systems that have independent components.
16Review of CPU Performance MIPS = Millions of Instructions/SecondMIPS =FreqCPIFreq = Clock Frequency, MHzCPI = Clocks per InstructionTo Increase MIPS:1. DECREASE CPI.- RISC simplicity reduces CPI to 1.0.- CPI below 1.0? State-of-the-art multiple instruction issue2. INCREASE Freq.- Freq limited by delay along longest combinational path; hence- PIPELINING is the key to improving performance.
17miniMIPS TimingCLKNew PCThe diagram on the left illustrates the Data Flow of miniMIPSWanted: longest pathComplications:some apparent paths aren’t “possible”functional units have variable execution times (eg, ALU)time axis is not to scale (eg, tPD,MEM is very big!)PC+4Fetch Inst.Control LogicRead RegsSign Extend+OFFSETASEL muxBSEL muxALUFetch dataPCSEL muxWASEL muxWDSEL muxPC setupRF setupMem setupCLK
18Where Are the Bottlenecks? WAPC+4InstructionMemoryADRegisterFileRA1RA2RD1RD2ALUBALUFNControl LogicData MemoryRDWDR/WAdrWrWDSELBSELJ:<25:0>PCSELWERF00PC+4Rt: <20:16>Imm: <15:0>ASELSEXT+x4BTZWASELRd:<15:11>Rt:<20:16>123PC<31:29>:J<25:0>:00JTNVCRs: <25:21>shamt:<10:6>456“16”IRQ0x0x0xRESET“31”“27”WEPipelining goal:Break LONG combinational paths memories, ALU in separate stages
19Ultimate Goal: 5-Stage Pipeline GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ALU)APPROACH: structure processor as 5-stage pipeline:IFInstruction Fetch stage: Maintains PC, fetches one instruction per cycle and passes it toID/RFInstruction Decode/Register File stage: Decode control lines and select source operandsALUALU stage: Performs specified operation, passes result toMEMMemory stage: If it’s a lw, use ALU result as an address, pass mem data (or ALU result if not lw) toWBWrite-Back stage: writes result back into register file.
20miniMIPS TimingDifferent instructions use various parts of the data path.1 instr every 14 nS, 14 nS, 20 nS, 9 nS, 19 nSProgram execution orderTimeCLKadd $4, $5, $6beq $1, $2, 40lw $3, 30($0)jalsw $2, 20($4)6 nS2 nS5 nS4 nS1 nSInstruction FetchThis is an example of a “Asynchronous Globally-Timed” control strategy (see Lecture 18). Such a system would vary the clock period based on the instruction being executed. This leads to complicated timing generation, and, in the end, slower systems, since it is not very compatible with pipelining!Instruction DecodeRegister Prop DelayALU OperationBranch TargetData AccessRegister Setup
21Uniform miniMIPS Timing With a fixed clock period, we have to allow for the worse case.1 instr EVERY 20 nSProgram execution orderTimeCLKadd $4, $5, $6beq $1, $2, 40lw $3, 30($0)jalsw $2, 20($4)6 nS2 nS5 nS4 nS1 nSInstruction FetchBy accounting for the “worse case” path (i.e. allowing time for each possible combination of operations) we can implement a “Synchronous Globally-Timed” control strategy. This simplifies timing generation, enforces a uniform processing order, and allows for pipelining!Instruction DecodeRegister Prop DelayIsn’t the net effect just a slower CPU?ALU OperationBranch TargetData AccessRegister Setup
22Step 1: A 2-Stage Pipeline 0x0xPC<31:29>:J<25:0>:000xJTBTIFPCSEL654321PC00EXEInstructionMemoryAD+4PCEXE00IREXERt: <20:16>Rs: <25:21>J:<25:0>WASELRd:<15:11>123Rt:<20:16>RA1RegisterRA2“31”WAWAFileWD“27”RD1RD2WEWERFImm: <15:0>IR stands for “Instruction Register”. The superscript “EXE” denotes the pipeline stage, in which the PC and IR are used.RESETSEXTSEXTJTIRQZNVCx4shamt:<10:6>Control Logic+“16”12ASEL1BSELPCSELWASELBTSEXTABBSELALUWrALUFNData MemoryRDWDR/WWDSELALUFNWrNVCZAdrWERFASELPC+4WDSEL
232-Stage Pipe TimingImproves performance by increasing instruction throughput. Ideal speedup is number of pipeline stages in the pipeline.Program execution orderTimeCLKadd $4, $5, $6During this, and all subsequent clocks two instructions are in various stages of executionbeq $1, $2, 40lw $3, 30($0)jalsw $2, 20($4)6 nS2 nS5 nS4 nS1 nSInstruction FetchBy partitioning each instruction cycle into a “fetch” stage and an “execute” stage, we get a simple pipeline. Why not include the Instruction-Decode/Register-Access time with the Instruction Fetch? You could. But this partitioning allows for a useful variant with 2-cycle loads and stores.Instruction DecodeRegister Prop DelayLatency?Throughput?ALU OperationBranch Target2 Clock periods = 2*14 nSData AccessRegister Setup1 instr per14 nS
242-Stage w/2-Cycle Loads & Stores Further improves performance, with slight increase in control complexity. Some 1st generation (pre-cache) RISC processors used this approach.Program execution orderTimeCLKadd $4, $5, $6This design is very similar to the multicycle CPU described in section 5.5 of the text, but with pipelining.beq $1, $2, 40lw $3, 30($0)jalsw $2, 20($4)Clock:8 nS!6 nS2 nS5 nS4 nS1 nSInstruction FetchThe clock rate of this variant is over twice that of our original design. Does that mean it is that much faster?Instruction DecodeRegister Prop DelayALU OperationNot likely. In practice, as many as 30% of instructions access memory. Thus, the effective speed up is:Branch TargetData AccessRegister Setup
252-Stage Pipelined Operation Consider a sequenceof instructions:... addi $t2,$t1,1 xor $t2,$t1,$t2 sltiu $t3,$t2,1 srl $t2,$t2,1 ...Recall “Pipeline Diagrams” from an earlier slide.Executed on our 2-stage pipeline:IFEXEii+1i+2i+3i+4i+5i+6...xoraddisrlsltiuTIME (cycles)PipelineIt can’t be this easy!?
26PC<31:29>:J<25:0>:00 0x0xPC<31:29>:J<25:0>:000xJTBTStep 2: 4-Stage miniMIPSPCSEL654321PC00InstructionMemoryATreats register file as two separate devices: combinational READ, clocked WRITE at end of pipe.What other information do we have to pass down pipeline?PCinstruction fieldsWhat sort of improvement should expect in cycle time?+4DInstructionFetchPCREG00IRREGRt: <20:16>Rs: <25:21>J:<25:0>RA1RegisterRA2WAFileRD1RD2JT=Imm: <15:0>SEXTSEXTBZx4shamt:<10:6>+“16”Register12ASEL1BSELFileBTPCALU00IRALUABWDALUABALUALUFN(return addresses)ALUNVCZPCMEM00IRMEMYWDMEM(decoding)WrPC+4Data MemoryRDAdrWDR/WRt:<20:16>Rd:<15:11>“31”“27”WASELWDSEL(NB: SAME RF AS ABOVE!)WriteWARegisterWDBackWERFWEWAFile
274-Stage miniMIPS Operation Consider a sequenceof instructions:... addi $t0,$t0,1 sll $t1,$t1,2 andi $t2,$t2,15 sub $t3,$0,$t3 ...Executed on our 4-stage pipeline:IFRFALUWBii+1i+2i+3i+4i+5i+6...slladdisubandiTIME (cycles)Pipeline