Presentation on theme: "L17 – Pipelining 1 Comp 411 – Spring 2008 4/1/2008 Pipelining Between Spring break and 411 problems sets, I haven’t had a minute to do laundry Now that’s."— Presentation transcript:
L17 – Pipelining 1 Comp 411 – Spring /1/2008 Pipelining Between Spring break and 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty laundry Read Chapter 6.1
L17 – Pipelining 2 Comp 411 – Spring /1/2008 Forget 411… Let’s Solve a “Relevant Problem” Device: Washer Function: Fill, Agitate, Spin Washer PD = 30 mins Device: Dryer Function: Heat, Spin Dryer PD = 60 mins INPUT: dirty laundry OUTPUT: 4 more weeks
L17 – Pipelining 3 Comp 411 – Spring /1/2008 One Load at a Time Everyone knows that the real reason that UNC students put off doing laundry so long is *not* because they procrastinate, are lazy, or even have better things to do. The fact is, doing laundry one load at a time is not smart. (Sorry Mom, but you were wrong about this one!) Step 1: Step 2: Total = Washer PD + Dryer PD = _________ mins 90
L17 – Pipelining 4 Comp 411 – Spring /1/2008 Doing N Loads of Laundry Here’s how they do laundry at Duke, the “combinational” way. (Actually, this is just an urban legend. No one at Duke actually does laundry. The butler’s all arrive on Wednesday morning, pick up the dirty laundry and return it all pressed and starched by dinner) Step 1: Step 2: Step 3: Step 4: Total = N*(Washer PD + Dryer PD ) = ____________ mins N*90 …
L17 – Pipelining 5 Comp 411 – Spring /1/2008 Doing N Loads… the UNC way UNC students “pipeline” the laundry process. That’s why we wait! Step 1: Step 2: Step 3: Total = N * Max(Washer PD, Dryer PD ) = ____________ mins N*60 … Actually, it’s more like N* if we account for the startup transient correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs.
L17 – Pipelining 6 Comp 411 – Spring /1/2008 Recall Our Performance Measures Latency: The delay from when an input is established until the output associated with that input becomes valid. (Duke Laundry = _________ mins) ( UNC Laundry = _________ mins) Throughput: The rate of which inputs or outputs are processed. (Duke Laundry = _________ outputs/min) ( UNC Laundry = _________ outputs/min) /90 1/60 Assuming that the wash is started as soon as possible and waits (wet) in the washer until dryer is available. Even though we increase latency, it takes less time per load
L17 – Pipelining 7 Comp 411 – Spring /1/2008 Okay, Back to Circuits… F G HXP(X) For combinational logic: latency = t PD, throughput = 1/t PD. We can’t get the answer faster, but are we making effective use of our hardware at all times? G(X) F(X) P(X) X F & G are “idle”, just holding their outputs stable while H performs its computation
L17 – Pipelining 8 Comp 411 – Spring /1/2008 Pipelined Circuits use registers to hold H’s input stable! F G HXP(X) Now F & G can be working on input X i+1 while H is performing its computation on X i. We’ve created a 2-stage pipeline : if we have a valid input X during clock cycle j, P(X) is valid during clock j+2. Suppose F, G, H have propagation delays of 15, 20, 25 ns and we are using ideal zero-delay registers (t s = 0, t pd = 0): latency 45 ______ throughput 1/45 ______ unpipelined 2-stage pipeline 50 worse 1/25 better Pipelining uses registers to improve the throughput of combinational circuits
L17 – Pipelining 9 Comp 411 – Spring /1/2008 Pipeline Diagrams Input F Reg G Reg H Reg ii+1i+2i+3 XiXi X i+1 F(X i ) G(X i ) X i+2 F(X i+1 ) G(X i+1 ) H(X i ) X i+3 F(X i+2 ) G(X i+2 ) H(X i+1 ) Clock cycle Pipeline stages The results associated with a particular set of input data moves diagonally through the diagram, progressing through one pipeline stage each clock cycle. H(X i+2 ) … … F G HXP(X) This is an example of parallelism. At any instant we are computing 2 results.
L17 – Pipelining 10 Comp 411 – Spring /1/2008 Pipelining Summary Advantages: – Higher throughput than combinational system – Different parts of the logic work on different parts of the problem… Disadvantages: – Generally, increases latency – Only as good as the *weakest* link (often called the pipeline’s BOTTLENECK) Isn’t there a way around this “weak link” problem? This bottleneck is the only problem
L17 – Pipelining 11 Comp 411 – Spring /1/2008 How do UNC students REALLY do Laundry? They work around the bottleneck. First, they find a place with twice as many dryers as washers. Throughput = ______ loads/min Latency = ______ mins/load Step 1: Step 2: Step 3: Step 4: 1/30 90
L17 – Pipelining 12 Comp 411 – Spring /1/2008 Step 5: Better Yet… Parallelism Step 1: Step 2: Step 3: Step 4: We can combine interleaving and pipelining with parallelism. Throughput = _______ load/min Latency = _______ min 2/30 = 1/15 90
L17 – Pipelining 13 Comp 411 – Spring /1/2008 “Classroom Computer” Row 1 Row 2 Row 3 Row 4 Row 5 Row 6 Psets in There are lots of problem sets to grade, each with six problems. Students in Row 1 grade Problem 1 and then hand it back to Row 2 for grading Problem 2, and so on… Assuming we want to pipeline the grading, how do we time the passing of papers between rows?
L17 – Pipelining 14 Comp 411 – Spring /1/2008 Controls for “Classroom Computer” SynchronousAsynchronous Globally Timed Locally Timed Teacher picks time interval long enough for worst-case student to grade toughest problem. Everyone passes psets at end of interval. Teacher picks variable time interval long enough for current students to grade current set of problems. Everyone passes psets at end of interval. Students raise hands when they finish grading current problem. Teacher checks every 10 secs, when all hands are raised, everyone passes psets to the row behind. Variant: students can pass when all students in a “column” have hands raised. Students grade current problem, wait for student in next row to be free, and then pass the pset back.
L17 – Pipelining 15 Comp 411 – Spring /1/2008 Control Structure Taxonomy SynchronousAsynchronous Globally Timed Locally Timed Centralized clocked FSM generates all control signals. Central control unit tailors current time slice to current tasks. Start and Finish signals generated by each major subsystem, synchronously with global clock. Each subsystem takes asynchronous Start, generates asynchronous Finish (perhaps using local clock). Easy to design but fixed-sized interval can be wasteful (no data-dependencies in timing) Large systems lead to very complicated timing generators… just say no! The best way to build large systems that have independent components. The “next big idea” for the last several decades: a lot of design work to do in general, but extra work is worth it in special cases
L17 – Pipelining 16 Comp 411 – Spring /1/2008 Review of CPU Performance MIPS = Millions of Instructions/Second Freq = Clock Frequency, MHz CPI = Clocks per Instruction MIPS = Freq CPI To Increase MIPS: 1. DECREASE CPI. - RISC simplicity reduces CPI to CPI below 1.0? State-of-the-art multiple instruction issue 2. INCREASE Freq. - Freq limited by delay along longest combinational path; hence - PIPELINING is the key to improving performance.
L17 – Pipelining 17 Comp 411 – Spring /1/2008 miniMIPS Timing New PC PC+4Fetch Inst. Control Logic Read Regs ASEL muxBSEL mux ALU Fetch data WDSEL mux RF setupPC setupMem setup PCSEL mux +OFFSET CLK The diagram on the left illustrates the Data Flow of miniMIPS Wanted: longest path Complications: some apparent paths aren’t “possible” functional units have variable execution times (eg, ALU) time axis is not to scale (eg, t PD,MEM is very big!) Sign Extend WASEL mux
L17 – Pipelining 18 Comp 411 – Spring /1/2008 Where Are the Bottlenecks? Pipelining goal: Break LONG combinational paths memories, ALU in separate stages
L17 – Pipelining 19 Comp 411 – Spring /1/2008 Ultimate Goal: 5-Stage Pipeline GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ALU) APPROACH: structure processor as 5-stage pipeline: IF Instruction Fetch stage: Maintains PC, fetches one instruction per cycle and passes it to WB Write-Back stage: writes result back into register file. ID/RF Instruction Decode/Register File stage: Decode control lines and select source operands ALU ALU stage: Performs specified operation, passes result to MEM Memory stage: If it’s a lw, use ALU result as an address, pass mem data (or ALU result if not lw) to
L17 – Pipelining 20 Comp 411 – Spring /1/2008 miniMIPS Timing Different instructions use various parts of the data path. add $4, $5, $6 beq $1, $2, 40 lw $3, 30($0) jal Program execution order Time CLK Instruction Fetch Instruction Decode Register Prop Delay ALU Operation Branch Target Data Access Register Setup sw $2, 20($4) This is an example of a “Asynchronous Globally-Timed” control strategy (see Lecture 18). Such a system would vary the clock period based on the instruction being executed. This leads to complicated timing generation, and, in the end, slower systems, since it is not very compatible with pipelining! 1 instr every 14 nS, 14 nS, 20 nS, 9 nS, 19 nS 6 nS 2 nS 5 nS 4 nS 6 nS 1 nS
L17 – Pipelining 21 Comp 411 – Spring /1/2008 Uniform miniMIPS Timing With a fixed clock period, we have to allow for the worse case. add $4, $5, $6 beq $1, $2, 40 lw $3, 30($0) jal Program execution order Time CLK Instruction Fetch Instruction Decode Register Prop Delay ALU Operation Branch Target Data Access Register Setup sw $2, 20($4) By accounting for the “worse case” path (i.e. allowing time for each possible combination of operations) we can implement a “Synchronous Globally-Timed” control strategy. This simplifies timing generation, enforces a uniform processing order, and allows for pipelining! Isn’t the net effect just a slower CPU? 1 instr EVERY 20 nS 6 nS 2 nS 5 nS 4 nS 6 nS 1 nS
L17 – Pipelining 22 Comp 411 – Spring /1/2008 Step 1: A 2-Stage Pipeline IF EXE WA PC +4 Instruction Memory A D Register File RA1RA2 RD1RD2 ALU AB WA WD WE ALUFN Control Logic Data Memory RD WD R/W Adr Wr WDSEL BSEL WDSEL ALUFN Wr J: PCSEL WERF 00 PC+4 Imm: ASEL SEXT + x4 BT Z WASEL Rd: Rt: WASEL PC :J :00 JT N V C Z VNC Rt: Rs: ASEL 20 SEXT BSEL 01 SEXT shamt: PCSEL “16” IRQ 0x x x RESET “31” “27” 1 PC EXE 00 IR EXE IR stands for “Instruction Register”. The superscript “EXE” denotes the pipeline stage, in which the PC and IR are used.
L17 – Pipelining 23 Comp 411 – Spring /1/ Stage Pipe Timing Improves performance by increasing instruction throughput. Ideal speedup is number of pipeline stages in the pipeline. add $4, $5, $6 beq $1, $2, 40 lw $3, 30($0) jal Program execution order Time CLK Instruction Fetch Instruction Decode Register Prop Delay ALU Operation Branch Target Data Access Register Setup sw $2, 20($4) By partitioning each instruction cycle into a “fetch” stage and an “execute” stage, we get a simple pipeline. Why not include the Instruction-Decode/Register-Access time with the Instruction Fetch? You could. But this partitioning allows for a useful variant with 2-cycle loads and stores. Latency? Throughput? 2 Clock periods = 2*14 nS 1 instr per 14 nS During this, and all subsequent clocks two instructions are in various stages of execution 6 nS 2 nS 5 nS 4 nS 6 nS 1 nS
L17 – Pipelining 24 Comp 411 – Spring /1/ Stage w/2-Cycle Loads & Stores Further improves performance, with slight increase in control complexity. Some 1 st generation (pre-cache) RISC processors used this approach. add $4, $5, $6 beq $1, $2, 40 lw $3, 30($0) jal Program execution order Time CLK Instruction Fetch Instruction Decode Register Prop Delay ALU Operation Branch Target Data Access Register Setup sw $2, 20($4) The clock rate of this variant is over twice that of our original design. Does that mean it is that much faster? This design is very similar to the multicycle CPU described in section 5.5 of the text, but with pipelining. Clock: 8 nS! 6 nS 2 nS 5 nS 4 nS 6 nS 1 nS Not likely. In practice, as many as 30% of instructions access memory. Thus, the effective speed up is:
L17 – Pipelining 25 Comp 411 – Spring /1/ Stage Pipelined Operation... addi $t2,$t1,1 xor $t2,$t1,$t2 sltiu $t3,$t2,1 srl $t2,$t2,1... Consider a sequence of instructions: Executed on our 2-stage pipeline: IF EXE i i+1 i+2i+3i+4i+5i+6...xoraddisrlsltiu...xoraddisrlsltiu TIME (cycles) Pipeline Recall “Pipeline Diagrams” from an earlier slide. It can’t be this easy!?
L17 – Pipelining 26 Comp 411 – Spring /1/2008 ALU AB ALUFN Data Memory RD WD R/W Adr Wr WDSEL PC+4 Z VNC PC +4 Instruction Memory A D 00 BT PC :J :00 JT PCSEL x x x PC REG 00 IR REG WA Register File RA1RA2 RD1RD2 J: Imm: + x4 BT JT Rt: Rs: ASEL 20 BSEL 01 SEXT shamt: “16” 1 = BZ Step 2: 4-Stage miniMIPS PC ALU 00 IR ALU A B WD ALU PC MEM 00 IR MEM Y WD MEM WA Register File WA WD WE WERF WASEL Rd: Rt: “31” “27” (NB: SAME RF AS ABOVE!) Treats register file as two separate devices: combinational READ, clocked WRITE at end of pipe. What other information do we have to pass down pipeline? PC instruction fields What sort of improvement should expect in cycle time? (return addresses) (decoding) Instruction Fetch Register File ALU Write Back
L17 – Pipelining 27 Comp 411 – Spring /1/ Stage miniMIPS Operation... addi $t0,$t0,1 sll $t1,$t1,2 andi $t2,$t2,15 sub $t3,$0,$t3... Consider a sequence of instructions: Executed on our 4-stage pipeline: IF RF ALU WB i i+1 i+2i+3i+4i+5i+6... slladdisubandi... slladdisubandi... slladdisubandi slladdisubandi TIME (cycles) Pipeline