Pipelining Read Chapter

Pipelining Read Chapter 4.5-4.8
Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty laundry Read Chapter

Forget 411… Let’s Solve a “Relevant Problem”
INPUT: dirty laundry Device: Washer Function: Fill, Agitate, Spin WasherPD = 30 mins OUTPUT: 4 more weeks Device: Dryer Function: Heat, Spin DryerPD = 60 mins

Total = WasherPD + DryerPD
One Load at a Time Everyone knows that the real reason that UNC students put off doing laundry so long is *not* because they procrastinate, are lazy, or even have better things to do. The fact is, doing laundry one load at a time is not smart. (Sorry Mom, but you were wrong about this one!) Step 1: Step 2: Total = WasherPD + DryerPD = _________ mins 90

Doing N Loads of Laundry
Here’s how they do laundry at Duke, the “combinational” way. (Actually, this is just an urban legend. No one at Duke actually does laundry. The butler’s all arrive on Wednesday morning, pick up the dirty laundry and return it all pressed and starched by dinner) Step 1: Step 2: Step 3: Step 4: … Total = N*(WasherPD + DryerPD) = ____________ mins N*90

Doing N Loads… the UNC way
UNC students “pipeline” the laundry process. That’s why we wait! Step 1: Step 2: Step 3: … Actually, it’s more like N* if we account for the startup transient correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs. Total = N * Max(WasherPD, DryerPD) = ____________ mins N*60

Recall Our Performance Measures
Latency: The delay from when an input is established until the output associated with that input becomes valid. (Duke Laundry = _________ mins) ( UNC Laundry = _________ mins) Throughput: The rate at which inputs or outputs are processed. (Duke Laundry = _________ outputs/min) ( UNC Laundry = _________ outputs/min) 90 Assuming that the wash is started as soon as possible and waits (wet) in the washer until dryer is available. 120 Even though we increase latency, it takes less time per load 1/90 1/60

Okay, Back to Circuits… X F(X) G(X) P(X)
H X P(X) For combinational logic: latency = tPD, throughput = 1/tPD. We can’t get the answer faster, but are we making effective use of our hardware at all times? X F(X) G(X) P(X) F & G are “idle”, just holding their outputs stable while H performs its computation

Pipelined Circuits use registers to hold H’s input stable! unpipelined
Now F & G can be working on input Xi+1 while H is performing its computation on Xi. We’ve created a 2-stage pipeline : if we have a valid input X during clock cycle j, P(X) is valid during clock j+2. F G H X P(X) 15 20 25 Suppose F, G, H have propagation delays of 15, 20, 25 ns and we are using ideal zero-delay registers (ts = 0, tpd = 0): Pipelining uses registers to improve the throughput of combinational circuits unpipelined 2-stage pipeline latency 45 ______ throughput 1/45 ______ 50 worse 1/25 better

Pipeline Diagrams Clock cycle i i+1 i+2 i+3 Input Xi Xi+1 F(Xi) G(Xi)
H X P(X) 15 20 25 This is an example of parallelism. At any instant we are computing 2 results. Clock cycle i i+1 i+2 i+3 Input Xi Xi+1 F(Xi) G(Xi) Xi+2 F(Xi+1) G(Xi+1) H(Xi) Xi+3 F(Xi+2) G(Xi+2) H(Xi+1) H(Xi+2) … F Reg Pipeline stages G Reg H Reg The results associated with a particular set of input data moves diagonally through the diagram, progressing through one pipeline stage each clock cycle.

Pipelining Summary Advantages: Disadvantages:
Higher throughput than combinational system Different parts of the logic work on different parts of the problem… Disadvantages: Generally, increases latency Only as good as the *weakest* link (often called the pipeline’s BOTTLENECK)

Review of CPU Performance
MIPS = Millions of Instructions/Second MIPS = Freq CPI Freq = Clock Frequency, MHz CPI = Clocks per Instruction To Increase MIPS: 1. DECREASE CPI. - RISC simplicity reduces CPI to 1.0. - CPI below 1.0? State-of-the-art multiple instruction issue 2. INCREASE Freq. - Freq limited by delay along longest combinational path; hence - PIPELINING is the key to improving performance.

Where Are the Bottlenecks?
WA PC +4 Instruction Memory A D Register File RA1 RA2 RD1 RD2 ALU B ALUFN Control Logic Data Memory RD WD R/W Adr Wr WDSEL BSEL J:<25:0> PCSEL WERF 00 PC+4 Rt: <20:16> Imm: <15:0> ASEL SEXT + x4 BT Z WASEL Rd:<15:11> Rt:<20:16> 1 2 3 PC<31:29>:J<25:0>:00 JT N V C Rs: <25:21> shamt:<10:6> 4 5 6 “16” IRQ 0x 0x 0x RESET “31” “27” WE Pipelining goal: Break LONG combinational paths  memories, ALU in separate stages

miniMIPS Timing Different instructions use various parts of the data path. 1 instr every 14 nS, 14 nS, 20 nS, 9 nS, 19 nS Program execution order Time CLK add $4, $5, $6 beq $1, $2, 40 lw $3, 30($0) jal sw $2, 20($4) 6 nS 2 nS 5 nS 4 nS 1 nS Instruction Fetch The above scenario is possible only if the system could vary the clock period based on the instruction being executed. This leads to complicated timing generation, and, in the end, slower systems, since it is not very compatible with pipelining! Instruction Decode Register Prop Delay ALU Operation Branch Target Data Access Register Setup

Uniform miniMIPS Timing
With a fixed clock period, we have to allow for the worse case. 1 instr EVERY 20 nS Program execution order Time CLK add $4, $5, $6 beq $1, $2, 40 lw $3, 30($0) jal sw $2, 20($4) 6 nS 2 nS 5 nS 4 nS 1 nS Instruction Fetch By accounting for the “worse case” path (i.e. allowing time for each possible combination of operations) we can implement a fixed clock period. This simplifies timing generation, enforces a uniform processing order, and allows for pipelining! Instruction Decode Register Prop Delay Isn’t the net effect just a slower CPU? ALU Operation Branch Target Data Access Register Setup

Goal: 5-Stage Pipeline IF ID/RF ALU MEM WB
GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ALU) APPROACH: structure processor as 5-stage pipeline: IF Instruction Fetch stage: Maintains PC, fetches one instruction per cycle and passes it to ID/RF Instruction Decode/Register File stage: Decode control lines and select source operands ALU ALU stage: Performs specified operation, passes result to MEM Memory stage: If it’s a lw, use ALU result as an address, pass mem data (or ALU result if not lw) to WB Write-Back stage: writes result back into register file.

PC<31:29>:J<25:0>:00
0x 0x PC<31:29>:J<25:0>:00 0x JT BT 5-Stage miniMIPS PCSEL 6 5 4 3 2 1 PC 00 Instruction Memory • Omits some details A Instruction +4 D • NO bypass or interlock logic Fetch PCREG 00 IRREG Rt: <20:16> Rs: <25:21> J:<25:0> RA1 Register RA2 WA File RD1 RD2 JT = Imm: <15:0> SEXT SEXT BZ x4 shamt:<10:6> + “16” Register 1 2 ASEL 1 BSEL File BT PCALU 00 IRALU A B WDALU Address is available right after instruction enters Memory stage A B ALU ALUFN ALU N V C Z PCMEM 00 IRMEM YMEM WDMEM Wr PC+4 Adr WD R/W Memory almost 2 clock cycles PCWB 00 IRWB YWB Data Memory RD Rt:<20:16> Rd:<15:11> “31” “27” Data is needed just before rising clock edge at end of Write Back stage WASEL WDSEL Write WA Register WD Back WERF WE WA File

Pipelining Improve performance by increasing instruction throughput
Ideal speedup is number of stages in the pipeline. Do we achieve this?

Pipelining What makes it easy What makes it hard?
all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction Individual Instructions still take the same number of cycles But we’ve improved the through-put by increasing the number of simultaneously executing instructions

Structural Hazards Inst Fetch Reg Read ALU Data Access Reg Write

Data Hazards Problem with starting next instruction before first is finished dependencies that “go backward in time” are data hazards

Software Solution Have compiler guarantee no hazards
Where do we insert the “nops” ? sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Problem: this really slows us down!

Forwarding Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding

Can't always forward Load word can still cause a hazard:
an instruction tries to read a register following a load instruction that writes to the same register. Thus, we need a hazard detection unit to “stall” the instruction

Stalling We can stall the pipeline by keeping an instruction in the same stage

Branch Hazards When we decide to branch, other instructions are in the pipeline! We are predicting “branch not taken” need to add hardware for flushing instructions if we are wrong

PC<31:29>:J<25:0>:00
0x 0x PC<31:29>:J<25:0>:00 0x JT BT 5-Stage miniMIPS PCSEL 6 5 4 3 2 1 • added CLK EN to freeze IF/RF stages so we can wait for lw to reach WB stage NOP We wanted a simple, clean pipeline but… PC 00 Instruction Memory A +4 D Instruction Fetch broke the sequential semantics of ISA by adding a branch delay-slot and early branch resolution logic PCREG 00 IRREG Rt: <20:16> Rs: <25:21> J:<25:0> RA1 Register RA2 WA File added A/B bypass muxes to get data before it’s written to regfile RD1 RD2 JT = Imm: <15:0> SEXT SEXT BZ x4 shamt:<10:6> + “16” Register 1 2 ASEL 1 BSEL File BT PCALU 00 IRALU A B WDALU A B ALU ALUFN ALU N V C Z PCMEM 00 IRMEM YMEM WDMEM Wr PC+4 Adr WD R/W Memory PCWB 00 IRWB YWB Data Memory RD Rt:<20:16> Rd:<15:11> “31” “27” WASEL WDSEL Write WA Register WD Back WERF WE WA File

Pipeline Summary (I) • Started with unpipelined implementation
– direct execute, 1 cycle/instruction – it had a long cycle time: mem + regs + alu + mem + wb • We ended up with a 5-stage pipelined implementation – increase throughput (3x???) – delayed branch decision (1 cycle) Choose to execute instruction after branch – delayed register writeback (3 cycles) Add bypass paths (6 x 2 = 12) to forward correct value – memory data available only in WB stage Introduce NOPs at IRALU, to stall IF and RF stages until LD result was ready

Pipeline Summary (II) Fallacy #1: Pipelining is easy
Smart people get it wrong all of the time! Fallacy #2: Pipelining is independent of ISA Many ISA decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes). Fallacy #3: Increasing Pipeline stages improves performance Diminishing returns. Increasing complexity.

Pipelining Read Chapter

Similar presentations

Presentation on theme: "Pipelining Read Chapter"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pipelining Read Chapter

Similar presentations

Presentation on theme: "Pipelining Read Chapter"— Presentation transcript:

Similar presentations

About project

Feedback