L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty.

Slides:



Advertisements
Similar presentations
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Advertisements

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Pipelining what Seymour Cray taught the laundry industry
Now that’s what I call dirty laundry
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Goal: Describe Pipelining
Computer Architecture
Chapter Six 1.
Pipeline Issues This pipeline stuff makes my head hurt! Maybe it’s that dumb hat.
11/1/2005Comp 120 Fall November Exam Postmortem 11 Classes to go! Read Sections 7.1 and 7.2 You read 6.1 for this time. Right? Pipelining then.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
L18 – Pipeline Issues 1 Comp 411 – Spring /03/08 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.
1 CSE SUNY New Paltz Chapter Six Enhancing Performance with Pipelining.
Appendix A Pipelining: Basic and Intermediate Concepts
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
CS1104: Computer Organisation School of Computing National University of Singapore.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time.
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
ECE 232 L18.Pipeline.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 18 Pipelining.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]
How Computers Work Lecture 12 Page 1 How Computers Work Lecture 12 Introduction to Pipelining.

Cs 152 L1 3.1 DAP Fa97,  U.CB Pipelining Lessons °Pipelining doesn’t help latency of single task, it helps throughput of entire workload °Multiple tasks.
Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Pipelining Example Laundry Example: Three Stages
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.
Computer Organization and Design Pipelining Montek Singh Dec 2, 2015 Lecture 16 (SELF STUDY – not covered on the final exam)
CBP 2005Comp 3070 Computer Architecture1 Last Time … All instructions the same length We learned to program MIPS And a bit about Intel’s x86 Instructions.
1. Convert the RISCEE 1 Architecture into a pipeline Architecture (like Figure 6.30) (showing the number data and control bits). 2. Build the control line.
CSE431 L06 Basic MIPS Pipelining.1Irwin, PSU, 2005 MIPS Pipeline Datapath Modifications  What do we need to add/modify in our MIPS datapath? l State registers.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
L17 – Pipeline Issues 1 Comp 411 – Fall /23/09 CPU Pipelining Issues Read Chapter This pipe stuff makes my head hurt! What have you been.
Chapter Six.
Lecture 18: Pipelining I.
Computer Organization
Morgan Kaufmann Publishers
CMSC 611: Advanced Computer Architecture
Pipeline Implementation (4.6)
Chapter 4 The Processor Part 3
Chapter 4 The Processor Part 2
Pipelining review.
Pipelining Read Chapter
How Computers Work Lecture 13 Details of the Pipelined Beta
Lecturer: Alan Christopher
Serial versus Pipelined Execution
Pipelining in more detail
Chapter Six.
Chapter Six.
Computer Organization and Design Pipelining
November 5 No exam results today. 9 Classes to go!
Now that’s what I call dirty laundry
Morgan Kaufmann Publishers The Processor
Guest Lecturer: Justin Hsia
Presentation transcript:

L16 – Pipelining 1 Comp 411 – Spring /20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty laundry Read Chapter

L16 – Pipelining 2 Comp 411 – Spring /20/2011 Forget 411… Let’s Solve a “Relevant Problem” Device: Washer Function: Fill, Agitate, Spin Washer PD = 30 mins Device: Dryer Function: Heat, Spin Dryer PD = 60 mins INPUT: dirty laundry OUTPUT: 4 more weeks

L16 – Pipelining 3 Comp 411 – Spring /20/2011 One Load at a Time Everyone knows that the real reason that UNC students put off doing laundry so long is *not* because they procrastinate, are lazy, or even have better things to do. The fact is, doing laundry one load at a time is not smart. (Sorry Mom, but you were wrong about this one!) Step 1: Step 2: Total = Washer PD + Dryer PD = _________ mins 90

L16 – Pipelining 4 Comp 411 – Spring /20/2011 Doing N Loads of Laundry Here’s how they do laundry at Duke, the “combinational” way. (Actually, this is just an urban legend. No one at Duke actually does laundry. The butler’s all arrive on Wednesday morning, pick up the dirty laundry and return it all pressed and starched by dinner) Step 1: Step 2: Step 3: Step 4: Total = N*(Washer PD + Dryer PD ) = ____________ mins N*90 …

L16 – Pipelining 5 Comp 411 – Spring /20/2011 Doing N Loads… the UNC way UNC students “pipeline” the laundry process. That’s why we wait! Step 1: Step 2: Step 3: Total = N * Max(Washer PD, Dryer PD ) = ____________ mins N*60 … Actually, it’s more like N* if we account for the startup transient correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs.

L16 – Pipelining 6 Comp 411 – Spring /20/2011 Recall Our Performance Measures Latency: The delay from when an input is established until the output associated with that input becomes valid. (Duke Laundry = _________ mins) ( UNC Laundry = _________ mins) Throughput: The rate at which inputs or outputs are processed. (Duke Laundry = _________ outputs/min) ( UNC Laundry = _________ outputs/min) /90 1/60 Assuming that the wash is started as soon as possible and waits (wet) in the washer until dryer is available. Even though we increase latency, it takes less time per load

L16 – Pipelining 7 Comp 411 – Spring /20/2011 Okay, Back to Circuits… F G HXP(X) For combinational logic: latency = t PD, throughput = 1/t PD. We can’t get the answer faster, but are we making effective use of our hardware at all times? G(X) F(X) P(X) X F & G are “idle”, just holding their outputs stable while H performs its computation

L16 – Pipelining 8 Comp 411 – Spring /20/2011 Pipelined Circuits use registers to hold H’s input stable! F G HXP(X) Now F & G can be working on input X i+1 while H is performing its computation on X i. We’ve created a 2-stage pipeline : if we have a valid input X during clock cycle j, P(X) is valid during clock j+2. Suppose F, G, H have propagation delays of 15, 20, 25 ns and we are using ideal zero-delay registers (t s = 0, t pd = 0): latency 45 ______ throughput 1/45 ______ unpipelined 2-stage pipeline 50 worse 1/25 better Pipelining uses registers to improve the throughput of combinationa l circuits

L16 – Pipelining 9 Comp 411 – Spring /20/2011 Pipeline Diagrams Input F Reg G Reg H Reg ii+1i+2i+3 XiXi X i+1 F(X i ) G(X i ) X i+2 F(X i+1 ) G(X i+1 ) H(X i ) X i+3 F(X i+2 ) G(X i+2 ) H(X i+1 ) Clock cycle Pipeline stages The results associated with a particular set of input data moves diagonally through the diagram, progressing through one pipeline stage each clock cycle. H(X i+2 ) … … F G HXP(X) This is an example of parallelism. At any instant we are computing 2 results.

L16 – Pipelining 10 Comp 411 – Spring /20/2011 Pipelining Summary Advantages: –Higher throughput than combinational system –Different parts of the logic work on different parts of the problem… Disadvantages: –Generally, increases latency –Only as good as the *weakest* link (often called the pipeline’s BOTTLENECK)

L16 – Pipelining 11 Comp 411 – Spring /20/2011 Review of CPU Performance MIPS = Millions of Instructions/Second Freq = Clock Frequency, MHz CPI = Clocks per Instruction MIPS = Freq CPI To Increase MIPS: 1. DECREASE CPI. - RISC simplicity reduces CPI to CPI below 1.0? State-of-the-art multiple instruction issue 2. INCREASE Freq. - Freq limited by delay along longest combinational path; hence - PIPELINING is the key to improving performance.

L16 – Pipelining 12 Comp 411 – Spring /20/2011 Where Are the Bottlenecks? Pipelining goal: Break LONG combinational paths  memories, ALU in separate stages

L16 – Pipelining 13 Comp 411 – Spring /20/2011 miniMIPS Timing Different instructions use various parts of the data path. add $4, $5, $6beq $1, $2, 40lw $3, 30($0)jal Program execution order Time CLK Instruction Fetch Instruction Decode Register Prop Delay ALU Operation Branch Target Data Access Register Setup sw $2, 20($4) The above scenario is possible only if the system could vary the clock period based on the instruction being executed. This leads to complicated timing generation, and, in the end, slower systems, since it is not very compatible with pipelining! 1 instr every 14 nS, 14 nS, 20 nS, 9 nS, 19 nS 6 nS 2 nS 5 nS 4 nS 6 nS 1 nS

L16 – Pipelining 14 Comp 411 – Spring /20/2011 Uniform miniMIPS Timing With a fixed clock period, we have to allow for the worse case. add $4, $5, $6 beq $1, $2, 40 lw $3, 30($0) jal Program execution order Tim e CLK Instruction Fetch Instruction Decode Register Prop Delay ALU Operation Branch Target Data Access Register Setup sw $2, 20($4) By accounting for the “worse case” path (i.e. allowing time for each possible combination of operations) we can implement a fixed clock period. This simplifies timing generation, enforces a uniform processing order, and allows for pipelining! Isn’t the net effect just a slower CPU? 1 instr EVERY 20 nS 6 nS 2 nS 5 nS 4 nS 6 nS 1 nS

L16 – Pipelining 15 Comp 411 – Spring /20/2011 Goal: 5-Stage Pipeline GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ALU) APPROACH: structure processor as 5-stage pipeline: IF Instruction Fetch stage: Maintains PC, fetches one instruction per cycle and passes it to WB Write-Back stage: writes result back into register file. ID/RF Instruction Decode/Register File stage: Decode control lines and select source operands ALU ALU stage: Performs specified operation, passes result to MEM Memory stage: If it’s a lw, use ALU result as an address, pass mem data (or ALU result if not lw) to

L16 – Pipelining 16 Comp 411 – Spring /20/2011 ALU AB ALUFN Data Memory RD WD R/W Adr Wr WDSEL PC+4 Z VNC PC +4 Instruction Memory A D 00 BT PC :J :00 JT PCSEL x x x PC REG 00 IR REG WA Register File RA1RA2 RD1RD2 J: Imm: + x4 BT JT Rt: Rs: ASEL 20 BSEL 01 SEXT shamt: “16” 1 = BZ 5-Stage miniMIPS PC ALU 00 IR ALU A B WD ALU PC MEM 00 IR MEM Y MEM WD MEM WA Register File WA WD WE WERF WASEL Rd: Rt: “31” “27” Instruction Fetch Register File ALU Write Back PC WB 00 IR WB Y WB Memory Address is available right after instruction enters Memory stage Data is needed just before rising clock edge at end of Write Back stage almost 2 clock cycles Omits some details NO bypass or interlock logic

L16 – Pipelining 17 Comp 411 – Spring /20/2011 Pipelining Improve performance by increasing instruction throughput Ideal speedup is number of stages in the pipeline. Do we achieve this?

L16 – Pipelining 18 Comp 411 – Spring /20/2011 Pipelining What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction Individual Instructions still take the same number of cycles But we’ve improved the through-put by increasing the number of simultaneously executing instructions

L16 – Pipelining 19 Comp 411 – Spring /20/2011 Structural Hazards Inst Fetch Reg Read ALUData Access Reg Write Inst Fetch Reg Read ALUData Access Reg Write Inst Fetch Reg Read ALUData Access Reg Write Inst Fetch Reg Read ALUData Access Reg Write

L16 – Pipelining 20 Comp 411 – Spring /20/2011 Problem with starting next instruction before first is finished dependencies that “go backward in time” are data hazards Data Hazards

L16 – Pipelining 21 Comp 411 – Spring /20/2011 Have compiler guarantee no hazards Where do we insert the “nops” ? sub$2, $1, $3 and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) Problem: this really slows us down! Software Solution

L16 – Pipelining 22 Comp 411 – Spring /20/2011 Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding Forwarding

L16 – Pipelining 23 Comp 411 – Spring /20/2011 Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the same register. Thus, we need a hazard detection unit to “stall” the instruction Can't always forward

L16 – Pipelining 24 Comp 411 – Spring /20/2011 Stalling We can stall the pipeline by keeping an instruction in the same stage

L16 – Pipelining 25 Comp 411 – Spring /20/2011 When we decide to branch, other instructions are in the pipeline! We are predicting “branch not taken” need to add hardware for flushing instructions if we are wrong Branch Hazards

L16 – Pipelining 26 Comp 411 – Spring /20/2011 ALU AB ALUFN RD WD R/W Adr Wr WDSEL PC+4 Z VNC PC +4 Instruction Memory A D 00 BT PC :J :00 JT PCSEL x x x PC REG 00 IR REG WA Register File RA1RA2 RD1RD2 J: Imm: + x4 BT JT Rt: Rs: ASEL 20 BSEL 01 SEXT shamt: “16” 1 = BZ 5-Stage miniMIPS PC ALU 00 IR ALU A B WD ALU PC MEM 00 IR MEM Y MEM WD MEM WA Register File WA WD WE WERF WASEL Rd: Rt: “31” “27” Instruction Fetch Register File ALU Write Back PC WB 00 IR WB Y WB Memory We wanted a simple, clean pipeline but… added CLK EN to freeze IF/RF stages so we can wait for lw to reach WB stage NOP Data Memory broke the sequential semantics of ISA by adding a branch delay- slot and early branch resolution logic added A/B bypass muxes to get data before it’s written to regfile

L16 – Pipelining 27 Comp 411 – Spring /20/2011 Pipeline Summary (I) Started with unpipelined implementation – direct execute, 1 cycle/instruction – it had a long cycle time: mem + regs + alu + mem + wb We ended up with a 5-stage pipelined implementation – increase throughput (3x???) – delayed branch decision (1 cycle) Choose to execute instruction after branch – delayed register writeback (3 cycles) Add bypass paths (6 x 2 = 12) to forward correct value – memory data available only in WB stage Introduce NOPs at IR ALU, to stall IF and RF stages until LD result was ready

L16 – Pipelining 28 Comp 411 – Spring /20/2011 Pipeline Summary (II) Fallacy #1: Pipelining is easy Smart people get it wrong all of the time! Fallacy #2: Pipelining is independent of ISA Many ISA decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes). Fallacy #3: Increasing Pipeline stages improves performance Diminishing returns. Increasing complexity.