1  What is the most boring household activity?. 2 A relevant question  Assuming you’ve got: —One washer (takes 30 minutes) —One drier (takes 40 minutes)

Slides:



Advertisements
Similar presentations
PipelineCSCE430/830 Pipeline: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Prof. Yifeng Zhu, U of Maine Fall,
Advertisements

Pipeline Example: cycle 1 lw R10,9(R1) sub R11,R2, R3 and R12,R4, R5 or R13,R6, R7.
CMPT 334 Computer Organization
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Chapter Six 1.
1 Today  All HW1 turned in on time, this is great!  HW2 will be out soon —You will work on procedure calls/stack/etc.  Lab1 will be out soon (possibly.
Computer Organization
The Processor: Datapath & Control
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 9, 2002 Topic: Pipelining Basics.
1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State.
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
Morgan Kaufmann Publishers
CMPE 421 Advanced Computer Architecture Supplementary material for Pipelining PART1.
Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time.
Ch6a- 2 EE/CS/CPE Computer Organization  Seattle Pacific University Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3.
B 0000 Pipelining ENGR xD52 Eric VanWyk Fall
1 A pipeline diagram  A pipeline diagram shows the execution of a series of instructions. —The instruction sequence is shown vertically, from top to bottom.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
Computer Organization CS224 Chapter 4 Part b The Processor Spring 2010 With thanks to M.J. Irwin, T. Fountain, D. Patterson, and J. Hennessy for some lecture.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
1 CS/COE0447 Computer Organization & Assembly Language Multi-Cycle Execution.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
Electrical and Computer Engineering University of Cyprus LAB3: IMPROVING MIPS PERFORMANCE WITH PIPELINING.
1 A single-cycle MIPS processor  An instruction set architecture is an interface that defines the hardware operations which are available to software.
December 26, 2015©2003 Craig Zilles (derived from slides by Howard Huang) 1 A single-cycle MIPS processor  As previously discussed, an instruction set.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Performance of Single-cycle Design
Pipelining Example Laundry Example: Three Stages
IT 251 Computer Organization and Architecture Multi Cycle CPU Datapath Chia-Chi Teng.
Pipelining CS365 Lecture 9. D. Barbara Pipeline CS465 2 Outline  Today’s topic  Pipelining is an implementation technique in which multiple instructions.
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
Datapath and Control AddressInstruction Memory Write Data Reg Addr Register File ALU Data Memory Address Write Data Read Data PC Read Data Read Data.
Chapter 4 From: Dr. Iyad F. Jafar Basic MIPS Architecture: Single-Cycle Datapath and Control.
February 22, 2016©2003 Craig Zilles (derived from slides by Howard Huang) 1 A single-cycle MIPS processor  As previously discussed, an instruction set.
1 CS/COE0447 Computer Organization & Assembly Language Chapter 5 Part 2.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Datapath and Control AddressInstruction Memory Write Data Reg Addr Register File ALU Data Memory Address Write Data Read Data PC Read Data Read Data.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control Design) Prof. Dr.
1 The final datapath. 2 Control  The control unit is responsible for setting all the control signals so that each instruction is executed properly. —The.
Computer Architecture Lecture 6.  Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions:
Lecture 18: Pipelining I.
Computer Organization
Pipelines An overview of pipelining
Note how everything goes left to right, except …
Computer Architecture
IT 251 Computer Organization and Architecture
Performance of Single-cycle Design
CMSC 611: Advanced Computer Architecture
ECE232: Hardware Organization and Design
CS/COE0447 Computer Organization & Assembly Language
Chapter 4 The Processor Part 2
Single-cycle datapath, slightly rearranged
A pipeline diagram Clock cycle lw $t0, 4($sp) IF ID
Lecturer: Alan Christopher
Serial versus Pipelined Execution
Systems Architecture II
The Processor Lecture 3.2: Building a Datapath with Control
Pipelining Appendix A and Chapter 3.
Introduction to Computer Organization and Architecture
A relevant question Assuming you’ve got: One washer (takes 30 minutes)
Pipelining.
Pipelined datapath and control
Presentation transcript:

1  What is the most boring household activity?

2 A relevant question  Assuming you’ve got: —One washer (takes 30 minutes) —One drier (takes 40 minutes) —One “folder” (takes 20 minutes)  It takes 90 minutes to wash, dry, and fold 1 load of laundry. —How long does 4 loads take?

3 The slow way  If each load is done sequentially it takes 6 hours PM Midnight Time

4 Laundry Pipelining  Start each load as soon as possible —Overlap loads  Pipelined laundry takes 3.5 hours 6 PM Midnight Time

5 Pipelining Lessons  Pipelining doesn’t help latency of single load, it helps throughput of entire workload  Pipeline rate limited by slowest pipeline stage  Multiple tasks operating simultaneously using different resources  Potential speedup = Number pipe stages  Unbalanced lengths of pipe stages reduces speedup  Time to “fill” pipeline and time to “drain” it reduces speedup 6 PM 789 Time

6 Pipelining  Pipelining is a general-purpose efficiency technique —It is not specific to processors  Pipelining is used in: —Assembly lines —Bucket brigades —Fast food restaurants  Pipelining is used in other CS disciplines: —Networking —Server software architecture  Useful to increase throughput in the presence of long latency —More on that later…

7 Pipelining Processors  We’ve seen two possible implementations of the MIPS architecture. —A single-cycle datapath executes each instruction in just one clock cycle, but the cycle time may be very long. —A multicycle datapath has much shorter cycle times, but each instruction requires many cycles to execute.  Pipelining gives the best of both worlds and is used in just about every modern processor. —Cycle times are short so clock rates are high. —But we can still execute an instruction in about one clock cycle! Single Cycle DatapathCPI = 1Long Cycle Time Multi-cycle DatapathCPI = ~4Short Cycle Time Pipelined DatapathCPI = ~1Short Cycle Time

8 Instruction execution review  Executing a MIPS instruction can take up to five steps.  However, as we saw, not all instructions need all five steps. StepNameDescription Instruction FetchIFRead an instruction from memory. Instruction DecodeIDRead source registers and generate control signals. ExecuteEXCompute an R-type result or a branch outcome. MemoryMEMRead or write the data memory. WritebackWBStore a result in the destination register. InstructionSteps required beqIFIDEX R-typeIFIDEXWB swIFIDEXMEM lwIFIDEXMEMWB

9 Single-cycle datapath diagram 2ns 1ns  How long does it take to execute each instruction?

10 Single-cycle review  All five execution steps occur in one clock cycle.  This means the cycle time must be long enough to accommodate all the steps of the most complex instruction—a “lw” in our instruction set. —If the register file has a 1ns latency and the memories and ALU have a 2ns latency, “lw” will require 8ns. —Thus all instructions will take 8ns to execute.  Each hardware element can only be used once per clock cycle. —A “lw” or “sw” must access memory twice (in the IF and MEM stages), so there are separate instruction and data memories. —There are multiple adders, since each instruction increments the PC (IF) and performs another computation (EX). On top of that, branches also need to compute a target address.

11 Example: Instruction Fetch (IF) Read address Instruction memory Instruction [31-0] Read address Write address Write data Data memory Read data MemWrite MemRead 1Mux01Mux0 MemToReg Sign extend 0Mux10Mux1 ALUSrc Result Zero ALU ALUOp I [15 - 0] I [ ] I [ ] I [ ] 0Mux10Mux1 RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite  Let’s quickly review how lw is executed in the single-cycle datapath.  We’ll ignore PC incrementing and branching for now.  In the Instruction Fetch (IF) step, we read the instruction memory.

12 Instruction Decode (ID) Read address Instruction memory Instruction [31-0] Read address Write address Write data Data memory Read data MemWrite MemRead 1Mux01Mux0 MemToReg Sign extend 0Mux10Mux1 ALUSrc Result Zero ALU ALUOp I [15 - 0] I [ ] I [ ] I [ ] 0Mux10Mux1 RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite  The Instruction Decode (ID) step reads the source register from the register file.

13 Execute (EX) Read address Instruction memory Instruction [31-0] Read address Write address Write data Data memory Read data MemWrite MemRead 1Mux01Mux0 MemToReg Sign extend 0Mux10Mux1 ALUSrc Result Zero ALU ALUOp I [15 - 0] I [ ] I [ ] I [ ] 0Mux10Mux1 RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite  The third step, Execute (EX), computes the effective memory address from the source register and the instruction’s constant field.

14 Memory (MEM) Read address Instruction memory Instruction [31-0] Read address Write address Write data Data memory Read data MemWrite MemRead 1Mux01Mux0 MemToReg Sign extend 0Mux10Mux1 ALUSrc Result Zero ALU ALUOp I [15 - 0] I [ ] I [ ] I [ ] 0Mux10Mux1 RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite  The Memory (MEM) step involves reading the data memory, from the address computed by the ALU.

15 Writeback (WB) Read address Instruction memory Instruction [31-0] Read address Write address Write data Data memory Read data MemWrite MemRead 1Mux01Mux0 MemToReg Sign extend 0Mux10Mux1 ALUSrc Result Zero ALU ALUOp I [15 - 0] I [ ] I [ ] I [ ] 0Mux10Mux1 RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite  Finally, in the Writeback (WB) step, the memory value is stored into the destination register.

16 A bunch of lazy functional units  Notice that each execution step uses a different functional unit.  In other words, the main units are idle for most of the 8ns cycle! —The instruction RAM is used for just 2ns at the start of the cycle. —Registers are read once in ID (1ns), and written once in WB (1ns). —The ALU is used for 2ns near the middle of the cycle. —Reading the data memory only takes 2ns as well.  That’s a lot of hardware sitting around doing nothing.

17 Putting those slackers to work  We shouldn’t have to wait for the entire instruction to complete before we can re-use the functional units.  For example, the instruction memory is free in the Instruction Decode step as shown below, so... Read address Instruction memory Instruction [31-0] Read address Write address Write data Data memory Read data MemWrite MemRead 1Mux01Mux0 MemToReg Sign extend 0Mux10Mux1 ALUSrc Result Zero ALU ALUOp I [15 - 0] I [ ] I [ ] I [ ] 0Mux10Mux1 RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite Instruction Decode (ID) Idle

18 Decoding and fetching together  Why don’t we go ahead and fetch the next instruction while we’re decoding the first one? Instruction memory Instruction [31-0] Read address Write address Write data Data memory Read data MemWrite MemRead 1Mux01Mux0 MemToReg Sign extend 0Mux10Mux1 ALUSrc Result Zero ALU ALUOp I [15 - 0] I [ ] I [ ] I [ ] 0Mux10Mux1 RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite Read address Decode 1st instructionFetch 2nd

19 Executing, decoding and fetching  Similarly, once the first instruction enters its Execute stage, we can go ahead and decode the second instruction.  But now the instruction memory is free again, so we can fetch the third instruction! Read address Instruction memory Instruction [31-0] Read address Write address Write data Data memory Read data MemWrite MemRead 1Mux01Mux0 MemToReg Sign extend 0Mux10Mux1 ALUSrc Result Zero ALU ALUOp I [15 - 0] I [ ] I [ ] I [ ] 0Mux10Mux1 RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite Decode 2nd Fetch 3rdExecute 1st

20 Making Pipelining Work  We’ll make our pipeline 5 stages long, to handle load instructions as they were handled in the multi-cycle implementation —Stages are: IF, ID, EX, MEM, and WB  We want to support executing 5 instructions simultaneously: one in each stage.

21 Break datapath into 5 stages  Each stage has its own functional units.  Each stage can execute in 2ns —Just like the multi-cycle implementation Read address Instruction memory Instruction [31-0] Read address Write address Write data Data memory Read data MemWrite MemRead 1Mux01Mux0 MemToReg Sign extend 0Mux10Mux1 ALUSrc Result Zero ALU ALUOp I [15 - 0] I [ ] I [ ] I [ ] 0Mux10Mux1 RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite ID IF EXEMEM WB 2ns

22 Pipelining Loads 6 PM 789 Time Clock cycle lw$t0, 4($sp) IFIDEXMEMWB lw$t1, 8($sp) IFIDEXMEMWB lw$t2, 12($sp) IFIDEXMEMWB lw$t3, 16($sp) IFIDEXMEMWB lw$t4, 20($sp) IFIDEXMEMWB

23 A pipeline diagram  A pipeline diagram shows the execution of a series of instructions. —The instruction sequence is shown vertically, from top to bottom. —Clock cycles are shown horizontally, from left to right. —Each instruction is divided into its component stages. (We show five stages for every instruction, which will make the control unit easier.)  This clearly indicates the overlapping of instructions. For example, there are three instructions active in the third cycle above. —The “lw” instruction is in its Execute stage. —Simultaneously, the “sub” is in its Instruction Decode stage. —Also, the “and” instruction is just being fetched. Clock cycle lw$t0, 4($sp) IFIDEXMEMWB sub$v0, $a0, $a1 IFIDEXMEMWB and$t1, $t2, $t3 IFIDEXMEMWB or$s0, $s1, $s2 IFIDEXMEMWB add$sp, $sp, -4 IFIDEXMEMWB

24 Pipeline terminology  The pipeline depth is the number of stages—in this case, five.  In the first four cycles here, the pipeline is filling, since there are unused functional units.  In cycle 5, the pipeline is full. Five instructions are being executed simultaneously, so all hardware units are in use.  In cycles 6-9, the pipeline is emptying. fillingfullemptying Clock cycle lw$t0, 4($sp) IFIDEXMEMWB sub$v0, $a0, $a1 IFIDEXMEMWB and$t1, $t2, $t3 IFIDEXMEMWB or$s0, $s1, $s2 IFIDEXMEMWB add$sp, $sp, -4 IFIDEXMEMWB

25 Pipelining Performance  Execution time on ideal pipeline: —time to fill the pipeline + one cycle per instruction —What is the execution time for N instructions?  Compare with other implementations: —Single Cycle: (8ns clock period)  How much faster is pipelining for N=1000 ? Clock cycle lw$t0, 4($sp) IFIDEXMEMWB lw$t1, 8($sp) IFIDEXMEMWB lw$t2, 12($sp) IFIDEXMEMWB lw$t3, 16($sp) IFIDEXMEMWB lw$t4, 20($sp) IFIDEXMEMWB filling

26 Pipeline Datapath: Resource Requirements Clock cycle lw$t0, 4($sp) IFIDEXMEMWB lw$t1, 8($sp) IFIDEXMEMWB lw$t2, 12($sp) IFIDEXMEMWB lw$t3, 16($sp) IFIDEXMEMWB lw$t4, 20($sp) IFIDEXMEMWB  We need to perform several operations in the same cycle. —Increment the PC and add registers at the same time. —Fetch one instruction while another one reads or writes data.  What does this mean for our hardware?

27 Pipelining other instruction types  R-type instructions only require 4 stages: IF, ID, EX, and WB —We don’t need the MEM stage  What happens if we try to pipeline loads with R-type instructions? Clock cycle add$sp, $sp, -4 IFIDEXWB sub$v0, $a0, $a1 IFIDEXWB lw$t0, 4($sp) IFIDEXMEMWB or$s0, $s1, $s2 IFIDEXWB lw$t1, 8($sp) IFIDEXMEMWB

28 Important Observation  Each functional unit can only be used once per instruction  Each functional unit must be used at the same stage for all instructions. See the problem if: —Load uses Register File’s Write Port during its 5th stage —R-type uses Register File’s Write Port during its 4th stage Clock cycle add$sp, $sp, -4 IFIDEXWB sub$v0, $a0, $a1 IFIDEXWB lw$t0, 4($sp) IFIDEXMEMWB or$s0, $s1, $s2 IFIDEXWB lw$t1, 8($sp) IFIDEXMEMWB

29 A solution: Insert NOP stages  Enforce uniformity —Make all instructions take 5 cycles. —Make them have the same stages, in the same order Some stages will do nothing for some instructions Stores and Branches have NOP stages, too… Clock cycle add$sp, $sp, -4 IFIDEXNOPWB sub$v0, $a0, $a1 IFIDEXNOPWB lw$t0, 4($sp) IFIDEXMEMWB or$s0, $s1, $s2 IFIDEXNOPWB lw$t1, 8($sp) IFIDEXMEMWB R-type IFIDEXNOPWB store IFIDEXMEMNOP branch IFIDEXNOP

30 Summary  Pipelining attempts to maximize instruction throughput by overlapping the execution of multiple instructions.  Pipelining offers amazing speedup. —In the best case, one instruction finishes on every cycle, and the speedup is equal to the pipeline depth.  The pipeline datapath is much like the single-cycle one, but with added pipeline registers —Each stage needs is own functional units  Next time we’ll see the datapath and control, and walk through an example execution.