1 COMP541 Pipelined MIPS Montek Singh Apr 9, 2012.

Slides:



Advertisements
Similar presentations
Morgan Kaufmann Publishers The Processor
Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Chapter Six 1.
Instruction-Level Parallelism (ILP)
Pipelined Processor II (cont’d) CPSC 321
COMP381 by M. Hamdi 1 Pipeline Hazards. COMP381 by M. Hamdi 2 Pipeline Hazards Hazards are situations in pipelining where one instruction cannot immediately.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Pipelined Processor.
S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.
1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Pipelining III Andreas Klappenecker CPSC321 Computer Architecture.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Chapter Six - 2nd Half Pipelined Processor Forwarding, Hazards, Branching EE3055 Web:
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
L18 – Pipeline Issues 1 Comp 411 – Spring /03/08 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you.
L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.
Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
55:035 Computer Architecture and Organization Lecture 10.
Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time.
B10001 Pipelining Hazards ENGR xD52 Eric VanWyk Fall 2012.
1 COMP541 Multicycle MIPS Montek Singh Apr 4, 2012.
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
COMP541 Multicycle MIPS Montek Singh Apr 8, 2015.
CMPE 421 Parallel Computer Architecture
Pipeline Data Hazards Warning, warning, warning! Read 4.8 Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Pipelining Basics.
1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.
CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.
COMP541 Multicycle MIPS Montek Singh Mar 25, 2010.
Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining.
1 (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann,
Computing Systems Pipelining: enhancing performance.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Branch Hazards and Static Branch Prediction Techniques
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
1/24/ :00 PM 1 of 86 Pipelining Chapter 6. 1/24/ :00 PM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which.
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
Introduction to Computer Organization Pipelining.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
L17 – Pipeline Issues 1 Comp 411 – Fall /23/09 CPU Pipelining Issues Read Chapter This pipe stuff makes my head hurt! What have you been.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Chapter 7 Digital Design and Computer Architecture, 2 nd Edition Chapter 7 David Money Harris and Sarah L. Harris.
Chapter 7 :: Microarchitecture
Chapter Six.
CDA3101 Recitation Section 8
CSCI206 - Computer Organization & Programming
Pipeline Implementation (4.6)
ECE232: Hardware Organization and Design
Chapter 4 The Processor Part 3
Lecture 9. MIPS Processor Design – Pipelined Processor Design #2
Lecture 5: Pipelining Basics
Pipelining in more detail
CSCI206 - Computer Organization & Programming
Lecture 5. MIPS Processor Design
Chapter Six.
Chapter Six.
November 5 No exam results today. 9 Classes to go!
Introduction to Computer Organization and Architecture
Presentation transcript:

1 COMP541 Pipelined MIPS Montek Singh Apr 9, 2012

2Topics  Today’s topic: Pipelining Can think of it as: Can think of it as:  A way to parallelize, or  A way to make better utilization of the hardware. Goal: Try to use all hardware every clock cycle Goal: Try to use all hardware every clock cycle  Reading Section 7.5 of textbook Section 7.5 of textbook

Parallelism  Two types of parallelism: Spatial parallelism Spatial parallelism  duplicate hardware performs multiple tasks at once Temporal parallelism Temporal parallelism  task is broken into multiple stages –each stage operating on different parts of distinct instructions  also called pipelining –example: an assembly line

Parallelism Definitions  Some definitions: Token: A group of inputs processed together to produce a group of outputs Token: A group of inputs processed together to produce a group of outputs  a “bundle” Latency: Time for one token to pass from start to end Latency: Time for one token to pass from start to end Throughput: The number of tokens that can be processed per unit time Throughput: The number of tokens that can be processed per unit time  Parallelism increases throughput Often sacrificing latency Often sacrificing latency

Parallelism Example  Ben is baking cookies 2-part task: 2-part task:  It takes 5 minutes to roll the cookies…  … and 15 minutes to bake them After finishing one batch he immediately starts the next batch. What is the latency and throughput if Ben does NOT use parallelism? After finishing one batch he immediately starts the next batch. What is the latency and throughput if Ben does NOT use parallelism? Latency = = 20 min = 1/3 hour Throughput = 1 tray/ 20 min = 3 trays/hour

Parallelism Example  What is the latency and throughput if Ben uses parallelism? Spatial parallelism: Ben asks Allysa to help, using her own oven Spatial parallelism: Ben asks Allysa to help, using her own oven Temporal parallelism: Ben breaks the task into two stages: roll and baking. He uses two trays. While the first batch is baking he rolls the second batch, and so on. Temporal parallelism: Ben breaks the task into two stages: roll and baking. He uses two trays. While the first batch is baking he rolls the second batch, and so on.

Spatial Parallelism Latency = ? Throughput = ?

Spatial Parallelism Latency = = 20 min = 1/3 hour (same) Throughput = 2 trays/ 20 min = 6 trays/hour (doubled)

Temporal Parallelism Latency = ? Throughput = ?

Temporal Parallelism Latency = = 20 min = 1/3 hour Throughput = 1 trays/ 15 min = 4 trays/hour Using both techniques, the throughput would be 8 trays/hour

Pipelined MIPS  Temporal parallelism  Divide single-cycle processor into 5 stages: Fetch Fetch Decode Decode Execute Execute Memory Memory Writeback Writeback  Add pipeline registers between stages

Single-Cycle vs. Pipelined Performance  Pipelining Break instruction into 5 steps Break instruction into 5 steps Each is 250 ps long (length of longest step) Each is 250 ps long (length of longest step) Write Reg and Read Reg occur during the first half and second half of the clock cycle, respectively. This allows more instr overlap!

Pipelining Abstraction  5 stages of execution instruction memory (fetch) instruction memory (fetch) register file read register file read ALU/execution ALU/execution data memory access data memory access register file write (writeback) register file write (writeback)

Single-Cycle and Pipelined Datapath  Key difference: insertion of pipeline registers

Multi-Cycle and Pipelined Datapath  Key difference: full-length registers for datapath and control

One Problem  There is a problem: ResultW and WriteReg are out of step ResultW and WriteReg are out of step

Corrected Pipelined Datapath  Solution: One modification needed Send both ResultW and WriteReg through an equal number of registers Send both ResultW and WriteReg through an equal number of registers

Pipelined Control  What is similar/different w.r.t. single-cycle MIPS? Same control signal values Same control signal values Values must be delayed and delivered in correct cycles Values must be delayed and delivered in correct cycles  Control signals must go through registers as well!

Pipelining Challenges  “Hazards” when an instruction depends on results from previous instruction that has not yet completed when an instruction depends on results from previous instruction that has not yet completed 2 types of hazards: 2 types of hazards:  Data hazard: register value not written back to register file yet –an instruction produces a result needed by next instruction –new value will be stored during write-back stage –new value needed by next instr during register read  following too close behind!  Control hazard: don’t know which is next instruction –next PC not decided yet –cause by conditional branches –cannot fetch next instr if current branch is not yet decided

Data Hazard: Example  First instruction computes a result ($s0) needed by the next 3 instructions first two cause problems first two cause problems third actually does not! third actually does not!

Handling Data Hazards  Static/compiler approaches: Insert nop’ s (no operations) in code at compile time Insert nop’ s (no operations) in code at compile time Rearrange code at compile time Rearrange code at compile time  Dynamic/runtime approaches: Forward data at run time Forward data at run time Stall the processor at run time Stall the processor at run time

Compile-Time Hazard Elimination  Insert enough nops between dependent instrs  Or re-arrange: move independent instructions earlier

Dynamic Approach: Data Forwarding  Also known as bypassing results are actually available even though not stored in RF results are actually available even though not stored in RF grab a copy and send where needed! grab a copy and send where needed!  Note: forwarding actually not needed for sub. Why? forwarding actually not needed for sub. Why?

Data Forwarding

 Forward to Execute stage from either: Memory stage or Memory stage or Writeback stage Writeback stage  Forwarding logic for ForwardAE: if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then ForwardAE = 10 else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW) then ForwardAE = 01 else ForwardAE = 00  Forwarding logic for ForwardBE same, but replace rsE with rtE

Data Forwarding if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then ForwardAE = 10 else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW)) then ForwardAE = 01 else ForwardAE = 00 26

Forwarding may not always work…  Example: Load followed immediately by R-type loads are harder to deal with because they need mem access! loads are harder to deal with because they need mem access!  need one extra cycle, compared to R-type instructions lw has a 2-cycle latency!

Stalling  Stall for a cycle, then forward solves the Load followed by R-type problem solves the Load followed by R-type problem

Stalling Hardware

Stalling Control  Stalling logic: lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemtoRegE StallF = StallD = FlushE = lwstall

Stalling Control lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemtoRegE StallF = StallD = FlushE = lwstall

Control Hazards  beq : branch is not determined until the fourth stage of the pipeline branch is not determined until the fourth stage of the pipeline Instructions after the branch are fetched before branch occurs Instructions after the branch are fetched before branch occurs These instructions must be flushed if the branch is taken These instructions must be flushed if the branch is taken

Effect & Solutions  Could always stall when branch decoded Expensive: 3 cycles lost per branch! Expensive: 3 cycles lost per branch!  Could predict banch outcome, and flush if wrong Branch misprediction penalty Branch misprediction penalty  Instructions flushed when branch is taken  May be reduced by determining branch earlier 33

Control Hazards: Flushing  Flushing turn instruction into a NOP (put all zeros) turn instruction into a NOP (put all zeros) renders it harmless! renders it harmless!

Control Hazards: Original Pipeline (for comparison)

Control Hazards: Early Branch Resolution Introduced another data hazard in Decode stage (fix a few slides away)

Control Hazards with Early Branch Resolution Penalty now only one lost cycle

Aside: Delayed Branch  MIPS always executes instruction following a branch So branch delayed So branch delayed  This allows us to avoid killing inst. Compilers move instruction that has no conflict w/ branch into delay slot Compilers move instruction that has no conflict w/ branch into delay slot 38

Example  This sequence add $4 $5 $6 beq $1 $2 40  reordered to this beq $1 $2 40 add $4 $5 $6 39

Handling the New Hazards

Control Forwarding and Stalling Hardware  Forwarding logic: ForwardAD = (rsD !=0) AND (rsD == WriteRegM) AND RegWriteM ForwardBD = (rtD !=0) AND (rtD == WriteRegM) AND RegWriteM  Stalling logic: branchstall = BranchD AND RegWriteE AND (WriteRegE == rsD OR WriteRegE == rtD) (WriteRegE == rsD OR WriteRegE == rtD) OR OR BranchD AND MemtoRegM AND BranchD AND MemtoRegM AND (WriteRegM == rsD OR WriteRegM == rtD) (WriteRegM == rsD OR WriteRegM == rtD) StallF = StallD = FlushE = lwstall OR branchstall

Branch Prediction  Especially important if branch penalty > 1 cycle  Guess whether branch will be taken Backward branches are usually taken (loops) Backward branches are usually taken (loops) Perhaps consider history of whether branch was previously taken to improve the guess Perhaps consider history of whether branch was previously taken to improve the guess  Good prediction reduces the fraction of branches requiring a flush

Pipelined Performance Example  Ideally CPI = 1 But less due to: stalls (caused by loads and branches) But less due to: stalls (caused by loads and branches)  SPECINT2000 benchmark: 25% loads 25% loads 10% stores 10% stores 11% branches 11% branches 2% jumps 2% jumps 52% R-type 52% R-type  Suppose: 40% of loads used by next instruction 40% of loads used by next instruction 25% of branches mispredicted 25% of branches mispredicted All jumps flush next instruction All jumps flush next instruction  What is the average CPI?

Pipelined Performance Example  SPECINT2000 benchmark: 25% loads 25% loads 10% stores 10% stores 11% branches 11% branches 2% jumps 2% jumps 52% R-type 52% R-type  Suppose: 40% of loads used by next instruction 40% of loads used by next instruction 25% of branches mispredicted 25% of branches mispredicted All jumps flush next instruction All jumps flush next instruction  What is the average CPI? Load/Branch CPI = 1 when no stalling, 2 when stalling. Thus, Load/Branch CPI = 1 when no stalling, 2 when stalling. Thus,  CPIlw = 1(0.6) + 2(0.4) = 1.4  CPIbeq = 1(0.75) + 2(0.25) = 1.25  Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + (0.52)(1) = 1.15

Pipelined Performance  Pipelined processor critical path: T c = max { t pcq + t mem + t setup, t pcq + t mem + t setup, 2(t RFread + t mux + t eq + t AND + t mux + t setup ), 2(t RFread + t mux + t eq + t AND + t mux + t setup ), t pcq + t mux + t mux + t ALU + t setup, t pcq + t mux + t mux + t ALU + t setup, t pcq + t memwrite + t setup, t pcq + t memwrite + t setup, 2(t pcq + t mux + t RFwrite ) } 2(t pcq + t mux + t RFwrite ) }

Pipelined Performance Example T c = 2(t RFread + t mux + t eq + t AND + t mux + t setup ) = 2[ ] ps = 550 ps

Pipelined Performance Example  For a program with 100 billion instructions executing on a pipelined MIPS processor, CPI = 1.15 CPI = 1.15 T c = 550 ps T c = 550 ps Execution Time = (# instructions) × CPI × Tc = (100 × 109)(1.15)(550 × 10-12) = (100 × 109)(1.15)(550 × 10-12) = 63 seconds = 63 seconds

Summary  Pipelining benefits use hardware more efficiently use hardware more efficiently throughput increases throughput increases  Pipelining challenges/drawbacks latency increases latency increases hazards ensue hazards ensue energy/power consumption increases energy/power consumption increases  All modern processors are pipelined some have way more than 5 pipeline stages some have way more than 5 pipeline stages  some Pentium’s have had pipeline stages!