EECS 470 Pipeline Hazards Lecture 4 Coverage: Appendix A.

Slides:



Advertisements
Similar presentations
Adding the Jump Instruction
Advertisements

Morgan Kaufmann Publishers The Processor
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Lecture 6: Pipelining MIPS R4000 and More Kai Bu
Instruction-Level Parallelism (ILP)
MIPS Pipelined Datapath
1 RISC Pipeline Han Wang CS3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 4.6.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University RISC Pipeline See: P&H Chapter 4.6.
CS-447– Computer Architecture Lecture 12 Multiple Cycle Datapath
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
EECS 470 Pipeline Control Hazards Lecture 5 Coverage: Chapter 3 & Appendix A.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Computer Architecture Lecture 3 Coverage: Appendix A
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
Appendix A Pipelining: Basic and Intermediate Concepts
Lecture 28: Chapter 4 Today’s topic –Data Hazards –Forwarding 1.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
EECS 470 Further review: Pipeline Hazards and More Lecture 2 – Winter 2014 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.
Pipelined Datapath and Control
CPE432 Chapter 4B.1Dr. W. Abu-Sufah, UJ Chapter 4B: The Processor, Part B-2 Read Section 4.7 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2012 Revised from original slides provided by MKP.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
Sample Code (Simple) Run the following code on a pipelined datapath: add1 2 3 ; reg 3 = reg 1 + reg 2 nand ; reg 6 = reg 4 & reg 5 lw ; reg.
Electrical and Computer Engineering University of Cyprus LAB3: IMPROVING MIPS PERFORMANCE WITH PIPELINING.
CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]
Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.
CDA 5155 Computer Architecture Week 1.5. Start with the materials: Conductors and Insulators Conductor: a material that permits electrical current to.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University MIPS Pipeline See P&H Chapter 4.6.
EE524/CptS561 Jose G. Delgado-Frias 1 Processor Basic steps to process an instruction IFID/OFEXMEMWB Instruction Fetch Instruction Decode / Operand Fetch.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
CSE431 L06 Basic MIPS Pipelining.1Irwin, PSU, 2005 MIPS Pipeline Datapath Modifications  What do we need to add/modify in our MIPS datapath? l State registers.
Introduction to Computer Organization Pipelining.
CDA 5155 Week 3 Branch Prediction Superscalar Execution.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
CSE 340 Computer Architecture Spring 2016 Overcoming Data Hazards.
Pipeline Timing Issues
Computer Organization
CDA3101 Recitation Section 8
Lecture 07: Pipelining Multicycle, MIPS R4000, and More
Single Clock Datapath With Control
Design of the Control Unit for Single-Cycle Instruction Execution
Review: MIPS Pipeline Data and Control Paths
Morgan Kaufmann Publishers The Processor
Chapter 4 The Processor Part 2
Pipelining Multicycle, MIPS R4000, and More
Pipelining review.
Current Design.
Design of the Control Unit for One-cycle Instruction Execution
Pipelining in more detail
The Processor Lecture 3.6: Control Hazards
The Processor Lecture 3.5: Data Hazards
Morgan Kaufmann Publishers The Processor
Introduction to Computer Organization and Architecture
MIPS Pipelined Datapath
Presentation transcript:

EECS 470 Pipeline Hazards Lecture 4 Coverage: Appendix A

rev 1 2 Basic Pipelining Data hazards –What are they? –How do you detect them? –How do you deal with them? Micro-architectural changes –Pipeline depth –Pipeline width Forwarding ISA

rev 1 3 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits op dest offset valB valA PC+1 target ALU result op dest valB op dest ALU result mdata eq? instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB Bits data dest Fetch DecodeExecute Memory WB

rev 1 4 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX op dest offset valB valA PC+1 target ALU result op dest valB op dest ALU result mdata eq? instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB data dest Fetch DecodeExecute Memory WB

rev 1 5 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX op offset valB valA PC+1 target ALU result op valB op ALU result mdata eq? instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB fwd data Fetch DecodeExecute Memory WB

rev 1 6 Pipeline function for ADD Fetch: read instruction from memory Decode: read source operands from reg Execute: calculate sum Memory: Pass results to next stage Writeback: write sum into register file

rev 1 7 Data Hazards add1 2 3 nand time fetch decode execute memory writeback add nand If not careful, you will read the wrong value of R3

rev 1 8 Three approaches to handling data hazards Avoidance –Make sure there are no hazards in the code Detect and Stall –If hazards exist, stall the processor until they go away. Detect and Forward –If hazards exist, fix up the pipeline to get the correct value (if possible)

rev 1 9 Handling data hazards: avoid all hazards Assume the programmer (or the compiler) knows about the processor implementation. –Make sure no hazards exist. Put noops between any dependent instructions. add1 2 3 noop nand3 4 5 write R3 in cycle 5 read R3 in cycle 6

rev 1 10 Problems with this solution Old programs (legacy code) may not run correctly on new implementations –Longer pipelines need more noops Programs get larger as noops are included –Especially a problem for machines that try to execute more than one instruction every cycle –Intel EPIC: Often 25% - 40% of instructions are noops Program execution is slower –CPI is one, but some I’s are noops

rev 1 11 Handling data hazards: detect and stall Detection: –Compare regA with previous DestRegs 3 bit operand fields –Compare regB with previous DestRegs 3 bit operand fields Stall: –Keep current instructions in fetch and decode –Pass a noop to execute

rev 1 12 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX op offset valB valA PC+1 target ALU result op valB op ALU result mdata eq? add R2 R3 R4 R5 R1 R6 R0 R7 regA regB data End of Cycle 1

rev 1 13 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX add PC+1 target ALU result op valB op ALU result mdata eq? nand R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 End of Cycle 2

rev 1 14 Hazard detection PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX add PC+1 target ALU result op valB op ALU result mdata eq? nand R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 3 First half of cycle 3

rev 1 15 REG file IF/ ID ID/ EX 3 compare Hazard detected regA regB compare 3

rev Hazard detected regA regB compare

rev 1 17 Handling data hazards: detect and stall the pipeline until ready Detection: –Compare regA with previous DestReg 3 bit operand fields –Compare regB with previous DestReg 3 bit operand fields Stall: Keep current instructions in fetch and decode Pass a noop to execute

rev 1 18 Hazard PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX add target ALU result valB ALU result mdata eq? nand R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 3 en First half of cycle 3

rev 1 19 Handling data hazards: detect and stall the pipeline until ready Detection: –Compare regA with previous DestReg 3 bit operand fields –Compare regB with previous DestReg 3 bit operand fields Stall: –Keep current instructions in fetch and decode –Pass a noop to execute

rev 1 20 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX 2 21 add ALU result mdata nand R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 End of cycle 3 noop

rev 1 21 Hazard PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX noop 2 21 add ALU result mdata nand R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 3 en First half of cycle 4

rev 1 22 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX noop 2 add 21 nand R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 End of cycle 4 noop

rev 1 23 No Hazard PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX noop 2 add 21 nand R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 3 First half of cycle 5

rev 1 24 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX nand noop add R2 R3 R4 R5 R1 R6 R0 R7 regA regB 5 data End of cycle 5

rev 1 25 No more hazard: stalling add1 2 3 nand time fetch decode execute memory writeback fetch decode decode decode execute add nand We are careful to get the right value of R3 hazard

rev 1 26 Problems with detect and stall CPI increases every time a hazard is detected! Is that necessary? Not always! –Re-route the result of the add to the nand nand no longer needs to read R3 from reg file It can get the data later (when it is ready) This lets us complete the decode this cycle –But we need more control to remember that the data that we aren’t getting from the reg file at this time will be found elsewhere in the pipeline at a later cycle.

rev 1 27 Handling data hazards: detect and forward Detection: same as detect and stall –Except that all 4 hazards are treated differently i.e., you can’t logical-OR the 4 hazard signals Forward: –New datapaths to route computed data to where it is needed –New Mux and control to pick the right data

rev 1 28 Hazard PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX add nand R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 fwd 3 First half of cycle 3

rev 1 29 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX nand add add R2 R3 R4 R5 R1 R6 R0 R7 regA regB 5 data H1 3 End of cycle 3

rev 1 30 New Hazard PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX nand add add R2 R3 R4 R5 R1 R6 R0 R7 regA regB 5 data 3 MUXMUX H1 3 First half of cycle

rev 1 31 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX add nand add 21 lw R2 R3 R4 R5 R1 R6 R0 R7 regA regB 753 data MUXMUX H2H1 End of cycle 4

rev 1 32 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX add nand add 21 lw R2 R3 R4 R5 R1 R6 R0 R7 regA regB 753 data MUXMUX H2H1 First half of cycle 5 3 No Hazard 21 1

rev 1 33 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX lw add nand -2 sw R2 R3 R4 R5 R1 R6 R0 R7 regA regB 75 data MUXMUX H2H1 6 End of cycle 5

rev 1 34 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX lw add nand -2 sw R2 R3 R4 R5 R1 R6 R0 R7 regA regB 675 data MUXMUX H2H1 First half of cycle 6 Hazard 6 en L

rev 1 35 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX 5 31 lw add 22 sw R2 R3 R4 R5 R1 R6 R0 R7 regA regB 67 data MUXMUX H2 End of cycle 6 noop

rev 1 36 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX noop 5 31 lw add 22 sw R2 R3 R4 R5 R1 R6 R0 R7 regA regB 67 data MUXMUX H2 First half of cycle 7 Hazard 6

rev 1 37 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX sw noop lw R2 R3 R4 R5 R1 R6 R0 R7 regA regB 6 data MUXMUX H3 End of cycle 7

rev 1 38 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX sw noop lw R2 R3 R4 R5 R1 R6 R0 R7 regA regB 6 data MUXMUX H3 First half of cycle

rev 1 39 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX 111 sw 7 noop R2 R3 R4 R5 R1 R6 R0 R7 regA regB data MUXMUX H3 End of cycle 8

rev 1 40 FP pipeline support fetch decode I M1M2M3M4M5M6M7 A1A2A3A4 MemWB Non-pipelined divide FP adder FP multiply add

rev 1 41 Adding pipeline stages Pipeline frontend –Fetch, Decode Pipeline middle –Execute Pipeline backend –Memory, Writeback

rev 1 42 Adding stages to fetch, decode Delays hazard detection No change in forwarding paths No performance penalty with respect to data hazards

rev 1 43 Adding stages to execute Check for structural hazards –ALU not pipelined –Multiple ALU ops completing at same time Data hazards may cause delays –If multicycle op hasn't computed data before the dependent instruction is ready to execute Performance penalty for each stall

rev 1 44 Adding stages to memory, writeback Instructions ready to execute may need to wait longer for multi-cycle memory stage Adds more pipeline registers –Thus more source registers to forward More complex hazard detection Wider muxes More control bits to manage muxes

rev 1 45 Wider pipelines fetch decodeexecute memWB fetch decodeexecute memWB More complex hazard detection 2X pipeline registers to forward from 2X more instructions to check 2X more destinations (muxes)

rev 1 46 Making forwarding explicit add r1  r2, EX/Mem ALU result –Include direct mux controls into the ISA –Hazard detection is now a compiler task –New micro-architecture leads to new ISA –Can reduce some resources No longer need to build a heavily ported reg file Ref: TTAs: Missing the ILP complexity wall