EEL4713C Ann Gordon-Ross.1 EEL-4713C Computer Architecture Pipelined Processor - Hazards.

Slides:

Advertisements

Similar presentations

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Advertisements

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.

Instruction-Level Parallelism (ILP)

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Review: MIPS Pipeline Data and Control Paths

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

ECE 361 Computer Architecture Lecture 13: Designing a Pipeline Processor Start X:40.

Chapter 5 Pipelining and Hazards

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan

1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.

Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

1 第三章 Instruction-Level Parallelism and Its Dynamic Exploitation 陈文智浙江大学计算机学院 2011 年 09 月.

Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.

CPE 731 Advanced Computer Architecture Pipelining Review Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,

EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

CMPE 421 Parallel Computer Architecture

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

CMPE 421 Parallel Computer Architecture Part 2: Hardware Solution: Forwarding.

Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

1 (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann,

HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.

CPE 442 hazards.1 Introduction to Computer Architecture CpE 442 Designing a Pipeline Processor (lect. II)

CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.

CSIE30300 Computer Architecture Unit 05: Overcoming Data Hazards Hsin-Chou Chi [Adapted from material by and

HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

CSE 340 Computer Architecture Spring 2016 Overcoming Data Hazards.

Computer Organization

Computer Organization CS224

Stalling delays the entire pipeline

Review: Instruction Set Evolution

Pipelining: Hazards Ver. Jan 14, 2014

CDA 3101 Spring 2016 Introduction to Computer Organization

5 Steps of MIPS Datapath Figure A.2, Page A-8

Single Clock Datapath With Control

Appendix C Pipeline implementation

ECS 154B Computer Architecture II Spring 2009

Appendix A - Pipelining

CpE 442 Designing a Pipeline Processor (lect. II)

Chapter 4 The Processor Part 3

Review: MIPS Pipeline Data and Control Paths

Pipelining review.

The processor: Pipelining and Branching

Lecture 9. MIPS Processor Design – Pipelined Processor Design #2

Pipelining in more detail

Data Hazards Data Hazard

Pipeline control unit (highly abstracted)

The Processor Lecture 3.6: Control Hazards

Control unit extension for data hazards

The Processor Lecture 3.5: Data Hazards

Instruction Execution Cycle

Overview What are pipeline hazards? Types of hazards

Pipeline control unit (highly abstracted)

Pipeline Control unit (highly abstracted)

Pipelining (II).

Control unit extension for data hazards

Introduction to Computer Organization and Architecture

Control unit extension for data hazards

Throughput = #instructions per unit time (seconds/cycles etc.)

Stalls and flushes Last time, we discussed data hazards that can occur in pipelined CPUs if some instructions depend upon others that are still executing.

ELEC / Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher.

Presentation transcript:

EEL4713C Ann Gordon-Ross.1 EEL-4713C Computer Architecture Pipelined Processor - Hazards

EEL4713C Ann Gordon-Ross.2 Outline & Announcements °Introduction to Hazards °Forwarding °1 cycle Load Delay °1 cycle Branch Delay °What makes pipelining hard

EEL4713C Ann Gordon-Ross.3 Pipelining – dealing with hazards °Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle structural hazards: HW cannot support this combination of instructions data hazards: instruction depends on result of prior instruction still in the pipeline control hazards: pipelining of branches & other instructions that change the PC °Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline

EEL4713C Ann Gordon-Ross.4 Mem Single Memory is a Structural Hazard I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4 ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Reg MemReg ALU Mem Reg MemReg

EEL4713C Ann Gordon-Ross.5 Option 1: Stall to resolve Memory Structural Hazard I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Instr 3(stall) Instr 4 ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Mem Reg MemReg bubble ALU Mem Reg MemReg ALU Mem Reg MemReg

EEL4713C Ann Gordon-Ross.6 Option 2: Duplicate to Resolve Structural Hazard I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4 ALU Im Reg Dm Reg ALU Im Reg DmReg ALU Im Reg DmReg Im ALU Reg DmReg ALU Im Reg DmReg Separate Instruction Cache (Im) & Data Cache (Dm)

EEL4713C Ann Gordon-Ross.7 Data Hazard on r1 add r1,r2,r3 sub r4, r1,r3 and r6, r1,r7 or r8, r1,r9 xor r10, r1,r11

EEL4713C Ann Gordon-Ross.8 Data Hazard on r1: I n s t r. O r d e r Time (clock cycles) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 IFID/RFEXMEMWB ALU Im Reg Dm Reg ALU Im Reg DmReg ALU Im Reg DmReg Im ALU Reg DmReg ALU Im Reg DmReg Dependencies backwards in time are hazards Recall, read/write at end/beginning of cycles, thus values passed through reg file

EEL4713C Ann Gordon-Ross.9 sub r4, r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Option1: HW Stalls to Resolve Data Hazard I n s t r. O r d e r Time (clock cycles) add r1,r2,r3 IFID/RFEXMEMWB ALU Im Reg Dm Reg ALU Im Reg Dm Im bubble ALU Reg Dm Reg ALU Im Reg Im Reg

EEL4713C Ann Gordon-Ross.10 But recall how the control logic works °The Main Control generates the control signals during Reg/Dec Control signals for Exec (ExtOp, ALUSrc,...) are used 1 cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr (MemtoReg MemWr) are used 3 cycles later IF/ID Register ID/Ex Register Ex/Mem Register Mem/Wr Register Reg/DecExecMem ExtOp ALUOp RegDst ALUSrc Branch MemWr MemtoReg RegWr Main Control ExtOp ALUOp RegDst ALUSrc MemtoReg RegWr MemtoReg RegWr MemtoReg RegWr Branch MemWr Branch MemWr Wr

EEL4713C Ann Gordon-Ross.11 Option 1: HW stalls pipeline I n s t r. O r d e r Time (clock cycles) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 IFID/RFEXMEMWB ALU Im Reg Dm Reg ALU Im Reg DmReg HW doesn ’ t change PC => keeps fetching same instruction & sets control signals to benign values (0) stall ALU Im Reg Dm bubble Im bubble Im bubble Im

EEL4713C Ann Gordon-Ross.12 Option 2: SW inserts independent instructions I n s t r. O r d e r Time (clock cycles) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 IFID/RFEXMEMWB ALU Im Reg Dm Reg ALU Im Reg DmReg Worst case inserts NOP instructions nop ALU Im Reg Dm ALU Im Reg DmReg ALU Im Reg DmReg Im ALU Reg DmReg

EEL4713C Ann Gordon-Ross.13 Option 3 Insight: Data is available! I n s t r. O r d e r Time (clock cycles) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 IFID/RFEXMEMWB ALU Im Reg Dm Reg ALU Im Reg DmReg ALU Im Reg DmReg Im ALU Reg DmReg ALU Im Reg DmReg Pipeline registers already contain needed data Key enabler: Reg file written at beginning of cycle, read at end

EEL4713C Ann Gordon-Ross.14 HW Change for “Forwarding” (Bypassing): ) Increase multiplexers to add paths from pipeline registers

EEL4713C Ann Gordon-Ross.15 Load delays °Although Load is fetched during Cycle 1: Data loaded from memory in cycle 4 The data is NOT written into the Reg File until Cycle 5 We cannot read this value from the Reg File until Cycle 6 2-instruction delay before the load take effect Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8 IfetchReg/DecExecMemWrI0: Load IfetchReg/DecExecMemWrPlus 1 IfetchReg/DecExecMemWrPlus 2 IfetchReg/DecExecMemWrPlus 3 IfetchReg/DecExecMemWrPlus 4

EEL4713C Ann Gordon-Ross.16 Forwarding reduces Data Hazard to 1 cycle: I n s t r. O r d e r Time (clock cycles) lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 IFID/RFEXMEMWB ALU Im Reg Dm Reg ALU Im Reg DmReg ALU Im Reg DmReg Im ALU Reg DmReg

EEL4713C Ann Gordon-Ross.17 Option1: HW Stalls to Resolve Data Hazard I n s t r. O r d e r Time (clock cycles) lw r1, 0(r2) sub r4,r1,r3 IFID/RFEXMEMWB ALU Im Reg Dm Reg Check for hazard & stalls stall bubble Im and r6,r1,r7 or r8,r1,r9 ALU Im Reg DmReg ALU Im Reg DmReg Im ALU Reg DmReg

EEL4713C Ann Gordon-Ross.18 Option 2: SW inserts independent instructions I n s t r. O r d e r Time (clock cycles) lw r1, 0(r2) sub r4,r1,r3 IFID/RFEXMEMWB Worst case inserts NOP instructions nop and r6,r1,r7 or r8,r1,r9 ALU Im Reg DmReg ALU Im Reg DmReg Im ALU Reg DmReg ALU Im Reg Dm Reg ALU Im Reg Dm Reg

EEL4713C Ann Gordon-Ross.19 Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SWd,Rd *Software Scheduling to Avoid Load Hazards

EEL4713C Ann Gordon-Ross.20 Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SWd,Rd *Software Scheduling to Avoid Load Hazards stall

EEL4713C Ann Gordon-Ross.21 Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SWd,Rd Software Scheduling to Avoid Load Hazards Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SWd,Rd

EEL4713C Ann Gordon-Ross.22 Compiler Avoiding Load Stalls:

EEL4713C Ann Gordon-Ross.23 Branch delay °Although Beq is fetched during Cycle 4: Target address is NOT written into the PC until the end of Cycle 7 Branch’s target is NOT fetched until Cycle 8 3-instruction delay before the branch take effect Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10Cycle 11 IfetchReg/DecExecMemWr IfetchReg/DecExecMemWr 16: R-type IfetchReg/DecExecMemWr IfetchReg/DecExecMemWr24: R-type 12: Beq (target is 1000) 20: R-type Clk IfetchReg/DecExecMemWr1000: Target of Br

EEL4713C Ann Gordon-Ross.24 Branch Stall Impact °If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! °2 part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier °MIPS branch tests = 0 or != 0 ° Solution Option 1: Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch vs. 3

EEL4713C Ann Gordon-Ross.25 Option 1: move HW forward to reduce branch delay 25 Adder IF/ID Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc ALU Memory Reg File MUX Data Memory MUX Sign Extend Zero? MEM/WB EX/MEM 4 Adder Next SEQ PC RD WB Data Next PC Address RS1 RS2 Imm MUX ID/EX

EEL4713C Ann Gordon-Ross.26 Option 2: Define Branch as Delayed °Add instructions after branch that need to execute independent of the branch outcome Worst case, SW inserts NOP into branch delay °Where to get instructions to fill branch delay slot? Before branch instruction From the target address: only valuable when branch From fall through: only valuable when don’t branch °Compiler effectiveness for single branch delay slot: Profiling: about 50% of slots usefully filled

EEL4713C Ann Gordon-Ross.27 Example °Add r1,r2,r3 °Beq r2, r4, target °Next °Target: x Branch not depending on add, so swap

EEL4713C Ann Gordon-Ross.28 Branch prediction °Aggressive pipelined processors: Place branch resolution as early as possible in pipeline Beyond that, use branch prediction and speculation °Simple branch prediction: Assume branch not taken, fetch from fall-through If branch is taken, flush pipeline °More complex techniques are often used: Predict taken or not taken based on learning of past behavior of a branch -Keep counters indexed by PC on a “branch predictor table” Predict target address before it is calculated -Branch target table, also indexed by PC

EEL4713C Ann Gordon-Ross.29 Branch prediction °Speculative execution: Trust, but verify Assume branch prediction is correct, have mechanisms to detect otherwise and flush pipeline before any damage to architectural state is done (i.e. registers or memory get corrupted) °Example: use the PC to look up a branch predictor table and a branch target table If there is a matching entry for the PC, chances are it is a branch, and chances are the direction (taken/not taken) and target match the prediction Go ahead and set the next PC to be the predicted one Later on in the pipeline, once the branch is resolved (is it a branch? Condition satisfied? What is the target?), either let the instructions that follow it commit, or discard them

EEL4713C Ann Gordon-Ross.30 Summary – 5-stage pipeline revisited °Pipeline registers Data and control signals propagate every cycle °Hazard detection logic and forwarding for data hazards 1 cycle load delay slot, R-type has zero delay °Move branch resolution to ID stage to reduce delay to 1 cycle

EEL4713C Ann Gordon-Ross.31 5-stage pipeline revisited Hazard detection Forwarding muxes Register Comparison logic

EEL4713C Ann Gordon-Ross.32 5-stage pipeline revisited Hazard detection: Load? Rt(load) = Rs,Rt(next)? Yes: stall PC, IF/ID, insert bubble Forwarding muxes Register Comparison logic

EEL4713C Ann Gordon-Ross.33 5-stage pipeline revisited ID/EX.MemRead==1 and ((ID/EX.Rt==IF/ID.Rs) or (ID/EX.Rt==IF/ID.Rt)) Forwarding muxes Register Comparison logic Clear control Signals for EX, M, WB Disable writing PC, IF/ID register

EEL4713C Ann Gordon-Ross.34 5-stage pipeline revisited Hazard detection Forwarding muxes WB=1? Destination reg (rt for loads, rd otherwise) matches rs or rt of next inst? Matches second next?

EEL4713C Ann Gordon-Ross.35 5-stage pipeline revisited Hazard detection: Is it a branch? Taken? Yes: flush IF/ID register (force nop) Forwarding muxes Register Comparison logic

EEL4713C Ann Gordon-Ross.36 Examples of other hazards °“Read-after-write” (RAW) Load followed by ALU instruction using same register Register read must occur after load writes it °“Write-after-write” (WAW) div.d $f0,$f2,$f4 add.d $f0,$f6,$f8 add.d’s write must occue after div.d’s °“Write-after-read” (WAR) div.d $f0,$f2,$f4 add.d $f2,$f4,$f6 add.d’s write must occur after div.d’s read

EEL4713C Ann Gordon-Ross.37 When is pipelining hard? °Interrupts: 5 instructions executing in 5 stage pipeline How to stop the pipeline? Restart? Who caused the interrupt? StageProblem interrupts occurring IFPage fault on instruction fetch; misaligned memory access; memory-protection violation IDUndefined or illegal opcode EXArithmetic interrupt MEMPage fault on data fetch; misaligned memory access; memory-protection violation

EEL4713C Ann Gordon-Ross.38 When is pipelining hard? °Complex Addressing Modes and Instructions °Address modes: Autoincrement causes register change during instruction execution Now worry about write hazards since write no longer last stage -Write After Read (WAR): Write occurs before independent read -Write After Write (WAW): Writes occur in wrong order, leaving wrong result in registers -(Previous data hazard called RAW, for Read After Write) °Memory-memory Move instructions Multiple page faults

EEL4713C Ann Gordon-Ross.39 When is pipelining hard? °Floating Point: long execution time °Also, may pipeline FP execution unit so that can initiate new instructions without waiting for full latency FP InstructionLatency(MIPS R4000) Add, Subtract4 Multiply8 Divide36 Square root112 Negate2 Absolute value2 FP compare3 °Divide, Square Root take 10X to 30X longer than Add Exceptions? Adds WAR and WAW hazards since pipelines are no longer same length

EEL4713C Ann Gordon-Ross.40 First Generation RISC Pipelines (“Scalar”) ° All instructions follow same pipeline order (“static schedule”). ° Register write in last stage – Avoid WAW hazards ° All register reads performed in first stage after issue. – Avoid WAR hazards ° Memory access in stage 4 – Avoid all memory hazards ° Control hazards resolved by delayed branch (with fast path) ° RAW hazards resolved by bypass, except on load results which are resolved by delayed load. Substantial pipelining with very little cost or complexity. Machine organization is (slightly) exposed!

EEL4713C Ann Gordon-Ross.41 Examples °Alpha (92): up to two instructions per cycle One floating-point, one integer (in-order) 7 stages (int), 10 stages (FP) °MIPS R3000 (88) One (integer) instruction per cycle 5 stages (int) °Sparc Micro (91) 5 stages

EEL4713C Ann Gordon-Ross.42 Today’s RISC Pipelines (“Superscalar”) ° Instructions can be issued out of order in pipeline (“dynamic schedule”) – Must handle WAW, WAR hazards in addition to RAW – Tomasulo, Scoreboarding techniques °Multiple instructions issued in a single cycle Instructions are “queued up” for execution in a reorder buffer CPIeffective < 1! ° Control hazards resolved (speculatively) by predicting branches ° Single-cycle memory access in best case (cache hit) Tens-hundreds if need to go to main memory °Aggressive pipelining with rapidly increasing cost/complexity. °Diminishing returns as more resources are added

EEL4713C Ann Gordon-Ross.43 Examples °Alpha (98) up to 4 instructions per cycle 7 stages (int), 10 stages (FP) °MIPS R10000 (96) 4 instruction per cycle 5 stages (int), 10 stages (FP) °Sparc Ultra II (96) 9 stages (int, FP) 4 instructions issued per cycle

EEL4713C Ann Gordon-Ross.44 NetBurst °Successor to Pentium Pro 3 uops per cycle, out-of-order °Key differences Deeper pipeline for fast clocks: 20 stages Seven integer execution units vs. 5 Can overlap instructions from two programs in the pipeline -“Hyper-threading”; simultaneous multi-threading -To software, looks as if it has 2 processors

EEL4713C Ann Gordon-Ross.45 Review: Summary of Pipelining Basics °Speed Up proportional to pipeline depth; if ideal CPI is 1, then: °Hazards limit performance on computers: structural: need more HW resources data: need forwarding, compiler scheduling control: early evaluation & PC, delayed branch, prediction °Increasing length of pipe increases impact of hazards since pipelining helps instruction bandwidth, not latency °Compilers key to reducing cost of data and control hazards load delay slots branch delay slots °Exceptions, Instruction Set, FP makes pipelining harder