HazardsCS510 Computer Architectures Lecture 7 - 1 Lecture 7 Pipeline Hazards.

Slides:

Advertisements

Similar presentations

COMP381 by M. Hamdi 1 (Recap) Pipeline Hazards. COMP381 by M. Hamdi 2 I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11.

Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.

Pipelining - Hazards.

COMP381 by M. Hamdi 1 Pipeline Hazards. COMP381 by M. Hamdi 2 Pipeline Hazards Hazards are situations in pipelining where one instruction cannot immediately.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

ECE 361 Computer Architecture Lecture 13: Designing a Pipeline Processor Start X:40.

CIS429.S00: Lec10- 1 Control Hazards Created by branch statements BEQZLOC ADDR1,R2,R3. LOCSUBR1,R2,R3 PC needs to be computed but it happens too late in.

Chapter 5 Pipelining and Hazards

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan

1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.

CIS629 Fall 2002 Pipelining 2- 1 Control Hazards Created by branch statements BEQZLOC ADDR1,R2,R3. LOCSUBR1,R2,R3 PC needs to be computed but it happens.

Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.

EENG449b/Savvides Lec 4.1 1/22/04 January 22, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

COMP381 by M. Hamdi 1 (Recap) Control Hazards. COMP381 by M. Hamdi 2 Control (Branch) Hazard A: beqz r2, label B: label: P: Problem: The outcome.

Pipelining - II Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.

Appendix A Pipelining: Basic and Intermediate Concepts

EECC551 - Shaaban #1 Lec # 4 winter Data Hazards Requiring Stall Cycles In some code sequence cases, potential data hazards cannot be handled.

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

Pipelining - II Rabi Mahapatra Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.

-1.1- PIPELINING 2 nd week. -2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PIPELINING 2 nd week References Pipelining concepts The DLX.

Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.

1 第三章 Instruction-Level Parallelism and Its Dynamic Exploitation 陈文智浙江大学计算机学院 2011 年 09 月.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.

CPE 731 Advanced Computer Architecture Pipelining Review Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,

1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.

EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.

Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

CMPE 421 Parallel Computer Architecture

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S

CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.

Cs 152 L1 3.1 DAP Fa97,  U.CB Pipelining Lessons °Pipelining doesn’t help latency of single task, it helps throughput of entire workload °Multiple tasks.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

Branch Hazards and Static Branch Prediction Techniques

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

CPE 442 hazards.1 Introduction to Computer Architecture CpE 442 Designing a Pipeline Processor (lect. II)

CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.

HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.

11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.

CSE431 L06 Basic MIPS Pipelining.1Irwin, PSU, 2005 MIPS Pipeline Datapath Modifications  What do we need to add/modify in our MIPS datapath? l State registers.

CMPE 421 Parallel Computer Architecture Part 3: Hardware Solution: Control Hazard and Prediction.

CS203 – Advanced Computer Architecture Pipelining Review.

Lecture 15: Pipelining: Branching & Complications

Review: Instruction Set Evolution

Pipelining: Hazards Ver. Jan 14, 2014

5 Steps of MIPS Datapath Figure A.2, Page A-8

Chapter 4 The Processor Part 4

Appendix A - Pipelining

CpE 442 Designing a Pipeline Processor (lect. II)

Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from

Chapter 4 The Processor Part 3

Pipelining review.

The Processor Lecture 3.6: Control Hazards

Overview What are pipeline hazards? Types of hazards

CS203 – Advanced Computer Architecture

Throughput = #instructions per unit time (seconds/cycles etc.)

Pipelining Hazards.

Presentation transcript:

HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards

HazardsCS510 Computer Architectures Lecture Its Not That Easy to Achieve the Promised Performance Limits to pipelining: Hazards prevent the next instruction from executing during its designated clock cycle –Structural hazards: HW cannot support this combination of instructions –Data hazards: Instruction depends on result of prior instruction still in the pipeline –Control hazards: Pipelining of branches and other instructions that change the PC Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles”, i.e., idle clock cycles, in the pipeline

HazardsCS510 Computer Architectures Lecture Structural Hazards /Memory Instruction Order LOAD Instr 1 Instr 2 Mem Reg ALU Mem Reg Mem Reg ALU Mem Reg Instr 3 Reg ALU Reg Mem Reg ALU Mem Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Mem Reg ALU Mem Reg Instr 4 Mem Operation on Memory by 2 different instructions in the same clock cycle

HazardsCS510 Computer Architectures Lecture Structural Hazards with Single-Port Memory Instruction Order LOAD Instr 1 Instr 2 Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Mem Reg ALU Mem Reg Mem Reg ALU Mem Reg Instr 3 Stall Reg ALU Reg Mem Reg ALU Mem Stall Mem Reg ALU Instr 3 Mem 3 cycles stall with 1-port memory

HazardsCS510 Computer Architectures Lecture Avoiding Structural Hazard with Dual-Port Memory Instruction Order Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 LOAD IM Reg ALU RegDM Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 IM DM IM Reg ALU DMReg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DM IM DM IM DM IM DM IM DM IM DM No stall with 2-port memory

HazardsCS510 Computer Architectures Lecture Data Hazard on Registers ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R11,R1 CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Mem Reg ALU Reg ALU MemReg Mem ALU MemReg ALU MemReg Mem ALU MemReg Mem Reg Time(clock cycles) R1 Re Reg Reg Reg Reg

HazardsCS510 Computer Architectures Lecture Data Hazard on Registers Registers can be made to read and store in the same cycle such that data is stored in the first half of the clock cycle, and that data can be read in the second half of the same clock cycle Clcok Cycle Register Ri Store into Ri Read from Ri

HazardsCS510 Computer Architectures Lecture Data Hazard on Registers ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R11,R1 CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Time(clock cycles) R1 Reg Needs to Stall 2 cycles Reg

HazardsCS510 Computer Architectures Lecture Three Generic Data Hazards Instr i followed by Instr j Read After Write (RAW) Instr j tries to read operand before Instr i writes it Instr i LW R1, 0(R2) Instr j SUBR 4, R1, R5

HazardsCS510 Computer Architectures Lecture Three Generic Data Hazards Instr I followed by Instr J Write After Read (WAR) Instr j tries to write operand before Instr i reads it Instr i ADD R1, R2, R3 Instr j LW R2, 0(R5) Can’t happen in DLX 5 stage pipeline because: – All instructions take 5 stages, – Reads are always in stage 2, and – Writes are always in stage 5

HazardsCS510 Computer Architectures Lecture Three Generic Data Hazards Instr I followed by Instr J Write After Write (WAW) Instr j tries to write operand before Instr i writes it – Leaves wrong result ( Instr i not Instr j ) Instr i LW R1, 0(R2) Instr j LW R1, 0(R3) Can’t happen in DLX 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes

HazardsCS510 Computer Architectures Lecture Forwarding to Avoid Data Hazards Time(clock cycles) ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R11,R1 Mem Reg ALU Reg ALU MemReg Mem Reg ALU MemReg ALU MemReg Mem Reg ALU MemReg CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Mem

HazardsCS510 Computer Architectures Lecture HW Change for Forwarding MUX Zero? Data Memory ALU D/A Buffer A/M BufferM/W Buffer

HazardsCS510 Computer Architectures Lecture Load Delay Due to Data Hazard LOAD R1,0(R2) Time(clock cycles) AND R6,R1,R7 IM Reg ALU DMReg OR R8,R1,R9 Reg ALU DM IM SUB R4,R1,R6 Reg ALU DMReg IM Load Delay =2cycles Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU RegDM Reg ALU DMReg IM

HazardsCS510 Computer Architectures Lecture Load Delay with Forwarding LOAD R1,0(R2) Time(clock cycles) IM Reg ALU RegDM SUB R4,R1,R6 AND R6,R1,R7 OR R8,R1,R9 IM Reg ALU DMReg ALU DMReg IM We need to add HW, called Pipeline Interlock IM Reg ALU DMReg ALU DMReg IM Reg ALU DMReg IM Load Delay with Forwarding=1cycle

HazardsCS510 Computer Architectures Lecture Try to produce fast code for a = b + c; d = e - f; assuming a, b, c, d,e, and f are in memory. Software Scheduling to Avoid Load Hazards Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SWd,Rd Slow code(with forwarding): LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SWd,Rd RAW Stall

HazardsCS510 Computer Architectures Lecture % loads stalling pipeline 0%20%40%60%80% tex spice gcc 25% 14% 31% 65% 42% 54% scheduledunscheduled Compiler Avoiding Load Stalls

HazardsCS510 Computer Architectures Lecture Mem Stage WB Stage IF Stage ID StageEX Stage Instr. Memory Sign Ext Zero? Data Memory PC MUX Add ALU Reg File SMD LMD F/D BufferD/A BufferA/M Buffer M/W Buffer Pipelined DLX Datapath Branch Address Calculation Decide Condition Branch Decision for target address

HazardsCS510 Computer Architectures Lecture Control Hazard on Branches: Three Stall Cycles IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 IM Reg ALU DMReg IM Reg ALU DMReg Program execution order in instructions 40 BEQ R1,R3, AND R12,R2, R5 48 OR R13,R6, R2 52 ADD R14,R2, R2 80 LD R4,R7, 100 Should’t be executed when branch condition is true ! IM Reg ALU DMReg Branch Delay = 3 cycles IM Reg ALU DMReg IM Reg ALU DMReg Branch Target available

HazardsCS510 Computer Architectures Lecture Control Hazard on Branches: Three Stall Cycles Branch instruction IF ID EX MEM WB Branch successor IF ID EX MEM 3 Wasted clock cycles for the TAKEN branch Now, we know the instruction being executed is a branch. But stall until branch target address is known. Now, target address is available.We don’t know yet the instruction being executed is a branch. Fetch the branch successor. Branch successor + 1 IF ID EX Branch successor + 2 IF ID

HazardsCS510 Computer Architectures Lecture Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9 –Half of the ideal speed Two part solution: –Determine the branch is TAKEN or NOT TAKEN sooner, AND –Compute TAKEN Branch Address(Branch Target) earlier DLX branch tests if register = 0 or 1 DLX Solution: Get New PC earlier Zero test to ID stage - Move Zero test to ID stage Additional ADDER - Additional ADDER to calculate New PC(taken PC) in ID stage - 1 clock cycle penalty for branch in contrast to 3 cycles

HazardsCS510 Computer Architectures Lecture Pipelined DLX Datapath To get target addr. earlier To get the Condition Earlier. Target Address available after ID. When a branch instruction is in Execute stage, Next Address is available here.

HazardsCS510 Computer Architectures Lecture Branch Behavior in Programs Conditional branch frequencies –integer average to 16 % –floating point to 12 % Forward and backward taken branches –forward taken % –backward taken % –the average of all conditional branches %

HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives Stall until branch direction is clear Predict branch NOT TAKEN Predict branch TAKEN Delayed branch

HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives: (1) STALL Stall until branch direction is clear Branch instruction IF ID EX MEM WB 3 cycle penalty Branch instruction IF ID EX MEM WB Branch successor stall IF ID EX MEM WB Branch successor + 1 IF ID EX MEM Branch successor + 2 IF ID Revised DLX pipeline(get the branch address at EX) 1 cycle penalty(Branch Delay Slot) Branch successor stall stall stall IF ID EX MEM Branch successor + 1 IF ID EX Branch successor + 2 IF ID

HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives: (2) Predict Branch “NOT TAKEN” Execute successor instructions in the sequence PC+4 is already calculated, so use it to get the next instruction Flush instructions in the pipeline if branch is actually TAKEN Advantage of late pipeline state update 47% of DLX branches are NOT TAKEN on the average NOT TAKEN branch instruction i IF ID EX MEM WB instruction i+1 IF ID EX MEM WB instruction i+2 IF ID EX MEM WB No penalty TAKEN branch instruction i IF ID EX MEM WB instruction i+1 IF ID EX MEM WB instruction T IF ID EX MEM WB 1 cycle penalty Flush this instruction in progress

HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives: (3) Predict Branch “TAKEN” –53% DLX branches TAKEN on average –Branch target address available after ID in DLX DLX still incurs 1 cycle branch penalty for TAKEN branch Other machines: branch target known before outcome 2 cycle penalty in DLX(1 in other machines). 1 cycle penalty in DLX(0 in other machines) NOT TAKEN instruction i IF ID EX MEM WB Instruction T stall IF Instruction i+1 IF ID EX MEM WB TAKEN branch instruction i IF ID EX MEM WB Instruction T stall IF ID EX MEM WB Instruction T+1 IF ID EX MEM WB TAKEN address not available at this time TAKEN address available

HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives: (4) Delayed Branch Delayed Branch –Delay branch to take place AFTER a successor instruction branch instruction sequential successor 1 sequential successor sequential successor n branch target if taken –1 slot delayed branch allows proper decision and branch target address in 5 stage DLX pipeline with control hazard improvement Delayed Branch of length n

HazardsCS510 Computer Architectures Lecture Delayed Branch Where to get instructions to fill branch delay slot? –Before branch instruction –From the target address: only valuable when branch TAKEN –From fall through: only valuable when branch NOT TAKEN –Canceling branches allow more slots to be filled Compiler effectiveness for single delayed branch slot: –Fills about 60% of delayed branch slots –About 80% of instructions executed in delayed branch slots are useful in computation –About 50% (60% x 80%) of slots usefully filled

HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives: Delayed Branch From target SUB R4, R5, R6 ADD R1, R2, R3 if R1=0 then Delay slot ADD R1, R2, R3 if R1=0 then SUB R4, R5, R6 - Improve performance when TAKEN(loop) - Must be alright to execute rescheduled instructions if Not Taken - May need duplicate the instruction if it is the target of another branch instr. From fall through ADD R1, R2, R3 if R1=0 then SUB R4, R5, R6 Delay slot ADD R1, R2, R3 if R2=0 then SUB R4, R5, R6 - Improve performance when NOT TAKEN - Must be alright to execute instructions of Taken - Always improve performance - Branch must not depend on rescheduled instructions From before ADD R1, R2, R3 if R2=0 then Delay slot if R2=0 then ADD R1, R2, R3

HazardsCS510 Computer Architectures Lecture Limitations on Delayed Branch Difficulty in finding useful instructions to fill the delayed branch slots Solution - Squashing –Delayed branch associated with a branch prediction –Instructions in the predicted path are executed in the delayed branch slot –If the branch outcome is mispredicted, instructions in the delayed branch slot are squashed(discarded)

HazardsCS510 Computer Architectures Lecture Canceling Branch Used when the delayed branch scheduling, i.e., filling the delay slot cannot be done due to –Restrictions on scheduling instructions at the delay slots –Limitations on the ability to predict whether it will TAKE or NOT TAKE at compile time Instruction includes the direction that the branch was predicted –When the branch behaves as predicted, the instructions in the delay slot are executed –When branch is incorrectly predicted, the instructions in the delay slot are turned into No-OPs Canceling Branch allows to fill the delay slot even if the instruction to be filled in the delay slot does not meet the requirements

HazardsCS510 Computer Architectures Lecture Evaluating Branch Alternatives Stall pipeline x3=1.42 5/1.42= Predict Taken x1=1.14 5/1.14= Predict Not Taken x0.65=1.09 5/1.09= Delayed branch x0.5=1.07 5/1.07= Pipeline speedup = Pipeline depth / CPI = Pipeline depth 1 + Branch frequency x Branch penalty Conditional and Unconditional collectively 14% frequency, 65% of branch is TAKEN Scheduling Branch CPI speedup vs speedup vs scheme penalty unpipelined stall

HazardsCS510 Computer Architectures Lecture Static(Compiler) Prediction of Taken/Untaken Branches Code Motion LWR1, 0(R2) SUB R1, R1, R3 BEQZR1, L ORR4, R5, R6 ADDR10,R4,R3 L: ADDR7, R8, R9 NOT TAKEN If branch is almost always NOT TAKEN, and R4 is not needed on the taken path, and R5 and R6 are not modified in the following instruction(s), this move can increase speed Depend on LW, need to stall TAKEN If branch is almost always TAKEN, and R7 is not needed, and R8 and R9 are not modified on the fall-through path, this move can increase speed

HazardsCS510 Computer Architectures Lecture Static(Compiler) Prediction of Taken/Untaken Branches Improves strategy for placing instructions in delay slot Two strategies –Direction-based Prediction: TAKEN backward branch, NOT TAKEN forward branch –Profile-based prediction: Record branch behaviors, predict branch based on the prior run(s) Frequency of Misprediction 0% 10% 20% 30% 40% 50% 60% 70% alvinn compress doduc espresso gcc hydro2d mdljsp2 ora swm256 tomcatv Always taken Misprediction Rate 0% 2% 4% 6% 8% 10% 12% 14% alvinn compress doduc espresso gcc hydro2d mdljsp2 ora swm256 tomcatv Taken backwards Not Taken Forwards

HazardsCS510 Computer Architectures Lecture Evaluating Static Branch Prediction Strategies Misprediction rate ignores frequency of branch Instructions between mispredicted branches is a better metric Instructions per mispredicted branch alvinn compress doduc espresso gcc hydro2d mdljsp2 ora swm256 tomcatv Profile-basedDirection-based

End of Hazards and their Re solution Point Thank you HazardsCS510 Computer Architectures Lecture