1 Lecture 3 Pipeline Contd. (Appendix A) Instructor: L.N. Bhuyan CS 203A Advanced Computer Architecture Some slides are adapted from Roth.

Slides:



Advertisements
Similar presentations
Morgan Kaufmann Publishers The Processor
Advertisements

COMP381 by M. Hamdi 1 (Recap) Pipeline Hazards. COMP381 by M. Hamdi 2 I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.
A scheme to overcome data hazards
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
COMP4611 Tutorial 6 Instruction Level Parallelism
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Parallel Architectures and Systems, Edwin Sha 1 Review of Computer Architecture -- Instruction sets, pipelines and caches.
Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.
Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CS252 Graduate Computer Architecture Lecture 6 Static Scheduling, Scoreboard February 6 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.
DAP.F96 1 Lecture 4: Hazards, Introduction to Compiler Techniques, Chapter 2.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
ELEN 468 Advanced Logic Design
Pipelining - Hazards.
Instruction-Level Parallelism (ILP)
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
COMP381 by M. Hamdi 1 Pipeline Hazards. COMP381 by M. Hamdi 2 Pipeline Hazards Hazards are situations in pipelining where one instruction cannot immediately.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
\course\ELEG652-03Fall\Topic Exploitation of Instruction-Level Parallelism (ILP)
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Appendix A Pipelining: Basic and Intermediate Concepts
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
-1.1- PIPELINING 2 nd week. -2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PIPELINING 2 nd week References Pipelining concepts The DLX.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.
Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.
Lecture 5: Pipelining & Instruction Level Parallelism Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
CMPE 421 Parallel Computer Architecture
Branch Hazards and Static Branch Prediction Techniques
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
CS203 – Advanced Computer Architecture Pipelining Review.
Instruction-Level Parallelism and Its Dynamic Exploitation
CS 352H: Computer Systems Architecture
Morgan Kaufmann Publishers
ELEN 468 Advanced Logic Design
Pipeline Implementation (4.6)
Appendix A - Pipelining
Morgan Kaufmann Publishers The Processor
CSC 4250 Computer Architectures
Computer Architecture
How to improve (decrease) CPI
The Processor Lecture 3.6: Control Hazards
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Overview What are pipeline hazards? Types of hazards
CS203 – Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining Hazards.
Presentation transcript:

1 Lecture 3 Pipeline Contd. (Appendix A) Instructor: L.N. Bhuyan CS 203A Advanced Computer Architecture Some slides are adapted from Roth

2 Pipeline Hazards Hazards are caused by conflicts between instructions. Will lead to incorrect behavior if not fixed. –Three types: Structural: two instructions use same h/w in the same cycle – resource conflicts (e.g. one memory port, unpipelined divider etc). Data: two instructions use same data storage (register/memory) – dependent instructions. Control: one instruction affects which instruction is next – PC modifying instruction, changes control flow of program.

3 Data Hazards Two different instructions use the same storage location –It must appear as if they executed in sequential order addR1, R2, R3 subR2, R4, R1 orR1, R6, R3 addR1, R2, R3 subR2, R4, R1 orR1, R6, R3 addR1, R2, R3 subR2, R4, R1 orR1, R6, R3 read-after-write (RAW) write-after-read (WAR) write-after-write (WAW) True dependence (real) anti dependence (artificial) output dependence (artificial) What about read-after-read dependence ?

4 Reducing RAW Hazards: Bypassing Data available at the end of EX stage, why wait until WB stage?  Bypass (forward) data directly to input of EX +Reduces/avoids stalls in a big way Large fraction of input operands are bypassed –Complex  Important: does not relieve you from having to perform WB add R1, R2, R3fdxmw sub R2, R4, R1fdxmw  Can bypass from MEM also

Minimizing Data Hazard Stalls by Forwarding

6 But … Even with bypassing, not all RAWs stalls can be avoided –Load to an ALU immediately after –Can be eliminated with compiler scheduling lw R1, 16(R3)fdxmw sub R2, R4, R1f-dxmw You can also stall before EX stage, but it is better to separate stall logic from bypassing logic

7 Compiler Scheduling Compiler moves instructions around to reduce stalls –E.g. code sequence: a = b+c, d = e-f before schedulingafter schedulinglw Rb, blw Rc, c add Ra, Rb, Rc //stall lw Re, e sw Ra, aadd Ra, Rb, Rc//no stall lw Re, elw Rf, f lw Rf, fsw Ra, a sub Rd, Re, Rf //stallsub Rd, Re, Rf//no stallsw Rd, d

8 WAR: Why do they exist? (Antidependence) Recall WAR addR1, R2, R3 subR2, R4, R1 orR1, R6, R3 Problem: swap means introducing false RAW hazards Artificial: can be removed if sub used a different destination register Can’t happen in in-order pipeline since reads happen in ID but writes happen in WB Can happen in out-of-order reads, e.g. out-of- order execution

9 WAW (Output Depndence) addR1, R2, R3 subR2, R4, R1 orR1, R6, R3 Problem: scheduling would leave wrong value in R1 for the sub Artificial: using different destination register would solve Can’t happen in in-order pipeline in which every instruction takes same cycles since writes are in-order Can happen in the presence of multi-cycle operations, i.e., out-of-order writes

10 I1. Load R1, A /R1  Memory(A)/ I2. Add R2, R1 /R2  (R2)+(R1)/ I3. Add R3, R4 /R3  (R3)+(R4)/ I4. Mul R4, R5 /R4  (R4)*(R5)/ I5. Comp R6 /R6  Not(R6)/ I6. Mul R6, R7 /R6  (R6)*(R7)/ Due to Superscalar Processing, it is possible that I4 completes before I3 starts. Similarly the value of R6 depends on the beginning and end of I5 and I6. Unpredictable result! A sample program and its dependence graph, where I2 and I3 share the adder and I4 and I6 share the same multiplier. These two dependences can be removed by duplicating the resources, or pipelined adders and multipliers. I1I3I5 I6I4I2 RAWWAR WAW and RAW Flow dependence Anti- dependence Output dependence, also flow dependence Program order EXAMPLE

11 Register Renaming Rewrite the previous program as: I1. R1b  Memory (A) I2. R2b  (R2a) + (R1b) I3. R3b  (R3a) + (R4a) I4. R4b  (R4a) * (R5a) I5. R6b  -(R6a) I6. R6c  (R6b) * (R7a) Allocate more registers and rename the registers that really do not have flow dependency. The WAR hazard between I3 and I4, and WAW hazard between I5 and I6 have been removed. These two hazards also called Name dependencies

12 Control Hazards Branch problem: –branches are resolved in EX stage  2 cycles penalty on taken branches Ideal CPI =1. Assuming 2 cycles for all branches and 32% branch instructions  new CPI = *2 = 1.64 Solutions: –Reduce branch penalty: change the datapath – new adder needed in ID stage. –Fill branch delay slot(s) with a useful instruction. –Fixed branch prediction. –Static branch prediction. –Dynamic branch prediction.

13 Control Hazards – branch delay slots Reduced branch penalty: –Compute condition and target address in the ID stage: 1 cycle stall. –Target and condition computed even when instruction is not a branch. Branch delay slot filling: move an instruction into the slot right after the branch, hoping that its execution is necessary. Three alternatives (next slide) Limitations: restrictions on which instructions can be rescheduled, compile time prediction of taken or untaken branches.

14 Example Nondelayed vs. Delayed Branch add M1,M2,M3 sub M4, M5,M6 beq M1, M4, Exit or M8, M9,M10 xor M10, M1,M11 Nondelayed Branch Exit: add M1,M2,M3 sub M4, M5,M6 beq M1, M4, Exit or M8, M9,M10 xor M10, M1,M11 Delayed Branch Exit:

15 Control Hazards: Branch Prediction Idea: doing something is better than waiting around doing nothing oGuess branch target, start executing at guessed position oExecute branch, verify (check) your guess +minimize penalty if guess is right (to zero) –May increase penalty for wrong guesses oHeavily researched area in the last 15 years Fixed branch prediction. Each of these strategies must be applied to all branch instructions indiscriminately. –Predict not-taken (47% actually not taken): continue to fetch instruction without stalling; do not change any state (no register write); if branch is taken turn the fetched instruction into no-op, restart fetch at target address: 1 cycle penalty.

16 Control Hazards: Branch Prediction –Predict taken (53%): more difficult, must know target before branch is decoded. no advantage in our simple 5- stage pipeline. Static branch prediction. –Opcode-based: prediction based on opcode itself and related condition. Examples: MC 88110, PowerPC 601/603. –Displacement based prediction: if d = 0 predict not taken. Examples: Alpha (as option), PowerPC 601/603 for regular conditional branches. –Compiler-directed prediction: compiler sets or clears a predict bit in the instruction itself. Examples: AT&T 9210 Hobbit, PowerPC 601/603 (predict bit reverses opcode or displacement predictions), HP PA 8000 (as option).

17 Control Hazards: Branch Prediction Dynamic branch prediction –Later

18 MIPS R4000 pipeline

19 MIPS FP Pipe Stages FP Instr … Add, SubtractUS+AA+RR+S MultiplyUE+MMMMNN+AR DivideUARD 28 …D+AD+R, D+R, D+A, D+R, A, R Square rootUE(A+R) 108 …AR NegateUS Absolute valueUS FP compareUAR Stages: MFirst stage of multiplier NSecond stage of multiplier RRounding stage SOperand shift stage UUnpack FP numbers AMantissa ADD stage DDivide pipeline stage EException test stage

20 R4000 Performance Not ideal CPI of 1: –Load stalls (1 or 2 clock cycles) –Branch stalls (2 cycles + unfilled slots) –FP result stalls: RAW data hazard (latency) –FP structural stalls: Not enough FP hardware (parallelism)

Oct. 26, FP Loop: Where are the Hazards? Loop:LDF0,0(R1);F0=vector element ADDDF4,F0,F2;add scalar from F2 SD0(R1),F4;store result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slot InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Integer opInteger op0 Where are the stalls?

Oct. 26, FP Loop Showing Stalls 9 clocks: Rewrite code to minimize stalls? InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 1 Loop:LDF0,0(R1);F0=vector element 2stall 3ADDDF4,F0,F2;add scalar in F2 4stall 5stall 6 SD0(R1),F4;store result 7 SUBI R1,R1,8;decrement pointer 8B (DW) 8 BNEZ R1,Loop;branch R1!=zero 9stall;delayed branch slot

Oct. 26, Minimizing Stalls Technique 1: Compiler Optimization 6 clocks InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 1 Loop:LDF0,0(R1) 2stall 3ADDDF4,F0,F2 4SUBIR1,R1,8 5BNEZR1,Loop;delayed branch 6 SD8(R1),F4;altered when move past SUBI Swap BNEZ and SD by changing address of SD