Pipelining 6.1, 6.2. Performance Measurements Cycle Time: Time __________________ Latency: Time to finish a _____________, start to finish Throughput:

Slides:



Advertisements
Similar presentations
Morgan Kaufmann Publishers The Processor
Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Goal: Describe Pipelining
Chapter Six 1.
Pipelining - Hazards.
Instruction-Level Parallelism (ILP)
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Computer Organization
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Pipelining Andreas Klappenecker CPSC321 Computer Architecture.
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
King Fahd University of Petroleum and Minerals King Fahd University of Petroleum and Minerals Computer Engineering Department Computer Engineering Department.
Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
Enhancing Performance with Pipelining Slides developed by Rami Abielmona and modified by Miodrag Bolic High-Level Computer Systems Design.
CS1104: Computer Organisation School of Computing National University of Singapore.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Computer Science Education
Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time.
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
11/13/2015 8:57 AM 1 of 86 Pipelining Chapter 6. 11/13/2015 8:57 AM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
5/13/99 Ashish Sabharwal1 Pipelining and Hazards n Hazards occur because –Don’t have enough resources (ALU’s, memory,…) Structural Hazard –Need a value.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Branch Hazards and Static Branch Prediction Techniques
1/24/ :00 PM 1 of 86 Pipelining Chapter 6. 1/24/ :00 PM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which.
Pipelining Example Laundry Example: Three Stages
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Introduction to Computer Organization Pipelining.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
L17 – Pipeline Issues 1 Comp 411 – Fall /23/09 CPU Pipelining Issues Read Chapter This pipe stuff makes my head hurt! What have you been.
CSCE 212 Chapter 6 Enhancing Performance with Pipelining Instructor: Jason D. Bakos.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
CS203 – Advanced Computer Architecture Pipelining Review.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Lecture 5. MIPS Processor Design Pipelined MIPS #1 Prof. Taeweon Suh Computer Science & Engineering Korea University COSE222, COMP212 Computer Architecture.
Lecture 18: Pipelining I.
Stalling delays the entire pipeline
CS2100 Computer Organization
CDA 3101 Spring 2016 Introduction to Computer Organization
CSCI206 - Computer Organization & Programming
5 Steps of MIPS Datapath Figure A.2, Page A-8
Pipeline Implementation (4.6)
ECE232: Hardware Organization and Design
ECS 154B Computer Architecture II Spring 2009
CDA 3101 Spring 2016 Introduction to Computer Organization
Chapter 4 The Processor Part 3
Morgan Kaufmann Publishers The Processor
Pipelining review.
Pipelining in more detail
CSCI206 - Computer Organization & Programming
Chapter Six.
Instruction Execution Cycle
CS203 – Advanced Computer Architecture
Presentation transcript:

Pipelining 6.1, 6.2

Performance Measurements Cycle Time: Time __________________ Latency: Time to finish a _____________, start to finish Throughput: Average ______________ per unit time CyclesPerJob: On average, number of _______________________________.

Goals Faster clock rate Use machine more efficiently No longer execute only one instruction at a time

Laundry Laundry-o-matic washes, dries & folds Wash: 30 min Dry: 40 min Fold: 20 min It switches them internally with no delay How long to complete 1 load? ______

Laundry-o-Matic - SingleCycle Minutes Load WFD

Laundry-o-Matic Cycle Time: Clothing is switched every ____ minutes Latency: A single load takes a total of ______ minutes Throughput: A load completes each ______ minutes CyclesPerLoad: Every ____ cycle(s), a load completes

Pipelined Laundry Split the laundry-o-matic into a washer, dryer, and folder (what a concept) Moving the laundry from one to another takes 6 minutes

Pipelined Laundry Minutes Load WF Switch all loads at the same time D

Pipelined Laundry Cycle Time: Clothing is switched every ____ minutes Latency: A single load takes a total of ______ minutes Throughput: A load completes each ______ minutes CyclesPerLoad: Every ____ cycle(s), a load completes

Single-Cycle vs Pipelined _________ has the higher cycle time _________ has the higher clock rate _________ has the higher single-load latency _________ has the higher throughput _________ has the higher CPL (Cycles per Load) More stages makes a _______ clock rate

Obstacles to speedup in Pipelining WFD Ideal cycle time w/out above limitations with n stage pipeline:

Creating Stages Fetch – get instruction Decode – read registers Execute – use ALU Memory – access memory WriteBack – write registers FetchDecodeExecuteMemoryWriteBack IFWB MEM ID

Pipelined Machine Read Addr Out Data Instruction Memory PC 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2 Pipeline Register Fetch (Writeback) ExecuteDecodeMemory

In what cycle was $s1 written? In what cycle was $s4 read? In what cycle was the Add executed? Time-> add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1) or $s3, $s4, $t3 IF ID IF ID IF MEM ID IF ID WB MEM WB MEM WB MEM WB

Performance Analysis Measurements related to our machine Job = single instruction Latency - Time to finish a complete _______________, start to finish. Throughput – Average ______________ completed per unit time. CyclesPerInstruction – An _________ completes each how many cycles? Which is more important for reducing program execution time?

Machine Comparison FetchDecode Execute Memory WriteBack 2ns1ns 2ns 2ns 1ns 0.1 ns pipeline register delay Single-Cycle Implementation Clock cycle time:_____ ns Latency of a single instruction: _____ ns Throughput for machine:_____ inst/ns Pipelined Implementation Clock cycle time:_____ ns Latency of a single instruction: _____ ns Throughput for machine:_____ inst/ns

Example 2 – How do we speed up pipelined machine? FetchDecode Execute Memory Writeback 6ns4ns 8ns 10ns 4ns 0.1 ns pipelined register delay Single cycle:1 / ns Pipelined:1 / ns

Example 2 – Split more stages FetchDecode Execute Memory Writeback 6ns4ns 8ns 10ns 4ns 0.1 ns pipelined register delay Which stage(s) should we split? _________ and _________

Example 2 – After Split FWB ___ns ___ns ___ns ___ns ___ns ___ns ___ns 0.1 ns pipelined register delay Single cycle:1 / ns Pipelined:1 / ns

Easy Right? Not so fast. In what cycle does the sw write $s0? In what cycle does the or read $s0? Time-> or $s3, $s0, $t3 IF ID IF ID MEM MEM WB sw $s2, 0($t1) and $s6, $s4, $t3 IF ID IF ID WB MEM WB Incorrect Execution lw $s0, 0($t4)

Time-> or $s3, $s0, $t3 IF ID IF ID MEM MEM WB In what cycle does the sw write $s0? In what cycle does the oi read $s0? Stall - wasted cycles sw $s2, 0($t1) and $s6, $s4, $t3 IF ID IF ID WB MEM WB IF Correct, Slow Execution Data Hazards ID & WB take only ½ cycle each. All other stages take the full cycle lw $s0, 0($t4)

Barriers to pipeline performance Uneven stages Pipeline register delays Data Hazards – ____________________________________ __

Time-> or $s3, $s0, $t3 IF ID IF MEM Stall - wasted cycles sw $s2, 0($t1) and $s6, $s4, $t3 Default: Stall lw $s0, 0($t4) WB

Time-> or $s3, $s0, $t3 IF ID IF MEM In what cycle is $s0 produced in the machine? In what cycle is $s0 used in the machine? Add wires that pass data from the stage in which it is produced to the stage in which it is used. sw $s2, 0($t1) and $s6, $s4, $t3 IF Solution 1: Data Forwarding lw $s0, 0($t4) WB ID

Data-Forwarding Where are those wires? Read Addr Out Data Instruction Memory PC 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2 Pipeline Register Fetch (Writeback) ExecuteDecodeMemory

Time-> add $s2, $s2, $t0 FD F M Draw the timing diagram with data forwarding Draw arrows to indicate data passing through forwarding F lw $t0, 0($s0) W D Data Forwarding Example 2 sw $s2, 0($s0) addi $t0, $t0, 1

Who reorders instructions? Static scheduling –Compiler –___________, but does not know when caches miss or loads/stores are to the same locations (conservative) Dynamic scheduling –Hardware –_____________________________, but has all knowledge (more accurate)

Time-> or $s3, $s0, $t3 IF ID IF MEM MEMWB IF Stall - wasted cycles sw $s2, 0($t1) and $s6, $s4, $t3 IF ID IF ID WB MEM WB Solution 2: Instruction Reordering (Before reordering) lw $s0, 0($t4) WB ID

Time-> IF ID IF ID MEM WB IF Solution 2: Instruction Reordering (With Instruction Reordering) lw $s0, 0($t4) WB

Time-> IF ID IF MEM MEMWB IF ID IF ID WB MEM WB ID or $s3, $s0, $t3 sw $s3, 0($t1) and $s0, $s4, $t3 lw $s0, 0($t4) Instruction Reordering - Are these equivalent? (Get the same answer)

Time-> or $s3, $s0, $t3 IF ID IF MEM MEM WB sw $s3, 0($t1) and $s0, $s4, $t3 IF ID IF ID WB MEM WB lw $s0, 0($t4) ID WB A deeper look

Dependencies And our friends RAW, WAR, and WAW

RAW – Read after Write add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0

WAR - Write after Read add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0

WAW –Write after Write add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0

Identify all of the dependencies IF ID IF ID MEM MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW

Which dependencies might cause stalls (hazards)? IF ID IF ID MEM MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW So why do we care about WAW and WAR dependencies at all?!?

Can we reorder the or? IF ID IF ID MEM MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW

No, we can not reorder the or IF ID IF ID MEM MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW RAW

Can we reorder the mul? IF ID IF ID MEM MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW

No, siree. Mul stays put. IF ID IF ID MEM MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW

How to remove WAR, WAW dependencies? IF ID IF ID MEM MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 and $t4, $s3, $s5 add $s3, $s4, $s6

___________________________ use a different register for that result (and all subsequent uses of that result) IF ID IF ID MEM MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub ____, $t0, $s3 or ____, $t7, $s2 and $t2, $t7, $s0 and $t4, ___, $s5 add $s3, $s4, $s6

Who renames registers? Static register renaming –Compiler –Compiler is the one who makes assignments in the first place! –Number of registers limited by ____________ Dynamic register renaming –Hardware –Can offer more registers – –Number of registers limited by ____________

Minimizing Data Hazards Summary

Dependencies Summary What is the difference between a hazard and a dependency? How can we get rid of name dependencies? What limits this solution?

The Big Picture Fewer StagesMore Stages More hardware Smaller More power Faster Clock Fewer Stalls Higher Throughput

The Big Picture Throughput (inst / ns) = –inst / cycle * cycle / ns

In what cycle does the nextPC get calculated for the bne? In what cycle does the or get fetched? Time-> bne $s0, $s1, end or $s3, $s0, $t3 IF ID IF ID MEM MEM WB end: sw $s2, 0($t1) IF ID IF ID WB MEM WB Control Hazard (Incorrect Execution) add $s5, $s4, $t1

Time-> bne $s0, $s1, end or $s3, $s0, $t3 IF ID IF ID MEM MEM WB end: sw $s2, 0($t1) IF ID MEMWB Control Hazard (Correct, Slow execution) IF ID MEMWB add $s5, $s4, $t1

Barriers to pipeline performance Uneven stages Pipeline register delays Data Hazards Control Hazards –___________________________________

Now in what cycle does the nextPC get calculated for the bne? In what cycle does the ori get fetched? Time-> IF ID IF ID MEM MEM WB IF ID MEMWB Solution 1: Add hardware to determine branch in decode stage bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 IF ID MEMWB

REMEMBER For the rest of this course, the branches will be determined in the decode stage All other optimizations will be in addition to moving branch calculation to decode stage

Redefine the semantics of a branch ALWAYS execute the instruction after the branch, regardless of the outcome of the branch. Time-> IF ID IF ID MEM MEM WB IF ID MEMWB Solution 2: Branch Delay Slot bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 IF ID MEMWB nop

Try to fill that slot with an instruction from before the branch. Time-> IF ID IF ID MEM MEM WB IF ID MEMWB Solution 2: Branch Delay Slot bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 IF ID MEMWB

Branch Delay Slot The hardware ________ executes instruction after a branch The _________ tries to take an instruction from before branch and move it after branch If it can find no instruction, it inserts a _____ after the branch If it forgets to place nop or inst there, you can get incorrect execution!!!!!

Branch Delay Slot - Limitations If you have a machine with 20 pipeline stages, and it takes 10 stages to calculate branch, how many branch delay slots are there? Can you move any instruction into branch delay slot? What happens as the pipeline gets deeper?

Solution 3: Branch Prediction bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 Guess which way the branch will go before calculation occurs. Clean up if predictor is wrong. Original code:

Initial algorithm: Always predict not taken If we are right, how many cycles do we stall? Time-> IF ID IF ID MEM MEM WB IF ID MEMWB bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 IF ID MEMWB Note: There are several prediction algorithms. Predict taken is merely an example. Correct Branch Prediction

Initial algorithm: Always predict not taken If we are wrong, then flush incorrect instruction(s) How many cycles do we stall? Time-> IF ID IF MEM WB IF ID MEMWB Branch Misprediction bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 IF ID MEMWB

Designing a Branch Predictor Understand the nature of programs Are branch directions random? If not, what will correlate? –Past behavior? –Previous branches’ behavior?

Branch Prediction slt $t1, $s2, $s3 beq $t1, $0, end loop: do some work addi $s2, $s2, 1 slt $t1, $s2, $s3 bne $t1, $0, loop End: for(i; i<n;i++) do some work Observations: Is beq often taken or not taken? Is bne often taken or not taken? Conclusions:

First Branch Predictor 0 Predict TakenPredict Not Taken Predict whatever happened last time Update the predictor for next time T NT T 1

Branch Prediction slt $t1, $s2, $s3 beq $t1, $0, end loop: do some work addi $s2, $s2, 1 slt $t1, $s2, $s3 bne $t1, $0, loop End: for(i; i<n;i++) do some work Iteration1 2 … x 1 2 … y CurState0 Prediction Reality NextState When are we wrong?????

Second Branch Predictor 0 Predict Taken Predict Not Taken Must be wrong twice to switch prediction Update the predictor for next time T NT T T T 123

Branch Prediction slt $t1, $s2, $s3 beq $t1, $0, end loop: do some work addi $s2, $s2, 1 slt $t1, $s2, $s3 bne $t1, $0, loop End: for(i; i<n;i++) do some work Iteration1 2 … x 1 2 … y CurState2 Prediction Reality NextState When are we wrong?????

Real Branch Predictors TargetPC saved with predictor Limited space, so different branches may map to the same predictor –errors? Prediction based on past behavior of several branches

Solution 4: Conditional Instructions Branch: Condition determines NextPC Conditional Instruction: Condition determines RegWrite control bit. Only execute an instruction if a condition is true Hypothetical MIPS instructions: movn $8, $11, $4 - copy $11 to $8 if $4 != zero movz $8, $11, $4 - copy $11 to $8 if $4 == zero

Conventional MIPS Write the following code in conventional MIPS If ($s0 < $s1) $s2 = $s0+$s1 else $s2 = $s1 << 2

Conditional Instructions Write the following code in MIPS using movn and/or movz If ($s0 < $s1) $s2 = $s0+$s1 else $s2 = $s1 << 2

Conditional Instructions How many insts with conventional MIPS? How many insts with conditional insts? So why would we use conditional insts?

Time-> IF ID IF ID MEM MEM WB Branches - Control Hazard IF ID MEMWB slt $s2, $s0, $s1 beq $s2, $0, else else:sll $s2, $s1, 2

Time-> IF ID IF ID MEM MEM WB IF ID MEMWB Conditional Instructions - No Control Hazard IF ID MEMWB

Advantages of Conditional Instructions No control hazards Fetch is uninterrupted - no chance of a branch misprediction Fairly simple hardware implementation

Disadvantages/Limits of Conditional Instructions More instructions to execute Compiler support is necessary Does not work with loops Nested if statements can become a problem In highly predictable branches, which have no penalty when using branch prediction, the extra instructions are worse.

Advantages of Branch Prediction No extra instructions Highly predictable branches have no stalls Works well with loops. All hardware - no compiler necessary

Disadvantages/Limits of Branch Prediction Large penalty when wrong –Badly behaved branches kill performance Only a few can be performed each cycle (only a problem in multi-issue machines)

Therefore…. Branch Prediction is probably necessary either way (for loops) Conditional Instructions are an optimization beyond branch prediction that can be very useful with a good compiler

Minimizing Control Hazards