Pipelining 6.1, 6.2. Performance Measurements Cycle Time: Time ______ Latency: Time to finish a _, start to finish Throughput:

Pipelining 6.1, 6.2

Performance Measurements Cycle Time: Time __________________ Latency: Time to finish a _____________, start to finish Throughput: Average ______________ per unit time CyclesPerJob: On average, number of _______________________________.

Goals Faster clock rate Use machine more efficiently No longer execute only one instruction at a time

Laundry Laundry-o-matic washes, dries & folds Wash: 30 min Dry: 40 min Fold: 20 min It switches them internally with no delay How long to complete 1 load? ______

Laundry-o-Matic - SingleCycle Minutes Load 123123 0 30 60 90 120 150 180 210 240 270 WFD

Laundry-o-Matic Cycle Time: Clothing is switched every ____ minutes Latency: A single load takes a total of ______ minutes Throughput: A load completes each ______ minutes CyclesPerLoad: Every ____ cycle(s), a load completes

Pipelined Laundry Split the laundry-o-matic into a washer, dryer, and folder (what a concept) Moving the laundry from one to another takes 6 minutes

Pipelined Laundry Minutes Load 123123 0 30 60 90 120 150 180 210 240 270 WF Switch all loads at the same time D

Pipelined Laundry Cycle Time: Clothing is switched every ____ minutes Latency: A single load takes a total of ______ minutes Throughput: A load completes each ______ minutes CyclesPerLoad: Every ____ cycle(s), a load completes

Single-Cycle vs Pipelined _________ has the higher cycle time _________ has the higher clock rate _________ has the higher single-load latency _________ has the higher throughput _________ has the higher CPL (Cycles per Load) More stages makes a _______ clock rate

Obstacles to speedup in Pipelining WFD 1. 2. Ideal cycle time w/out above limitations with n stage pipeline:

Creating Stages Fetch – get instruction Decode – read registers Execute – use ALU Memory – access memory WriteBack – write registers FetchDecodeExecuteMemoryWriteBack IFWB MEM ID

Pipelined Machine Read Addr Out Data Instruction Memory PC 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2 Pipeline Register Fetch (Writeback) ExecuteDecodeMemory

In what cycle was $s1 written? In what cycle was $s4 read? In what cycle was the Add executed? Time-> add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1) or $s3, $s4, $t3 IF ID IF ID IF MEM ID IF 1 2 3 4 5 6 7 8 ID WB MEM WB MEM WB MEM WB

Performance Analysis Measurements related to our machine Job = single instruction Latency - Time to finish a complete _______________, start to finish. Throughput – Average ______________ completed per unit time. CyclesPerInstruction – An _________ completes each how many cycles? Which is more important for reducing program execution time?

Machine Comparison FetchDecode Execute Memory WriteBack 2ns1ns 2ns 2ns 1ns 0.1 ns pipeline register delay Single-Cycle Implementation Clock cycle time:_____ ns Latency of a single instruction: _____ ns Throughput for machine:_____ inst/ns Pipelined Implementation Clock cycle time:_____ ns Latency of a single instruction: _____ ns Throughput for machine:_____ inst/ns

Example 2 – How do we speed up pipelined machine? FetchDecode Execute Memory Writeback 6ns4ns 8ns 10ns 4ns 0.1 ns pipelined register delay Single cycle:1 / ns Pipelined:1 / ns

Example 2 – Split more stages FetchDecode Execute Memory Writeback 6ns4ns 8ns 10ns 4ns 0.1 ns pipelined register delay Which stage(s) should we split? _________ and _________

Example 2 – After Split FWB ___ns ___ns ___ns ___ns ___ns ___ns ___ns 0.1 ns pipelined register delay Single cycle:1 / ns Pipelined:1 / ns

Easy Right? Not so fast. In what cycle does the sw write $s0? In what cycle does the or read $s0? Time-> or $s3, $s0, $t3 IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB sw $s2, 0($t1) and $s6, $s4, $t3 IF ID IF ID WB MEM WB Incorrect Execution lw $s0, 0($t4)

Time-> or $s3, $s0, $t3 IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB In what cycle does the sw write $s0? In what cycle does the oi read $s0? Stall - wasted cycles sw $s2, 0($t1) and $s6, $s4, $t3 IF ID IF ID WB MEM WB IF Correct, Slow Execution Data Hazards ID & WB take only ½ cycle each. All other stages take the full cycle lw $s0, 0($t4)

Barriers to pipeline performance Uneven stages Pipeline register delays Data Hazards – ____________________________________ __

Time-> or $s3, $s0, $t3 IF ID IF MEM 1 2 3 4 5 6 7 8 Stall - wasted cycles sw $s2, 0($t1) and $s6, $s4, $t3 Default: Stall lw $s0, 0($t4) WB

Time-> or $s3, $s0, $t3 IF ID IF MEM 1 2 3 4 5 6 7 8 In what cycle is $s0 produced in the machine? In what cycle is $s0 used in the machine? Add wires that pass data from the stage in which it is produced to the stage in which it is used. sw $s2, 0($t1) and $s6, $s4, $t3 IF Solution 1: Data Forwarding lw $s0, 0($t4) WB ID

Data-Forwarding Where are those wires? Read Addr Out Data Instruction Memory PC 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2 Pipeline Register Fetch (Writeback) ExecuteDecodeMemory

Time-> add $s2, $s2, $t0 FD F M 1 2 3 4 5 6 7 8 9 10 11 12 Draw the timing diagram with data forwarding Draw arrows to indicate data passing through forwarding F lw $t0, 0($s0) W D Data Forwarding Example 2 sw $s2, 0($s0) addi $t0, $t0, 1

Who reorders instructions? Static scheduling –Compiler –___________, but does not know when caches miss or loads/stores are to the same locations (conservative) Dynamic scheduling –Hardware –_____________________________, but has all knowledge (more accurate)

Time-> or $s3, $s0, $t3 IF ID IF MEM 1 2 3 4 5 6 7 8 MEMWB IF Stall - wasted cycles sw $s2, 0($t1) and $s6, $s4, $t3 IF ID IF ID WB MEM WB Solution 2: Instruction Reordering (Before reordering) lw $s0, 0($t4) WB ID

Time-> IF ID IF ID MEM 1 2 3 4 5 6 7 8 WB IF Solution 2: Instruction Reordering (With Instruction Reordering) lw $s0, 0($t4) WB

Time-> IF ID IF MEM 1 2 3 4 5 6 7 8 MEMWB IF ID IF ID WB MEM WB ID or $s3, $s0, $t3 sw $s3, 0($t1) and $s0, $s4, $t3 lw $s0, 0($t4) Instruction Reordering - Are these equivalent? (Get the same answer)

Time-> or $s3, $s0, $t3 IF ID IF MEM 1 2 3 4 5 6 7 8 MEM WB sw $s3, 0($t1) and $s0, $s4, $t3 IF ID IF ID WB MEM WB lw $s0, 0($t4) ID WB A deeper look

Dependencies And our friends RAW, WAR, and WAW

RAW – Read after Write add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0

WAR - Write after Read add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0

WAW –Write after Write add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0

Identify all of the dependencies IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW

Which dependencies might cause stalls (hazards)? IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW So why do we care about WAW and WAR dependencies at all?!?

Can we reorder the or? IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW

No, we can not reorder the or IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW RAW

Can we reorder the mul? IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW

No, siree. Mul stays put. IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW

How to remove WAR, WAW dependencies? IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 and $t4, $s3, $s5 add $s3, $s4, $s6

___________________________ use a different register for that result (and all subsequent uses of that result) IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub ____, $t0, $s3 or ____, $t7, $s2 and $t2, $t7, $s0 and $t4, ___, $s5 add $s3, $s4, $s6

Who renames registers? Static register renaming –Compiler –Compiler is the one who makes assignments in the first place! –Number of registers limited by ____________ Dynamic register renaming –Hardware –Can offer more registers – –Number of registers limited by ____________

Minimizing Data Hazards Summary

Dependencies Summary What is the difference between a hazard and a dependency? How can we get rid of name dependencies? What limits this solution?

The Big Picture Fewer StagesMore Stages More hardware Smaller More power Faster Clock Fewer Stalls Higher Throughput

The Big Picture Throughput (inst / ns) = –inst / cycle * cycle / ns

In what cycle does the nextPC get calculated for the bne? In what cycle does the or get fetched? Time-> bne $s0, $s1, end or $s3, $s0, $t3 IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB end: sw $s2, 0($t1) IF ID IF ID WB MEM WB Control Hazard (Incorrect Execution) add $s5, $s4, $t1

Time-> bne $s0, $s1, end or $s3, $s0, $t3 IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB end: sw $s2, 0($t1) IF ID MEMWB Control Hazard (Correct, Slow execution) IF ID MEMWB add $s5, $s4, $t1

Barriers to pipeline performance Uneven stages Pipeline register delays Data Hazards Control Hazards –___________________________________

Now in what cycle does the nextPC get calculated for the bne? In what cycle does the ori get fetched? Time-> IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB IF ID MEMWB Solution 1: Add hardware to determine branch in decode stage bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 IF ID MEMWB

REMEMBER For the rest of this course, the branches will be determined in the decode stage All other optimizations will be in addition to moving branch calculation to decode stage

Redefine the semantics of a branch ALWAYS execute the instruction after the branch, regardless of the outcome of the branch. Time-> IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB IF ID MEMWB Solution 2: Branch Delay Slot bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 IF ID MEMWB nop

Try to fill that slot with an instruction from before the branch. Time-> IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB IF ID MEMWB Solution 2: Branch Delay Slot bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 IF ID MEMWB

Branch Delay Slot The hardware ________ executes instruction after a branch The _________ tries to take an instruction from before branch and move it after branch If it can find no instruction, it inserts a _____ after the branch If it forgets to place nop or inst there, you can get incorrect execution!!!!!

Branch Delay Slot - Limitations If you have a machine with 20 pipeline stages, and it takes 10 stages to calculate branch, how many branch delay slots are there? Can you move any instruction into branch delay slot? What happens as the pipeline gets deeper?

Solution 3: Branch Prediction bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 Guess which way the branch will go before calculation occurs. Clean up if predictor is wrong. Original code:

Initial algorithm: Always predict not taken If we are right, how many cycles do we stall? Time-> IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB IF ID MEMWB bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 IF ID MEMWB Note: There are several prediction algorithms. Predict taken is merely an example. Correct Branch Prediction

Initial algorithm: Always predict not taken If we are wrong, then flush incorrect instruction(s) How many cycles do we stall? Time-> IF ID IF MEM 1 2 3 4 5 6 7 8 WB IF ID MEMWB Branch Misprediction bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 IF ID MEMWB

Designing a Branch Predictor Understand the nature of programs Are branch directions random? If not, what will correlate? –Past behavior? –Previous branches’ behavior?

Branch Prediction slt $t1, $s2, $s3 beq $t1, $0, end loop: do some work addi $s2, $s2, 1 slt $t1, $s2, $s3 bne $t1, $0, loop End: for(i; i<n;i++) do some work Observations: Is beq often taken or not taken? Is bne often taken or not taken? Conclusions:

First Branch Predictor 0 Predict TakenPredict Not Taken Predict whatever happened last time Update the predictor for next time T NT T 1

Branch Prediction slt $t1, $s2, $s3 beq $t1, $0, end loop: do some work addi $s2, $s2, 1 slt $t1, $s2, $s3 bne $t1, $0, loop End: for(i; i<n;i++) do some work Iteration1 2 … x 1 2 … y CurState0 Prediction Reality NextState When are we wrong?????

Second Branch Predictor 0 Predict Taken Predict Not Taken Must be wrong twice to switch prediction Update the predictor for next time T NT T T T 123

Branch Prediction slt $t1, $s2, $s3 beq $t1, $0, end loop: do some work addi $s2, $s2, 1 slt $t1, $s2, $s3 bne $t1, $0, loop End: for(i; i<n;i++) do some work Iteration1 2 … x 1 2 … y CurState2 Prediction Reality NextState When are we wrong?????

Real Branch Predictors TargetPC saved with predictor Limited space, so different branches may map to the same predictor –errors? Prediction based on past behavior of several branches

Solution 4: Conditional Instructions Branch: Condition determines NextPC Conditional Instruction: Condition determines RegWrite control bit. Only execute an instruction if a condition is true Hypothetical MIPS instructions: movn $8, $11, $4 - copy $11 to $8 if $4 != zero movz $8, $11, $4 - copy $11 to $8 if $4 == zero

Conventional MIPS Write the following code in conventional MIPS If ($s0 < $s1) $s2 = $s0+$s1 else $s2 = $s1 << 2

Conditional Instructions Write the following code in MIPS using movn and/or movz If ($s0 < $s1) $s2 = $s0+$s1 else $s2 = $s1 << 2

Conditional Instructions How many insts with conventional MIPS? How many insts with conditional insts? So why would we use conditional insts?

Time-> IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB Branches - Control Hazard IF ID MEMWB slt $s2, $s0, $s1 beq $s2, $0, else else:sll $s2, $s1, 2

Time-> IF ID IF ID MEM 1 2 3 4 5 6 7 8 MEM WB IF ID MEMWB Conditional Instructions - No Control Hazard IF ID MEMWB

Advantages of Conditional Instructions No control hazards Fetch is uninterrupted - no chance of a branch misprediction Fairly simple hardware implementation

Disadvantages/Limits of Conditional Instructions More instructions to execute Compiler support is necessary Does not work with loops Nested if statements can become a problem In highly predictable branches, which have no penalty when using branch prediction, the extra instructions are worse.

Advantages of Branch Prediction No extra instructions Highly predictable branches have no stalls Works well with loops. All hardware - no compiler necessary

Disadvantages/Limits of Branch Prediction Large penalty when wrong –Badly behaved branches kill performance Only a few can be performed each cycle (only a problem in multi-issue machines)

Therefore…. Branch Prediction is probably necessary either way (for loops) Conditional Instructions are an optimization beyond branch prediction that can be very useful with a good compiler

Minimizing Control Hazards

Pipelining 6.1, 6.2. Performance Measurements Cycle Time: Time ______ Latency: Time to finish a _, start to finish Throughput:

Similar presentations

Presentation on theme: "Pipelining 6.1, 6.2. Performance Measurements Cycle Time: Time ______ Latency: Time to finish a _, start to finish Throughput:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pipelining 6.1, 6.2. Performance Measurements Cycle Time: Time __________________ Latency: Time to finish a _____________, start to finish Throughput:

Similar presentations

Presentation on theme: "Pipelining 6.1, 6.2. Performance Measurements Cycle Time: Time __________________ Latency: Time to finish a _____________, start to finish Throughput:"— Presentation transcript:

Similar presentations

About project

Feedback

Pipelining 6.1, 6.2. Performance Measurements Cycle Time: Time ______ Latency: Time to finish a _, start to finish Throughput:

Presentation on theme: "Pipelining 6.1, 6.2. Performance Measurements Cycle Time: Time ______ Latency: Time to finish a _, start to finish Throughput:"— Presentation transcript: