Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pipelining 6.1, 6.2. Performance Measurements Cycle Time: Time __________________ Latency: Time to finish a _____________, start to finish Throughput:

Similar presentations


Presentation on theme: "Pipelining 6.1, 6.2. Performance Measurements Cycle Time: Time __________________ Latency: Time to finish a _____________, start to finish Throughput:"— Presentation transcript:

1 Pipelining 6.1, 6.2

2 Performance Measurements Cycle Time: Time __________________ Latency: Time to finish a _____________, start to finish Throughput: Average ______________ per unit time CyclesPerJob: On average, number of _______________________________.

3 Goals Faster clock rate Use machine more efficiently No longer execute only one instruction at a time

4 Laundry Laundry-o-matic washes, dries & folds Wash: 30 min Dry: 40 min Fold: 20 min It switches them internally with no delay How long to complete 1 load? ______

5 Laundry-o-Matic - SingleCycle Minutes Load WFD

6 Laundry-o-Matic Cycle Time: Clothing is switched every ____ minutes Latency: A single load takes a total of ______ minutes Throughput: A load completes each ______ minutes CyclesPerLoad: Every ____ cycle(s), a load completes

7 Pipelined Laundry Split the laundry-o-matic into a washer, dryer, and folder (what a concept) Moving the laundry from one to another takes 6 minutes

8 Pipelined Laundry Minutes Load WF Switch all loads at the same time D

9 Pipelined Laundry Cycle Time: Clothing is switched every ____ minutes Latency: A single load takes a total of ______ minutes Throughput: A load completes each ______ minutes CyclesPerLoad: Every ____ cycle(s), a load completes

10 Single-Cycle vs Pipelined _________ has the higher cycle time _________ has the higher clock rate _________ has the higher single-load latency _________ has the higher throughput _________ has the higher CPL (Cycles per Load) More stages makes a _______ clock rate

11 Obstacles to speedup in Pipelining WFD Ideal cycle time w/out above limitations with n stage pipeline:

12

13 Creating Stages Fetch – get instruction Decode – read registers Execute – use ALU Memory – access memory WriteBack – write registers FetchDecodeExecuteMemoryWriteBack IFWB MEM ID

14 Pipelined Machine Read Addr Out Data Instruction Memory PC 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2 Pipeline Register Fetch (Writeback) ExecuteDecodeMemory

15 In what cycle was $s1 written? In what cycle was $s4 read? In what cycle was the Add executed? Time-> add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1) or $s3, $s4, $t3 IF ID IF ID IF MEM ID IF ID WB MEM WB MEM WB MEM WB

16 Performance Analysis Measurements related to our machine Job = single instruction Latency - Time to finish a complete _______________, start to finish. Throughput – Average ______________ completed per unit time. CyclesPerInstruction – An _________ completes each how many cycles? Which is more important for reducing program execution time?

17 Machine Comparison FetchDecode Execute Memory WriteBack 2ns1ns 2ns 2ns 1ns 0.1 ns pipeline register delay Single-Cycle Implementation Clock cycle time:_____ ns Latency of a single instruction: _____ ns Throughput for machine:_____ inst/ns Pipelined Implementation Clock cycle time:_____ ns Latency of a single instruction: _____ ns Throughput for machine:_____ inst/ns

18 Example 2 – How do we speed up pipelined machine? FetchDecode Execute Memory Writeback 6ns4ns 8ns 10ns 4ns 0.1 ns pipelined register delay Single cycle:1 / ns Pipelined:1 / ns

19 Example 2 – Split more stages FetchDecode Execute Memory Writeback 6ns4ns 8ns 10ns 4ns 0.1 ns pipelined register delay Which stage(s) should we split? _________ and _________

20 Example 2 – After Split FWB ___ns ___ns ___ns ___ns ___ns ___ns ___ns 0.1 ns pipelined register delay Single cycle:1 / ns Pipelined:1 / ns

21 Easy Right? Not so fast. In what cycle does the sw write $s0? In what cycle does the or read $s0? Time-> or $s3, $s0, $t3 IF ID IF ID MEM MEM WB sw $s2, 0($t1) and $s6, $s4, $t3 IF ID IF ID WB MEM WB Incorrect Execution lw $s0, 0($t4)

22 Time-> or $s3, $s0, $t3 IF ID IF ID MEM MEM WB In what cycle does the sw write $s0? In what cycle does the oi read $s0? Stall - wasted cycles sw $s2, 0($t1) and $s6, $s4, $t3 IF ID IF ID WB MEM WB IF Correct, Slow Execution Data Hazards ID & WB take only ½ cycle each. All other stages take the full cycle lw $s0, 0($t4)

23 Barriers to pipeline performance Uneven stages Pipeline register delays Data Hazards – ____________________________________ __

24 Time-> or $s3, $s0, $t3 IF ID IF MEM Stall - wasted cycles sw $s2, 0($t1) and $s6, $s4, $t3 Default: Stall lw $s0, 0($t4) WB

25 Time-> or $s3, $s0, $t3 IF ID IF MEM In what cycle is $s0 produced in the machine? In what cycle is $s0 used in the machine? Add wires that pass data from the stage in which it is produced to the stage in which it is used. sw $s2, 0($t1) and $s6, $s4, $t3 IF Solution 1: Data Forwarding lw $s0, 0($t4) WB ID

26 Data-Forwarding Where are those wires? Read Addr Out Data Instruction Memory PC 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2 Pipeline Register Fetch (Writeback) ExecuteDecodeMemory

27 Time-> add $s2, $s2, $t0 FD F M Draw the timing diagram with data forwarding Draw arrows to indicate data passing through forwarding F lw $t0, 0($s0) W D Data Forwarding Example 2 sw $s2, 0($s0) addi $t0, $t0, 1

28 Who reorders instructions? Static scheduling –Compiler –___________, but does not know when caches miss or loads/stores are to the same locations (conservative) Dynamic scheduling –Hardware –_____________________________, but has all knowledge (more accurate)

29 Time-> or $s3, $s0, $t3 IF ID IF MEM MEMWB IF Stall - wasted cycles sw $s2, 0($t1) and $s6, $s4, $t3 IF ID IF ID WB MEM WB Solution 2: Instruction Reordering (Before reordering) lw $s0, 0($t4) WB ID

30 Time-> IF ID IF ID MEM WB IF Solution 2: Instruction Reordering (With Instruction Reordering) lw $s0, 0($t4) WB

31 Time-> IF ID IF MEM MEMWB IF ID IF ID WB MEM WB ID or $s3, $s0, $t3 sw $s3, 0($t1) and $s0, $s4, $t3 lw $s0, 0($t4) Instruction Reordering - Are these equivalent? (Get the same answer)

32 Time-> or $s3, $s0, $t3 IF ID IF MEM MEM WB sw $s3, 0($t1) and $s0, $s4, $t3 IF ID IF ID WB MEM WB lw $s0, 0($t4) ID WB A deeper look

33 Dependencies And our friends RAW, WAR, and WAW

34 RAW – Read after Write add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0

35 WAR - Write after Read add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0

36 WAW –Write after Write add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0

37 Identify all of the dependencies IF ID IF ID MEM MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW

38 Which dependencies might cause stalls (hazards)? IF ID IF ID MEM MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW So why do we care about WAW and WAR dependencies at all?!?

39 Can we reorder the or? IF ID IF ID MEM MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW

40 No, we can not reorder the or IF ID IF ID MEM MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW RAW

41 Can we reorder the mul? IF ID IF ID MEM MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW

42 No, siree. Mul stays put. IF ID IF ID MEM MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 RAW WAR WAW

43 How to remove WAR, WAW dependencies? IF ID IF ID MEM MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0 and $t4, $s3, $s5 add $s3, $s4, $s6

44 ___________________________ use a different register for that result (and all subsequent uses of that result) IF ID IF ID MEM MEM WB IF ID IF ID WB MEM WB add $t0, $s0, $s1 sub ____, $t0, $s3 or ____, $t7, $s2 and $t2, $t7, $s0 and $t4, ___, $s5 add $s3, $s4, $s6

45 Who renames registers? Static register renaming –Compiler –Compiler is the one who makes assignments in the first place! –Number of registers limited by ____________ Dynamic register renaming –Hardware –Can offer more registers – –Number of registers limited by ____________

46 Minimizing Data Hazards Summary

47 Dependencies Summary What is the difference between a hazard and a dependency? How can we get rid of name dependencies? What limits this solution?

48 The Big Picture Fewer StagesMore Stages More hardware Smaller More power Faster Clock Fewer Stalls Higher Throughput

49 The Big Picture Throughput (inst / ns) = –inst / cycle * cycle / ns

50 In what cycle does the nextPC get calculated for the bne? In what cycle does the or get fetched? Time-> bne $s0, $s1, end or $s3, $s0, $t3 IF ID IF ID MEM MEM WB end: sw $s2, 0($t1) IF ID IF ID WB MEM WB Control Hazard (Incorrect Execution) add $s5, $s4, $t1

51 Time-> bne $s0, $s1, end or $s3, $s0, $t3 IF ID IF ID MEM MEM WB end: sw $s2, 0($t1) IF ID MEMWB Control Hazard (Correct, Slow execution) IF ID MEMWB add $s5, $s4, $t1

52 Barriers to pipeline performance Uneven stages Pipeline register delays Data Hazards Control Hazards –___________________________________

53 Now in what cycle does the nextPC get calculated for the bne? In what cycle does the ori get fetched? Time-> IF ID IF ID MEM MEM WB IF ID MEMWB Solution 1: Add hardware to determine branch in decode stage bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 IF ID MEMWB

54 REMEMBER For the rest of this course, the branches will be determined in the decode stage All other optimizations will be in addition to moving branch calculation to decode stage

55 Redefine the semantics of a branch ALWAYS execute the instruction after the branch, regardless of the outcome of the branch. Time-> IF ID IF ID MEM MEM WB IF ID MEMWB Solution 2: Branch Delay Slot bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 IF ID MEMWB nop

56 Try to fill that slot with an instruction from before the branch. Time-> IF ID IF ID MEM MEM WB IF ID MEMWB Solution 2: Branch Delay Slot bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 IF ID MEMWB

57 Branch Delay Slot The hardware ________ executes instruction after a branch The _________ tries to take an instruction from before branch and move it after branch If it can find no instruction, it inserts a _____ after the branch If it forgets to place nop or inst there, you can get incorrect execution!!!!!

58 Branch Delay Slot - Limitations If you have a machine with 20 pipeline stages, and it takes 10 stages to calculate branch, how many branch delay slots are there? Can you move any instruction into branch delay slot? What happens as the pipeline gets deeper?

59 Solution 3: Branch Prediction bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 Guess which way the branch will go before calculation occurs. Clean up if predictor is wrong. Original code:

60

61 Initial algorithm: Always predict not taken If we are right, how many cycles do we stall? Time-> IF ID IF ID MEM MEM WB IF ID MEMWB bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 IF ID MEMWB Note: There are several prediction algorithms. Predict taken is merely an example. Correct Branch Prediction

62 Initial algorithm: Always predict not taken If we are wrong, then flush incorrect instruction(s) How many cycles do we stall? Time-> IF ID IF MEM WB IF ID MEMWB Branch Misprediction bne $s0, $s1, end or $s3, $s0, $t3 end: sw $s2, 0($t1) add $s5, $s4, $t1 IF ID MEMWB

63 Designing a Branch Predictor Understand the nature of programs Are branch directions random? If not, what will correlate? –Past behavior? –Previous branches’ behavior?

64 Branch Prediction slt $t1, $s2, $s3 beq $t1, $0, end loop: do some work addi $s2, $s2, 1 slt $t1, $s2, $s3 bne $t1, $0, loop End: for(i; i

65 First Branch Predictor 0 Predict TakenPredict Not Taken Predict whatever happened last time Update the predictor for next time T NT T 1

66 Branch Prediction slt $t1, $s2, $s3 beq $t1, $0, end loop: do some work addi $s2, $s2, 1 slt $t1, $s2, $s3 bne $t1, $0, loop End: for(i; i

67 Second Branch Predictor 0 Predict Taken Predict Not Taken Must be wrong twice to switch prediction Update the predictor for next time T NT T T T 123

68 Branch Prediction slt $t1, $s2, $s3 beq $t1, $0, end loop: do some work addi $s2, $s2, 1 slt $t1, $s2, $s3 bne $t1, $0, loop End: for(i; i

69 Real Branch Predictors TargetPC saved with predictor Limited space, so different branches may map to the same predictor –errors? Prediction based on past behavior of several branches

70

71

72

73 Solution 4: Conditional Instructions Branch: Condition determines NextPC Conditional Instruction: Condition determines RegWrite control bit. Only execute an instruction if a condition is true Hypothetical MIPS instructions: movn $8, $11, $4 - copy $11 to $8 if $4 != zero movz $8, $11, $4 - copy $11 to $8 if $4 == zero

74 Conventional MIPS Write the following code in conventional MIPS If ($s0 < $s1) $s2 = $s0+$s1 else $s2 = $s1 << 2

75 Conditional Instructions Write the following code in MIPS using movn and/or movz If ($s0 < $s1) $s2 = $s0+$s1 else $s2 = $s1 << 2

76 Conditional Instructions How many insts with conventional MIPS? How many insts with conditional insts? So why would we use conditional insts?

77 Time-> IF ID IF ID MEM MEM WB Branches - Control Hazard IF ID MEMWB slt $s2, $s0, $s1 beq $s2, $0, else else:sll $s2, $s1, 2

78 Time-> IF ID IF ID MEM MEM WB IF ID MEMWB Conditional Instructions - No Control Hazard IF ID MEMWB

79 Advantages of Conditional Instructions No control hazards Fetch is uninterrupted - no chance of a branch misprediction Fairly simple hardware implementation

80 Disadvantages/Limits of Conditional Instructions More instructions to execute Compiler support is necessary Does not work with loops Nested if statements can become a problem In highly predictable branches, which have no penalty when using branch prediction, the extra instructions are worse.

81 Advantages of Branch Prediction No extra instructions Highly predictable branches have no stalls Works well with loops. All hardware - no compiler necessary

82 Disadvantages/Limits of Branch Prediction Large penalty when wrong –Badly behaved branches kill performance Only a few can be performed each cycle (only a problem in multi-issue machines)

83 Therefore…. Branch Prediction is probably necessary either way (for loops) Conditional Instructions are an optimization beyond branch prediction that can be very useful with a good compiler

84 Minimizing Control Hazards


Download ppt "Pipelining 6.1, 6.2. Performance Measurements Cycle Time: Time __________________ Latency: Time to finish a _____________, start to finish Throughput:"

Similar presentations


Ads by Google