Presentation is loading. Please wait.

Presentation is loading. Please wait.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput.

Similar presentations


Presentation on theme: "1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput."— Presentation transcript:

1 1  1998 Morgan Kaufmann Publishers Chapter Six

2 2  1998 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput Ideal speedup is number of stages in the pipeline. Do we achieve this?

3 3  1998 Morgan Kaufmann Publishers Pipelining What makes it easy –all instructions are the same length –just a few instruction formats –memory operands appear only in loads and stores What makes it hard? –structural hazards: suppose we had only one memory –control hazards: need to worry about branch instructions –data hazards: an instruction depends on a previous instruction We will build a simple pipeline and look at these issues We will talk about modern processors and what really makes it hard: –exception handling –trying to improve performance with out-of-order execution, etc.

4 4  1998 Morgan Kaufmann Publishers Basic Idea What do we need to add to actually split the datapath into stages?

5 5  1998 Morgan Kaufmann Publishers Pipelined Datapath Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?

6 6  1998 Morgan Kaufmann Publishers Corrected Datapath

7 7  1998 Morgan Kaufmann Publishers Instruction fetch stage of lw

8 8  1998 Morgan Kaufmann Publishers Instruction decode stage of lw

9 9  1998 Morgan Kaufmann Publishers Execution stage of lw

10 10  1998 Morgan Kaufmann Publishers Memory stage of lw

11 11  1998 Morgan Kaufmann Publishers Write back stage of lw

12 12  1998 Morgan Kaufmann Publishers Execution stage of sw

13 13  1998 Morgan Kaufmann Publishers Memory stage of sw

14 14  1998 Morgan Kaufmann Publishers Write back stage of sw

15 15  1998 Morgan Kaufmann Publishers Instructions flowing in pipeline

16 16  1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)

17 17  1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)

18 18  1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)

19 19  1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)

20 20  1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)

21 21  1998 Morgan Kaufmann Publishers Graphically Representing Pipelines Can help with answering questions like: –how many cycles does it take to execute this code? –what is the ALU doing during cycle 4? –use this representation to help understand datapaths

22 22  1998 Morgan Kaufmann Publishers Pipeline Control Can these operations be completed one stage earlier? Think about critical path!!! Why does IM have no control signal at all? Why does RF have only write control? Can ALU result be written back in MEM? Cost of RF write port!!!

23 23  1998 Morgan Kaufmann Publishers Pipeline design considerations Simplification of control mechanism –Active in every clock cycle (always enable) Ex: –Instruction memory has no control signal. –Pipeline registers Minimization of power consumption –Explicit control for infrequent operations Ex: both read and write controls for data memory Cost consideration –Ex: alternatives of ALU write: in MEM or WB

24 24  1998 Morgan Kaufmann Publishers We have 5 stages. What needs to be controlled in each stage? –Instruction Fetch and PC Increment –Instruction Decode / Register Fetch –Execution –Memory Stage –Write Back How would control be handled in an automobile plant? a fancy control center telling everyone what to do? should we use a finite state machine? –Centralized –Distributed Pipeline control

25 25  1998 Morgan Kaufmann Publishers Pass control signals along just like the data –Generate all control signals at the decode stage…similar to the single cycle implementation –Pass the generated control signals along the pipeline and consume the related control signals at the corresponding stage…similar to the multicycle implementation Pipeline Control

26 26  1998 Morgan Kaufmann Publishers Datapath with Control

27 27  1998 Morgan Kaufmann Publishers

28 28  1998 Morgan Kaufmann Publishers

29 29  1998 Morgan Kaufmann Publishers

30 30  1998 Morgan Kaufmann Publishers $4, $5

31 31  1998 Morgan Kaufmann Publishers

32 32  1998 Morgan Kaufmann Publishers

33 33  1998 Morgan Kaufmann Publishers

34 34  1998 Morgan Kaufmann Publishers

35 35  1998 Morgan Kaufmann Publishers

36 36  1998 Morgan Kaufmann Publishers Problem with starting next instruction before first is finished –dependencies that point backward in time are data hazards Dependencies

37 37  1998 Morgan Kaufmann Publishers Have compiler guarantee no hazards Where do we insert the nops?? sub$2, $1, $3 and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) Insert NOP… sub$2, $1, $3 NOP NOP NOP and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) Problem: this really slows us down! Software Solution

38 38  1998 Morgan Kaufmann Publishers Use temporary results, don’t wait for them to be written ALU forwarding Register file forwarding (latch-based register file) to handle read/write to same register (read what you just write!!!) Forwarding what if this $2 was $13? Transparent latch

39 39  1998 Morgan Kaufmann Publishers No forwarding datapath

40 40  1998 Morgan Kaufmann Publishers With forwarding datapath

41 41  1998 Morgan Kaufmann Publishers The control values for the forwarding multiplexors Mux controlSourceExplanation ForwardingA = 00ID/EX The first ALU operand comes from the register file ForwardingA = 10EX/MEM The first ALU operand is forwarded from prior ALU result ForwardingA = 01MEM/WB The first ALU operand is forwarded from data memory or an earlier ALU result ForwardingB = 00ID/EX The second ALU operand comes from the register file ForwardingB = 10EX/MEM The second ALU operand is forwarded from prior ALU result ForwardingB = 01MEM/WB The second ALU operand is forwarded from data memory or an earlier ALU result If (MEM/WB.RegWrite and (MEM/WB.RegisterRd  0) and (EX/MEM.RegisterRd  ID/EX.RegisterRs) and (MEM/WB.registerRd = ID/EX.RegisterRs)) ForwardA = 01 If (MEM/WB.RegWrite and (MEM/WB.RegisterRd  0) and (EX/MEM.RegisterRd  ID/EX.RegisterRt) and (MEM/WB.registerRd = ID/EX.RegisterRt)) ForwardB = 01

42 42  1998 Morgan Kaufmann Publishers The modified datapath resolves hazards via forwarding

43 43  1998 Morgan Kaufmann Publishers

44 44  1998 Morgan Kaufmann Publishers

45 45  1998 Morgan Kaufmann Publishers $2 Register file forwarding (through the transparent latch).

46 46  1998 Morgan Kaufmann Publishers Simultaneously match the $4 operands in stage MEM and stage WB. Forward from the nearest stage: MEM (based on the sequential programming model) Write from the WB stage to the register file (RF). Forward Reg. Write

47 47  1998 Morgan Kaufmann Publishers Load word can still cause a hazard: –an instruction tries to read a register following a load instruction that writes to the same register. Thus, we need a hazard detection unit to stall the load instruction Can't always forward Latch-based RF: read what you just write! (write then read)

48 48  1998 Morgan Kaufmann Publishers Stalling We can stall the pipeline by keeping an instruction in the same stage

49 49  1998 Morgan Kaufmann Publishers Hazard Detection Unit Stall by letting an instruction that won’t write anything go forward stall Insert NOP LOAD continues to next stage

50 50  1998 Morgan Kaufmann Publishers

51 51  1998 Morgan Kaufmann Publishers

52 52  1998 Morgan Kaufmann Publishers

53 53  1998 Morgan Kaufmann Publishers

54 54  1998 Morgan Kaufmann Publishers

55 55  1998 Morgan Kaufmann Publishers

56 56  1998 Morgan Kaufmann Publishers When we decide to branch, other instructions are in the pipeline! We are predicting branch not taken –need to add hardware for flushing instructions if we are wrong Branch Hazards

57 57  1998 Morgan Kaufmann Publishers Flushing Instructions Optimized data path for branch performance: Branch delay: 3 => 1 Original data path

58 58  1998 Morgan Kaufmann Publishers Instruction Flushing for Branch

59 59  1998 Morgan Kaufmann Publishers Instruction Flushing for Branch (Flushed and instr.)

60 60  1998 Morgan Kaufmann Publishers Improving Performance Try and avoid stalls! E.g., reorder these instructions: lw $t0, 0($t1)  lw $t0, 0($t1) lw $t2, 4($t1)lw $t2, 4($t1) sw $t2, 0($t1)sw $t0, 4($t1) sw $t0, 4($t1)sw $t2, 0($t1) Add a branch delay slot (delayed branch) –the next instruction after a branch is always executed –rely on compiler to fill the slot with something useful add $2, $3, $4  beq $9, $10, 400 beq $9, $10, 400add $2, $3, $4 ; always executedsub $11, $12, $13: Superscalar: start more than one instruction in the same cycle

61 61  1998 Morgan Kaufmann Publishers Utilizing the branch delay slot (compiler’s task) : :

62 62  1998 Morgan Kaufmann Publishers Final data/control path for hazard handling

63 63  1998 Morgan Kaufmann Publishers Dynamic Branch Prediction

64 64  1998 Morgan Kaufmann Publishers Final data/control path for exception handling 1.flush instr.; 2.save PC (PC+4); Cause 3.set new PC; 4.overflowed instr. (EX) => NOP 1 2 1 3 4

65 65  1998 Morgan Kaufmann Publishers Overflow!

66 66  1998 Morgan Kaufmann Publishers continue! ………… Flushing (NOP’s) ……………

67 67  1998 Morgan Kaufmann Publishers … Fig.6.56: Complete data path and control for Chap. 6

68 68  1998 Morgan Kaufmann Publishers Instruction typePipe stages ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB Suerscalar Execution

69 69  1998 Morgan Kaufmann Publishers

70 70  1998 Morgan Kaufmann Publishers Simple Superscalar Code Scheduling Loop : lw$t0, 0($s1) # $t0=array element ($s1 is i) addu$t0, $t0, $s2# add dcalar in $s2 (B) sw$t0, 0($s1)# store result addi$s1, $s1, -4# decrement pointer bne$s1, $zero, Loop# branch $s1 != 0 ALU or branch instructionData transfer instructionClock cycle Loop:lw$t0, 0($s1)1 addi$s1, $s1, -42 addu$t0, $t0, $s23 bne$s1, $zero, Loopsw$t0, 0($s1)4 Do { *I = *I + B; I = I -4 ; } While (I != 0) ; Do { *I = *I + B; I = I -4 ; } While (I != 0) ;

71 71  1998 Morgan Kaufmann Publishers Loop Unrolling for Superscalar Pipelines ALU or branch instructionData transfer instructionClock cycle Loop:addi$s1, $s1, -16lw$t0, 0($s1)1 lw$t1, 12($s1)2 addu$t0, $t0, $s2lw$t2, 8($s1)3 addu$t1, $t1, $s2lw$t3, 4($s1)4 addu$t2, $t2, $s2sw$t0, 16($s1)5 addu$t3, $t3, $s2sw$t1, 12($s1)6 sw$t2, 8($s1)7 bne$s1, $zero, Loopsw$t2, 8($s1)8 Do { I = I - 16 ; *I = *I + B; *(I+12) = *(I+12) + B; *(I+8) = *(I+8) + B; *(I+4) = *(I+4) + B; ; } While (I != 0) ; Do { I = I - 16 ; *I = *I + B; *(I+12) = *(I+12) + B; *(I+8) = *(I+8) + B; *(I+4) = *(I+4) + B; ; } While (I != 0) ;

72 72  1998 Morgan Kaufmann Publishers Loop Unrolling Superscalar has the architecture to perform parallel calculation For C source code: –for(i=100; i!=0; i--) { A[i]=A[i]+1; } –for(i=100; i!=0; i=i-4) { A[i]=A[i]+1; A[i-1]=A[i-1]+1; A[i-2]=A[i-2]+1; A[i-3]=A[i-3]+1; } In uni-processor, the functionalities are the same. But in superscalar, large amount of operations provide a richer opportunity for parallel execution.

73 73  1998 Morgan Kaufmann Publishers

74 74  1998 Morgan Kaufmann Publishers

75 75  1998 Morgan Kaufmann Publishers Dynamic Scheduling: dispatch add

76 76  1998 Morgan Kaufmann Publishers Dynamic Scheduling: dispatch subi

77 77  1998 Morgan Kaufmann Publishers Dynamic Scheduling: execute add

78 78  1998 Morgan Kaufmann Publishers Dynamic Scheduling: execute subi

79 79  1998 Morgan Kaufmann Publishers Dynamic Scheduling: write back add

80 80  1998 Morgan Kaufmann Publishers Dynamic Scheduling: write back subi

81 81  1998 Morgan Kaufmann Publishers

82 82  1998 Morgan Kaufmann Publishers

83 83  1998 Morgan Kaufmann Publishers Dynamic Scheduling The hardware performs the scheduling? –hardware tries to find instructions to execute –out of order execution is possible –speculative execution and dynamic branch prediction All modern processors are very complicated –DEC Alpha 21264: 9 stage pipeline, 6 instruction issue –PowerPC and Pentium: branch history table –Compiler technology important This class has given you the background you need to learn more Video: An Overview of Intel Pentium Processor (available from University Video Communications)

84 84  1998 Morgan Kaufmann Publishers Figure 6.52: The performance consequences of single-cycle, multiple-cycle and pipelined

85 85  1998 Morgan Kaufmann Publishers Figure 6.53: Basic relationship between the datapaths in Figure 6.52


Download ppt "1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput."

Similar presentations


Ads by Google