1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput.

1  1998 Morgan Kaufmann Publishers Chapter Six

2  1998 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput Ideal speedup is number of stages in the pipeline. Do we achieve this?

3  1998 Morgan Kaufmann Publishers Pipelining What makes it easy –all instructions are the same length –just a few instruction formats –memory operands appear only in loads and stores What makes it hard? –structural hazards: suppose we had only one memory –control hazards: need to worry about branch instructions –data hazards: an instruction depends on a previous instruction We will build a simple pipeline and look at these issues We will talk about modern processors and what really makes it hard: –exception handling –trying to improve performance with out-of-order execution, etc.

4  1998 Morgan Kaufmann Publishers Basic Idea What do we need to add to actually split the datapath into stages?

5  1998 Morgan Kaufmann Publishers Pipelined Datapath Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?

6  1998 Morgan Kaufmann Publishers Corrected Datapath

7  1998 Morgan Kaufmann Publishers Instruction fetch stage of lw

8  1998 Morgan Kaufmann Publishers Instruction decode stage of lw

9  1998 Morgan Kaufmann Publishers Execution stage of lw

10  1998 Morgan Kaufmann Publishers Memory stage of lw

11  1998 Morgan Kaufmann Publishers Write back stage of lw

12  1998 Morgan Kaufmann Publishers Execution stage of sw

13  1998 Morgan Kaufmann Publishers Memory stage of sw

14  1998 Morgan Kaufmann Publishers Write back stage of sw

15  1998 Morgan Kaufmann Publishers Instructions flowing in pipeline

16  1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)

21  1998 Morgan Kaufmann Publishers Graphically Representing Pipelines Can help with answering questions like: –how many cycles does it take to execute this code? –what is the ALU doing during cycle 4? –use this representation to help understand datapaths

22  1998 Morgan Kaufmann Publishers Pipeline Control Can these operations be completed one stage earlier? Think about critical path!!! Why does IM have no control signal at all? Why does RF have only write control? Can ALU result be written back in MEM? Cost of RF write port!!!

23  1998 Morgan Kaufmann Publishers Pipeline design considerations Simplification of control mechanism –Active in every clock cycle (always enable) Ex: –Instruction memory has no control signal. –Pipeline registers Minimization of power consumption –Explicit control for infrequent operations Ex: both read and write controls for data memory Cost consideration –Ex: alternatives of ALU write: in MEM or WB

24  1998 Morgan Kaufmann Publishers We have 5 stages. What needs to be controlled in each stage? –Instruction Fetch and PC Increment –Instruction Decode / Register Fetch –Execution –Memory Stage –Write Back How would control be handled in an automobile plant? a fancy control center telling everyone what to do? should we use a finite state machine? –Centralized –Distributed Pipeline control

25  1998 Morgan Kaufmann Publishers Pass control signals along just like the data –Generate all control signals at the decode stage…similar to the single cycle implementation –Pass the generated control signals along the pipeline and consume the related control signals at the corresponding stage…similar to the multicycle implementation Pipeline Control

26  1998 Morgan Kaufmann Publishers Datapath with Control

27  1998 Morgan Kaufmann Publishers

30  1998 Morgan Kaufmann Publishers $4, $5

36  1998 Morgan Kaufmann Publishers Problem with starting next instruction before first is finished –dependencies that point backward in time are data hazards Dependencies

37  1998 Morgan Kaufmann Publishers Have compiler guarantee no hazards Where do we insert the nops?? sub$2, $1, $3 and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) Insert NOP… sub$2, $1, $3 NOP NOP NOP and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) Problem: this really slows us down! Software Solution

38  1998 Morgan Kaufmann Publishers Use temporary results, don’t wait for them to be written ALU forwarding Register file forwarding (latch-based register file) to handle read/write to same register (read what you just write!!!) Forwarding what if this $2 was $13? Transparent latch

39  1998 Morgan Kaufmann Publishers No forwarding datapath

40  1998 Morgan Kaufmann Publishers With forwarding datapath

41  1998 Morgan Kaufmann Publishers The control values for the forwarding multiplexors Mux controlSourceExplanation ForwardingA = 00ID/EX The first ALU operand comes from the register file ForwardingA = 10EX/MEM The first ALU operand is forwarded from prior ALU result ForwardingA = 01MEM/WB The first ALU operand is forwarded from data memory or an earlier ALU result ForwardingB = 00ID/EX The second ALU operand comes from the register file ForwardingB = 10EX/MEM The second ALU operand is forwarded from prior ALU result ForwardingB = 01MEM/WB The second ALU operand is forwarded from data memory or an earlier ALU result If (MEM/WB.RegWrite and (MEM/WB.RegisterRd  0) and (EX/MEM.RegisterRd  ID/EX.RegisterRs) and (MEM/WB.registerRd = ID/EX.RegisterRs)) ForwardA = 01 If (MEM/WB.RegWrite and (MEM/WB.RegisterRd  0) and (EX/MEM.RegisterRd  ID/EX.RegisterRt) and (MEM/WB.registerRd = ID/EX.RegisterRt)) ForwardB = 01

42  1998 Morgan Kaufmann Publishers The modified datapath resolves hazards via forwarding

45  1998 Morgan Kaufmann Publishers $2 Register file forwarding (through the transparent latch).

46  1998 Morgan Kaufmann Publishers Simultaneously match the $4 operands in stage MEM and stage WB. Forward from the nearest stage: MEM (based on the sequential programming model) Write from the WB stage to the register file (RF). Forward Reg. Write

47  1998 Morgan Kaufmann Publishers Load word can still cause a hazard: –an instruction tries to read a register following a load instruction that writes to the same register. Thus, we need a hazard detection unit to stall the load instruction Can't always forward Latch-based RF: read what you just write! (write then read)

48  1998 Morgan Kaufmann Publishers Stalling We can stall the pipeline by keeping an instruction in the same stage

49  1998 Morgan Kaufmann Publishers Hazard Detection Unit Stall by letting an instruction that won’t write anything go forward stall Insert NOP LOAD continues to next stage

56  1998 Morgan Kaufmann Publishers When we decide to branch, other instructions are in the pipeline! We are predicting branch not taken –need to add hardware for flushing instructions if we are wrong Branch Hazards

57  1998 Morgan Kaufmann Publishers Flushing Instructions Optimized data path for branch performance: Branch delay: 3 => 1 Original data path

58  1998 Morgan Kaufmann Publishers Instruction Flushing for Branch

59  1998 Morgan Kaufmann Publishers Instruction Flushing for Branch (Flushed and instr.)

60  1998 Morgan Kaufmann Publishers Improving Performance Try and avoid stalls! E.g., reorder these instructions: lw $t0, 0($t1)  lw $t0, 0($t1) lw $t2, 4($t1)lw $t2, 4($t1) sw $t2, 0($t1)sw $t0, 4($t1) sw $t0, 4($t1)sw $t2, 0($t1) Add a branch delay slot (delayed branch) –the next instruction after a branch is always executed –rely on compiler to fill the slot with something useful add $2, $3, $4  beq $9, $10, 400 beq $9, $10, 400add $2, $3, $4 ; always executedsub $11, $12, $13: Superscalar: start more than one instruction in the same cycle

61  1998 Morgan Kaufmann Publishers Utilizing the branch delay slot (compiler’s task) : :

62  1998 Morgan Kaufmann Publishers Final data/control path for hazard handling

63  1998 Morgan Kaufmann Publishers Dynamic Branch Prediction

64  1998 Morgan Kaufmann Publishers Final data/control path for exception handling 1.flush instr.; 2.save PC (PC+4); Cause 3.set new PC; 4.overflowed instr. (EX) => NOP 1 2 1 3 4

65  1998 Morgan Kaufmann Publishers Overflow!

66  1998 Morgan Kaufmann Publishers continue! ………… Flushing (NOP’s) ……………

67  1998 Morgan Kaufmann Publishers … Fig.6.56: Complete data path and control for Chap. 6

68  1998 Morgan Kaufmann Publishers Instruction typePipe stages ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB Suerscalar Execution

70  1998 Morgan Kaufmann Publishers Simple Superscalar Code Scheduling Loop ： lw$t0, 0($s1) # $t0=array element ($s1 is i) addu$t0, $t0, $s2# add dcalar in $s2 (B) sw$t0, 0($s1)# store result addi$s1, $s1, -4# decrement pointer bne$s1, $zero, Loop# branch $s1 != 0 ALU or branch instructionData transfer instructionClock cycle Loop:lw$t0, 0($s1)1 addi$s1, $s1, -42 addu$t0, $t0, $s23 bne$s1, $zero, Loopsw$t0, 0($s1)4 Do { *I = *I + B; I = I -4 ; } While (I != 0) ; Do { *I = *I + B; I = I -4 ; } While (I != 0) ;

71  1998 Morgan Kaufmann Publishers Loop Unrolling for Superscalar Pipelines ALU or branch instructionData transfer instructionClock cycle Loop:addi$s1, $s1, -16lw$t0, 0($s1)1 lw$t1, 12($s1)2 addu$t0, $t0, $s2lw$t2, 8($s1)3 addu$t1, $t1, $s2lw$t3, 4($s1)4 addu$t2, $t2, $s2sw$t0, 16($s1)5 addu$t3, $t3, $s2sw$t1, 12($s1)6 sw$t2, 8($s1)7 bne$s1, $zero, Loopsw$t2, 8($s1)8 Do { I = I - 16 ; *I = *I + B; *(I+12) = *(I+12) + B; *(I+8) = *(I+8) + B; *(I+4) = *(I+4) + B; ; } While (I != 0) ; Do { I = I - 16 ; *I = *I + B; *(I+12) = *(I+12) + B; *(I+8) = *(I+8) + B; *(I+4) = *(I+4) + B; ; } While (I != 0) ;

72  1998 Morgan Kaufmann Publishers Loop Unrolling Superscalar has the architecture to perform parallel calculation For C source code: –for(i=100; i!=0; i--) { A[i]=A[i]+1; } –for(i=100; i!=0; i=i-4) { A[i]=A[i]+1; A[i-1]=A[i-1]+1; A[i-2]=A[i-2]+1; A[i-3]=A[i-3]+1; } In uni-processor, the functionalities are the same. But in superscalar, large amount of operations provide a richer opportunity for parallel execution.

75  1998 Morgan Kaufmann Publishers Dynamic Scheduling: dispatch add

76  1998 Morgan Kaufmann Publishers Dynamic Scheduling: dispatch subi

77  1998 Morgan Kaufmann Publishers Dynamic Scheduling: execute add

78  1998 Morgan Kaufmann Publishers Dynamic Scheduling: execute subi

79  1998 Morgan Kaufmann Publishers Dynamic Scheduling: write back add

80  1998 Morgan Kaufmann Publishers Dynamic Scheduling: write back subi

83  1998 Morgan Kaufmann Publishers Dynamic Scheduling The hardware performs the scheduling? –hardware tries to find instructions to execute –out of order execution is possible –speculative execution and dynamic branch prediction All modern processors are very complicated –DEC Alpha 21264: 9 stage pipeline, 6 instruction issue –PowerPC and Pentium: branch history table –Compiler technology important This class has given you the background you need to learn more Video: An Overview of Intel Pentium Processor (available from University Video Communications)

84  1998 Morgan Kaufmann Publishers Figure 6.52: The performance consequences of single-cycle, multiple-cycle and pipelined

85  1998 Morgan Kaufmann Publishers Figure 6.53: Basic relationship between the datapaths in Figure 6.52

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput.

Similar presentations

Presentation on theme: "1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput.

Similar presentations

Presentation on theme: "1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput."— Presentation transcript:

Similar presentations

About project

Feedback