Download presentation
Presentation is loading. Please wait.
1
1 1998 Morgan Kaufmann Publishers Chapter Six
2
2 1998 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput Ideal speedup is number of stages in the pipeline. Do we achieve this?
3
3 1998 Morgan Kaufmann Publishers Pipelining What makes it easy –all instructions are the same length –just a few instruction formats –memory operands appear only in loads and stores What makes it hard? –structural hazards: suppose we had only one memory –control hazards: need to worry about branch instructions –data hazards: an instruction depends on a previous instruction We will build a simple pipeline and look at these issues We will talk about modern processors and what really makes it hard: –exception handling –trying to improve performance with out-of-order execution, etc.
4
4 1998 Morgan Kaufmann Publishers Basic Idea What do we need to add to actually split the datapath into stages?
5
5 1998 Morgan Kaufmann Publishers Pipelined Datapath Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?
6
6 1998 Morgan Kaufmann Publishers Corrected Datapath
7
7 1998 Morgan Kaufmann Publishers Instruction fetch stage of lw
8
8 1998 Morgan Kaufmann Publishers Instruction decode stage of lw
9
9 1998 Morgan Kaufmann Publishers Execution stage of lw
10
10 1998 Morgan Kaufmann Publishers Memory stage of lw
11
11 1998 Morgan Kaufmann Publishers Write back stage of lw
12
12 1998 Morgan Kaufmann Publishers Execution stage of sw
13
13 1998 Morgan Kaufmann Publishers Memory stage of sw
14
14 1998 Morgan Kaufmann Publishers Write back stage of sw
15
15 1998 Morgan Kaufmann Publishers Instructions flowing in pipeline
16
16 1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)
17
17 1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)
18
18 1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)
19
19 1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)
20
20 1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)
21
21 1998 Morgan Kaufmann Publishers Graphically Representing Pipelines Can help with answering questions like: –how many cycles does it take to execute this code? –what is the ALU doing during cycle 4? –use this representation to help understand datapaths
22
22 1998 Morgan Kaufmann Publishers Pipeline Control Can these operations be completed one stage earlier? Think about critical path!!! Why does IM have no control signal at all? Why does RF have only write control? Can ALU result be written back in MEM? Cost of RF write port!!!
23
23 1998 Morgan Kaufmann Publishers Pipeline design considerations Simplification of control mechanism –Active in every clock cycle (always enable) Ex: –Instruction memory has no control signal. –Pipeline registers Minimization of power consumption –Explicit control for infrequent operations Ex: both read and write controls for data memory Cost consideration –Ex: alternatives of ALU write: in MEM or WB
24
24 1998 Morgan Kaufmann Publishers We have 5 stages. What needs to be controlled in each stage? –Instruction Fetch and PC Increment –Instruction Decode / Register Fetch –Execution –Memory Stage –Write Back How would control be handled in an automobile plant? a fancy control center telling everyone what to do? should we use a finite state machine? –Centralized –Distributed Pipeline control
25
25 1998 Morgan Kaufmann Publishers Pass control signals along just like the data –Generate all control signals at the decode stage…similar to the single cycle implementation –Pass the generated control signals along the pipeline and consume the related control signals at the corresponding stage…similar to the multicycle implementation Pipeline Control
26
26 1998 Morgan Kaufmann Publishers Datapath with Control
27
27 1998 Morgan Kaufmann Publishers
28
28 1998 Morgan Kaufmann Publishers
29
29 1998 Morgan Kaufmann Publishers
30
30 1998 Morgan Kaufmann Publishers $4, $5
31
31 1998 Morgan Kaufmann Publishers
32
32 1998 Morgan Kaufmann Publishers
33
33 1998 Morgan Kaufmann Publishers
34
34 1998 Morgan Kaufmann Publishers
35
35 1998 Morgan Kaufmann Publishers
36
36 1998 Morgan Kaufmann Publishers Problem with starting next instruction before first is finished –dependencies that point backward in time are data hazards Dependencies
37
37 1998 Morgan Kaufmann Publishers Have compiler guarantee no hazards Where do we insert the nops?? sub$2, $1, $3 and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) Insert NOP… sub$2, $1, $3 NOP NOP NOP and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) Problem: this really slows us down! Software Solution
38
38 1998 Morgan Kaufmann Publishers Use temporary results, don’t wait for them to be written ALU forwarding Register file forwarding (latch-based register file) to handle read/write to same register (read what you just write!!!) Forwarding what if this $2 was $13? Transparent latch
39
39 1998 Morgan Kaufmann Publishers No forwarding datapath
40
40 1998 Morgan Kaufmann Publishers With forwarding datapath
41
41 1998 Morgan Kaufmann Publishers The control values for the forwarding multiplexors Mux controlSourceExplanation ForwardingA = 00ID/EX The first ALU operand comes from the register file ForwardingA = 10EX/MEM The first ALU operand is forwarded from prior ALU result ForwardingA = 01MEM/WB The first ALU operand is forwarded from data memory or an earlier ALU result ForwardingB = 00ID/EX The second ALU operand comes from the register file ForwardingB = 10EX/MEM The second ALU operand is forwarded from prior ALU result ForwardingB = 01MEM/WB The second ALU operand is forwarded from data memory or an earlier ALU result If (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and (EX/MEM.RegisterRd ID/EX.RegisterRs) and (MEM/WB.registerRd = ID/EX.RegisterRs)) ForwardA = 01 If (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and (EX/MEM.RegisterRd ID/EX.RegisterRt) and (MEM/WB.registerRd = ID/EX.RegisterRt)) ForwardB = 01
42
42 1998 Morgan Kaufmann Publishers The modified datapath resolves hazards via forwarding
43
43 1998 Morgan Kaufmann Publishers
44
44 1998 Morgan Kaufmann Publishers
45
45 1998 Morgan Kaufmann Publishers $2 Register file forwarding (through the transparent latch).
46
46 1998 Morgan Kaufmann Publishers Simultaneously match the $4 operands in stage MEM and stage WB. Forward from the nearest stage: MEM (based on the sequential programming model) Write from the WB stage to the register file (RF). Forward Reg. Write
47
47 1998 Morgan Kaufmann Publishers Load word can still cause a hazard: –an instruction tries to read a register following a load instruction that writes to the same register. Thus, we need a hazard detection unit to stall the load instruction Can't always forward Latch-based RF: read what you just write! (write then read)
48
48 1998 Morgan Kaufmann Publishers Stalling We can stall the pipeline by keeping an instruction in the same stage
49
49 1998 Morgan Kaufmann Publishers Hazard Detection Unit Stall by letting an instruction that won’t write anything go forward stall Insert NOP LOAD continues to next stage
50
50 1998 Morgan Kaufmann Publishers
51
51 1998 Morgan Kaufmann Publishers
52
52 1998 Morgan Kaufmann Publishers
53
53 1998 Morgan Kaufmann Publishers
54
54 1998 Morgan Kaufmann Publishers
55
55 1998 Morgan Kaufmann Publishers
56
56 1998 Morgan Kaufmann Publishers When we decide to branch, other instructions are in the pipeline! We are predicting branch not taken –need to add hardware for flushing instructions if we are wrong Branch Hazards
57
57 1998 Morgan Kaufmann Publishers Flushing Instructions Optimized data path for branch performance: Branch delay: 3 => 1 Original data path
58
58 1998 Morgan Kaufmann Publishers Instruction Flushing for Branch
59
59 1998 Morgan Kaufmann Publishers Instruction Flushing for Branch (Flushed and instr.)
60
60 1998 Morgan Kaufmann Publishers Improving Performance Try and avoid stalls! E.g., reorder these instructions: lw $t0, 0($t1) lw $t0, 0($t1) lw $t2, 4($t1)lw $t2, 4($t1) sw $t2, 0($t1)sw $t0, 4($t1) sw $t0, 4($t1)sw $t2, 0($t1) Add a branch delay slot (delayed branch) –the next instruction after a branch is always executed –rely on compiler to fill the slot with something useful add $2, $3, $4 beq $9, $10, 400 beq $9, $10, 400add $2, $3, $4 ; always executedsub $11, $12, $13: Superscalar: start more than one instruction in the same cycle
61
61 1998 Morgan Kaufmann Publishers Utilizing the branch delay slot (compiler’s task) : :
62
62 1998 Morgan Kaufmann Publishers Final data/control path for hazard handling
63
63 1998 Morgan Kaufmann Publishers Dynamic Branch Prediction
64
64 1998 Morgan Kaufmann Publishers Final data/control path for exception handling 1.flush instr.; 2.save PC (PC+4); Cause 3.set new PC; 4.overflowed instr. (EX) => NOP 1 2 1 3 4
65
65 1998 Morgan Kaufmann Publishers Overflow!
66
66 1998 Morgan Kaufmann Publishers continue! ………… Flushing (NOP’s) ……………
67
67 1998 Morgan Kaufmann Publishers … Fig.6.56: Complete data path and control for Chap. 6
68
68 1998 Morgan Kaufmann Publishers Instruction typePipe stages ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB Suerscalar Execution
69
69 1998 Morgan Kaufmann Publishers
70
70 1998 Morgan Kaufmann Publishers Simple Superscalar Code Scheduling Loop : lw$t0, 0($s1) # $t0=array element ($s1 is i) addu$t0, $t0, $s2# add dcalar in $s2 (B) sw$t0, 0($s1)# store result addi$s1, $s1, -4# decrement pointer bne$s1, $zero, Loop# branch $s1 != 0 ALU or branch instructionData transfer instructionClock cycle Loop:lw$t0, 0($s1)1 addi$s1, $s1, -42 addu$t0, $t0, $s23 bne$s1, $zero, Loopsw$t0, 0($s1)4 Do { *I = *I + B; I = I -4 ; } While (I != 0) ; Do { *I = *I + B; I = I -4 ; } While (I != 0) ;
71
71 1998 Morgan Kaufmann Publishers Loop Unrolling for Superscalar Pipelines ALU or branch instructionData transfer instructionClock cycle Loop:addi$s1, $s1, -16lw$t0, 0($s1)1 lw$t1, 12($s1)2 addu$t0, $t0, $s2lw$t2, 8($s1)3 addu$t1, $t1, $s2lw$t3, 4($s1)4 addu$t2, $t2, $s2sw$t0, 16($s1)5 addu$t3, $t3, $s2sw$t1, 12($s1)6 sw$t2, 8($s1)7 bne$s1, $zero, Loopsw$t2, 8($s1)8 Do { I = I - 16 ; *I = *I + B; *(I+12) = *(I+12) + B; *(I+8) = *(I+8) + B; *(I+4) = *(I+4) + B; ; } While (I != 0) ; Do { I = I - 16 ; *I = *I + B; *(I+12) = *(I+12) + B; *(I+8) = *(I+8) + B; *(I+4) = *(I+4) + B; ; } While (I != 0) ;
72
72 1998 Morgan Kaufmann Publishers Loop Unrolling Superscalar has the architecture to perform parallel calculation For C source code: –for(i=100; i!=0; i--) { A[i]=A[i]+1; } –for(i=100; i!=0; i=i-4) { A[i]=A[i]+1; A[i-1]=A[i-1]+1; A[i-2]=A[i-2]+1; A[i-3]=A[i-3]+1; } In uni-processor, the functionalities are the same. But in superscalar, large amount of operations provide a richer opportunity for parallel execution.
73
73 1998 Morgan Kaufmann Publishers
74
74 1998 Morgan Kaufmann Publishers
75
75 1998 Morgan Kaufmann Publishers Dynamic Scheduling: dispatch add
76
76 1998 Morgan Kaufmann Publishers Dynamic Scheduling: dispatch subi
77
77 1998 Morgan Kaufmann Publishers Dynamic Scheduling: execute add
78
78 1998 Morgan Kaufmann Publishers Dynamic Scheduling: execute subi
79
79 1998 Morgan Kaufmann Publishers Dynamic Scheduling: write back add
80
80 1998 Morgan Kaufmann Publishers Dynamic Scheduling: write back subi
81
81 1998 Morgan Kaufmann Publishers
82
82 1998 Morgan Kaufmann Publishers
83
83 1998 Morgan Kaufmann Publishers Dynamic Scheduling The hardware performs the scheduling? –hardware tries to find instructions to execute –out of order execution is possible –speculative execution and dynamic branch prediction All modern processors are very complicated –DEC Alpha 21264: 9 stage pipeline, 6 instruction issue –PowerPC and Pentium: branch history table –Compiler technology important This class has given you the background you need to learn more Video: An Overview of Intel Pentium Processor (available from University Video Communications)
84
84 1998 Morgan Kaufmann Publishers Figure 6.52: The performance consequences of single-cycle, multiple-cycle and pipelined
85
85 1998 Morgan Kaufmann Publishers Figure 6.53: Basic relationship between the datapaths in Figure 6.52
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.