Download presentation
1
CSE 502: Computer Architecture
Core Pipelining
2
Before there was pipelining…
Single-cycle insn0.(fetch,decode,exec) insn1.(fetch,decode,exec) Multi-cycle insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec time Single-cycle control: hardwired Low CPI (1) Long clock period (to accommodate slowest instruction) Multi-cycle control: micro-programmed Short clock period High CPI Can we have both low CPI and short clock period?
3
Can have as many insns in flight as there are stages
Pipelining Multi-cycle insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec insn0.fetch insn0.dec insn0.exec Pipelined insn1.fetch insn1.dec insn1.exec time insn2.dec insn2.fetch insn2.exec Start with multi-cycle design When insn0 goes from stage 1 to stage 2 … insn1 starts stage 1 Each instruction passes through all stages … but instructions enter and leave at faster rate Can have as many insns in flight as there are stages
4
Increases throughput at the expense of latency
Pipeline Examples address hit? = Stage delay = 𝑛 Bandwidth = ~( 1 𝑛 ) address hit? = Stage delay = 𝑛 2 Bandwidth = ~( 2 𝑛 ) = address hit? Stage delay = 𝑛 3 Bandwidth = ~( 3 𝑛 ) = = = Increases throughput at the expense of latency
5
Processor Pipeline Review
Fetch Decode Execute Memory (Write-back) +4 I-cache Reg File ALU D-cache PC
6
Stage 1: Fetch Fetch an instruction from memory every cycle
Use PC to index memory Increment PC (assume no branches for now) Write state to the pipeline register (IF/ID) The next stage will read this pipeline register
7
Stage 1: Fetch Diagram target PC + 1 Decode Instruction bits IF / ID
Pipeline register 1 + M U X target Decode PC + 1 PC Instruction Cache en
8
Stage 2: Decode Decodes opcode bits
Set up Control signals for later stages Read input operands from register file Specified by decoded instruction bits Write state to the pipeline register (ID/EX) Opcode Register contents PC+1 (even though decode didn’t use it) Control signals (from insn) for opcode and destReg
9
Stage 2: Decode Diagram target Instruction bits IF / ID
Pipeline register PC + 1 ID / EX Pipeline register contents regA regB Fetch PC + 1 Execute Register File regA regB en destReg data Control signals
10
Stage 3: Execute Perform ALU operations
Calculate result of instruction Control signals select operation Contents of regA used as one input Either regB or constant offset (from insn) used as second input Calculate PC-relative branch target PC+1+(constant offset) Write state to the pipeline register (EX/Mem) ALU result, contents of regB, and PC+1+offset Control signals (from insn) for opcode and destReg
11
Stage 3: Execute Diagram
target ID / EX Pipeline register contents regA regB ALU result EX/Mem Pipeline register PC+1 +offset + PC + 1 Decode Memory A L U M X contents regB Control signals Control signals destReg data
12
Stage 4: Memory Perform data cache access
ALU result contains address for LD or ST Opcode bits control R/W and enable signals Write state to the pipeline register (Mem/WB) ALU result and Loaded data Control signals (from insn) for opcode and destReg
13
Stage 4: Memory Diagram ALU result EX/Mem Pipeline register PC+1
+offset ALU result Mem/WB Pipeline register target Execute Write-back Loaded data Data Cache en R/W in_data in_addr contents regB Control signals Control signals destReg data
14
Stage 5: Write-back Writing result to register file (if required)
Write Loaded data to destReg for LD Write ALU result to destReg for arithmetic insn Opcode bits control register write enable signal
15
Stage 5: Write-back Diagram
Memory ALU result M U X data Loaded data Control signals destReg M U X Mem/WB Pipeline register
16
Putting It All Together
M U X + 1 + target PC+1 PC+1 R0 eq? regA ALU result R1 Register file regB A L U R2 instruction valA M U X PC Inst Cache Data Cache R3 ALU result mdata R4 R5 valB R6 M U X data R7 offset dest valB M U X dest dest dest op op op IF/ID ID/EX EX/Mem Mem/WB
17
Pipelining Idealism Uniform Sub-operations
Operation can partitioned into uniform-latency sub-ops Repetition of Identical Operations Same ops performed on many different inputs Repetition of Independent Operations All repetitions of op are mutually independent
18
Pipelining is expensive
Pipeline Realism Uniform Sub-operations … NOT! Balance pipeline stages Stage quantization to yield balanced stages Minimize internal fragmentation (left-over time near end of cycle) Repetition of Identical Operations … NOT! Unifying instruction types Coalescing instruction types into one “multi-function” pipe Minimize external fragmentation (idle stages to match length) Repetition of Independent Operations … NOT! Resolve data and resource hazards Inter-instruction dependency detection and resolution Pipelining is expensive
19
The Generic Instruction Pipeline
IF Instruction Fetch ID Instruction Decode OF Operand Fetch EX Instruction Execute WB Write-back
20
Balancing Pipeline Stages
IF TIF= 6 units Without pipelining Tcyc TIF+TID+TOF+TEX+TOS = 31 Pipelined Tcyc max{TIF, TID, TOF, TEX, TOS} = 9 Speedup= 31 / 9 ID TID= 2 units OF TID= 9 units EX TEX= 5 units WB TOS= 9 units Can we do better?
21
Balancing Pipeline Stages (1/2)
Two methods for stage quantization Merge multiple sub-ops into one Divide sub-ops into smaller pieces Recent/Current trends Deeper pipelines (more and more stages) Multiple different pipelines/sub-pipelines Pipelining of memory accesses
22
Balancing Pipeline Stages (2/2)
Coarser-Grained Machine Cycle: 4 machine cyc / instruction Finer-Grained Machine Cycle: 11 machine cyc /instruction TIF&ID= 8 units TOF= 9 units TEX= 5 units TOS= 9 units IF ID OF WB EX # stages = 11 Tcyc= 3 units IF ID OF EX WB # stages = 4 Tcyc= 9 units
23
Pipeline Examples AMDAHL 470V/7 IF MIPS R2000/R3000 IF IF ID ID OF RD
PC GEN MIPS R2000/R3000 Cache Read IF Cache Read IF ID ID Decode OF RD Read REG OF Addr GEN ALU EX Cache Read Cache Read MEM WB EX EX 1 EX 2 WB WB Check Result Write Result
24
Instruction Dependencies (1/2)
Data Dependence Read-After-Write (RAW) (only true dependence) Read must wait until earlier write finishes Anti-Dependence (WAR) Write must wait until earlier read finishes (avoid clobbering) Output Dependence (WAW) Earlier write can’t overwrite later write Control Dependence (a.k.a. Procedural Dependence) Branch condition must execute before branch target Instructions after branch cannot run before branch
25
Instruction Dependencies (1/2)
# for (;(j<high)&&(array[j]<array[low]);++j); bge j, high, $36 mul $15, j, addu $24, array, $15 lw $25, 0($24) mul $13, low, 4 addu $14, array, $13 lw $15, 0($14) bge $25, $15, $36 $35: addu j, j, . . . $36: addu $11, $11, -1 Real code has lots of dependencies
26
Hardware Dependency Analysis
Processor must handle Register Data Dependencies (same register) RAW, WAW, WAR Memory Data Dependencies (same address) Control Dependencies
27
Pipeline Terminology Pipeline Hazards Hazard Resolution
Potential violations of program dependencies Must ensure program dependencies are not violated Hazard Resolution Static method: performed at compile time in software Dynamic method: performed at runtime using hardware Two options: Stall (costs perf.) or Forward (costs hw.) Pipeline Interlock Hardware mechanism for dynamic hazard resolution Must detect and enforce dependencies at runtime
28
Pipeline: Steady State
Instj IF ID RD ALU MEM WB Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID RD ALU MEM WB Instj+3 IF ID RD ALU MEM WB Instj+4 IF ID RD ALU MEM WB IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF
29
Pipeline: Data Hazard t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB
30
Option 1: Stall on Data Hazard
Instj IF ID RD ALU MEM WB Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID Stalled in RD RD ALU MEM WB Instj+3 IF Stalled in ID ID RD ALU MEM WB Instj+4 Stalled in IF IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF
31
Option 2: Forwarding Paths (1/3)
Instj IF ID RD ALU MEM WB Many possible paths Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID RD ALU MEM WB Instj+3 IF ID RD ALU MEM WB Instj+4 IF ID RD ALU MEM WB IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF Requires stalling even with forwarding paths MEM ALU
32
Option 2: Forwarding Paths (2/3)
src1 IF ID Register File src2 dest ALU MEM WB
33
Option 2: Forwarding Paths (3/3)
src1 IF ID Register File src2 dest = = = = Deeper pipeline may require additional forwarding paths = = ALU MEM WB
34
Pipeline: Control Hazard
Insti IF ID RD ALU MEM WB Insti+1 IF ID RD ALU MEM WB Insti+2 IF ID RD ALU MEM WB Insti+3 IF ID RD ALU MEM WB Insti+4 IF ID RD ALU MEM WB IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF
35
Pipeline: Stall on Control Hazard
Insti IF ID RD ALU MEM WB Insti+1 IF ID RD ALU MEM WB Insti+2 Stalled in IF IF ID RD ALU MEM Insti+3 IF ID RD ALU Insti+4 IF ID RD IF ID IF
36
Pipeline: Prediction for Control Hazards
Insti IF ID RD ALU MEM WB Speculative State Cleared Insti+1 IF ID RD ALU MEM WB Insti+2 IF ID RD ALU nop nop nop Insti+3 IF ID RD nop nop nop ALU RD ID Insti+4 IF ID nop nop New Insti+2 IF ID RD New Insti+3 Fetch Resteered IF ID New Insti+4 IF
37
Going Beyond Scalar Scalar pipeline limited to CPI ≥ 1.0
Can never run more than 1 insn. per cycle “Superscalar” can achieve CPI ≤ 1.0 (i.e., IPC ≥ 1.0) Superscalar means executing multiple insns. in parallel
38
Architectures for Instruction Parallelism
Scalar pipeline (baseline) Instruction/overlap parallelism = D Operation Latency = 1 Peak IPC = 1.0 D D different instructions overlapped Successive Instructions 1 2 3 4 5 6 7 8 9 10 11 12 Time in cycles
39
Superscalar Machine Superscalar (pipelined) Execution
Instruction parallelism = D x N Operation Latency = 1 Peak IPC = N per cycle D x N different instructions overlapped N Successive Instructions 1 2 3 4 5 6 7 8 9 10 11 12 Time in cycles
40
Superscalar Example: Pentium
Prefetch 4× 32-byte buffers Decode1 Decode up to 2 insts Decode2 Decode2 Read operands, Addr comp Asymmetric pipes Execute Execute both u-pipe v-pipe mov, lea, simple ALU, push/pop test/cmp shift rotate some FP jmp, jcc, call, fxch Writeback Writeback
41
Pentium Hazards & Stalls
“Pairing Rules” (when can’t two insns exec?) Read/flow dependence mov eax, 8 mov [ebp], eax Output dependence mov eax, [ebp] Partial register stalls mov al, 1 mov ah, 0 Function unit rules Some instructions can never be paired MUL, DIV, PUSHA, MOVS, some FP
42
Limitations of In-Order Pipelines
If the machine parallelism is increased … dependencies reduce performance CPI of in-order pipelines degrades sharply As N approaches avg. distance between dependent instructions Forwarding is no longer effective Must stall often In-order pipelines are rarely full
43
The In-Order N-Instruction Limit
On average, parent-child separation is about ± 5 insn. (Franklin and Sohi ’92) Ex. Superscalar degree N = 4 Dependent insn must be N = 4 instructions away Any dependency between these instructions will cause a stall Average of 5 means there are many cases when the separation is < 4… each of these limits parallelism Reasonable in-order superscalar is effectively N=2
44
In Search of Parallelism
“Trivial” Parallelism is limited What is trivial parallelism? In-order: sequential instructions do not have dependencies In all previous examples, all instructions executed either at the same time or after earlier instructions previous slides show that superscalar execution quickly hits a ceiling So what is “non-trivial” parallelism? …
45
What is Parallelism? Work Critical Path Average Parallelism
T1: time to complete a computation on a sequential system Critical Path T: time to complete the same computation on an infinitely-parallel system Average Parallelism Pavg = T1/ T For a p-wide system Tp max{T1/p , T} Pavg >> p Tp T1/p x = a + b; y = b * 2 z =(x-y) * (x+y)
46
ILP: Instruction-Level Parallelism
ILP is a measure of the amount of inter-dependencies between instructions Average ILP = num instructions / longest path code1: ILP = (must execute serially) T1 = 3, T = 3 code2: ILP = (can execute at the same time) T1 = 3, T = 1 “Longest path” measured by the number of instructions in the path (not the number of edges) code1: r1 r2 + 1 r3 r1 / 17 r4 r0 - r3 code2: r1 r2 + 1 r3 r9 / 17 r4 r0 - r10
47
ILP != IPC Instruction level parallelism usually assumes infinite resources, perfect fetch, and unit-latency for all instructions ILP is more a property of the program dataflow IPC is the “real” observed metric of exactly how many instructions are executed per machine cycle, which includes all of the limitations of a real machine The ILP of a program is an upper-bound on the attainable IPC
48
Scope of ILP Analysis r1 r2 + 1 r3 r1 / 17 r4 r0 - r3 ILP=1
This is just to point out that when you talk about ILP, you need to be very clear about what part(s) of the program you’re considering.
49
DFG Analysis A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4
D: R7 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4 ST 0[R1] J: R1 = R1 – 1 K: R3 ST 0[R1] In-class example: draw out all of the dataflow graph nodes from A to K, find the longest path, compute the ILP.
50
In-Order Issue, Out-of-Order Completion
Inst. Stream Execution Begins In-order INT Fadd1 Fmul1 Ld/St Fadd2 Fmul2 Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard Fmul3 Out-of-order Completion Issue = send an instruction to execution
51
Example A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4
D: R7 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4 ST 0[R1] J: R1 = R1 – 1 K: R3 ST 0[R1] Cycle 1: A B C 2: D 3: 4: 5: A B C IPC = 10/8 = 1.25 D G E F J E F 6: G H K This example is about IPC, not ILP. H J K 7: 8:
52
Example (2) A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4
D: R9 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4 ST 0[R9] J: R1 = R9 – 1 K: R3 ST 0[R1] Cycle 1: A B C 2: D 3: 4: 5: E F G A B E C D F G H J H J 6: K K 7: IPC = 10/7 = 1.43
53
Track with Simple Scoreboarding
Scoreboard: a bit-array, 1-bit for each GPR If the bit is not set: the register has valid data If the bit is set: the register has stale data i.e., some outstanding instruction is going to change it Issue in Order: RD Fn (RS, RT) If SB[RS] or SB[RT] is set RAW, stall If SB[RD] is set WAW, stall Else, dispatch to FU (Fn) and set SB[RD] Complete out-of-order Update GPR[RD], clear SB[RD] H&P-style notation
54
Out-of-Order Issue Need an extra Stage/buffers for Dependency
In-order Inst. Stream Need an extra Stage/buffers for Dependency Resolution DR DR DR DR Out of Program Order Execution INT Fadd1 Fmul1 Ld/St Fadd2 Fmul2 Fmul3 Out-of-order Completion
55
OOO Scoreboarding Similar to In-Order scoreboarding
Need new tables to track status of individual instructions and functional units Still enforce dependencies Stall dispatch on WAW Stall issue on RAW Stall completion on WAR Limitations of Scoreboarding? Hints No structural hazards Can always write a RAW-free code sequence Add R1 = R0 + 1; Add R2 = R0 + 1; Add R3 = R0 + 1; … Think about x86 ISA with only 8 registers Finite number of registers in any ISA will force you to reuse register names at some point WAR, WAW stalls
56
Lessons thus Far More out-of-orderness More ILP exposed
But more hazards Stalling is a generic technique to ensure sequencing RAW stall is a fundamental requirement (?) Compiler analysis and scheduling can help (not covered in this course) The question mark next to RAW’s is hinting at the various value-prediction type of studies that attempt to bypass RAW dependencies.
57
Ex. Tomasulo’s Algorithm [IBM 360/91, 1967]
58
FYI: Historical Note Tomasulo’s algorithm (1967) was not the first
Also at IBM, Lynn Conway proposed multi-issue dynamic instruction scheduling (OOO) in Feb 1966 Ideas got buried due to internal politics, changing project goals, etc. But it’s still the first (as far as I know) See
59
Modern Enhancements to Tomasulo’s Algorithm
Machine Width Structural Deps Anti-Deps Output-Deps True Deps Exceptions Tomasulo Peak IPC = 1 2 FP FU’s Single CDB Operand copying RS Tag Tag-based forwarding Imprecise Modern Peak IPC = 6+ 6-10+ FU’s Many forwarding buses Renamed registers Tag-based forwarding Precise (requires ROB)
61
Balancing Pipeline Stages
Without pipelining Tcyc TIF+TID+TEX+TMEM+TWB = 31 Pipelined Tcyc max{TIF,TID,TEX,TMEM,TWB} = 9 Speedup= 31 / 9 Can we do better in terms of either performance or efficiency? TIF= 6 units IF ID TID= 2 units EX TEX= 9 units MEM TMEM= 5 units WB TWB= 9 units
62
Balancing Pipeline Stages
Two Methods for Stage Quantization: Merging of multiple stages Further subdividing a stage Recent Trends: Deeper pipelines (more and more stages) Pipeline depth growing more slowly since Pentium 4. Why? Multiple pipelines (subpipelines) Pipelined memory/cache accesses (tricky)
63
The Cost of Deeper Pipelines
Instruction pipelines are not ideal i.e. Instructions in different stages can have dependencies Suppose add nand RAW!! t0 t1 t2 t3 t4 t5 t0 t1 t2 t3 t4 t5 add F D E M W nand Inst0 F F D D E E Stall M W E M W Inst1 F F D D Stall E M D W E M
64
Types of Dependencies and Hazards
Data Dependence (Both memory and register) True dependence (RAW) Instruction must wait for all required input operands Anti-Dependence (WAR) Later write must not clobber a still-pending earlier read Output dependence (WAW) Earlier write must not clobber already-completed later write Control Dependence (aka Procedural Dependence) Conditional branches may change instruction sequence Instructions after cond. branch depend on outcome (more exact definition later)
65
Terminology Pipeline Hazards: Hazard Resolution: Pipeline Interlock:
Potential violations of program dependences Must ensure program dependences are not violated Hazard Resolution: Static Method: Performed at compiled time in software Dynamic Method: Performed at run time using hardware Pipeline Interlock: Hardware mechanisms for dynamic hazard resolution Must detect and enforce dependences at run time
66
Necessary Conditions for Data Hazards
stage X j:rk_ Reg Write j:rk_ Reg Write j:_rk Reg Read Hazard Distance stage Y Reg Write Reg Read Reg Write i:rk_ i:_rk i:rk_ WAW Hazard WAR Hazard RAW Hazard dist(i,j) dist(X,Y) ?? dist(i,j) > dist(X,Y) ?? dist(i,j) dist(X,Y) Hazard!! dist(i,j) > dist(X,Y) Safe
67
Handling Data Hazards Avoidance (static) Detect and Stall (dynamic)
Make sure there are no hazards in the code Detect and Stall (dynamic) Stall until earlier instructions finish Detect and Forward (dynamic) Get correct value from elsewhere in pipeline
68
Handling Data Hazards: Avoidance
Programmer/compiler must know implementation details Insert nops between dependent instructions add nop nand write R3 in cycle 5 read R3 in cycle 6
69
Problems with Avoidance
Binary compatability New implementations may require more nops Code size Higher instruction cache footprint Longer binary load times Worse in machines that execute multiple instructions / cycle Intel Itanium – 25-40% of instructions are nops Slower execution CPI=1, but many instructions are nops
70
Handling Data Hazards: Detect & Stall
Detection Compare regA & regB with DestReg of preceding insn. 3 bit comparators Stall Do not advance pipeline register for Fetch/Decode Pass nop to Execute
71
Problems with Detect & Stall
CPI increases on every hazard Are these stalls necessary? Not always! The new value for R3 is in the EX/Mem register Reroute the result to the nand Called “forwarding” or “bypassing”
72
Handling Data Hazards: Detect & Forward
Detection Same as detect and stall, but… each possible hazard requires different forwarding paths Forward Add data paths for all possible sources Add mux in front of ALU to select source “bypassing logic” often a critical path in wide-issue machines # paths grows quadratically with machine width
73
Handling Control Hazards
Avoidance (static) No branches? Convert branches to predication Control dependence becomes data dependence Detect and Stall (dynamic) Stop fetch until branch resolves Speculate and squash (dynamic) Keep going past branch, throw away instructions if wrong
74
Avoidance: if-conversion
if (a == b) { x++; y = n / d; } sub t1 a, b jnz t1, PC+2 add x x, #1 div y n, d sub t1 a, b add t2 x, #1 div t3 n, d cmov(t1) x t2 cmov(t1) y t3 sub t1 a, b add(t1) x x, #1 div(t1) y n, d
75
Handling Control Hazards: Detect & Stall
Detection In decode, check if opcode is branch or jump Stall Hold next instruction in Fetch Pass noop to Decode
76
Problems with Detect & Stall
CPI increases on every branch Are these stalls necessary? Not always! Branch is only taken half the time Assume branch is NOT taken Keep fetching, treat branch as noop If wrong, make sure bad instructions don’t complete
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.