Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth.

Lecture 2: Pipelining and Superscalar Review

Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth or Throughput = Performance BW = num. tasks/unit time For a system that operates on one task at a time: BW = 1 / latency Pipelining can increase BW if many repetitions of same operation/task Latency per task remains same or increases Lecture 2: Pipelining and Superscalar Review 2

3 Combinatorial Logic N Gate Delays Combinatorial Logic N Gate Delays BW = ~(1/n) Combinatorial Logic N/2 Gate Delays Combinatorial Logic N/2 Gate Delays Combinatorial Logic N/2 Gate Delays Combinatorial Logic N/2 Gate Delays Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates BW = ~(2/n) BW = ~(3/n)

T/k Starting from an unpipelined version with propagation delay T and BW=1/T Perf pipe = BW pipe = 1 / (T/k + S) where S = latch delay where k = num stages Lecture 2: Pipelining and Superscalar Review 4 T T S S S S k-stage pipelined unpipelined

G/k Starting from an unpipelined version with hardware cost G Cost pipe = G + kL where L = latch cost incl. control where k = num stages Lecture 2: Pipelining and Superscalar Review 5 G G L L L L k-stage pipelined unpipelined

Lecture 2: Pipelining and Superscalar Review 6 Cost/Performance: C/P = [Lk + G] / [1/(T/k + S)] = (Lk + G) (T/k + S) = LT + GS + LSk + GT/k Optimal Cost/Performance: find min. C/P w.r.t. choice of k k C/P      k opt GT LS --------=    Lk + G 1 T k + S d dk = 0 + 0 + LS - GT k2k2

Lecture 2: Pipelining and Superscalar Review 7 Pipeline Depth k x10 4 Cost/Performance Ratio (C/P) G=175, L=41, T=400, S=22 G=175, L=21, T=400, S=11

“Hardware Cost” –Transistor/Gate Count Should include additional logic to control the pipeline –Area (related to gate count) –Power! More gates  more switching More gates  more leakage Many metrics to optimize Very difficult to determine what really is “optimal” Lecture 2: Pipelining and Superscalar Review 8

Uniform Suboperations –The operation to be pipelined can be evenly partitioned into uniform-latency suboperations Repetition of Identical Operations –The same operations are to be performed repeatedly on a large number of different inputs Repetition of Independent Operations –All the repetitions of the same operation are mutually independent, i.e., no data dependences and no resource conflicts Lecture 2: Pipelining and Superscalar Review 9 Good Examples: Automobile assembly line Floating-Point multiplier Instruction pipeline (?) Good Examples: Automobile assembly line Floating-Point multiplier Instruction pipeline (?)

Uniform Suboperations … NOT! –Balance pipeline stages Stage quantization to yield balanced stages Minimize internal fragmentation (some waiting stages) Identical operations … NOT! –Unifying instruction types Coalescing instruction types into one “multi-function” pipe Minimize external fragmentation (some idling stages) Independent operations … NOT! –Resolve data and resource hazards Inter-instruction dependency detection and resolution Minimize performance loss Lecture 2: Pipelining and Superscalar Review 10

The “computation” to be pipelined: 1.Instruction Fetch (IF) 2.Instruction Decode (ID) 3.Operand(s) Fetch (OF) 4.Instruction Execution (EX) 5.Operand Store (OS) a.k.a. writeback (WB) 6.Update Program Counter (PC) Lecture 2: Pipelining and Superscalar Review 11

Lecture 2: Pipelining and Superscalar Review 12 Based on Obvious Subcomputations: Instruction Fetch Instruction Decode Operand Fetch Instruction Execute Operand Store IFIF IDID OF/RFOF/RF EXEX OS/WBOS/WB

Lecture 2: Pipelining and Superscalar Review 13 T IF = 6 units T ID = 2 units T ID = 9 units T EX = 5 units T OS = 9 units Without pipelining T cyc  T IF +T ID +T OF +T EX +T OS = 31 Pipelined T cyc  max{T IF, T ID, T OF, T EX, T OS } = 9 Speedup= 31 / 9 Can we do better in terms of either performance or efficiency? IFIF IDID OF/RFOF/RF EXEX OS/WBOS/WB

Two methods for stage quantization –Merging multiple subcomputations into one –Subdividing a subcomputation into multiple smaller ones Recent/Current trends –Deeper pipelines (more and more stages) To a certain point: then cost function takes over –Multiple different pipelines/subpipelines –Pipelining of memory accesses (tricky) Lecture 2: Pipelining and Superscalar Review 14

Lecture 2: Pipelining and Superscalar Review 15 Coarser-Grained Machine Cycle: 4 machine cyc / instruction T IF&ID = 8 units T OF = 9 units T EX = 5 units T OS = 9 units Finer-Grained Machine Cycle: 11 machine cyc /instruction T cyc = 3 units T IF,T ID,T OF,T EX,T OS = (6/2/9/5/9) IFIF IDID OFOF OSOS EXEX IFIF IFIF IDID OFOF OFOF OFOF EXEX EXEX OSOS OSOS OSOS

Logic needed for each pipeline stage Register file ports needed to support all (relevant) stages Memory accessing ports needed to support all (relevant) stages Lecture 2: Pipelining and Superscalar Review 16 IFIF IDID OFOF OSOS EXEX IFIF IFIF IDID OFOF OFOF OFOF EXEX EXEX OSOS OSOS OSOS

Lecture 2: Pipelining and Superscalar Review 17 IFIF RDRD ALUALU MEMMEM WBWB IF ID OF EX OS PC GEN Cache Read DecodeDecode Read REG Add GEN Cache Read EX 1 EX 2 Check Result Write Result OS EX OF ID IF MIPS R2000/R3000 AMDAHL 470V/7

Data Dependence –True Dependence (RAW) Instruction must wait for all required input operands –Anti-Dependence (WAR) Later write must not clobber a still-pending earlier read –Output Dependence (WAW) Earlier write must not clobber an already-finished later write Control Dependence (a.k.a. Procedural Dependence) –Conditional branches cause uncertainty to instruction sequencing –Instructions following a conditional branch depends on the execution of the branch instruction –Instructions following a computed branch depends on the execution of the branch instruction Lecture 2: Pipelining and Superscalar Review 18

Lecture 2: Pipelining and Superscalar Review19 bge$10, $9, $36 mul$15, $10, 4 addu$24, $6, $15 lw$25, 0($24) mul$13, $8, 4 addu$14, $6, $13 lw$15, 0($14) bge$25, $15, $36 $35: addu$10, $10, 1... $36: addu$11, $11, -1... #for (;(j<high)&&(array[j]<array[low]);++j); #$10 = j; $9 = high; $6 = array; $8 = low

Processor must handle –Register Data Dependencies RAW, WAW, WAR –Memory Data Dependencies RAW, WAW, WAR –Control Dependencies Lecture 2: Pipelining and Superscalar Review 20

Pipeline Hazards: –Potential violations of program dependencies –Must ensure program dependencies are not violated Hazard Resolution: –Static method: performed at compile time in software –Dynamic method: performed at runtime using hardware Stall, Flush or Forward Pipeline Interlock: –Hardware mechanism for dynamic hazard resolution –Must detect and enforce dependencies at runtime Lecture 2: Pipelining and Superscalar Review 21

Lecture 2: Pipelining and Superscalar Review 22 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4

Lecture 2: Pipelining and Superscalar Review 23 t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4

Lecture 2: Pipelining and Superscalar Review 24 IFIDRDALUMEMWB IFIDRDALUMEMWB IFID Stalled in RD ALUMEMWB IF Stalled in ID RDALUMEMWB Stalled in IF IDRDALUMEM IFIDRDALU t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4 RD ID IF IFIDRD IFID IF

Lecture 2: Pipelining and Superscalar Review 25

Lecture 2: Pipelining and Superscalar Review 26 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Many possible paths Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4 MEMMEMALUALU Requires stalling even with fwding paths

Lecture 2: Pipelining and Superscalar Review 27 Deeper pipeline may require additional forwarding paths Deeper pipeline may require additional forwarding paths IFIFIDID Register File src1 src2 = = = = ALUALU MEMMEM = = = = dest

Lecture 2: Pipelining and Superscalar Review 28 t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF

Lecture 2: Pipelining and Superscalar Review 29 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 Stalled in IF

Lecture 2: Pipelining and Superscalar Review 30 t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUnopnop IFIDRDnopnop IFIDnopnop IFIDRD IFID IFnopnopnop ALUnop RDALU IDRD nopnop nop New Inst i+2 New Inst i+3 New Inst i+4 Speculative State Cleared Fetch Resteered

Simple pipeline limited to execution of CPI ≥ 1.0 “Superscalar” can achieve CPI ≤ 1.0 (i.e., IPC ≥ 1.0) –Superscalar means executing more than one scalar instruction in parallel (e.g., add + xor + mul) –Contrast to Vector which effectively executes multiple operations in parallel, but they all must be the same (e.g., four parallel additions) Lecture 2: Pipelining and Superscalar Review 31

Scalar pipeline (baseline) –Instruction/overlap parallelism = D –Operation Latency = 1 –Peak IPC = 1 Lecture 2: Pipelining and Superscalar Review 32 D Successive Instructions Time in cycles 123456789101112 D different instructions overlapped

Superscalar (pipelined) Execution –Instruction parallelism = D x N –Operation Latency = 1 –Peak IPC = N per cycle Lecture 2: Pipelining and Superscalar Review 33 N Successive Instructions Time in cycles 123456789101112 D x N different instructions overlapped

Lecture 2: Pipelining and Superscalar Review 34 PrefetchPrefetch Decode1Decode1 Decode2Decode2Decode2Decode2 ExecuteExecuteExecuteExecute WritebackWritebackWritebackWriteback 4× 32-byte buffers Decode up to 2 insts Read operands, Addr comp Asymmetric pipes u-pipev-pipe shift rotate some FP jmp, jcc, call, fxch Both mov, lea, simple ALU, push/pop test/cmp

“Pairing Rules” (when can/can’t two insts exec at the same time?) –read/flow dependence mov eax, 8 mov [ebp], eax –output dependence mov eax, 8 mov eax, [ebp] –partial register stalls mov al, 1 mov ah, 0 –function unit rules some instructions can never be paired: MUL, DIV, PUSHA, MOVS, some FP Lecture 2: Pipelining and Superscalar Review 35

CPI of inorder pipelines degrades very sharply if the machine parallelism is increased beyond a certain point –i.e., when N approaches the average distance between dependent instructions –Forwarding is no longer effective  Must stall more often  Pipeline may never be full due to frequency of dependency stalls Lecture 2: Pipelining and Superscalar Review 36

Lecture 2: Pipelining and Superscalar Review 37 Ex. Superscalar degree N = 4 Any dependency between these instructions will cause a stall Dependent inst must be N = 4 instructions away On average, the parent- child separation is only about 5± instructions! (Franklin and Sohi ’92) Pentium: Superscalar degree N=2 is reasonable… going much further encounters rapidly diminishing returns Pentium: Superscalar degree N=2 is reasonable… going much further encounters rapidly diminishing returns Average of 5 means there are many cases when the separation is < 4… each of these limits parallelism

“Trivial” Parallelism is limited –What is trivial parallelism? In-order: sequential instructions do not have dependencies in all previous examples, all instructions executed either at the same time or after earlier instructions –previous slides show that superscalar execution quickly hits a ceiling So what is “non-trivial” parallelism? … Lecture 2: Pipelining and Superscalar Review 38

Work T 1 : time to complete a computation on a sequential system Critical Path T  : time to complete the same computation on an infinitely-parallel system Average Parallelism P avg = T 1 / T  For a p-wide system T p  max{T 1 /p, T  } P avg >> p  T p  T 1 /p Lecture 2: Pipelining and Superscalar Review 39 x = a + b; y = b * 2 z =(x-y) * (x+y)

ILP is a measure of the amount of inter-dependencies between instructions Average ILP = num instructions / longest path code 1 :ILP = 1 (must execute serially) T 1 = 3, T  = 3 code 2 :ILP = 3 (can execute at the same time) T 1 = 3, T  = 1 Lecture 2: Pipelining and Superscalar Review 40 code 1 : r1  r2 + 1 r3  r1 / 17 r4  r0 - r3 code 2 :r1  r2 + 1 r3  r9 / 17 r4  r0 - r10

Instruction level parallelism usually assumes infinite resources, perfect fetch, and unit-latency for all instructions ILP is more a property of the program dataflow IPC is the “real” observed metric of exactly how many instructions are executed per machine cycle, which includes all of the limitations of a real machine The ILP of a program is an upper-bound on the attainable IPC Lecture 2: Pipelining and Superscalar Review 41

Lecture 2: Pipelining and Superscalar Review 42 r1  r2 + 1 r3  r1 / 17 r4  r0 - r3 r11  r12 + 1 r13  r19 / 17 r14  r0 - r20 ILP=2 ILP=1ILP=3

A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R7 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4  ST 0[R1] J: R1 = R1 – 1 K: R3  ST 0[R1] Lecture 2: Pipelining and Superscalar Review 43

Lecture 2: Pipelining and Superscalar Review 44 Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard Issue = send an instruction to execution Issue = send an instruction to execution INTINTFadd1Fadd1 Fadd2Fadd2 Fmul1Fmul1 Fmul2Fmul2 Fmul3Fmul3 Ld/StLd/St In-orderInst.Stream Execution Begins In-order Out-of-order Completion

Lecture 2: Pipelining and Superscalar Review 45 A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R7 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4  ST 0[R1] J: R1 = R1 – 1 K: R3  ST 0[R1] AB Cycle 1: C 2: D 3: 4: 5: EF 6: GHJ K 7: 8: IPC = 10/8 = 1.25 AB C D EF G H J K

Lecture 2: Pipelining and Superscalar Review 46 A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R9 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4  ST 0[R9] J: R1 = R9 – 1 K: R3  ST 0[R1] AB Cycle 1: C 2: D 3: 4: 5: EFG IPC = 10/7 = 1.43 HJ 6: K 7: AB C D E FG HJ K

Scoreboard: a bit-array, 1-bit for each GPR –If the bit is not set: the register has valid data –If the bit is set: the register has stale data i.e., some outstanding instruction is going to change it Issue in Order: RD  Fn (RS, RT) –If SB[RS] or SB[RT] is set  RAW, stall –If SB[RD] is set  WAW, stall –Else, dispatch to FU (Fn) and set SB[RD] Complete out-of-order –Update GPR[RD], clear SB[RD] Lecture 2: Pipelining and Superscalar Review 47

Lecture 2: Pipelining and Superscalar Review 48 INTINTFadd1Fadd1 Fadd2Fadd2 Fmul1Fmul1 Fmul2Fmul2 Fmul3Fmul3 Ld/StLd/St In-order Inst. Stream DRDRDRDRDRDRDRDR Out-of-order Completion Out of Program Order Execution Need an extra Stage/buffers for Dependency Resolution

Similar to In-Order scoreboarding –Need new tables to track status of individual instructions and functional units –Still enforce dependencies Stall dispatch on WAW Stall issue on RAW Stall completion on WAR Limitations of Scoreboarding? Hints –No structural hazards –Can always write a RAW-free code sequence Add R1 = R0 + 1; Add R2 = R0 + 1; Add R3 = R0 + 1; … –Think about x86 ISA with only 8 registers Lecture 2: Pipelining and Superscalar Review 49 Finite number of registers in any ISA will force you to reuse register names at some point  WAR, WAW  stalls Finite number of registers in any ISA will force you to reuse register names at some point  WAR, WAW  stalls

More out-of-orderness  More ILP exposed  But more hazards Stalling is a generic technique to ensure sequencing RAW stall is a fundamental requirement (?) Compiler analysis and scheduling can help (not covered in this course) Lecture 2: Pipelining and Superscalar Review 50

Lecture 2: Pipelining and Superscalar Review 51

Tomasulo’s algorithm (1967) was not the first Also at IBM, Lynn Conway proposed multi-issue dynamic instruction scheduling (OOO) in Feb 1966 –Ideas got buried due to internal politics, changing project goals, etc. –But it’s still the first (as far as I know) Lecture 2: Pipelining and Superscalar Review 52

Lecture 2: Pipelining and Superscalar Review 53 Tomasulo Peak IPC = 1 2 FP FU’s Single CDB Operand copying RS Tag Tag-based forwarding Imprecise Modern Peak IPC = 6+ 6-10+ FU’s Many forwarding buses Renamed registers Tag-based forwarding Precise (requires ROB) Machine Width Structural Deps Anti-Deps Output-Deps True Deps Exceptions

Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth.

Similar presentations

Presentation on theme: "Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth.

Similar presentations

Presentation on theme: "Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth."— Presentation transcript:

Similar presentations

About project

Feedback