Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

Similar presentations


Presentation on theme: "CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)"— Presentation transcript:

1 CSE502: Computer Architecture Core Pipelining

2 CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1) – Long clock period (to accommodate slowest instruction) Multi-cycle control: micro-programmed – Short clock period – High CPI Can we have both low CPI and short clock period? Single-cycle Multi-cycle insn0.(fetch,decode,exec)insn1.(fetch,decode,exec) insn0.decinsn0.fetchinsn1.decinsn1.fetchinsn0.execinsn1.exec time

3 CSE502: Computer Architecture Pipelining Start with multi-cycle design When insn0 goes from stage 1 to stage 2 … insn1 starts stage 1 Each instruction passes through all stages … but instructions enter and leave at faster rate Multi-cycle insn0.decinsn0.fetchinsn1.decinsn1.fetchinsn0.execinsn1.exec time Pipelined insn0.execinsn0.decinsn0.fetch insn1.decinsn1.fetchinsn1.exec insn2.decinsn2.fetchinsn2.exec Can have as many insns in flight as there are stages

4 CSE502: Computer Architecture Pipeline Examples address hit? = = = = = = = = Increases throughput at the expense of latency address hit? = = = = = = = = address hit? = = = = = = = =

5 CSE502: Computer Architecture Processor Pipeline Review I-cache Reg File PC +4+4 D-cache ALU FetchDecodeMemory (Write-back) Execute

6 CSE502: Computer Architecture Stage 1: Fetch Fetch an instruction from memory every cycle – Use PC to index memory – Increment PC (assume no branches for now) Write state to the pipeline register (IF/ID) – The next stage will read this pipeline register

7 CSE502: Computer Architecture Stage 1: Fetch Diagram Instruction bits Instruction bits IF / ID Pipeline register PC Instruction Cache Instruction Cache en 1 1 + MUXMUX MUXMUX PC + 1 Decode target

8 CSE502: Computer Architecture Stage 2: Decode Decodes opcode bits – Set up Control signals for later stages Read input operands from register file – Specified by decoded instruction bits Write state to the pipeline register (ID/EX) – Opcode – Register contents – PC+1 (even though decode didnt use it) – Control signals (from insn) for opcode and destReg

9 CSE502: Computer Architecture Stage 2: Decode Diagram ID / EX Pipeline register regA contents regA contents regB contents regB contents Register File regA regB en Instruction bits Instruction bits IF / ID Pipeline register PC + 1 Control signals Control signals Fetch Execute destReg data target

10 CSE502: Computer Architecture Stage 3: Execute Perform ALU operations – Calculate result of instruction Control signals select operation Contents of regA used as one input Either regB or constant offset (from insn) used as second input – Calculate PC-relative branch target PC+1+(constant offset) Write state to the pipeline register (EX/Mem) – ALU result, contents of regB, and PC+1+offset – Control signals (from insn) for opcode and destReg

11 CSE502: Computer Architecture Stage 3: Execute Diagram ID / EX Pipeline register regA contents regA contents regB contents regB contents ALU result ALU result EX/Mem Pipeline register PC + 1 Control signals Control signals Control signals Control signals PC+1 +offset PC+1 +offset + regB contents regB contents ALUALU MUXMUX MUXMUX Decode Memory destReg data target

12 CSE502: Computer Architecture Stage 4: Memory Perform data cache access – ALU result contains address for LD or ST – Opcode bits control R/W and enable signals Write state to the pipeline register (Mem/WB) – ALU result and Loaded data – Control signals (from insn) for opcode and destReg

13 CSE502: Computer Architecture Stage 4: Memory Diagram ALU result ALU result Mem/WB Pipeline register ALU result ALU result EX/Mem Pipeline register Control signals Control signals PC+1 +offset PC+1 +offset regB contents regB contents Loaded data Loaded data Data Cache en R/W in_data in_addr Control signals Control signals Execute Write-back destReg data target

14 CSE502: Computer Architecture Stage 5: Write-back Writing result to register file (if required) – Write Loaded data to destReg for LD – Write ALU result to destReg for arithmetic insn – Opcode bits control register write enable signal

15 CSE502: Computer Architecture Stage 5: Write-back Diagram ALU result ALU result Mem/WB Pipeline register Control signals Control signals Loaded data Loaded data MUXMUX MUXMUX destReg MUXMUX MUXMUX Memory

16 CSE502: Computer Architecture Putting It All Together PC Inst Cache Inst Cache Register file MUXMUX MUXMUX 1 1 Data Cache Data Cache MUXMUX MUXMUX IF/IDID/EXEX/MemMem/WB MUXMUX MUXMUX op dest offset valB valA PC+1 target ALU result ALU result op dest valB op dest ALU result ALU result mdata eq? instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB data dest MUXMUX MUXMUX

17 CSE502: Computer Architecture Pipelining Idealism Uniform Sub-operations – Operation can partitioned into uniform-latency sub-ops Repetition of Identical Operations – Same ops performed on many different inputs Repetition of Independent Operations – All repetitions of op are mutually independent

18 CSE502: Computer Architecture Pipeline Realism Uniform Sub-operations … NOT! – Balance pipeline stages Stage quantization to yield balanced stages Minimize internal fragmentation (left-over time near end of cycle) Repetition of Identical Operations … NOT! – Unifying instruction types Coalescing instruction types into one multi-function pipe Minimize external fragmentation (idle stages to match length) Repetition of Independent Operations … NOT! – Resolve data and resource hazards Inter-instruction dependency detection and resolution Pipelining is expensive

19 CSE502: Computer Architecture The Generic Instruction Pipeline Instruction Fetch Instruction Decode Operand Fetch Instruction Execute Write-back IFIF IDID OFOF EXEX WBWB

20 CSE502: Computer Architecture Balancing Pipeline Stages T IF = 6 units T ID = 2 units T ID = 9 units T EX = 5 units T OS = 9 units Without pipelining T cyc T IF +T ID +T OF +T EX +T OS = 31 Pipelined T cyc max{T IF, T ID, T OF, T EX, T OS } = 9 Speedup= 31 / 9 IFIF IDID OFOF EXEX WBWB Can we do better?

21 CSE502: Computer Architecture Balancing Pipeline Stages (1/2) Two methods for stage quantization – Merge multiple sub-ops into one – Divide sub-ops into smaller pieces Recent/Current trends – Deeper pipelines (more and more stages) – Multiple different pipelines/sub-pipelines – Pipelining of memory accesses

22 CSE502: Computer Architecture Balancing Pipeline Stages (2/2) Coarser-Grained Machine Cycle: 4 machine cyc / instruction Finer-Grained Machine Cycle: 11 machine cyc /instruction T IF&ID = 8 units T OF = 9 units T EX = 5 units T OS = 9 units IFIF IDID OFOF WBWB EXEX # stages = 11 T cyc = 3 units IFIF IFIF IDID OFOF OFOF OFOF EXEX EXEX WBWB WBWB WBWB # stages = 4 T cyc = 9 units

23 CSE502: Computer Architecture Pipeline Examples IFIF RDRD ALUALU MEMMEM WBWB IF ID OF EX WB PC GEN Cache Read DecodeDecode Read REG Addr GEN Cache Read EX 1 EX 2 Check Result Write Result WB EX OF ID IF MIPS R2000/R3000 AMDAHL 470V/7

24 CSE502: Computer Architecture Instruction Dependencies (1/2) Data Dependence – Read-After-Write (RAW) (only true dependence) Read must wait until earlier write finishes – Anti-Dependence (WAR) Write must wait until earlier read finishes (avoid clobbering) – Output Dependence (WAW) Earlier write cant overwrite later write Control Dependence (a.k.a. Procedural Dependence) – Branch condition must execute before branch target – Instructions after branch cannot run before branch

25 CSE502: Computer Architecture Instruction Dependencies (1/2) #for (;(j { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/1546541/5/slides/slide_24.jpg", "name": "CSE502: Computer Architecture Instruction Dependencies (1/2) #for (;(j

26 CSE502: Computer Architecture Hardware Dependency Analysis Processor must handle – Register Data Dependencies (same register) RAW, WAW, WAR – Memory Data Dependencies (same address) RAW, WAW, WAR – Control Dependencies

27 CSE502: Computer Architecture Pipeline Terminology Pipeline Hazards – Potential violations of program dependencies – Must ensure program dependencies are not violated Hazard Resolution – Static method: performed at compile time in software – Dynamic method: performed at runtime using hardware – Two options: Stall (costs perf.) or Forward (costs hw.) Pipeline Interlock – Hardware mechanism for dynamic hazard resolution – Must detect and enforce dependencies at runtime

28 CSE502: Computer Architecture Pipeline: Steady State IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4

29 CSE502: Computer Architecture Pipeline: Data Hazard t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4

30 CSE502: Computer Architecture Option 1: Stall on Data Hazard IFIDRDALUMEMWB IFIDRDALUMEMWB IFID Stalled in RD ALUMEMWB IF Stalled in ID RDALUMEMWB Stalled in IF IDRDALUMEM IFIDRDALU t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 RD ID IF IFIDRD IFID IF Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4

31 CSE502: Computer Architecture Option 2: Forwarding Paths (1/3) IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Many possible paths Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4 MEMMEMALUALU Requires stalling even with forwarding paths

32 CSE502: Computer Architecture Option 2: Forwarding Paths (2/3) IFIFIDID Register File src1 src2 ALUALU MEMMEM dest WBWB

33 CSE502: Computer Architecture Option 2: Forwarding Paths (3/3) Deeper pipeline may require additional forwarding paths Deeper pipeline may require additional forwarding paths IFIF Register File src1 src2 ALUALU MEMMEM dest = = = = = = = = WBWB = = = = IDID

34 CSE502: Computer Architecture Pipeline: Control Hazard t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF

35 CSE502: Computer Architecture Pipeline: Stall on Control Hazard IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 Stalled in IF

36 CSE502: Computer Architecture Pipeline: Prediction for Control Hazards t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUnopnop IFIDRDnopnop IFIDnopnop IFIDRD IFID IF nop nopnop ALUnop RDALU IDRD nop nop nop New Inst i+2 New Inst i+3 New Inst i+4 Speculative State Cleared Fetch Resteered

37 CSE502: Computer Architecture Going Beyond Scalar Scalar pipeline limited to CPI 1.0 – Can never run more than 1 insn. per cycle Superscalar can achieve CPI 1.0 (i.e., IPC 1.0) – Superscalar means executing multiple insns. in parallel

38 CSE502: Computer Architecture Architectures for Instruction Parallelism Scalar pipeline (baseline) – Instruction/overlap parallelism = D – Operation Latency = 1 – Peak IPC = 1.0 D Successive Instructions Time in cycles 123456789101112 D different instructions overlapped

39 CSE502: Computer Architecture Superscalar Machine Superscalar (pipelined) Execution – Instruction parallelism = D x N – Operation Latency = 1 – Peak IPC = N per cycle Successive Instructions Time in cycles 123456789101112 N D x N different instructions overlapped

40 CSE502: Computer Architecture Superscalar Example: Pentium PrefetchPrefetch Decode1Decode1 Decode2Decode2Decode2Decode2 ExecuteExecuteExecuteExecute WritebackWritebackWritebackWriteback 4× 32-byte buffers Decode up to 2 insts Read operands, Addr comp Asymmetric pipes u-pipev-pipe shift rotate some FP jmp, jcc, call, fxch both mov, lea, simple ALU, push/pop test/cmp

41 CSE502: Computer Architecture Pentium Hazards & Stalls Pairing Rules (when cant two insns exec?) – Read/flow dependence mov eax, 8 mov [ebp], eax – Output dependence mov eax, 8 mov eax, [ebp] – Partial register stalls mov al, 1 mov ah, 0 – Function unit rules Some instructions can never be paired – MUL, DIV, PUSHA, MOVS, some FP

42 CSE502: Computer Architecture Limitations of In-Order Pipelines If the machine parallelism is increased – … dependencies reduce performance – CPI of in-order pipelines degrades sharply As N approaches avg. distance between dependent instructions Forwarding is no longer effective – Must stall often In-order pipelines are rarely full

43 CSE502: Computer Architecture The In-Order N-Instruction Limit On average, parent-child separation is about ± 5 insn. – (Franklin and Sohi 92) Ex. Superscalar degree N = 4 Any dependency between these instructions will cause a stall Dependent insn must be N = 4 instructions away Average of 5 means there are many cases when the separation is < 4… each of these limits parallelism Reasonable in-order superscalar is effectively N=2

44 CSE502: Computer Architecture In Search of Parallelism Trivial Parallelism is limited – What is trivial parallelism? In-order: sequential instructions do not have dependencies In all previous examples, all instructions executed either at the same time or after earlier instructions – previous slides show that superscalar execution quickly hits a ceiling So what is non-trivial parallelism? …

45 CSE502: Computer Architecture What is Parallelism? Work – T1: time to complete a computation on a sequential system Critical Path – T : time to complete the same computation on an infinitely-parallel system Average Parallelism – Pavg = T1/ T For a p-wide system – Tp max{T1/p, T } – Pavg >> p Tp T1/p x = a + b; y = b * 2 z =(x-y) * (x+y)

46 CSE502: Computer Architecture ILP: Instruction-Level Parallelism ILP is a measure of the amount of inter- dependencies between instructions Average ILP = num instructions / longest path – code1:ILP = 1 (must execute serially) – T1 = 3, T = 3 – code2:ILP = 3 (can execute at the same time) – T1 = 3, T = 1 code 1 : r1 r2 + 1 r3 r1 / 17 r4 r0 - r3 code 2 :r1 r2 + 1 r3 r9 / 17 r4 r0 - r10

47 CSE502: Computer Architecture ILP != IPC Instruction level parallelism usually assumes infinite resources, perfect fetch, and unit-latency for all instructions ILP is more a property of the program dataflow IPC is the real observed metric of exactly how many instructions are executed per machine cycle, which includes all of the limitations of a real machine The ILP of a program is an upper-bound on the attainable IPC

48 CSE502: Computer Architecture Scope of ILP Analysis r1 r2 + 1 r3 r1 / 17 r4 r0 - r3 r11 r12 + 1 r13 r19 / 17 r14 r0 - r20 ILP=2 ILP=1ILP=3

49 CSE502: Computer Architecture DFG Analysis A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R7 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4 ST 0[R1] J: R1 = R1 – 1 K: R3 ST 0[R1]

50 CSE502: Computer Architecture In-Order Issue, Out-of-Order Completion Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard Issue = send an instruction to execution Issue = send an instruction to execution INTINTFadd1Fadd1 Fadd2Fadd2 Fmul1Fmul1 Fmul2Fmul2 Fmul3Fmul3 Ld/StLd/St In-orderInst.Stream Execution Begins In-order Out-of-order Completion

51 CSE502: Computer Architecture Example A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R7 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4 ST 0[R1] J: R1 = R1 – 1 K: R3 ST 0[R1] AB Cycle 1: C 2: D 3: 4: 5: EF 6: GHJ K 7: 8: IPC = 10/8 = 1.25 AB C D EF G H J K

52 CSE502: Computer Architecture Example (2) A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R9 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4 ST 0[R9] J: R1 = R9 – 1 K: R3 ST 0[R1] AB Cycle 1: C 2: D 3: 4: 5: EFG IPC = 10/7 = 1.43 HJ 6: K 7: AB C D E FG HJ K

53 CSE502: Computer Architecture Track with Simple Scoreboarding Scoreboard: a bit-array, 1-bit for each GPR – If the bit is not set: the register has valid data – If the bit is set: the register has stale data i.e., some outstanding instruction is going to change it Issue in Order: RD Fn (RS, RT) – If SB[RS] or SB[RT] is set RAW, stall – If SB[RD] is set WAW, stall – Else, dispatch to FU (Fn) and set SB[RD] Complete out-of-order – Update GPR[RD], clear SB[RD]

54 CSE502: Computer Architecture Out-of-Order Issue INTINTFadd1Fadd1 Fadd2Fadd2 Fmul1Fmul1 Fmul2Fmul2 Fmul3Fmul3 Ld/StLd/St In-order Inst. Stream DRDRDRDRDRDRDRDR Out-of-order Completion Out of Program Order Execution Need an extra Stage/buffers for Dependency Resolution

55 CSE502: Computer Architecture OOO Scoreboarding Similar to In-Order scoreboarding – Need new tables to track status of individual instructions and functional units – Still enforce dependencies Stall dispatch on WAW Stall issue on RAW Stall completion on WAR Limitations of Scoreboarding? Hints – No structural hazards – Can always write a RAW-free code sequence Add R1 = R0 + 1; Add R2 = R0 + 1; Add R3 = R0 + 1; … – Think about x86 ISA with only 8 registers Finite number of registers in any ISA will force you to reuse register names at some point WAR, WAW stalls Finite number of registers in any ISA will force you to reuse register names at some point WAR, WAW stalls

56 CSE502: Computer Architecture Lessons thus Far More out-of-orderness More ILP exposed – But more hazards Stalling is a generic technique to ensure sequencing RAW stall is a fundamental requirement (?) Compiler analysis and scheduling can help (not covered in this course)

57 CSE502: Computer Architecture Ex. Tomasulos Algorithm [IBM 360/91, 1967]

58 CSE502: Computer Architecture FYI: Historical Note Tomasulos algorithm (1967) was not the first Also at IBM, Lynn Conway proposed multi-issue dynamic instruction scheduling (OOO) in Feb 1966 – Ideas got buried due to internal politics, changing project goals, etc. – But its still the first (as far as I know)

59 CSE502: Computer Architecture Modern Enhancements to Tomasulos Algorithm Tomasulo Peak IPC = 1 2 FP FUs Single CDB Operand copying RS Tag Tag-based forwarding Imprecise Modern Peak IPC = 6+ 6-10+ FUs Many forwarding buses Renamed registers Tag-based forwarding Precise (requires ROB) Machine Width Structural Deps Anti-Deps Output-Deps True Deps Exceptions

60 CSE502: Computer Architecture

61 Balancing Pipeline Stages Without pipelining T cyc T IF +T ID +T EX +T MEM +T WB = 31 Pipelined T cyc max{T IF,T ID,T EX,T MEM,T WB } = 9 Speedup= 31 / 9 Can we do better in terms of either performance or efficiency? IF ID EX MEM WB T IF = 6 units T ID = 2 units T EX = 9 units T MEM = 5 units T WB = 9 units

62 CSE502: Computer Architecture Balancing Pipeline Stages Two Methods for Stage Quantization: – Merging of multiple stages – Further subdividing a stage Recent Trends: – Deeper pipelines (more and more stages) Pipeline depth growing more slowly since Pentium 4. Why? – Multiple pipelines (subpipelines) – Pipelined memory/cache accesses (tricky)

63 CSE502: Computer Architecture The Cost of Deeper Pipelines Instruction pipelines are not ideal i.e. Instructions in different stages can have dependencies Suppose add 1 2 3 nand 3 4 5 FDEMW FDEMW t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst 0 Inst 1 RAW!! FDEMW FDEMW t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 add nandE Stall FEM D Stall D

64 CSE502: Computer Architecture Types of Dependencies and Hazards Data Dependence (Both memory and register) – True dependence (RAW) Instruction must wait for all required input operands – Anti-Dependence (WAR) Later write must not clobber a still-pending earlier read – Output dependence (WAW) Earlier write must not clobber already-completed later write Control Dependence (aka Procedural Dependence) – Conditional branches may change instruction sequence – Instructions after cond. branch depend on outcome (more exact definition later)

65 CSE502: Computer Architecture Terminology Pipeline Hazards: – Potential violations of program dependences – Must ensure program dependences are not violated Hazard Resolution: – Static Method: Performed at compiled time in software – Dynamic Method: Performed at run time using hardware Pipeline Interlock: – Hardware mechanisms for dynamic hazard resolution – Must detect and enforce dependences at run time

66 CSE502: Computer Architecture Necessary Conditions for Data Hazards i:r k _ j:r k _ Reg Write i:_ r k j:r k _ Reg Write Reg Read i:r k _ j:_ r k Reg Read Reg Write stage X stage Y dist(i,j) dist(X,Y) ?? dist(i,j) > dist(X,Y) ?? WAW HazardWAR HazardRAW Hazard dist(i,j) dist(X,Y) Hazard!! dist(i,j) > dist(X,Y) Safe Hazard Distance

67 CSE502: Computer Architecture Handling Data Hazards Avoidance (static) – Make sure there are no hazards in the code Detect and Stall (dynamic) – Stall until earlier instructions finish Detect and Forward (dynamic) – Get correct value from elsewhere in pipeline

68 CSE502: Computer Architecture Handling Data Hazards: Avoidance Programmer/compiler must know implementation details – Insert nops between dependent instructions add1 2 3 nop nand3 4 5 write R3 in cycle 5 read R3 in cycle 6

69 CSE502: Computer Architecture Problems with Avoidance Binary compatability – New implementations may require more nops Code size – Higher instruction cache footprint – Longer binary load times – Worse in machines that execute multiple instructions / cycle Intel Itanium – 25-40% of instructions are nops Slower execution – CPI=1, but many instructions are nops

70 CSE502: Computer Architecture Handling Data Hazards: Detect & Stall Detection – Compare regA & regB with DestReg of preceding insn. 3 bit comparators Stall – Do not advance pipeline register for Fetch/Decode – Pass nop to Execute

71 CSE502: Computer Architecture Problems with Detect & Stall CPI increases on every hazard Are these stalls necessary? Not always! – The new value for R3 is in the EX/Mem register – Reroute the result to the nand Called forwarding or bypassing

72 CSE502: Computer Architecture Handling Data Hazards: Detect & Forward Detection – Same as detect and stall, but… each possible hazard requires different forwarding paths Forward – Add data paths for all possible sources – Add mux in front of ALU to select source bypassing logic often a critical path in wide-issue machines – # paths grows quadratically with machine width

73 CSE502: Computer Architecture Handling Control Hazards Avoidance (static) – No branches? – Convert branches to predication Control dependence becomes data dependence Detect and Stall (dynamic) – Stop fetch until branch resolves Speculate and squash (dynamic) – Keep going past branch, throw away instructions if wrong

74 CSE502: Computer Architecture Avoidance: if-conversion if (a == b) { x++; y = n / d; } subt1 a, b jnzt1, PC+2 addx x, #1 divy n, d sub t1 a, b add(t1) x x, #1 div(t1) y n, d sub t1 a, b add t2 x, #1 div t3 n, d cmov(t1) x t2 cmov(t1) y t3

75 CSE502: Computer Architecture Handling Control Hazards: Detect & Stall Detection – In decode, check if opcode is branch or jump Stall – Hold next instruction in Fetch – Pass noop to Decode

76 CSE502: Computer Architecture Problems with Detect & Stall CPI increases on every branch Are these stalls necessary? Not always! – Branch is only taken half the time – Assume branch is NOT taken Keep fetching, treat branch as noop If wrong, make sure bad instructions dont complete


Download ppt "CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)"

Similar presentations


Ads by Google