CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

Slides:

Advertisements

Similar presentations

CSE 502: Computer Architecture

Advertisements

Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.

Instruction-Level Parallelism (ILP)

Hazard Analysis Procedure  Memory Data Dependences Output Dependence (WAW) Anti Dependence (WAR) True Data Dependence (RAW)  Register Data Dependences.

1 RISC Pipeline Han Wang CS3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 4.6.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University RISC Pipeline See: P&H Chapter 4.6.

Chapter 12 Pipelining Strategies Performance Hazards.

EECS 470 Pipeline Hazards Lecture 4 Coverage: Appendix A.

Computer Architecture Lecture 3 Coverage: Appendix A

“Iron Law” of Processor Performance

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan

1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.

Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.

DLX Instruction Format

Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.

Pipelining - II Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.

Appendix A Pipelining: Basic and Intermediate Concepts

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

EECS 470 Further review: Pipeline Hazards and More Lecture 2 – Winter 2014 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,

Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.

Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.

1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.

EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.

Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.

Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

CMPE 421 Parallel Computer Architecture

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

Sample Code (Simple) Run the following code on a pipelined datapath: add1 2 3 ; reg 3 = reg 1 + reg 2 nand ; reg 6 = reg 4 & reg 5 lw ; reg.

Electrical and Computer Engineering University of Cyprus LAB3: IMPROVING MIPS PERFORMANCE WITH PIPELINING.

EECE 476: Computer Architecture Slide Set #5: Implementing Pipelining Tor Aamodt Slide background: Die photo of the MIPS R2000 (first commercial MIPS microprocessor)

CDA 5155 Computer Architecture Week 1.5. Start with the materials: Conductors and Insulators Conductor: a material that permits electrical current to.

Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam.

Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth.

11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.

Computer Architecture Lecture 7: Pipelining Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 1/29/2014.

Computer Organization

CDA3101 Recitation Section 8

Pipelining: Hazards Ver. Jan 14, 2014

Single Clock Datapath With Control

Appendix C Pipeline implementation

ECE232: Hardware Organization and Design

ECE232: Hardware Organization and Design

Morgan Kaufmann Publishers The Processor

Chapter 4 The Processor Part 2

Lecture 6: Advanced Pipelines

Pipelining Multicycle, MIPS R4000, and More

Pipelining in more detail

How to improve (decrease) CPI

Pipeline control unit (highly abstracted)

The Processor Lecture 3.6: Control Hazards

Control unit extension for data hazards

The Processor Lecture 3.5: Data Hazards

Instruction Execution Cycle

Pipeline control unit (highly abstracted)

Pipeline Control unit (highly abstracted)

Control unit extension for data hazards

Introduction to Computer Organization and Architecture

Control unit extension for data hazards

Spring 2010 Ilam University

Presentation transcript:

CSE502: Computer Architecture Core Pipelining

CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1) – Long clock period (to accommodate slowest instruction) Multi-cycle control: micro-programmed – Short clock period – High CPI Can we have both low CPI and short clock period? Single-cycle Multi-cycle insn0.(fetch,decode,exec)insn1.(fetch,decode,exec) insn0.decinsn0.fetchinsn1.decinsn1.fetchinsn0.execinsn1.exec time

CSE502: Computer Architecture Pipelining Start with multi-cycle design When insn0 goes from stage 1 to stage 2 … insn1 starts stage 1 Each instruction passes through all stages … but instructions enter and leave at faster rate Multi-cycle insn0.decinsn0.fetchinsn1.decinsn1.fetchinsn0.execinsn1.exec time Pipelined insn0.execinsn0.decinsn0.fetch insn1.decinsn1.fetchinsn1.exec insn2.decinsn2.fetchinsn2.exec Can have as many insns in flight as there are stages

CSE502: Computer Architecture Pipeline Examples address hit? = = = = = = = = Increases throughput at the expense of latency address hit? = = = = = = = = address hit? = = = = = = = =

CSE502: Computer Architecture Processor Pipeline Review I-cache Reg File PC +4+4 D-cache ALU FetchDecodeMemory (Write-back) Execute

CSE502: Computer Architecture Stage 1: Fetch Fetch an instruction from memory every cycle – Use PC to index memory – Increment PC (assume no branches for now) Write state to the pipeline register (IF/ID) – The next stage will read this pipeline register

CSE502: Computer Architecture Stage 1: Fetch Diagram Instruction bits Instruction bits IF / ID Pipeline register PC Instruction Cache Instruction Cache en MUXMUX MUXMUX PC + 1 Decode target

CSE502: Computer Architecture Stage 2: Decode Decodes opcode bits – Set up Control signals for later stages Read input operands from register file – Specified by decoded instruction bits Write state to the pipeline register (ID/EX) – Opcode – Register contents – PC+1 (even though decode didn’t use it) – Control signals (from insn) for opcode and destReg

CSE502: Computer Architecture Stage 2: Decode Diagram ID / EX Pipeline register regA contents regA contents regB contents regB contents Register File regA regB en Instruction bits Instruction bits IF / ID Pipeline register PC + 1 Control signals Control signals Fetch Execute destReg data target

CSE502: Computer Architecture Stage 3: Execute Perform ALU operations – Calculate result of instruction Control signals select operation Contents of regA used as one input Either regB or constant offset (from insn) used as second input – Calculate PC-relative branch target PC+1+(constant offset) Write state to the pipeline register (EX/Mem) – ALU result, contents of regB, and PC+1+offset – Control signals (from insn) for opcode and destReg

CSE502: Computer Architecture Stage 3: Execute Diagram ID / EX Pipeline register regA contents regA contents regB contents regB contents ALU result ALU result EX/Mem Pipeline register PC + 1 Control signals Control signals Control signals Control signals PC+1 +offset PC+1 +offset + regB contents regB contents ALUALU MUXMUX MUXMUX Decode Memory destReg data target

CSE502: Computer Architecture Stage 4: Memory Perform data cache access – ALU result contains address for LD or ST – Opcode bits control R/W and enable signals Write state to the pipeline register (Mem/WB) – ALU result and Loaded data – Control signals (from insn) for opcode and destReg

CSE502: Computer Architecture Stage 4: Memory Diagram ALU result ALU result Mem/WB Pipeline register ALU result ALU result EX/Mem Pipeline register Control signals Control signals PC+1 +offset PC+1 +offset regB contents regB contents Loaded data Loaded data Data Cache en R/W in_data in_addr Control signals Control signals Execute Write-back destReg data target

CSE502: Computer Architecture Stage 5: Write-back Writing result to register file (if required) – Write Loaded data to destReg for LD – Write ALU result to destReg for arithmetic insn – Opcode bits control register write enable signal

CSE502: Computer Architecture Stage 5: Write-back Diagram ALU result ALU result Mem/WB Pipeline register Control signals Control signals Loaded data Loaded data MUXMUX MUXMUX destReg MUXMUX MUXMUX Memory

CSE502: Computer Architecture Putting It All Together PC Inst Cache Inst Cache Register file MUXMUX MUXMUX 1 1 Data Cache Data Cache MUXMUX MUXMUX IF/IDID/EXEX/MemMem/WB MUXMUX MUXMUX op dest offset valB valA PC+1 target ALU result ALU result op dest valB op dest ALU result ALU result mdata eq? instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB data dest MUXMUX MUXMUX

CSE502: Computer Architecture Pipelining Idealism Uniform Sub-operations – Operation can partitioned into uniform-latency sub-ops Repetition of Identical Operations – Same ops performed on many different inputs Repetition of Independent Operations – All repetitions of op are mutually independent

CSE502: Computer Architecture Pipeline Realism Uniform Sub-operations … NOT! – Balance pipeline stages Stage quantization to yield balanced stages Minimize internal fragmentation (left-over time near end of cycle) Repetition of Identical Operations … NOT! – Unifying instruction types Coalescing instruction types into one “multi-function” pipe Minimize external fragmentation (idle stages to match length) Repetition of Independent Operations … NOT! – Resolve data and resource hazards Inter-instruction dependency detection and resolution Pipelining is expensive

CSE502: Computer Architecture The Generic Instruction Pipeline Instruction Fetch Instruction Decode Operand Fetch Instruction Execute Write-back IFIF IDID OFOF EXEX WBWB

CSE502: Computer Architecture Balancing Pipeline Stages T IF = 6 units T ID = 2 units T ID = 9 units T EX = 5 units T OS = 9 units Without pipelining T cyc  T IF +T ID +T OF +T EX +T OS = 31 Pipelined T cyc  max{T IF, T ID, T OF, T EX, T OS } = 9 Speedup= 31 / 9 IFIF IDID OFOF EXEX WBWB Can we do better?

CSE502: Computer Architecture Balancing Pipeline Stages (1/2) Two methods for stage quantization – Merge multiple sub-ops into one – Divide sub-ops into smaller pieces Recent/Current trends – Deeper pipelines (more and more stages) – Multiple different pipelines/sub-pipelines – Pipelining of memory accesses

CSE502: Computer Architecture Balancing Pipeline Stages (2/2) Coarser-Grained Machine Cycle: 4 machine cyc / instruction Finer-Grained Machine Cycle: 11 machine cyc /instruction T IF&ID = 8 units T OF = 9 units T EX = 5 units T OS = 9 units IFIF IDID OFOF WBWB EXEX # stages = 11 T cyc = 3 units IFIF IFIF IDID OFOF OFOF OFOF EXEX EXEX WBWB WBWB WBWB # stages = 4 T cyc = 9 units

CSE502: Computer Architecture Pipeline Examples IFIF RDRD ALUALU MEMMEM WBWB IF ID OF EX WB PC GEN Cache Read DecodeDecode Read REG Addr GEN Cache Read EX 1 EX 2 Check Result Write Result WB EX OF ID IF MIPS R2000/R3000 AMDAHL 470V/7

CSE502: Computer Architecture Instruction Dependencies (1/2) Data Dependence – Read-After-Write (RAW) (only true dependence) Read must wait until earlier write finishes – Anti-Dependence (WAR) Write must wait until earlier read finishes (avoid clobbering) – Output Dependence (WAW) Earlier write can’t overwrite later write Control Dependence (a.k.a. Procedural Dependence) – Branch condition must execute before branch target – Instructions after branch cannot run before branch

CSE502: Computer Architecture Instruction Dependencies (1/2) #for (;(j<high)&&(array[j]<array[low]);++j); bge j, high, $36 mul$15, j, 4 addu$24, array, $15 lw$25, 0($24) mul$13, low, 4 addu$14, array, $13 lw$15, 0($14) bge$25, $15, $36 $35: addu j, j, 1... $36: addu$11, $11, Real code has lots of dependencies

CSE502: Computer Architecture Hardware Dependency Analysis Processor must handle – Register Data Dependencies (same register) RAW, WAW, WAR – Memory Data Dependencies (same address) RAW, WAW, WAR – Control Dependencies

CSE502: Computer Architecture Pipeline Terminology Pipeline Hazards – Potential violations of program dependencies – Must ensure program dependencies are not violated Hazard Resolution – Static method: performed at compile time in software – Dynamic method: performed at runtime using hardware – Two options: Stall (costs perf.) or Forward (costs hw.) Pipeline Interlock – Hardware mechanism for dynamic hazard resolution – Must detect and enforce dependencies at runtime

CSE502: Computer Architecture Pipeline: Steady State IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4

CSE502: Computer Architecture Pipeline: Data Hazard t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4

CSE502: Computer Architecture Option 1: Stall on Data Hazard IFIDRDALUMEMWB IFIDRDALUMEMWB IFID Stalled in RD ALUMEMWB IF Stalled in ID RDALUMEMWB Stalled in IF IDRDALUMEM IFIDRDALU t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 RD ID IF IFIDRD IFID IF Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4

CSE502: Computer Architecture Option 2: Forwarding Paths (1/3) IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Many possible paths Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4 MEMMEMALUALU Requires stalling even with forwarding paths

CSE502: Computer Architecture Option 2: Forwarding Paths (2/3) IFIFIDID Register File src1 src2 ALUALU MEMMEM dest WBWB

CSE502: Computer Architecture Option 2: Forwarding Paths (3/3) Deeper pipeline may require additional forwarding paths Deeper pipeline may require additional forwarding paths IFIF Register File src1 src2 ALUALU MEMMEM dest = = = = = = = = WBWB = = = = IDID

CSE502: Computer Architecture Pipeline: Control Hazard t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF

CSE502: Computer Architecture Pipeline: Stall on Control Hazard IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 Stalled in IF

CSE502: Computer Architecture Pipeline: Prediction for Control Hazards t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUnopnop IFIDRDnopnop IFIDnopnop IFIDRD IFID IF nop nopnop ALUnop RDALU IDRD nop nop nop New Inst i+2 New Inst i+3 New Inst i+4 Speculative State Cleared Fetch Resteered

CSE502: Computer Architecture Going Beyond Scalar Scalar pipeline limited to CPI ≥ 1.0 – Can never run more than 1 insn. per cycle “Superscalar” can achieve CPI ≤ 1.0 (i.e., IPC ≥ 1.0) – Superscalar means executing multiple insns. in parallel

CSE502: Computer Architecture Architectures for Instruction Parallelism Scalar pipeline (baseline) – Instruction/overlap parallelism = D – Operation Latency = 1 – Peak IPC = 1.0 D Successive Instructions Time in cycles D different instructions overlapped

CSE502: Computer Architecture Superscalar Machine Superscalar (pipelined) Execution – Instruction parallelism = D x N – Operation Latency = 1 – Peak IPC = N per cycle Successive Instructions Time in cycles N D x N different instructions overlapped

CSE502: Computer Architecture Superscalar Example: Pentium PrefetchPrefetch Decode1Decode1 Decode2Decode2Decode2Decode2 ExecuteExecuteExecuteExecute WritebackWritebackWritebackWriteback 4× 32-byte buffers Decode up to 2 insts Read operands, Addr comp Asymmetric pipes u-pipev-pipe shift rotate some FP jmp, jcc, call, fxch both mov, lea, simple ALU, push/pop test/cmp

CSE502: Computer Architecture Pentium Hazards & Stalls “Pairing Rules” (when can’t two insns exec?) – Read/flow dependence mov eax, 8 mov [ebp], eax – Output dependence mov eax, 8 mov eax, [ebp] – Partial register stalls mov al, 1 mov ah, 0 – Function unit rules Some instructions can never be paired – MUL, DIV, PUSHA, MOVS, some FP

CSE502: Computer Architecture Limitations of In-Order Pipelines If the machine parallelism is increased – … dependencies reduce performance – CPI of in-order pipelines degrades sharply As N approaches avg. distance between dependent instructions Forwarding is no longer effective – Must stall often In-order pipelines are rarely full

CSE502: Computer Architecture The In-Order N-Instruction Limit On average, parent-child separation is about ± 5 insn. – (Franklin and Sohi ’92) Ex. Superscalar degree N = 4 Any dependency between these instructions will cause a stall Dependent insn must be N = 4 instructions away Average of 5 means there are many cases when the separation is < 4… each of these limits parallelism Reasonable in-order superscalar is effectively N=2