Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

Slides:

Advertisements

Similar presentations

PipelineCSCE430/830 Pipeline: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Prof. Yifeng Zhu, U of Maine Fall,

Advertisements

OMSE 510: Computing Foundations 4: The CPU!

Pipelining I Topics Pipelining principles Pipeline overheads Pipeline registers and stages Systems I.

Pipeline and Vector Processing (Chapter2 and Appendix A)

Chapter 8. Pipelining.

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.

Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.

EECS 318 CAD Computer Aided Design LECTURE 2: DSP Architectures Instructor: Francis G. Wolff Case Western Reserve University This presentation.

Goal: Describe Pipelining

ENGS 116 Lecture 41 Instruction Set Design Part II Introduction to Pipelining Vincent H. Berk September 28, 2005 Reading for today: Chapter 2.1 – 2.12,

Computer Architecture

CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.

Computer Organization

1 Recap (Pipelining). 2 What is Pipelining? A way of speeding up execution of tasks Key idea : overlap execution of multiple taks.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.

Pipelining Andreas Klappenecker CPSC321 Computer Architecture.

Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.

Pipelining Datapath Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley) and Hank Walker (TAMU)

CS430 – Computer Architecture Introduction to Pipelined Execution

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 9, 2002 Topic: Pipelining Basics.

1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State.

CS 61C L30 Introduction to Pipelined Execution (1) Garcia, Fall 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

Appendix A Pipelining: Basic and Intermediate Concepts

Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.

Introduction to Pipelining Rabi Mahapatra Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley)

9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.

CS1104: Computer Organisation School of Computing National University of Singapore.

COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

Integrated Circuits Costs

B 0000 Pipelining ENGR xD52 Eric VanWyk Fall

EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.

Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Analogy: Gotta Do Laundry

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

ECE 232 L18.Pipeline.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 18 Pipelining.

Cs 152 L1 3.1 DAP Fa97,  U.CB Pipelining Lessons °Pipelining doesn’t help latency of single task, it helps throughput of entire workload °Multiple tasks.

Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Pipelining Example Laundry Example: Three Stages

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Pipelining CS365 Lecture 9. D. Barbara Pipeline CS465 2 Outline  Today’s topic  Pipelining is an implementation technique in which multiple instructions.

11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Chapter One Introduction to Pipelined Processors.

CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.

Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control Design) Prof. Dr.

DICCD Class-08. Parallel processing A parallel processing system is able to perform concurrent data processing to achieve faster execution time The system.

Lecture 5. MIPS Processor Design Pipelined MIPS #1 Prof. Taeweon Suh Computer Science & Engineering Korea University COSE222, COMP212 Computer Architecture.

Lecture 18: Pipelining I.

Pipelines An overview of pipelining

Review: Instruction Set Evolution

CMSC 611: Advanced Computer Architecture

Chapter One Introduction to Pipelined Processors

ECE232: Hardware Organization and Design

Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from

Chapter 4 The Processor Part 2

Lecturer: Alan Christopher

Serial versus Pipelined Execution

Systems I Pipelining I Topics Pipelining principles Pipeline overheads

An Introduction to pipelining

Chapter 8. Pipelining.

Pipelining Appendix A and Chapter 3.

Presentation transcript:

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee

Pipelining

What is pipelining? An implementation technique that overlaps the execution of multiple instructions. An implementation technique that overlaps the execution of multiple instructions. Any architecture in which digital information flows through a series of stations that each inspect, interpret or modify the information. Any architecture in which digital information flows through a series of stations that each inspect, interpret or modify the information.

Example(1) : Laundry Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Washer takes 30 minutes Dryer takes 40 minutes Dryer takes 40 minutes “Folder” takes 20 minutes “Folder” takes 20 minutes ABCD

Sequential Laundry Sequential laundry takes 6 hours for 4 loads Sequential laundry takes 6 hours for 4 loads ABCD PM Midnight TaskOrderTaskOrder Time

Pipelined Laundry Start work ASAP Pipelined laundry takes 3.5 hours for 4 loads Pipelined laundry takes 3.5 hours for 4 loads Speedup = 6/3.5 = 1.7 Speedup = 6/3.5 = 1.7 ABCD 6 PM Midnight TaskOrderTaskOrder Time

Pipelining Lessons Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Multiple tasks operating simultaneously Potential speedup = Number pipe stages Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup

Computer Pipelines Execute billions of instructions, so through put is what matters Execute billions of instructions, so through put is what matters RISC desirable features: all instructions sa me length, registers located in same place in instruction format, memory operands onl y in loads or stores RISC desirable features: all instructions sa me length, registers located in same place in instruction format, memory operands onl y in loads or stores

Unpipelined Design Single-cycle implementation Single-cycle implementation The cycle time depends on the slowest instruction The cycle time depends on the slowest instruction Every instruction takes the same amount of time Every instruction takes the same amount of time Multi-cycle implementation Multi-cycle implementation Divide the execution of an instruction into multiple steps Divide the execution of an instruction into multiple steps Each instruction may take variable number of steps (clock cycles) Each instruction may take variable number of steps (clock cycles)

Unpipelined System Comb. Logic REGREG 30ns3ns Clock Time Op1Op2Op3 ?? One operation must complete before next can begin One operation must complete before next can begin Operations spaced 33ns apart Operations spaced 33ns apart

Pipelined Design Divide the execution of an instruction into multiple steps (stages) Divide the execution of an instruction into multiple steps (stages) Overlap the execution of different instructions in different stages Overlap the execution of different instructions in different stages Each cycle different instruction is executed in different stages Each cycle different instruction is executed in different stages For example, 5-stage pipeline (Fetch-Decode-Read- Execute-Write), For example, 5-stage pipeline (Fetch-Decode-Read- Execute-Write), 5 instructions are executed concurrently in 5 different pipeline stages 5 instructions are executed concurrently in 5 different pipeline stages Complete the execution of one instruction every cycle (instead of every 5 cycle) Complete the execution of one instruction every cycle (instead of every 5 cycle) Can increase the throughput of the machine 5 times Can increase the throughput of the machine 5 times

Example(1) : 3 Stage Pipeline Space operations Space operations 13ns apart 13ns apart 3 operations occur si multaneously 3 operations occur si multaneously REGREG Clock Comb. Logic REGREG Comb. Logic REGREG Comb. Logic 10ns3ns10ns3ns10ns3ns Time Op1 Op2 Op3 ?? Op4 Delay = 39ns Throughput = 77MHz

Example(2) FDREW FDREW FDREW FDREW FDREW FDREW FDREW FDREW FDREW F Non-pipelined processor: 25 cycles = number of instrs (5) * number of stages (5) Pipelined processor: 9 cycles = start-up latency (4) + number of instrs (5) Filling the pipeline Draining the pipeline 5 stage pipeline: Fetch – Decode – Read – Execute - Write

Basic Performance Issues in Pipelining Pipelining increases the CPU instruction throughput - the number of instructions complete per unit of time - but it is not reduce the execution time of an individual instruction. Pipelining increases the CPU instruction throughput - the number of instructions complete per unit of time - but it is not reduce the execution time of an individual instruction.

Pipeline Speedup Example Assume the multiple cycle has a 10-ns clock cycle, loads take 5 clock cycles and account for 40% of the instructions, and all other instructions take 4 clock cycles. Assume the multiple cycle has a 10-ns clock cycle, loads take 5 clock cycles and account for 40% of the instructions, and all other instructions take 4 clock cycles. If pipelining the machine add 1-ns to the clock cycle, how much speedup in instruction execution rate do we get from pipelining. If pipelining the machine add 1-ns to the clock cycle, how much speedup in instruction execution rate do we get from pipelining. MC Ave Instr. Time = Clock cycle x Average CPI = 10 ns x (0.6 x x 5) = 10 ns x (0.6 x x 5) = 44 ns = 44 ns PL Ave Instr. Time = = 11 ns Speedup = 44 / 11 = 4 This ignores time needed to fill & empty the pipeline and delays due to hazards. This ignores time needed to fill & empty the pipeline and delays due to hazards.

What makes it easy? What makes it easy? all instructions are the same length all instructions are the same length just a few instruction formats just a few instruction formats memory operands appear only in loads and stores memory operands appear only in loads and stores What makes it easy?

It’s not Easy for Computers Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: Hardware cannot support this combination of instructions - two instructions need the same resource. Structural hazards: Hardware cannot support this combination of instructions - two instructions need the same resource. Data hazards: Instruction depends on result of prior instruction still in the pipeline Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Pipelining of branches & other instructions that change the PC Control hazards: Pipelining of branches & other instructions that change the PC Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline

Limitation: Nonuniform Pipelining Clock REGREG Com. Log. REGREG Comb. Logic REGREG Comb. Logic 5ns3ns15ns3ns10ns3ns Throughput limited by slowest stage Throughput limited by slowest stage Delay determined by clock period * number of stages Delay determined by clock period * number of stages Must attempt to balance stages Must attempt to balance stages Delay = 18 * 3 = 54 ns Throughput = 55MHz

Limitation: Deep Pipelines Diminishing returns as add more pipeline stages Diminishing returns as add more pipeline stages Register delays become limiting factor Register delays become limiting factor Increased latency Increased latency Small throughput gains Small throughput gains Clock REGREG Com. Log. 5ns3ns REGREG Com. Log. 5ns3ns REGREG Com. Log. 5ns3ns REGREG Com. Log. 5ns3ns REGREG Com. Log. 5ns3ns REGREG Com. Log. 5ns3ns Delay = 48ns, Throughput = 128MHz

Limitation: Sequential Dependencies Op4 gets result Op4 gets result from Op1 from Op1 Pipeline Hazard Pipeline Hazard REGREG Clock Comb. Logic REGREG Comb. Logic REGREG Comb. Logic Time Op1 Op2 Op3 ?? Op4

Structure Hazard Sometimes called Resource Conflict. Sometimes called Resource Conflict. Example. Example. Some pipelined machines have shared a single memory pipeline for a data and instruction. As a result, when an instruction contains a data memory reference, it will conflict with the instruction reference for a latter instruction. Some pipelined machines have shared a single memory pipeline for a data and instruction. As a result, when an instruction contains a data memory reference, it will conflict with the instruction reference for a latter instruction.

Solutions to Structural Hazard Resource Duplication Resource Duplication example example Separate I and D caches for memory access conflict Separate I and D caches for memory access conflict Time-multiplexed or multi-port register file for register file access conflict Time-multiplexed or multi-port register file for register file access conflict

Data Hazard Data hazard occur when pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially execution instructions on an unpipelined machine Data hazard occur when pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially execution instructions on an unpipelined machine

Solutions to Data Hazard Freezing the pipeline Freezing the pipeline (Internal) Forwarding (Internal) Forwarding Compiler scheduling Compiler scheduling

Control (Branch) Hazards Caused by branches Caused by branches Instruction fetch of a next instruction has to wait until the target (including the branch condition) of the current branch instruction need to be resolved Instruction fetch of a next instruction has to wait until the target (including the branch condition) of the current branch instruction need to be resolved

Solutions to Control Hazard Optimized branch processing Optimized branch processing 1. Find out branch taken or not early 1. Find out branch taken or not early → simplified branch condition → simplified branch condition 2. Compute branch target address early 2. Compute branch target address early → extra hardware → extra hardware Branch prediction Branch prediction - Predict the next target address (branch prediction) and if wrong, flush all the speculatively fetched instructions from the pipeline - Predict the next target address (branch prediction) and if wrong, flush all the speculatively fetched instructions from the pipeline Delayed branch Delayed branch - Pipeline stall to delay the fetch of the next instruction - Pipeline stall to delay the fetch of the next instruction

Summary Pipelining overlaps the execution of Pipelining overlaps the execution of multiple instructions. multiple instructions. With an idea pipeline, the CPI(Cycle Per Instruction) is one, and the speedup is equal to the number of stages in the pipeline. With an idea pipeline, the CPI(Cycle Per Instruction) is one, and the speedup is equal to the number of stages in the pipeline. However, several factors prevent us from achieving the ideal speedup, including However, several factors prevent us from achieving the ideal speedup, including Not being able to divide the pipeline evenly Not being able to divide the pipeline evenly The time needed to empty and flush the pipeline The time needed to empty and flush the pipeline Overhead needed for pipeling Overhead needed for pipeling Structural, data, and control harzards Structural, data, and control harzards

Summary Just overlap tasks, and easy if tasks are independent Just overlap tasks, and easy if tasks are independent Speed Up VS. Pipeline Depth; if ideal CPI is 1, then: Speed Up VS. Pipeline Depth; if ideal CPI is 1, then: Hazards limit performance on computers: Hazards limit performance on computers: Structural: need more HW resources Structural: need more HW resources Data: need forwarding, compiler scheduling Data: need forwarding, compiler scheduling Control: discuss next time Control: discuss next time Speedup = Pipeline Depth 1 + Pipeline stall CPI X Clock Cycle Unpipelined Clock Cycle Pipelined