Pipelining I Topics Pipelining principles Pipeline overheads Pipeline registers and stages Systems I.

Slides:



Advertisements
Similar presentations
PipelineCSCE430/830 Pipeline: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Prof. Yifeng Zhu, U of Maine Fall,
Advertisements

Lecture 4: CPU Performance
OMSE 510: Computing Foundations 4: The CPU!
Chapter 8. Pipelining.
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
EECS 318 CAD Computer Aided Design LECTURE 2: DSP Architectures Instructor: Francis G. Wolff Case Western Reserve University This presentation.
Computer Architecture
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
Chapter Six 1.
PipelinedImplementation Part I CSC 333. – 2 – Overview General Principles of Pipelining Goal Difficulties Creating a Pipelined Y86 Processor Rearranging.
1 Recap (Pipelining). 2 What is Pipelining? A way of speeding up execution of tasks Key idea : overlap execution of multiple taks.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
King Fahd University of Petroleum and Minerals King Fahd University of Petroleum and Minerals Computer Engineering Department Computer Engineering Department.
CS430 – Computer Architecture Introduction to Pipelined Execution
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 9, 2002 Topic: Pipelining Basics.
1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State.
CS 61C L30 Introduction to Pipelined Execution (1) Garcia, Fall 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Appendix A Pipelining: Basic and Intermediate Concepts
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.
Introduction to Pipelining Rabi Mahapatra Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley)
David O’Hallaron Carnegie Mellon University Processor Architecture PIPE: Pipelined Implementation Part I Processor Architecture PIPE: Pipelined Implementation.
9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.
CS1104: Computer Organisation School of Computing National University of Singapore.
1 Seoul National University Pipelined Implementation : Part I.
Integrated Circuits Costs
B 0000 Pipelining ENGR xD52 Eric VanWyk Fall
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.
Analogy: Gotta Do Laundry
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
ECE 232 L18.Pipeline.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 18 Pipelining.
Pipeline Architecture I Slides from: Bryant & O’ Hallaron

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Pipelining Example Laundry Example: Three Stages
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Chapter One Introduction to Pipelined Processors.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control Design) Prof. Dr.
Lecture 3. Performance Prof. Taeweon Suh Computer Science & Engineering Korea University COSE222, COMP212, CYDF210 Computer Architecture.
Real-World Pipelines Idea Divide process into independent stages
Lecture 18: Pipelining I.
Pipelines An overview of pipelining
Review: Instruction Set Evolution
Lecture 14 Y86-64: PIPE – pipelined implementation
CMSC 611: Advanced Computer Architecture
Chapter One Introduction to Pipelined Processors
ECE232: Hardware Organization and Design
Chapter 4 The Processor Part 2
Pipelined Implementation : Part I
Lecturer: Alan Christopher
Pipelined Implementation : Part I
Serial versus Pipelined Execution
Systems I Pipelining I Topics Pipelining principles Pipeline overheads
Pipeline Architecture I Slides from: Bryant & O’ Hallaron
An Introduction to pipelining
Pipelined Implementation : Part I
Chapter 8. Pipelining.
Pipelining Appendix A and Chapter 3.
A relevant question Assuming you’ve got: One washer (takes 30 minutes)
Pipelining.
Presentation transcript:

Pipelining I Topics Pipelining principles Pipeline overheads Pipeline registers and stages Systems I

2 Overview What’s wrong with the sequential (SEQ) Y86? It’s slow! Each piece of hardware is used only a small fraction of time We would like to find a way to get more performance with only a little more hardware General Principles of Pipelining Goal Difficulties Creating a Pipelined Y86 Processor Rearranging SEQ Inserting pipeline registers Problems with data and control hazards

3 Real-World Pipelines: Car Washes Idea Divide process into independent stages Move objects through stages in sequence At any given times, multiple objects being processed SequentialParallel Pipelined

4 Laundry example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 30 minutes “Folder” takes 30 minutes “Stasher” takes 30 minutes to put clothes into drawers ABCD Slide courtesy of D. Patterson

5 Sequential Laundry Sequential laundry takes 8 hours for 4 loads If they learned pipelining, how long would laundry take? 30 TaskOrderTaskOrder B C D A Time 30 6 PM AM Slide courtesy of D. Patterson

6 Pipelined Laundry: Start ASAP Pipelined laundry takes 3.5 hours for 4 loads! TaskOrderTaskOrder 12 2 AM 6 PM Time B C D A 30 Slide courtesy of D. Patterson

7 Pipelining Lessons Pipelining doesn’t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Stall for Dependences 6 PM 789 Time B C D A 30 TaskOrderTaskOrder Slide courtesy of D. Patterson

8 Latency and Throughput Latency: time to complete an operation Throughput: work completed per unit time Consider plumbing Low latency: turn on faucet and water comes out High bandwidth: lots of water (e.g., to fill a pool) What is “High speed Internet?” Low latency: needed to interactive gaming High bandwidth: needed for downloading large files Marketing departments like to conflate latency and bandwidth…

9 Relationship between Latency and Throughput Latency and bandwidth only loosely coupled Henry Ford: assembly lines increase bandwidth without reducing latency My factory takes 1 day to make a Model-T ford. But I can start building a new car every 10 minutes At 24 hrs/day, I can make 24 * 6 = 144 cars per day A special order for 1 green car, still takes 1 day Throughput is increased, but latency is not. Latency reduction is difficult Often, one can buy bandwidth E.g., more memory chips, more disks, more computers Big server farms (e.g., google) are high bandwidth

10 Computational Example System Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Must have clock cycle of at least 320 ps Combinational logic RegReg 300 ps20 ps Clock Delay = 320 ps Throughput = 3.12 GOPS

11 3-Way Pipelined Version System Divide combinational logic into 3 blocks of 100 ps each Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps Overall latency increases 360 ps from start to finish RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps Delay = 360 ps Throughput = 8.33 GOPS

12 Pipeline Diagrams Unpipelined Cannot start new operation until previous one completes 3-Way Pipelined Up to 3 operations in process simultaneously Time OP1 OP2 OP3 Time ABC ABC ABC OP1 OP2 OP3

13 Operating a Pipeline Time OP1 OP2 OP3 ABC ABC ABC Clock RegReg Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 239 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 241 RegReg RegReg RegReg 100 ps20 ps100 ps20 ps100 ps20 ps Comb. logic A Comb. logic B Comb. logic C Clock 300 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 359

14 Limitations: Nonuniform Delays Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stages RegReg Clock RegReg Comb. logic B RegReg Comb. logic C 50 ps20 ps150 ps20 ps100 ps20 ps Delay = 510 ps Throughput = 5.88 GOPS Comb. logic A Time OP1 OP2 OP3 ABC ABC ABC

15 Limitations: Register Overhead As try to deepen pipeline, overhead of loading registers becomes more significant Percentage of clock cycle spent loading register: 1-stage pipeline: 6.25% 3-stage pipeline: 16.67% 6-stage pipeline: 28.57% High speeds of modern processor designs obtained through very deep pipelining Delay = 420 ps, Throughput = GOPSClock RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps

16 CPU Performance Equation 3 components to execution time: Factors affecting CPU execution time: Consider all three elements when optimizing Workloads change!

17 Cycles Per Instruction (CPI) Depends on the instruction Average cycles per instruction Example:

18 Comparing and Summarizing Performance Fair way to summarize performance? Capture in a single number? Example: Which of the following machines is best?

19 Means Arithmetic mean Geometric mean Can be weighted: a i T i Represents total execution time Should not be used for aggregating normalized numbers Consistent independent of reference Best for combining results Best for normalized results

20 What is the geometric mean of 2 and 8? A. 5 B. 4

21 Is Speed the Last Word in Performance? Depends on the application! Cost Not just processor, but other components (ie. memory) Power consumption Trade power for performance in many applicationsCapacity Many database applications are I/O bound and disk bandwidth is the precious commodity

22 Revisiting the Performance Eqn Instruction Count: No change Clock Cycle Time Improves by factor of almost N for N-deep pipeline Not quite factor of N due to pipeline overheads Cycles Per Instruction In ideal world, CPI would stay the same An individual instruction takes N cycles But we have N instructions in flight at a time So - average CPI pipe = CPI no_pipe * 1/N Thus performance can improve by up to factor of N

23 Data Dependencies Result from one instruction used as operand for another Read-after-write (RAW) dependency Very common in actual programs Must make sure our pipeline handles these properly Get correct results Minimize performance impact 1 irmovl $50, %eax 2 addl %eax, %ebx 3 mrmovl 100( %ebx ), %edx Time OP1 OP2 OP3

24 Data Hazards Result does not feed back around in time for next operation Pipelining has changed behavior of system RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C Time OP1 OP2 OP3 ABC ABC ABC OP4 ABC

25 SEQ Hardware Stages occur in sequence Stages occur in sequence One operation in process at a time One operation in process at a time One stage for each logical pipeline operation One stage for each logical pipeline operation Fetch (get next instruction from memory) Decode (figure out what instruction does and get values from regfile) Execute (compute) Memory (access data memory if necessary) Write back (write any instruction result to regfile)

26 SEQ+ Hardware Still sequential implementation Reorder PC stage to put at beginning PC Stage Task is to select PC for current instruction Based on results computed by previous instruction Processor State PC is no longer stored in register But, can determine PC based on other stored information

27 Adding Pipeline Registers

28 Pipeline Stages Fetch Select current PC Read instruction Compute incremented PCDecode Read program registersExecute Operate ALUMemory Read or write data memory Write Back Update register file

29 Summary Today Pipelining principles (assembly line) Overheads due to imperfect pipelining Breaking instruction execution into sequence of stages Next Time Pipelining hardware: registers and feedback paths Difficulties with pipelines: hazards Method of mitigating hazards