Serial versus Pipelined Execution

Slides:



Advertisements
Similar presentations
PipelineCSCE430/830 Pipeline: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Prof. Yifeng Zhu, U of Maine Fall,
Advertisements

Lecture 4: CPU Performance
Adding the Jump Instruction
CMPT 334 Computer Organization
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
EECS 318 CAD Computer Aided Design LECTURE 2: DSP Architectures Instructor: Francis G. Wolff Case Western Reserve University This presentation.
Computer Architecture
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
Pipelined Datapath and Control (Lecture #13) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
King Fahd University of Petroleum and Minerals King Fahd University of Petroleum and Minerals Computer Engineering Department Computer Engineering Department.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 9, 2002 Topic: Pipelining Basics.
1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State.
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.
Introduction to Pipelining Rabi Mahapatra Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley)
9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.
B 0000 Pipelining ENGR xD52 Eric VanWyk Fall
Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,
Lecture 14: Processors CS 2011 Fall 2014, Dr. Rozier.
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.
Analogy: Gotta Do Laundry
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
ECE 232 L18.Pipeline.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 18 Pipelining.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Performance of Single-cycle Design
Pipelining Example Laundry Example: Three Stages
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Chapter One Introduction to Pipelined Processors.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control Design) Prof. Dr.
Lecture 18: Pipelining I.
Pipelines An overview of pipelining
Review: Instruction Set Evolution
Pipelining concepts, datapath and hazards
Morgan Kaufmann Publishers
Performance of Single-cycle Design
CMSC 611: Advanced Computer Architecture
Morgan Kaufmann Publishers The Processor
Pipelining.
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers The Processor
Single Clock Datapath With Control
Pipeline Implementation (4.6)
Chapter One Introduction to Pipelined Processors
ECE232: Hardware Organization and Design
Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from
Pipelining.
Chapter 4 The Processor Part 2
Pipelined Processor Design
Lecturer: Alan Christopher
Systems I Pipelining I Topics Pipelining principles Pipeline overheads
An Introduction to pipelining
Chapter 8. Pipelining.
Pipelining: Basic Concepts
Pipelining Appendix A and Chapter 3.
Morgan Kaufmann Publishers The Processor
A relevant question Assuming you’ve got: One washer (takes 30 minutes)
Pipelining.
Presentation transcript:

Serial versus Pipelined Execution COE 301 / ICS 233 Computer Organization Dr. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals

Drawback of Single Cycle Processor Drawback is the Long cycle time All instructions take as much time as the slowest instruction Instruction Fetch ALU Decode Reg Read Reg Write longest delay Load Instruction Fetch Decode Reg Read Compute Address Reg Write Memory Read Store Instruction Fetch Decode Reg Read Compute Address Memory Write Branch Instruction Fetch Reg Read Br Target Compare & Update PC Jump Instruction Fetch Decode & Update PC

Alternative: Multicycle Implementation Break instruction execution into five steps Instruction fetch Instruction decode, register read, target address for jump/branch Execution, memory address calculation, or branch outcome Memory access or ALU instruction completion Load instruction completion One clock cycle per step (clock cycle is reduced) First 2 steps are the same for all instructions Instruction # cycles ALU & Store 4 Branch 3 Load 5 Jump 2

Performance Example Assume the following operation times for components: Access time for Instruction and data memories: 200 ps Delay in ALU and adders: 180 ps Delay in Decode and Register file access (read or write): 150 ps Ignore the other delays in PC, mux, extender, and wires Which of the following would be faster and by how much? Single-cycle implementation for all instructions Multicycle implementation optimized for every class of instructions Assume the following instruction mix: 40% ALU, 20% Loads, 10% stores, 20% branches, & 10% jumps

Solution For fixed single-cycle implementation: Instruction Class Memory Register Read ALU Operation Data Write Total 200 150 180 680 ps Load 880 ps Store 730 ps Branch 530 ps Jump 350 ps Compare and update PC Decode and update PC For fixed single-cycle implementation: Clock cycle = For multi-cycle implementation: Average CPI = Speedup = 880 ps determined by longest delay (load instruction) max (200, 150, 180) = 200 ps (maximum delay at any step) 0.4×4 + 0.2×5 + 0.1×4+ 0.2×3 + 0.1×2 = 3.8 880 ps / (3.8 × 200 ps) = 880 / 760 = 1.16

Pipelining Example Laundry Example: Three Stages Wash dirty load of clothes Dry wet clothes Fold and put clothes into drawers Each stage takes 30 minutes to complete Four loads of clothes to wash, dry, and fold A B C D

Sequential Laundry Sequential laundry takes 6 hours for 4 loads 6 PM 7 8 9 10 11 12 AM Time 30 D 30 C 30 B A 30 Sequential laundry takes 6 hours for 4 loads Intuitively, we can use pipelining to speed up laundry

Pipelined Laundry: Start Load ASAP 6 PM 7 8 9 PM A 30 C 30 B 30 Time D 30 30 30 Pipelined laundry takes 3 hours for 4 loads Speedup factor is 2 for 4 loads Time to wash, dry, and fold one load is still the same (90 minutes)

Serial Execution versus Pipelining Consider a task that can be divided into k subtasks The k subtasks are executed on k different stages Each subtask requires one time unit The total execution time of the task is k time units Pipelining is to overlap the execution The k stages work in parallel on k different tasks Tasks enter/leave pipeline at the rate of one task per time unit 1 2 k … 1 2 k … Without Pipelining One completion every k time units With Pipelining One completion every 1 time unit

Synchronous Pipeline Uses clocked registers between stages Upon arrival of a clock edge … All registers hold the results of previous stages simultaneously The pipeline stages are combinational logic circuits It is desirable to have balanced stages Approximately equal delay in all stages Clock period is determined by the maximum stage delay S1 S2 Sk Register Input Clock Output

Pipeline Performance Let ti = time delay in stage Si Clock cycle t = max(ti) is the maximum stage delay Clock frequency f = 1/t = 1/max(ti) A pipeline can process n tasks in k + n – 1 cycles k cycles are needed to complete the first task n – 1 cycles are needed to complete the remaining n – 1 tasks Ideal speedup of a k-stage pipeline over serial execution k + n – 1 Pipelined execution in cycles Serial execution in cycles = Sk → k for large n nk Sk

MIPS Processor Pipeline Morgan Kaufmann Publishers 29 November, 2018 Five stages, one cycle per stage IF: Instruction Fetch from instruction memory ID: Instruction Decode, register read, and J/Br address EX: Execute operation or calculate load/store address MEM: Memory access for load and store WB: Write Back result to register Chapter 4 — The Processor

Single-Cycle vs Pipelined Performance Consider a 5-stage instruction execution in which … Instruction fetch = ALU operation = Data memory access = 200 ps Register read = register write = 150 ps What is the clock cycle of the single-cycle processor? What is the clock cycle of the pipelined processor? What is the speedup factor of pipelined execution? Solution Single-Cycle Clock = 200+150+200+200+150 = 900 ps Reg ALU MEM IF 900 ps

Single-Cycle versus Pipelined – cont’d Pipelined clock cycle = CPI for pipelined execution = One instruction completes each cycle (ignoring pipeline fill) Speedup of pipelined execution = Instruction count and CPI are equal in both cases Speedup factor is less than 5 (number of pipeline stage) Because the pipeline stages are not balanced max(200, 150) = 200 ps 200 IF Reg MEM ALU 1 900 ps / 200 ps = 4.5

Pipeline Performance Summary Pipelining doesn’t improve latency of a single instruction However, it improves throughput of entire workload Instructions are initiated and completed at a higher rate In a k-stage pipeline, k instructions operate in parallel Overlapped execution using multiple hardware resources Potential speedup = number of pipeline stages k Pipeline rate is limited by slowest pipeline stage Unbalanced lengths of pipeline stages reduces speedup Also, time to fill and drain pipeline reduces speedup