10/24/05 ELEC62001 Kasi L.K. Anbumony Department of Electrical and Computer Engineering Auburn University Auburn, AL 36849 Superscalar Processors.

Slides:



Advertisements
Similar presentations
PipelineCSCE430/830 Pipeline: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Prof. Yifeng Zhu, U of Maine Fall,
Advertisements

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
CMPT 334 Computer Organization
Pipeline and Vector Processing (Chapter2 and Appendix A)
Chapter 8. Pipelining.
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Goal: Describe Pipelining
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
Chapter Six 1.
Instruction Level Parallelism Chapter 4: CS465. Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase.
Chapter 4 CSF 2009 The processor: Instruction-Level Parallelism.
S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Pipelining Andreas Klappenecker CPSC321 Computer Architecture.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
1 CSE SUNY New Paltz Chapter Six Enhancing Performance with Pipelining.
CS430 – Computer Architecture Introduction to Pipelined Execution
Appendix A Pipelining: Basic and Intermediate Concepts
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.
Introduction to Pipelining Rabi Mahapatra Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley)
9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.
CS1104: Computer Organisation School of Computing National University of Singapore.
Integrated Circuits Costs
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.
Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.
Analogy: Gotta Do Laundry
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
ECE 232 L18.Pipeline.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 18 Pipelining.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Cs 152 L1 3.1 DAP Fa97,  U.CB Pipelining Lessons °Pipelining doesn’t help latency of single task, it helps throughput of entire workload °Multiple tasks.
Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Pipelining Example Laundry Example: Three Stages
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control Design) Prof. Dr.
DICCD Class-08. Parallel processing A parallel processing system is able to perform concurrent data processing to achieve faster execution time The system.
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
Lecture 18: Pipelining I.
Review: Instruction Set Evolution
Pipeline Implementation (4.6)
ECE232: Hardware Organization and Design
Morgan Kaufmann Publishers The Processor
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Serial versus Pipelined Execution
Chapter 8. Pipelining.
CSC3050 – Computer Architecture
Pipelining Appendix A and Chapter 3.
Throughput = #instructions per unit time (seconds/cycles etc.)
A relevant question Assuming you’ve got: One washer (takes 30 minutes)
Presentation transcript:

10/24/05 ELEC62001 Kasi L.K. Anbumony Department of Electrical and Computer Engineering Auburn University Auburn, AL Superscalar Processors

10/24/05 ELEC62002 Pipelining: Motivation Pipeline Hazards Advanced Pipelining –Instruction Level Parallelism (ILP) –Multiple Issue (MIPS Superscalar)  Static Multiple Issue (SW centric)  Dynamic Multiple Issue (HW centric) Superscalar Processor Conclusion Outline

10/24/05 ELEC62003 Multiple instructions are overlapped in execution. To exploit the Instruction level parallelism(ILP) One of technique to make the processors fast Some terms:  Stages  Task Order  Throughput In pipeline the stages occur concurrently (or) parallely Possible as long as we have separate resources for each stage Pipelining: Motivation

10/24/05 ELEC62004 Sequential Laundry: Non-pipelined Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? TaskOrderTaskOrder A B C D 6 PM Midnight Time

10/24/05 ELEC62005 Pipelined Laundry:Start work ASAP Pipelined laundry takes 3.5 hours for 4 loads TaskOrderTaskOrder 6 PM Midnight Time 20 A B C D 30 40

10/24/05 ELEC62006 Pipelining: Lessons Improvement in throughput of entire workload without improving any time to complete a single load Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup TaskOrderTaskOrder 6 PM 789 Time 20 A B C D 30 40

10/24/05 ELEC62007 Comparison: Example Consider a non-pipelined machine with 5 execution steps of lengths 200 ps, 100 ps, 200 ps, 200 ps, and 100 ps. Due to clock skew and setup, pipelining adds 5 ps of overhead to each instruction stage. Ignoring latency impact, how much speedup in the instruction execution rate will we gain from a pipeline?

10/24/05 ELEC62008 Sequential vs. Pipelined Execution Pipelined Execution Sequential Execution

10/24/05 ELEC62009 Speed Up Equation for Pipelining Speedup from pipelining= = Ideal CPI pipelined = CPI unpipelined /Pipeline depth Speedup =

10/24/05 ELEC Speed Up Equation for Pipelining

10/24/05 ELEC It’s Not That Easy for Computers: Limitation Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle –Structural hazards: Hardware cannot support this combination of instructions that has to be executed in the same clock cycle (washer+dryer) –Data hazards: Instruction depends on result of prior instruction still in pipeline (one sock missing) –Control hazards: Pipelining of branches & other instructions. Common solution is to stall the pipeline until the hazard “bubbles” through the pipeline

10/24/05 ELEC Instruction Level Parallelism Longer pipeline Laundry analogy: Divide our washer into three machines that perform the wash, rinse and spin steps of a traditional machine To get the full speedup,we need to rebalance the remaining steps so that they are of the same length Amount of parallelism exploited is higher, since there are more operations being overlapped

10/24/05 ELEC Advanced Pipelining: Techniques Motivation: To further exploit the Instruction Level Parallelism (ILP) Multiple Issue Replicate the internal components of the computer so that it can launch multiple instructions in every pipeline stage Dynamic Pipeline scheduling (or) Dynamic Pipelining (or) Dynamic Multiple issue by hardware to avoid pipeline hazards

10/24/05 ELEC Multiple Issue: Superscalar Launch multiple instructions in parallel A Superscalar laundry would replace our household washer and dryer with say, three washers and three dryers. Also followed by 3 assistants to fold and put away thee times as much laundry in the same amount of time. Downside extra work needed to keep all the machines busy and transferring load to next pipeline stage. Superscalar is defined as executing more than one instruction per clock cycle

10/24/05 ELEC Performance Metrics: CPI & IPC Instruction execution rate exceed the clock rate Example: 6GHz, 4-way multiple-issue microprocessor can execute at a peak rate of 24 billion instructions per second and have a best case of CPI of 0.25 Instructions per clock cycle (IPC) (for the above case: 4) Assume a 5 stage pipeline such a processor would have 20 instructions in execution at any given time.

10/24/05 ELEC Multiple issue processor: Decision Strategy Static Multiple Issue  Decisions are made at compile time before execution  Software based  Compiler scheduling  VLIW(Very Long Instruction Word) Dynamic Multiple Issue  Decisions are made at run/execution time by the processor  Dynamic scheduling  Hardware based

10/24/05 ELEC Static Multiple Issue Processor Issue Packet: Set of instructions which can be paired to form one large instruction with multiple operations (VLIW) Relies on Compiler to take on responsibilities for handling data and control hazards Some of the compiler’s responsibilities may be static branch prediction and code scheduling

10/24/05 ELEC Getting CPI < 1:Static 2 Issue pipeline Superscalar MIPS: 2 instructions, 1 ALU & 1 LOAD instruction – Fetch 64-bits/clock cycle; ALU on left, Load on right – Can only issue 2nd instruction if 1st instruction issues TypePipeStages ALU instructionIFIDEXMEMWB Load instructionIFIDEXMEMWB ALU instructionIFIDEXMEMWB Load instructionIFIDEXMEMWB ALU instructionIFIDEXMEMWB Load instructionIFIDEXMEMWB

10/24/05 ELEC Static Multiple Issue: Datapath IM Reg. file ALU ALU/bx xion lw/sw xion

10/24/05 ELEC Example: Multiple Issue code scheduling Loop: lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0 ($s1) addi $s1, $s1, -4 bne $s1,$zero, Loop After reordering the instructions based on dependencies, we get a CPI=0.8 (or) IPC=1.25 ALU/BXlw/swClock cycle Loop:lw $t0, 0($s1)1 addi $s1, $s1, -42 addu $t0, $t0, $s23 bne $s1,$zero, Loopsw $t0, 0 ($s1)4

10/24/05 ELEC Loop Unrolling: 4 Iterations Multiple copies of the loop body are made, thus more ILP by overlapping instructions from different iterations CPI=8/14=0.57 ALU/BXlw/swClock cycle Loop:addi $s1, $s1, -16lw $t0, 0($s1)1 lw $t1, 12($s1)2 addu $t0, $t0, $s2lw $t2, 8($s1)3 addu $t1, $t1, $s2lw $t3, 4($s1)4 addu $t2, $t2, $s2sw $t0, 16 ($s1)5 addu $t3, $t3, $s2sw $t0, 12 ($s1)6 sw $t0, 8 ($s1)7 bne $s1,$zero, Loopsw $t0, 4 ($s1)8

10/24/05 ELEC Dynamic Multiple-Issue Processors Instructions are issue in order and the processor decides whether zero,one (or) more instructions can issue in a given clock cycle Again achieving good performance requires the compiler to schedule instructions to move dependencies apart and thereby improving the instruction issue rate

10/24/05 ELEC Dynamic Scheduling: Definition Dynamic pipeline scheduling goes past stalls to find later instructions to execute while waiting for the stall to be resolved Chooses which instruction to execute next by reordering the instructions to avoid stalls (dynamic issue decisions) lw $t0, 20($s2) addu $t1, $t0, $s2 sub $s4, $s4, $t3 slti $t5, $s4, 20 bne $s1,$zero, Loop

10/24/05 ELEC HW Schemes: Why? Why in HW at run time? –Works when can’t know real dependence at compile time –Compiler simpler –Code for one machine runs well on another

10/24/05 ELEC Dynamic Pipeline Scheduling: Model Inst. Fetch & decode unit Res. station Integer FP lw/sw Reorder buffer Commit unit Res. station ……….. In order Out order

10/24/05 ELEC HW Units: Working Inst fetch/decode unit fetches instructions,decodes them and sends each instruction to a corresponding functional unit of the execute stage 5-10 functional units with buffers called reservation stations that holds the operands and operation As soon as buffer contains all the operands, functional unit executes, the result is calculated It is for the commit unit to decide when it is safe to put the result into the register file (or) for store into memory

10/24/05 ELEC Dynamic scheduling: in-order completion To make programs behave as if they run on a non-pipelined computer, the instruction fetch and decode unit is required to issue instructions in order, and the commit unit is required to write results to registers and memory in program execution order (in-order completion) Hence an exception occurs, the computer can point to the last instruction executed and the only registers updated will be all those written by the instructions before exception

10/24/05 ELEC Dynamic scheduling: Speculation Speculative execution: Dynamic scheduling can be combined with branch prediction, so after a mispredicted branch, commit unit be able to discard all the results in the execution unit Dynamic scheduling can also be combined with Superscalar execution, so each unit may be committing 4 to 6 instructions per cycle

10/24/05 ELEC Superscalar Processor

10/24/05 ELEC Conclusion: Several Steps ILP Exploitation

10/24/05 ELEC References Computer Organization & Design, Patterson & Hennessy, 2 & 3 Edition mlhttp:// ml %20lecture%204-05f.ppt (Pipelining) 116%20lecture%204-05f.ppt