Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

Slides:



Advertisements
Similar presentations
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
A scheme to overcome data hazards
Compiler techniques for exposing ILP
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Instruction Level Parallelism 2. Superscalar and VLIW processors.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
COMP4611 Tutorial 6 Instruction Level Parallelism
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
1 Lecture 18: VLIW and EPIC Static superscalar, VLIW, EPIC and Itanium Processor (First introduce fast and high- bandwidth L1 cache design)
COMP25212 Advanced Pipelining Out of Order Processors.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.
Computer Architecture Lec 8 – Instruction Level Parallelism.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CS252 Graduate Computer Architecture Lecture 6 Static Scheduling, Scoreboard February 6 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.
DAP.F96 1 Lecture 4: Hazards, Introduction to Compiler Techniques, Chapter 2.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
EECC551 - Shaaban #1 lec # 6 Fall Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
1 EE524 / CptS561 Computer Architecture Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Mar 17, 2009 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
EECC551 - Shaaban #1 lec # 8 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1 Instruction Level Parallelism Vincent H. Berk October 15, 2008 Reading for today: A.7 – A.8 Reading for Friday: 2.1 – 2.5 Project Proposals Due Right.
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Instruction-Level Parallelism and Its Dynamic Exploitation
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
COMP 740: Computer Architecture and Implementation
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
Chapter 3: ILP and Its Exploitation
Morgan Kaufmann Publishers The Processor
Tomasulo With Reorder buffer:
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 23: Static Scheduling for High ILP
How to improve (decrease) CPI
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Midterm 2 review Chapter
Key to pipelining: smooth flow Hazards limit performance
Chapter 3: ILP and Its Exploitation
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
CMSC 611: Advanced Computer Architecture
Loop-Level Parallelism
Overcoming Control Hazards with Dynamic Scheduling & Speculation
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Pipelining 5

Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically or dynamically VLIW (Very Long Instruction Word) –Issue a single very long instruction per clock that contains a large number of real instructions –Instructions are scheduled statically by the compiler

Superscalar Superscalar DLX: 2 instructions, 1 FP & 1 anything else – Fetch 64-bits/clock cycle; Int on left, FP on right – Can only issue 2nd instruction if 1st instruction issues – More ports for FP registers to do FP load & FP op as doubles TypePipeStages Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB 1 cycle load delay expands to 3 instructions in Superscalar –instruction in right half can’t use it, nor instructions in next slot

Unrolled Loop 1 Loop:LDF0,0(R1) 2 LDF6,-8(R1) 3 LDF10,-16(R1) 4 LDF14,-24(R1) 5 ADDDF4,F0,F2 6 ADDDF8,F6,F2 7 ADDDF12,F10,F2 8 ADDDF16,F14,F2 9 SD0(R1),F4 10 SD-8(R1),F8 11 SD-16(R1),F12 12 SUBIR1,R1,#32 13 BNEZR1,LOOP 14 SD8(R1),F16; 8-32 = clock cycles, or 3.5 per iteration LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles

Loop Unrolling in Superscalar Integer instructionFP instructionClock cycle Loop:LD F0,0(R1)1 LD F6,-8(R1)2 LD F10,-16(R1)ADDD F4,F0,F23 LD F14,-24(R1)ADDD F8,F6,F24 LD F18,-32(R1)ADDD F12,F10,F25 SD 0(R1),F4ADDD F16,F14,F26 SD -8(R1),F8ADDD F20,F18,F27 SD -16(R1),F128 SD -24(R1),F169 SUBI R1,R1,#4010 BNEZ R1,LOOP11 SD -32(R1),F2012 Unrolled 5 times to avoid delays (+1 due to Superscalar) 12 clocks, or 2.4 clocks per iteration

Dynamic Scheduling in Superscalar Dependencies will stop instruction issue Code compiled for non-superscalar will run poorly on superscalar Simple approach –Separate Tomasulo control –Separate reservation stations for Integer FU/Reg and for FP FU/Reg

Dynamic Scheduling in Superscalar How to issue two dependent instructions in the same cycle? - otherwise why use the dynamic scheduling? –Issue 2X Clock Rate, so that issue remains in order –More complex issue logic - relatively easy since only FP loads might cause dependency between integer and FP issue

Performance of Dynamic Superscalar Iteration InstructionsIssues ExecutesWrites result no. clock-cycle number 1LD F0,0(R1)124 1ADDD F4,F0,F2158 1SD 0(R1),F429 1SUBI R1,R1,#8345 1BNEZ R1,LOOP45 2LD F0,0(R1)568 2ADDD F4,F0,F SD 0(R1),F4613 2SUBI R1,R1,#8789 2BNEZ R1,LOOP89 5 clocks per iteration Branches, Decrements still take 1 clock cycle

Limits of Superscalar While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: –Exactly 50% FP operations –No hazards If more instructions issue at the same time, greater difficulty of decode and issue –Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue Issue rates of modern processors vary between 2 and 4 instructions per cycle.

VLIW Processors Very Long Instruction Word (VLIW) processors –Tradeoff instruction space for simple decoding –The long instruction word has room for many operations –By definition, all the operations the compiler puts in the long instruction word can execute in parallel –E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch »16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide –Need compiling technique that schedules across several branches (Trace Scheduling)

Loop Unrolling in VLIW Memory MemoryFPFPInt. op/Clock reference 1reference 2operation 1 op. 2 branch LD F0,0(R1)LD F6,-8(R1)1 LD F10,-16(R1)LD F14,-24(R1)2 LD F18,-32(R1)LD F22,-40(R1)ADDD F4,F0,F2ADDD F8,F6,F23 LD F26,-48(R1)ADDD F12,F10,F2ADDD F16,F14,F24 ADDD F20,F18,F2ADDD F24,F22,F25 SD 0(R1),F4SD -8(R1),F8ADDD F28,F26,F26 SD -16(R1),F12SD -24(R1),F167 SD -32(R1),F20SD -40(R1),F24SUBI R1,R1,#488 SD -0(R1),F28BNEZ R1,LOOP9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration Need more registers in VLIW

Limits of Multi-Issue Machines Inherent limitations of ILP –1 branch in 5 instructions => how to keep a 5-way VLIW busy? –Latencies of units => many operations must be scheduled –Need about Pipeline Depth x No. Functional Units of independent operations to keep the functional units busy Difficulties in building HW –Duplicate FUs to get parallel execution –Increased # of ports to register file –Increased # of ports to memory –Instruction issue hardware (a wide spectrum)

Limits to Multi-Issue Machines Limitations specific to either Superscalar or VLIW implementation –Superscalar instruction issue logic –VLIW code size: unroll loops + wasted fields –VLIW lock step => 1 hazard & all instructions stall –VLIW binary compatibility => object-code (binary) translation

Hardware-based Speculation Instructions are executed out of order and speculatively but committed in order Instruction commit is separated from instruction completion In-order instruction commit requires a hardware buffer called the reorder buffer The reorder buffer holds the results of an instruction between its completion time and commit time Exceptions are also processed in order (precise exception model)

Hardware-based Speculation IF EX FUn EX FU1 Write results Issue Structural hazard: delaying the issue until there is an empty reservation station and an empty slot in the reorder buffer RAW data hazard: wait at the reservation station until the values of the source registers are available reservation station 1 reservation station 2 reservation station 2 reservation station 1 reservation station 3 Commit Wait until the head of the reorder buffer is reached and the result is present

Hardware-based Speculation

Example Code LD F6,34(R2) LD F2,45(R3) MULTD F0,F2,F4 SUBD F8,F6,F2 DIVD F10,F0,F6 ADDD F6,F8,F2 Figure 4.35 in the book

Hardware-based Speculation Advantages –Lessens the performance degradation resulting from control hazards –Allows a precise exception model since exception conditions can be checked at the instruction commit time –Can incorporate hardware-based branch prediction –Does not require additional bookkeeping code –Does not depend on a good compiler - performs OK with non-optimized code Disadvantage –Hardware complexity

Summary Superscalar and VLIW –CPI < 1 –Superscalar is more hardware dependent (dynamic) –VLIW is more compiler dependent (static) –More instructions issue at same time => larger penalties for hazards Hardware-based speculative execution –Minimizes the impact of control hazards on performance –Enables a precise exception model for out of order and/or speculative execution