Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Slides:



Advertisements
Similar presentations
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Advertisements

Pipeline Summary Try to put everything together for pipelines Before going onto caches. Peer Instruction Lecture Materials for Computer Architecture by.
A scheme to overcome data hazards
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Computer Architecture Pipelines & Superscalars. Pipelines Data Hazards Code: lw $4, 0($1) add $15, $1, $1 sub$2, $1, $3 and $12, $2, $5 or $13, $6, $2.
Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
EECE476: Computer Architecture Lecture 23: Speculative Execution, Dynamic Superscalar (text 6.8 plus more) The University of British ColumbiaEECE 476©
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
1 CSE SUNY New Paltz Chapter Six Enhancing Performance with Pipelining.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.
Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.
CMPE 421 Parallel Computer Architecture
Pipeline Data Hazards Warning, warning, warning! Read 4.8 Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
PipeliningPipelining Computer Architecture (Fall 2006)
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Out of Order Processors
Modern Processor Design: Superscalar and Superpipelining
CS203 – Advanced Computer Architecture
CS203 – Advanced Computer Architecture
/ Computer Architecture and Design
Advantages of Dynamic Scheduling
Pipelining: Advanced ILP
Morgan Kaufmann Publishers The Processor
Out of Order Processors
Vishwani D. Agrawal James J. Danaher Professor
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
How to improve (decrease) CPI
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Vishwani D. Agrawal James J. Danaher Professor
CSC3050 – Computer Architecture
Presentation transcript:

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Dr. Leo Porter Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License

Pipelining in Today’s Most Advanced Processors Not fundamentally different than the techniques we discussed Deeper pipelines Pipelining is combined with –superscalar execution –out-of-order execution –VLIW (very-long-instruction-word)

Deeper Pipelines Power 4 Pentium 3 Pentium 4 Give the Review of “Intel’s game”

End of superpipelining? Pipeline register overheads were already starting to play a role Thermal wall/Power wall – cannot increase clock rate so deeper pipelines have less utility In fact – pipelines are getting smaller Much like parallelism – cutting in half didn’t halve CT

Superscalar Execution IMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMReg

Selection AAny two instructions BAny two independent instructions CAn arithmetic instruction and a memory instruction DAny instruction and a memory instruction ENone of the above What can this do in parallel?

A modest superscalar MIPS what can this machine do in parallel? what other logic is required? Represents earliest superscalar technology (eg, circa early 1990s) Hazards – detecting independent insts.

Superscalar Execution To execute four instructions in the same cycle, we must find four independent instructions If the four instructions fetched are guaranteed by the compiler to be independent, this is a VLIW machine If the four instructions fetched are only executed together if hardware confirms that they are independent, this is an in-order superscalar processor. If the hardware actively finds four (not necessarily consecutive) instructions that are independent, this is an out-of-order superscalar processor.

Superscalar Scheduling Assume in-order, 2-issue, ld-store followed by integer. In which cycle can we start “executing” of each instruction (assume the first lw has already gone through F and D. lw $6, 36($2) add $5, $6, $4 lw $7, 1000($5) sub $9, $12, $ A B C D E None are correct D (X has to wait on stall) Point out “issue” is used in modern machines to say “start executing” or go to X in our MIPS pipeline – POINT out ASSUME same pipeline as MIPS

Superscalar Scheduling Assume in-order, 4-issue, any combination. In which cycle can we start “executing” each instruction (assume the first lw has already gone through F and D. lw $6, 36($2) add $5, $6, $4 lw $7, 1000($5) sub $9, $12, $5 sw $5, 200($6) add $3, $9, $9 and $11, $7, $ A E None are correct B C D C What we’ve done here could be done in HW or SW

VLIW Advantages In the past – the strongest argument for VLIW has been that by removing the complexity of doing dynamic scheduling in hardware, we can increase clock rate. This advantage now runs into the same problem as superpipelining. What now becomes the best argument for VLIW? SelectionBest argument AVLIW can find more ILP than hardware scheduling BVLIW is a legacy component to the Itanium 2 CVLIW enables a single compilation for multiple generations of processors DVLIW can be more power efficient ENone of the above Talk about predication

Which of the following pairs of instructions represent hazards which only apply to out-of-order execution? SelectionInstruction Pair A1 B2 C3 D1 and 2 E1 and 3 lw $1, 0 ($2) add $1, $2, $3 lw $1, 0 ($2) add $3, $2, $1 add $2, $1, $ : WAW 2: RAW Talk through Reg rename (virtual 3: WAR registers in a way)

Early Out of order Processor FetchDecode Instruction Queue Register Rename INT ALU INT ALU INT ALU FP ALU FP ALU Load Queue Store Queue L1 Result Bus **Spend a lot of time talking this trhough – how do various units communicate, reservation stations, etc. Point out memory load/store queues In order front end – out of order back end

Which of the following are not possible if you allow out-of-order commit (no reorder buffer)? SelectionInstruction Pair A1, 2, and 4 B1 and 3 C3 and 4 D1, 3, and 4 ENone of the above D – Talk through what each means before posing the question. 1.speculate on load instructions 2.forward values between instructions 3.speculate on branches 4.provide “precise” interrupts

Modern OOO Processor FetchDecode Instruction Queue Register Rename INT ALU INT ALU INT ALU FP ALU FP ALU Load Queue Store Queue L1 Reorder Buffer Point out memory load/store queues and how a long latency miss can still mess things up

Out of order with Reorder Buffer Assume 2-issue out-of-order, any pair of instructions. Register renaming. Execute begins as soon as operands are available. (Assume all instructions applicable insts. are held in the instruction queue). When does each instruction issue (execute)? lw $6, 36($2) add $5, $6, $4 beq $2, $6 there #pred NT PC+4lw $7, 1000($4) sub $9, $12, $8 add $3, $5, $ A B C D E None are correct Point out VLIW can’t do this.

Pentium 4

Modern Processors Pentium II, III – 3-wide superscalar, out-of-order, 14 integer pipeline stages Pentium 4 – 3-wide superscalar, out-of-order, simultaneous multithreading, 20+ pipe stages AMD Athlon, 3-wide ss, out-of-order, 10 integer pipe stages AMD Opteron, similar to Athlon, with 64-bit registers, 12 pipe stages, better multiprocessor support. Alpha – 2-wide ss, in-order, 7 pipe stages Alpha – 4-wide ss, out-of-order, 7 pipe stages Intel Itanium – 3-operation VLIW, 2-instruction issue (6 ops per cycle), in-order, 10-stage pipeline

Nehalem From: Fast Thread Migration via Working Set Prediction. Brown, Porter, Tullsen. HPCA 2011

Advanced Pipelining -- Key Points ET = Number of instructions * CPI * cycle time Pipelining attempts to get CPI close to 1. To improve performance we must reduce CT (superpipelining) or CPI below one (superscalar, VLIW). Hardware or software can guarantee instruction independence Modern processors often do in-order fetch, in-order commit, with out-of-order execution in between