1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Computer Organization and Architecture
Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
Chapter 4 CSF 2009 The processor: Instruction-Level Parallelism.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
Mult. Issue CSE 471 Autumn 011 Multiple Issue Alternatives Superscalar (hardware detects conflicts) –Statically scheduled (in order dispatch and hence.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
PipeliningPipelining Computer Architecture (Fall 2006)
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
CS203 – Advanced Computer Architecture ILP and Speculation.
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
/ Computer Architecture and Design
CS203 – Advanced Computer Architecture
/ Computer Architecture and Design
Pipelining: Advanced ILP
Lecture 6: Advanced Pipelines
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 11: Memory Data Flow Techniques
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Advanced Computer Architecture
Control unit extension for data hazards
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Adapted from the slides of Prof
CSC3050 – Computer Architecture
Lecture 9: Dynamic ILP Topics: out-of-order processors
Presentation transcript:

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2

2 Improving Performance Three ways Reduce clock cycle time Technology, implementation Reduce number of instructions Improve instruction set Improve compiler Reduce Cycles/Instruction Improve implementation Can we do better than 1 cycle/instruction?

3 Instruction Level Parallelism Property of software How many instructions are independent? Very dependent on compiler Many ways to find ILP Dynamic scheduling Superscalar Speculation Static methods (Chapter 4)

4 Multiple Issue Issue more than 1 instruction per cycle 2 variants Superscalar Extract parallelism from single-issue instruction set Run unmodified sequential programs VLIW Parallelism is explicit in instruction set Chapter 4

5 Superscalar variants Scheduling Static In order execution Early superscalar processors, e.g., Sun UltraSparc II Dynamic Tomasulo’s algorithm – out of order execution, in order issue Dynamic with speculation Out of order execution, out of order issue Most high-end CPU’s today

6 Issue constraints Number of instructions to issue at once Minimum 2, maximum 8 Number of execution units Can be larger than number issued Usually classes of E.U’s Sometimes multiple instances per class Dependencies RAW still applies

7 Implementation Costs Lookahead Instruction window IC alignment Register pressure 2 extra read ports per function unit Branch penalties 4 issue CPU, 25% of instructions are branches Hazard detection More?

8 Single Issue Instruction Memory ALU Data Memory Register Array PC 32 bits

9 Single Issue – separate float Instruction Memory Int ALU Data Memory Int Register Array PC Float Register Array Float ALU 32 bits

10 Multiple Issue – separate float Instruction Memory Int ALU Data Memory Int Register Array PC Float Register Array Float ALU 64 bits

11 Multiple Issue – replication Instruction Memory Int ALU Data Memory Int Register Array PC Float Register Array Float ALU 64 bits Int ALU Float ALU

12 Superscalar Function Units Can be same as issue width or wider than issue width Varies by design IBM RS/6000 (1 st superscalar): 1 ALU/Load, 1 FP Pentium II: 1 ALU/FP, 1 ALU, 1 Load, 1 Store, 1 Branch Alpha 21264: 1 ALU/FP/Branch, 2 ALU, 1 Load/Store

13 Forwarding – Standard Pipeline Int ALU

14 Forwarding – 2-way Superscalar Int ALU Int ALU

15 Forwarding – 4-way Superscalar Int ALU Int ALU Int ALU Int ALU

16 Clustering Int ALU Int ALU Int ALU Int ALU Alpha an example

17 Forwarding costs RAW detection N 2 growth in detection logic Relatively localized N 2 growth in input muxes Not so good Wide buses (64-bit data) Probably limits growth

18 Impact on Caches Need to fetch n times as many bytes for n instructions issue Branches a problem Not-Taken can be issued, but must be checked Data Cache pressure More reads in parallel More writes in parallel

19 Instruction Window Must fetch enough memory from Icache to issue all instructions Wider fetch at same clock rate Must be able to analyze all instructions in window at once Looking for branches

20 Trace Caches Combine instruction cache with Branch Target Buffer Tag: PC plus directions of branches Instruction fetches come from trace cache before IC Still need regular IC for backup Used in high end superscalar Pentium 4 (actually micro-ops)

21 Trace Cache example AddrData 0inst 0, inst 1 2inst 2, inst 3 4inst 4, inst 5 Addr Taken? Data 0Yesinst 0, inst 4 2-inst 2, inst 3 4-inst 4, inst inst0 FDXMW inst4 FDXMW inst5 FDXMW inst6 FDXMW inst7 FDXMW icache tcache Assume inst0 is a branch to inst inst0 FDXMW inst4 FDXMW inst5 FDXMW inst6 FDXMW inst7 FDXMW

22 Dynamic Scheduling Apply Tomasulo to Superscalar Reservation stations to each function unit Difference: simultaneous assignments Pipeline management more complex Sometimes extend pipeline to calculate May need even more function units Out of order execution may result in resource conflicts May need extra CDB (allows more than 1 completion per cycle)

23 Speculation Branches halt dynamic in-order issue Speculate on result of branches and issue anyway Fix up later if you guess wrong Really need good branch prediction Likewise, need dynamic scheduling Used in all modern high-performance processors Pentium III, Pentium 4, Pentium M PowerPC 603/604/G3/G4/G5 MIPS R10000/R12000 AMD K5/K6/Athlon/Opteron

24 How? Allow out-of-order issue Allow out-of-order execution Allow out-of-order writeback Commit writes only when branches are resolved Prevent speculative executions from changing state Keep pending writes in a reorder buffer Like adding even more registers than dynamic scheduling

25 Adding a reorder buffer Store buffer is gone

26 Stages of Speculative Execution Issue Issue if reservation and reorder buffer slot are free Execute When both operands are available, execute (check CDB) Write Result Write to reservation stations and reorder buffer Commit When instruction at head of buffer is ready, update registers/memory If mispredicted branch, flush reorder buffer

27 Different implementations Register Renaming (e.g., MIPS 10K) Architectural registers replaced by larger physical register array Registers dynamically mapped to correpsond to reservation stations, reorder buffer Removes hazards No name conflicts, since no reuse of names Needs a free list! Deallocating is complex Do we still need a register after commit? Permits a lot of parallelism

28 Advantages of Reorder Buffer No more imprecise interrupts Instructions complete out of order, but retire in order Keeps dynamic schedule working through branches Function as no-ops with respect to pipeline (unless mispredicted) Extends register space Even more instructions can be in flight

29 Disadvantages of speculation Complexity Handling variable length instructions A lot of data copying internally Even more load on CDB Even larger instruction window Looking past branches Pressure on clock frequency Sometimes simpler/faster beats complex/slower for performance Simpler can be replicated more (see IBM’s Blue Gene)

30 How much further? How far past a branch to speculate? How many branches to speculate past? Quadratic increase in complexity Pressure on pipeline depth No one does it very far