Topics Left Superscalar machines IA64 / EPIC architecture

Slides:



Advertisements
Similar presentations
Instruction Level Parallelism and Superscalar Processors
Advertisements

CH14 Instruction Level Parallelism and Superscalar Processors
Instruction-Level Parallelism and Superscalar Processors
SE-292 High Performance Computing
Instruction Level Parallelism
Lecture 9 – OOO execution © Avi Mendelson, 5/ MAMAS – Computer Architecture Lecture 9 – Out Of Order (OOO) Dr. Avi Mendelson Some of the slides.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Computer Organization and Architecture
Computer architecture
CSCI 4717/5717 Computer Architecture
Arsitektur Komputer Pertemuan – 13 Super Scalar
Chapter 14 Instruction Level Parallelism and Superscalar Processors
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102),
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Chapter 12 Pipelining Strategies Performance Hazards.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
Chapter 12 CPU Structure and Function. Example Register Organizations.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
Instruction level parallelism And Superscalar processors By Kevin Morfin.
CS203 – Advanced Computer Architecture ILP and Speculation.
Advanced Architectures
Instruction Level Parallelism
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
/ Computer Architecture and Design
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
PowerPC 604 Superscalar Microprocessor
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar Processors
Superscalar Processors & VLIW Processors
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Instruction Level Parallelism and Superscalar Processors
Adapted from the slides of Prof
Control unit extension for data hazards
Sampoorani, Sivakumar and Joshua
* From AMD 1996 Publication #18522 Revision E
Computer Architecture
Adapted from the slides of Prof
Chapter 13 Instruction-Level Parallelism and Superscalar Processors
Instruction-level Parallelism: Reduced Instruction Set Computers and
Created by Vivi Sahfitri
Presentation transcript:

Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware implementation vs microprogramming

Superscalar Processors Chapter 14 Superscalar Processors Definition of Superscalar Design Issues: Instruction Issue Policy Register renaming Machine parallelism Branch Prediction Execution Pentium 4 example

What is Superscalar? A Superscalar machine executes multiple independent instructions in parallel. They are pipelined as well. “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently. Equally applicable to RISC & CISC, but more straightforward in RISC machines. The order of execution is usually assisted by the compiler.

Example of Superscalar Organization 2 Integer ALU pipelines, 2 FP ALU pipelines, 1 memory pipeline (?)

Superscalar v Superpipelined

Limitations of Superscalar Dependent upon: - Instruction level parallelism possible - Compiler based optimization - Hardware support Limited by Data dependency Procedural dependency Resource conflicts

(Recall) True Data Dependency (Must W before R) ADD r1, r2 r1+r2  r1 MOVE r3, r1 r1  r3 Can fetch and decode second instruction in parallel with first LOAD r1, X x (memory)  r1 MOVE r3, r1 r1 r3 Can NOT execute second instruction until first is finished Second instruction is dependent on first (R after W)

(recall) Antidependancy (Must R before W) ADD R4, R3, 1 R3 + 1  R4 ADD R3, R5, 1 R5 + 1  R3 Cannot complete the second instruction before the first has read R3

(Recall) Procedural Dependency Can’t execute instructions after a branch in parallel with instructions before a branch, because? Note: Also, if instruction length is not fixed, instructions have to be decoded to find out how many fetches are needed

(recall) Resource Conflict Two or more instructions requiring access to the same resource at the same time e.g. two arithmetic instructions need the ALU Solution - Can possibly duplicate resources e.g. have two arithmetic units

Effect of Dependencies on Superscalar Operation Notes: 1) Superscalar operation is double impacted by a stall. 2) CISC machines typically have different length instructions and need to be at least partially decoded before the next can be fetched – not good for superscalar operation

Instruction-level Parallelism – degree of Consider: LOAD R1, R2 ADD R3, 1 ADD R4, R2 These can be handled in parallel. ADD R4, R3 STO (R4), R0 These cannot be handled in parallel. The “degree” of instruction-level parallelism is determined by the number of instructions that can be executed in parallel without stalling for dependencies

Instruction Issue Policies Order in which instructions are fetched Order in which instructions are executed Order in which instructions update registers and memory values (order of completion) Standard Categories: In-order issue with in-order completion In-order issue with out-of-order completion Out-of order issue with out-of-order completion

In-Order Issue -- In-Order Completion Issue instructions in the order they occur: Not very efficient Instructions must stall if necessary (and stalling in superpipelining is expensive)

In-Order Issue -- In-Order Completion (Example) Assume: I1 requires 2 cycles to execute I3 & I4 conflict for the same functional unit I5 depends upon value produced by I4 I5 & I6 conflict for a functional unit

In-Order Issue -- Out-of-Order Completion (Example) Again: I1 requires 2 cycles to execute I3 & I4 conflict for the same functional unit I5 depends upon value produced by I4 I5 & I6 conflict for a functional unit How does this effect interrupts?

Out-of-Order Issue -- Out-of-Order Completion Decouple decode pipeline from execution pipeline Can continue to fetch and decode until the “window” is full When a functional unit becomes available an instruction can be executed (usually in as much in-order as possible) Since instructions have been decoded, processor can look ahead

Out-of-Order Issue -- Out-of-Order Completion (Example) Again: I1 requires 2 cycles to execute I3 & I4 conflict for the same functional unit I5 depends upon value produced by I4 I5 & I6 conflict for a functional unit Note: I5 depends upon I4, but I6 does not

Register Renaming to avoid hazards Output and antidependencies occur because register contents may not reflect the correct ordering from the program Can require a pipeline stall One solution: Allocate Registers dynamically (renaming registers)

Register Renaming example Add R3, R3, R5 R3b:=R3a + R5a (I1) Add R4, R3, 1 R4b:=R3b + 1 (I2) Add R3, R5, 1 R3c:=R5a + 1 (I3) Add R7, R3, R4 R7b:=R3c + R4b (I4) Without “subscript” refers to logical register in instruction With subscript is hardware register allocated: R3a R3b R3c Note: R3c avoids: antidependency on I2 output dependency I1

Recaping: Machine Parallelism Support Duplication of Resources Out of order issue hardware Windowing to decouple execution from decode Register Renaming capability

Speedups of Machine Organizations (Without Procedural Dependencies) Not worth duplication of functional units without register renaming Need instruction window large enough (more than 8, probably not more than 32)

Branch Prediction in Superscalar Machines Delayed branch not used much. Why? Multiple instructions need to execute in the delay slot. This leads to much complexity in recovery. Branch prediction should be used - Branch history is very useful

View of Superscalar Execution

Committing or Retiring Instructions Results need to be put into order (commit or retire) Results sometimes must be held in temporary storage until it is certain they can be placed in “permanent” storage. (either committed or retired/flushed) Temporary storage requires regular clean up – overhead – done in hardware.

Superscalar Hardware Support Facilities to simultaneously fetch multiple instructions Logic to determine true dependencies involving register values and Mechanisms to communicate these values Mechanisms to initiate multiple instructions in parallel Resources for parallel execution of multiple instructions Mechanisms for committing process state in correct order

Example: Pentium 4 A Superscalar CISC Machine

Pentium 4 alternate view

Pentium 4 pipeline 20 stages !

a) Generation of Micro-ops (stages 1 &2) Using the Branch Target Buffer and Instruction Translation Lookaside Buffer, the x86 instructions are fetched 64 bytes at a time from the L2 cache The instruction boundaries are determined and instructions decoded into 1-4 118-bit RISC micro-ops Micro-ops are stored in the trace cache

b) Trace cache next instruction pointer (stage 3) The Trace Cache Branch Target Buffer contains dynamic gathered history information (4 bit tag) If target is not in BTB Branch not PC relative: predict branch taken if it is a return, predict not taken otherwise For PC relative backward conditional branches, predict take, otherwise not taken

c) Trace Cache fetch (stage 4) Orders micro-ops in program-ordered sequences called traces These are fetched in order, subject to branch prediction Some micro-ops require many micro-ops (CISC instructions). These are coded into the ROM and fetched from the ROM

d) Drive (stage 5) Delivers instructions from the Trace Cache to the Rename/Allocator module for reordering

e) Allocate: register naming (stages 6, 7, & 8) Allocates resources for execution (3 micro-ops arrive per clock cycle): - Each micro-op is allocated to a slot in the 126 position circular Reorder Buffer (ROB) which tracks progress of the micro-ops. Buffer entries include: - State – scheduled, dispatched, completed, ready for retire - Address that generated the micro-op - Operation - Alias registers are assigned for one of 16 arch reg (128 alias registers) {to remove data dependencies} The micro-ops are dispatched out of order as resources are available Allocates an entry to one of the 2 scheduler queues - memory access or not The micro-ops are retired in order from the ROB

f) Micro-op queuing (stage 9) Micro-ops are loaded into one of 2 queues: - one for memory operations - one for non memory operations Each queue operates on a FIFO policy

g) Micro-op scheduling (stages 10, 11, & 12) h) Dispatch (stages 13 & 14) g) Micro-op scheduling (stages 10, 11, & 12) The 2 schedulers retrieve micro-ops based upon having all the operands ready and dispatch them to an available unit (up to 6 per clock cycle) If two micro-ops need the same unit, they are dispatched in sequence.

i) Register file (stages 15 & 16) j) Execute: flags (stages 17 & 18) The register files are the sources for pending fixed and FF operations A separate stage is used to compute the flags

k) Branch check (stage 19) l) Branch check results (stage 20) Checks flags and compares results with predictions If the branch prediction was wrong: all incorrect micro-ops must be flushed (don’t want to be wrong!) the correct branch destination is provided to the Branch Predictor the pipeline is restarted from the new target address