The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Slides:



Advertisements
Similar presentations
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
Advertisements

A scheme to overcome data hazards
COMP25212 Advanced Pipelining Out of Order Processors.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Alpha Microarchitecture Onur/Aditya 11/6/2001.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
CS 152 Computer Architecture & Engineering Andrew Waterman University of California, Berkeley Section 8 Spring 2010.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Superscalar Processors by
ALPHA Introduction I- Stream ALPHA Introduction I- Stream Dharmesh Parikh.
Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
The Alpha – Data Stream Matt Ziegler.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
现代计算机体系结构 主讲教师:张钢天津大学计算机学院 2009 年.
IBM System 360. Common architecture for a set of machines
Lecture: Out-of-order Processors
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
CSL718 : Superscalar Processors
ALPHA Introduction I- Stream
PowerPC 604 Superscalar Microprocessor
Case Studies MAINAK CS422 1 CS422 MAINAK CS422 MAINAK 1.
CS203 – Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
CS 152 Computer Architecture & Engineering
Out of Order Processors
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Alpha Microarchitecture
Lecture 8: Dynamic ILP Topics: out-of-order processors
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Control unit extension for data hazards
* From AMD 1996 Publication #18522 Revision E
Control unit extension for data hazards
Dynamic Hardware Prediction
Control unit extension for data hazards
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 9: Dynamic ILP Topics: out-of-order processors
Presentation transcript:

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis

Alpha (EV6) pipeline

Instruction Fetch 4 commands per cycle Techniques for maximum fetch efficiency Large 64KB 2-way associative instruction cache  Line and set prediction to indicate where to fetch the next block from including which set should be used  Low mispredict cost of line and set prediction (single-cycle bubble) Branch Predictor  Branch prediction scheme dynamically chooses between local and global history

Register Renaming Assignment of a unique storage location with each write reference to a register Elimination of WAR and WAW register dependencies, but preservation of all RAW register dependencies necessary for correct computation 64 architectural registers + 41 integer + 41 floating point registers available for holding speculative results prior to instruction retirement in an 80 instruction in- flight window

Out of Order Issue Queues Separate integer and floating-point queues  Each cycle the queues select from pending instructions as they become data-ready, using register scoreboards based on the renamed register numbers  Scoreboards maintain the status of renamed registers by tracking the progress of single-cycle, multiple-cycle and variable-cycle instructions  When FU available, scoreboard unit notifies instructions in queue that require the register value  These instructions can issue when bypass result is available from FU or load.

Out of Order Issue Queues (cont.) 20-entry integer queue  can issue 4 instructions per cycle 15-entry floating-point queue  can issue 2 instructions per cycle Static assignment of instructions to 2 of 4 execution pipes before entering the queue Issue queue has 2 arbiters that dynamically issue the oldest 2 instructions each cycle within the upper and lower pipes respectively Queues issue instructions speculatively  Queue is collapsing (an entry becomes available) when the instruction issues or is squashed due to mis-speculation

Execution Engine All execution units require access to the register file  14 ports needed to support 4 simultaneous instructions in addition to 2 load operations  large size of register file  The splits the register file into 2 clusters that contain duplicates of the 80-entry register file.  2 pipes access a single register file to form a cluster, and 2 clusters are combined to support 4 way-integer instruction execution Incremental cost: additional cycle of latency to broadcast results from each integer cluster to the other cluster  small cost Integer issue queue dynamically schedules instructions to minimize the 1 cycle cross-cluster communication cost 2 FP execution pipes access a single 72-entry register file

Execution Engine (cont.) New functionality not present in prior Alpha microprocessors:  Fully-pipelined integer multiply unit  Integer population count and leading/trailing zero count unit  Floating-point square root FU  Instructions to move register values directly between FP and integer registers

Memory System Supports in-flight memory references and out-of-order operation Receives up to 2 memory operations from the integer execution pipes every cycle Data cache operates at twice the frequency of the processor cycle 3-cycle latency for integer loads and 4 cycles for FP loads

Store/Load Memory Ordering Hazard detection logic to recover from mis-speculation that allows a load to incorrectly issue before an earlier store to same address  After the first time of a load mis-speculation  training of the out-of-order execution core to avoid it on subsequent executions of the same load. This is done by setting a bit in a load wait table that is examined at first time. If the bit is set, the forces the issue point of the load to be delayed until all prior stores have issued.

Load Hit/ Miss Prediction To achieve the 3-cycle integer load hit latency, it is necessary to speculatively issue consumers of integer load data before knowing if the load hit or missed in the on-chip data cache. In case of a load miss  mini-restart  When consumers speculatively issue 3 cycles after a load that misses, 2 integer issue cycles are squashed and all instructions that issued during these 2 cycles are pulled back into the issue queue to be re-issued later  less costly method than a full pipeline restart, but still expensive for applications for many integer load misses The predicts when loads will miss and does not speculatively issue the consumers of the load in that case. Effective load latency: 5 cycles for an integer load hit that is incorrectly predicted to miss

Load Hit Speculation SymbolMeaning QIssue queue RRegister file read EExecute DDCache Access BData bus active Cycle Number Integer LoadQBDER Instruction 1 Instruction 2 Q R Q Hit Pipeline Timing for Integer Load

Load Hit Speculation (cont.) There are 2 cycles in which the issue queue may speculatively issue instructions that use load data before Dcache hit information is known Any instructions issued in these 2 cycles are kept in the issue queue until the load hit condition is known, even if they are not dependent on the load operation.  If load hits  instructions are removed from queue  If load misses  execution of these instructions is aborted and instructions are allowed to request service again In the previous example, instructions 1 and 2 are issued within the speculative window of the load instruction. If load hits, instructions will be removed from queue by the start of cycle 7 while if it misses, both instructions will be aborted from execution pipelines.

Load Hit Speculation (cont.) If software misses are likely, the can still benefit from scheduling the instruction stream for Dcache miss latency. Saturating 4-bit counter  incremented by 1 when load hits and decremented by 2 when load misses. When the upper bit of the counter=0  integer load latency is increased to 5 cycles and speculative window is removed

Load Hit Speculation (cont.) SymbolMeaning QIssue queue RRegister file read EExecute DDCache Access BData bus active Cycle Number Integer LoadQBDER Instruction 1 Instruction 2 Q R Q Hit Pipeline Timing for Floating Point Load

Load Hit Speculation (cont.) Speculative window for FP loads= 1 cycle FQ-issued instructions within this window of an FP load that has missed are aborted only if they depend on the load being successful. In the example, only instruction 1 is issued in the speculative window. If this instruction is not a user of data returned by the load, it is removed from the queue at the normal time (cycle 7). But if it is dependent on the load instruction data and the load hits, then it is removed from the queue one cycle later. If the load misses, instruction 1 is aborted from execution pipelines and may request service again in cycle 7.

Conclusion 21264: fastest microprocessor available  Combines high Alpha clock speeds with many advanced micro-architectural techniques, i.e. out-of-order and speculative execution with many in-flight instructions  High-bandwidth memory system to quickly deliver data values to the execution core  robust performance for many applications