Out-of-Order Commit Processor

Slides:



Advertisements
Similar presentations
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
Advertisements

Final Project : Pipelined Microprocessor Joseph Kim.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Register Renaming & Value Prediction. Overview ► Need for Post-RISC ► Register Renaming vs. Allocation Strategies ► How to compile for Post-RISC machines.
Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.
OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
Lecture: Out-of-order Processors
Dynamic Scheduling Why go out of style?
Basic Computer Organization and Design
Smruti R. Sarangi IIT Delhi
Computer Structure Multi-Threading
PowerPC 604 Superscalar Microprocessor
CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
Lecture: Out-of-order Processors
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Out-of-Order Commit Processors
Commit out of order Phd student: Adrián Cristal.
Tomasulo With Reorder buffer:
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
The Microarchitecture of the Pentium 4 processor
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Tolerating Long Latency Instructions
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Smruti R. Sarangi IIT Delhi
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Out-of-Order Commit Processors
Control unit extension for data hazards
Lecture 20: OOO, Memory Hierarchy
Lecture 20: OOO, Memory Hierarchy
Translation Buffers (TLB’s)
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
* From AMD 1996 Publication #18522 Revision E
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Translation Buffers (TLB’s)
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Translation Buffers (TLBs)
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Review What are the advantages/disadvantages of pages versus segments?
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

Out-of-Order Commit Processor Adrain Cristal, Daniel Ortega, Josep Llosa and Mateo Valero

Performance Limiting factors Widening gap between memory and processor performance Increasing wire delays

High Memory Latency Current Solution ROB Cache Miss LD R1, 0(R3) Multiple cache hierarchy Large number of in-flight instructions ROB Register File Load Store Queue Instruction queue DADDI R2, R4 #2 … … … … … …

Motivation

Motivation

Goal of the paper To support large number of in-flight instructions without up-sizing ROB and Instruction queue Out-of-Order commit Slow Lane Instruction Queuing

Re-Order Buffer In-order commit, to handle precise interrupts Controls exactly when stores can write to the memory Frees physical register Enable processor to recover from branch mis-prediction Keeps track of all in-flight instructions Large in-flight instruction Huge ROB structure Cycle time limitation

Checkpointing instead of ROB

Implementation CAM (Content-addressable memory) register mapping Inclusion of Future Free bit For freeing physical register Free List Used for choosing free register

Checkpointing

Operation

Operation

Checkpoint Valid bits Future free bits Number of (active) instructions in that checkpoint

Heuristic for taking checkpoints First branch after 64 instructions Every 512 instructions After 64 stores After flushing the pipeline

Slow Lane Instruction Queuing

Slow Lane Instruction Queuing Identifying instructions that will take long time Put them in a secondary buffer till it gets ready Alternate paper that considers these as critical instructions and put them in the fast queue

SLIQ Pseudo-ROB for finding long latency instructions Slow queue to store the long latency instructions 32-bit register for 32 logical register to keep track of the dependency

Wakening of instructions in SLIQ Every long latency load is stored in SLIQ along with its destination register Wakening done at a pace of four instructions per cycle LD R1, 0(R3) DADDI R2, R4 #2 … … New Load … … …

Baseline Processor Configuration

RESULTS

Effect of delay in re-insertion Clearly shows that the program is highly parallel What about integer programs?

Number of In-flight instructions

Results

Ephemeral Registers Conventional Scheme Virtual Physical Registers Early release Ephemeral Registers

Early Release Early Release of Registers Needs a pending counter for each register When an instruction is decoded, each pending counter associated with the source registers is incremented and when the instruction ins are issued, the pending counter is decremented. The instructions in a wrong path, are nullified and issued in order to maintain the pending counter Coupled with the renaming logic CAM maps table scheme A register can be freed if it is not referenced in any map table, and if its pending counter is zero.

Virtual Registers Decouple renaming from physical register allocation Requires two map tables – GMT (General Map Table) and PMT (Physical Map Table) PMT - New table which maps virtual register to physical register

Putting it together

Analysis How efficient these methods are for integer programs which have Very little parallelism Very poor branch prediction accuracy Lengthy critical path How Scalable is the CAM scheme they have used for future processors having hundreds of physical register and running at very high clock speed Impact of these techniques on power