Out-of-Order Commit Processor

Out-of-Order Commit Processor
Adrain Cristal, Daniel Ortega, Josep Llosa and Mateo Valero

Performance Limiting factors
Widening gap between memory and processor performance Increasing wire delays

High Memory Latency Current Solution
ROB Cache Miss LD R1, 0(R3) Multiple cache hierarchy Large number of in-flight instructions ROB Register File Load Store Queue Instruction queue DADDI R2, R4 #2 … … … … … …

Motivation

Goal of the paper To support large number of in-flight instructions without up-sizing ROB and Instruction queue Out-of-Order commit Slow Lane Instruction Queuing

Re-Order Buffer In-order commit, to handle precise interrupts
Controls exactly when stores can write to the memory Frees physical register Enable processor to recover from branch mis-prediction Keeps track of all in-flight instructions Large in-flight instruction Huge ROB structure Cycle time limitation

Checkpointing instead of ROB

Implementation CAM (Content-addressable memory) register mapping
Inclusion of Future Free bit For freeing physical register Free List Used for choosing free register

Checkpointing

Operation

Checkpoint Valid bits Future free bits
Number of (active) instructions in that checkpoint

Heuristic for taking checkpoints
First branch after 64 instructions Every 512 instructions After 64 stores After flushing the pipeline

Slow Lane Instruction Queuing

Slow Lane Instruction Queuing
Identifying instructions that will take long time Put them in a secondary buffer till it gets ready Alternate paper that considers these as critical instructions and put them in the fast queue

SLIQ Pseudo-ROB for finding long latency instructions
Slow queue to store the long latency instructions 32-bit register for 32 logical register to keep track of the dependency

Wakening of instructions in SLIQ
Every long latency load is stored in SLIQ along with its destination register Wakening done at a pace of four instructions per cycle LD R1, 0(R3) DADDI R2, R4 #2 … … New Load … … …

Baseline Processor Configuration

RESULTS

Effect of delay in re-insertion
Clearly shows that the program is highly parallel What about integer programs?

Number of In-flight instructions

Results

Ephemeral Registers Conventional Scheme Virtual Physical Registers
Early release Ephemeral Registers

Early Release Early Release of Registers
Needs a pending counter for each register When an instruction is decoded, each pending counter associated with the source registers is incremented and when the instruction ins are issued, the pending counter is decremented. The instructions in a wrong path, are nullified and issued in order to maintain the pending counter Coupled with the renaming logic CAM maps table scheme A register can be freed if it is not referenced in any map table, and if its pending counter is zero.

Virtual Registers Decouple renaming from physical register allocation
Requires two map tables – GMT (General Map Table) and PMT (Physical Map Table) PMT - New table which maps virtual register to physical register

Putting it together

Analysis How efficient these methods are for integer programs which have Very little parallelism Very poor branch prediction accuracy Lengthy critical path How Scalable is the CAM scheme they have used for future processors having hundreds of physical register and running at very high clock speed Impact of these techniques on power

Out-of-Order Commit Processor

Similar presentations

Presentation on theme: "Out-of-Order Commit Processor"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Out-of-Order Commit Processor

Similar presentations

Presentation on theme: "Out-of-Order Commit Processor"— Presentation transcript:

Similar presentations

About project

Feedback