Computer Structure Out-Of-Order Execution

Computer Structure Out-Of-Order Execution
Lihu Rappoport and Adi Yoaz

CPU Time = clock cycle  CPI  IC
What’s Next Goal: minimize CPU Time CPU Time = clock cycle  CPI  IC So far we have learned Minimize clock cycle  add more pipe stages Minimize CPI  use pipeline Minimize IC  architecture In a pipelined CPU CPI w/o hazards is 1 CPI with hazards is > 1 Adding more pipe stages reduces clock cycle but increases CPI Higher penalty due to control hazards More data hazards What can we do ? Further reduce the CPI !

A Superscalar CPU IF ID EXE MEM WB IF ID EXE MEM WB
Duplicating HW in one pipe stage won’t help e.g., have 2 ALUs the bottleneck moves to other stages Getting IPC > 1 requires to fetch, decode, exe, and retire >1 instruction per clock: IF ID EXE MEM WB IF ID EXE MEM WB

The Pentium Processor
Fetches and decodes 2 instructions per cycle Before register file read, decide on pairing: can the two instructions be executed in parallel Pairing decision is based on Data dependencies: 2nd instruction must be independent of 1st Resources: U-pipe and V-pipe are not symmetric (save HW) Common instructions can execute on either pipe Some instructions can execute only on the U-pipe V-pipe can run a subset of the instructions that can run on U-pipe If the 2nd instruction requires the U-pipe, it cannot pair Some instructions use resources of both pipes IF ID U-pipe V-pipe pairing

Misprediction Penalty in a Superscalar CPU
MPI : miss-per-instruction: #incorrectly predicted branches #predicted branches MPI = = MPR× total # of instructions total # of instructions MPI correlates better with performance, e.g., assume MPR = 5%, %branches = 20%  MPI = 1% Peak IPC without control hazards =2 (2 instructions per cycles) Flush penalty of 5 cycles We get MPI = 1%  flush in every 100 instructions IPC=2  flush every 100/2 = 50 cycles 5 cycles flush penalty every 50 cycles  10% performance hit For IPC=1 we would get 5 cycles flush penalty per 100 cycles  5% performance hit Flush penalty increases as the machine is deeper and wider

Extract More ILP ILP – Instruction Level Parallelism
A given program, executed on a given input data has a given parallelism Can execute only independent instructions in parallel If for example each instruction is dependent on the previous instruction, the ILP of the program is 1 Adding more HW will not change that Adjacent instructions are usually dependent The utilization of the 2nd pipe is usually low There are algorithms in which both pipes are highly utilized Solution: Out-Of-Order Execution Look for independent instructions further ahead in the program Execute instructions based on data readiness Still need to keep the semantics of the original program

Data Flow Analysis Example: (1) r1  r4 / r7 (2) r8  r1 + r2
3 4 5 2 6 Assume divide take multiple cycles In-order execution 1 3 4 2 5 6 Data Flow Graph r1 r5 r6 r4 r8 1 3 4 5 2 6 Out-of-order execution

OOOE – General Scheme Fetch & Decode Instruction pool In-order Retire
(commit) In-order Execute Out-of-order Fetch & decode instructions in parallel but in-order Fill the Instruction Pool Execute ready instructions from the instructions pool All source data ready + needed execution resources available Once an instruction is executed signal all dependent instructions that data is ready Commit instructions in parallel but in-order State change (memory, register) and fault/exception handling

Write-After-Write Dependency
(1) r1r9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 (7) r4r3+r1 (8) r32

(1) r1r9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 If inst (3) is executed before inst (1), r1 ends up having a wrong value. Called write-after-write false dependency: The order of two writes to the same register is flipped. (7) r4r3+r1 (8) r32

(1) r1r9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 Inst (4) should use the value of r1 produced by inst (3), even if inst (1) is executed after inst (3). (7) r4r3+r1 (8) r32 Write-After-Write (WAW) is a false dependency Not a real data dependency, but an artifact of OOO execution

Speculative Execution
(1) r1r9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 1/5 instruction is a branch  we must continue fetching, decoding, and allocating instructions into the instruction pool according to the predicted path. Otherwise we would limit ourselves to look ahead for independent instructions only until the next branch This is called “OOO Speculative Execution” Instructions from the predicted path can be executed before the jump itself is executed (before the prediction is verified to be correct) E.g. inst (6) can be executed before inst (1) If eventually the jump prediction turns to be wrong, we need a way to undo the instructions which executed on the speculative path. (7) r4r3+r1 (8) r32

Write-After-Read Dependency
(1) r1r9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 (7) r4r3+r1 (8) r32

Write-After-Read Dependency
(1) r1r9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 (8) r32 If inst (8) is executed before inst (7), inst (7) gets a wrong value of r3. Called write-after-read false dependency. (7) r4r3+r1 Write-After-Read (WAR) is a false dependency Not a real data dependency, but an artifact of OOO execution

Register Renaming Hold a pool of physical registers
Maintain a mapping of architectural registers to physical registers When an instruction is allocated into the inst. pool (in-order) Allocate a free physical register from a pool: PRi When the instruction executes, instead of writing the result to its architectural destination register, Rx, it writes the result to PRi Replace each one of the architectural source registers of the instruction with the currently mapped physical register (or arch register if no physical register) Update the mapping of Rx to Pri: the latest value for Rx is stored in PRi Send instruction to EXE (out-of-order) When its physical sources are ready AND needed execution resources are available The instruction uses the correct sources, regardless of the execution order When an instruction commits (in-order) Copy the value from its physical register to the architectural register: Rx  PRi If the current mapping of the instruction’s architectural destination register, Rx, is still the physical register allocated to the instruction, PRi Set Rx’s mapping to be architectural (the most updated value of Rx is in Rx) Return PRi to the free physical register pool

Register Renaming Renaming r1:pr1 pr117 r2:pr2
(1) r117 (2) r2r2+r1 (3) r123 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (7) r4r3+r1 (8) r32 Renaming r1:pr1 pr117 r2:pr2 pr2r2+pr1 r2 is used for both source and destination. This explains why first the source registers are renamed according to the current mapping, and only then the mapping of the destination register r2 is updated by the newly allocated physical register pr2. r1 r2 r3 r4 Register mapping pr1 pr2

Register Renaming (1) r117 (2) r2r2+r1 (3) r123 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (7) r4r3+r1 (8) r32 Renaming r1:pr1 pr117 r2:pr2 pr2r2+pr1 r1:pr3 pr323 r3:pr4 pr4r3+pr3 r1:pr5 pr535 r4:pr6 pr6pr4+pr5 r3:pr7 pr72 r1 r2 r3 r4 Register mapping pr3 pr5 pr1 pr2 pr4 pr7 pr6 Each instruction uses the correct sources, regardless of the execution order

Speculative Execution – Misprediction
(1) r117 (2) r2r2+r1 (3) r123 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (7) r4r3+r1 (8) r32 Renaming r1:pr1 pr117 r2:pr2 pr2r2+pr1 r1:pr3 pr323 r3:pr4 pr4r3+pr3 r1:pr5 pr535 r4:pr6 pr6pr4+pr5 r3:pr7 pr72 If the predicted branch path turns out to be wrong (when the branch is executed): The instructions following the branch are flushed before they are committed  the architectural state is not changed r1 r2 r3 r4 Register mapping pr5 pr6 pr7 pr1 pr3 pr2 pr4

Speculative Execution – Misprediction
(1) r117 (2) r2r2+r1 (3) r123 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (7) r4r3+r1 (8) r32 Renaming r1:pr1 pr117 r2:pr2 pr2r2+pr1 r1:pr3 pr323 r3:pr4 pr4r3+pr3 r1:pr5 pr535 r4:pr6 pr6pr4+pr5 r3:pr7 pr72 But the register mapping was already wrongly updated by the wrong path instructions r1 r2 r3 r4 Register mapping pr5 pr6 pr7 pr3 pr1 pr2 pr4 Later on we will see various schemes for recovering the mapping

A Superscalar OOO Machine
In-order Fetch and decode instructions in parallel but in order Fetch & Decode rename Rename sources to physical registers Allocate a physical register to the destination Update the mapping

In-order Reorder Buffer (ROB) Pool of instructions waiting for retirement RS ROB Fetch & Decode rename Reservation Stations (RS) Pool of instructions waiting for execution Maintains per inst. sources ready/not-ready status Allocate instructions in RS and in ROB

ROB Out-Of-Order Send “exe done” indication to ROB and reclaim RS entry Fetch & Decode rename Execute Write back value and mark more sources as ready Track which instruction are ready: have all sources are ready Dispatch ready instruction to execution ports in FIFO order The RS handles only register dependencies – it does not handle memory dependencies

In-order Retire (commit) RS ROB Fetch & Decode rename Commit instructions in parallel but in-order Handle faults/exceptions Reclaim ROB entry and physical register Execute

Architectural dest. reg
Re-order Buffer (ROB) Holds instructions from allocation and until retirement At the same order as in the program Provides a large physical register space for register renaming One physical register per each ROB entry Each instruction has at most one destination Physical register number = ROB entry number Buffers the execution results until retirement Data Valid is set after instruction executed and result written to physical reg #entry Entry Valid Data Physical Reg Data Architectural dest. reg 1 12H R2 33H R3 2 xxx R8 39 XXX

RRF – Real Register File
Holds the Architectural Register File Architectural Register are numbered: 0 – R0, 1 – R1, … The value of an architectural register is the value written to it by the last instruction committed which writes to this register RRF: #entry Arch Reg Data 0 (R0) 9AH 1 (R1) F34H

Reservation station (RS)
Pool of all “not yet executed” instructions Holds the instruction’s attributes and source data until it is executed When an instruction is allocated in RS, operand values are updated RS Entry valid opcode src1datavalid src1 src2 data valid src2 Pdst 1 add 97H 12H 37 sub rob37 33H 38 Operand from Data valid Get value from architectural register 1 RRF physical register ROB Wait for value

Reservation station (RS)
The RS maintains operands status “ready/not-ready” Each cycle, executed instructions make more operands “ready” The RS arbitrates the WB busses between the execution units The RS monitors the WB bus to capture data needed by awaiting instructions Data can be bypassed directly from WB bus to execution unit Instructions for which all operands are ready can be dispatched for execution Dispatcher chooses which of the ready instructions to execute next Dispatches chosen instructions to Execution Units Execute RS Write-Back Bus

Allocation and Renaming
Perform register allocation and renaming for ≤4 instructions/cyc The Register Alias Table (RAT) Maps architectural registers into physical registers For each arch reg, holds the number of latest phy reg that updates it When a new instruction that writes to arch reg R is allocated Record phy reg allocated to the inst. as the latest reg that updates R The Allocator (Alloc) Assigns each instruction an entry number in the ROB / RS For each one of the sources (architectural registers) of the instruction Lookup the RAT to find out the latest phy reg updating it Write it up in the RS entry Allocate Load & Store instructions in the MOB (will be discussed later on) Arch reg #reg Location R1 1 RRF R2 19 ROB R3 23

Register Renaming example
IDQ add R1R2,R1 RAT / Alloc  #reg R1 1 RRF R2 19 ROB R3 23 #reg R1 37 ROB R2 19 R3 23 add ROB37ROB19,RRF1 ROB  RS # Data Valid data dst 19 1 12H R2 23 33H R3 37 x xxx 38 # Data Valid data dst 19 1 12H R2 23 33H R3 37 xxx R1 38 x entry valid op code src1datavalid src1 src2 datavalid src2 dst 1 add 97H 12H 37 RRF: 1 R1 97H

Register Renaming example (2)
IDQ sub R1R3,R1 RAT / Alloc  #reg R1 37 ROB R2 19 R3 23 #reg R1 38 ROB R2 19 R3 23 sub ROB38ROB23,ROB37 ROB  RS # Data Valid data dst 19 1 12H R2 23 33H R3 37 xxx R1 38 x # Data Valid data dst 19 1 12H R2 23 33H R3 37 xxx R1 38 entry valid op code src1datavalid src1 src2data valid src2 dst 1 add 97H 12H 37 sub rob37 33H 38 RRF: 1 R1 97H

Out-of-order Core: Execution Units
MIU Port 0 Port 1 Port 2 Port 3,4 SHF FMU FDIV IDIV FAU IEU JEU AGU Load Address Store Address internal 0-dealy bypass within each EU 2nd bypass in RS RS 1st bypass in MIU DCU SDB

Execution Units Bypasses
MIU Port 0 Port 1 Port 2 Port 3,4 SHF FMU FDIV IDIV FAU IEU JEU AGU Load Address Store Address internal 0-dealy bypass within each EU 2nd bypass in RS RS 1st bypass in MIU DCU SDB 0-level bypass within port 1st-level bypass between EUs in different ports 2nd-level bypass through the RS

The Bypass Network Assume
Instruction Y assigned ROB19 as its destination physical register Instruction X uses ROB19 as a source Its other source is an immediate value The RS marks instruction X as ready in the following cases Instruction Y finished execution before instruction X was allocated ROB19 is marked valid at instruction X allocation time ROB19 value is written to the RS at that time Instruction X is ready at allocation – this is the non-bypass case Instruction Y is sent to the same EU that instruction X is assigned to Instruction X is marked ready the next cycle Instruction Y is sent to a different port than instruction X is assigned to Instruction X is marked ready 3 cycles later Y is sent to execution before X is allocated, but still not finish execution When Y finishes exe and its WB value gets to the RS, X can use it

In-Order Retire The Reorder Buffer (ROB)
Instructions are allocated in-order retire oldest alloc youngest

The ROB The Reorder Buffer (ROB) Instructions are allocated in-order
After an instruction is executed (OOO) mark as executed  ready to retire retire oldest alloc youngest

The ROB The Reorder Buffer (ROB) Instructions are allocated in-order
After an instruction is executed (OOO) mark as executed  ready to retire An executed instruction can be retired once all prior instructions have retired Once the oldest instructions in the ROB are ready to retire, they are retired Instructions are retired in-order Upon instruction retirement Copy value from phy reg to arch reg If the Renamer maps the arch reg to the instruction’s phy reg  map to the arch reg to the RRF Reclaim the ROB entry retire oldest youngest alloc

Faults And Exceptions Faults and exceptions are served in program order At EXE: mark an instruction that takes a fault/exception Divide by 0, page fault, … retire oldest alloc youngest

Faults And Exceptions Faults and exceptions are served in program order At EXE: mark an instruction that takes a fault/exception Divide by 0, page fault, … Instructions older than the faulting instruction are retired When the faulting instruction retires – handle the fault retire oldest alloc youngest

Faults And Exceptions Faults and exceptions are served in program order At EXE: mark an instruction that takes a fault/exception Divide by 0, page fault, … Instructions older than the faulting instruction are retired When the faulting instruction retires – handle the fault Flush the ROB (and the pipeline) retire oldest alloc

The faulting instruction
Faults And Exceptions Faults and exceptions are served in program order At EXE: mark an instruction that takes a fault/exception Divide by 0, page fault, … Instructions older than the faulting instruction are retired When the faulting instruction retires – handle the fault Flush the ROB (and the pipeline) Initiate the fault handling code according to the fault type Re-fetch the faulting instruction and the subsequent instructions retire oldest Fault Handling Code The faulting instruction alloc youngest

Interrupts Interrupts are served when the next instruction retires
Let the instruction in the current cycle retire Interrupt retire oldest alloc youngest

Interrupts Interrupts are served when the next instruction retires
Let the instruction in the current cycle retire Flush subsequent instructions Interrupt retire oldest alloc

interrupt service routine
Interrupts are served when the next instruction retires Let the instruction in the current cycle retire Flush subsequent instructions Initiate the interrupt service code Re-fetch the subsequent instructions retire oldest interrupt service routine alloc youngest alloc youngest

Pipeline: Fetch Predict/Fetch Decode Alloc Schedule EX Retire
IQ IDQ Alloc ROB RS Schedule EX Retire Fetch multiple instruction bytes from the I$ In case of a variable instruction length architecture (like x86) Length-decode instructions within the fetched instruction bytes Write the instructions into an Instruction Queue Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Decode Predict/Fetch Decode Alloc Schedule EX Retire
IQ IDQ Alloc ROB RS Schedule EX Retire Read multiple instructions from the IQ Decode the instructions For x86 processors, each instruction is decoded into ≥1 μops μops are RISC-like instructions Write the resulting μops into the Instruction Decoder Queue (IDQ) Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Allocate Predict/Fetch Decode Alloc Schedule EX Retire
IQ IDQ Alloc ROB RS Schedule EX Retire Rename multiple μops Perform port-binding per μop Select the execution port to which the μop is dispatched when ready Allocate ROB/RS entry per μop If source data is available from ROB or RRF, write data to RS Otherwise, mark data not ready in RS Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: EXE Predict/Fetch Decode Alloc Schedule EX Retire
IQ IDQ Alloc ROB RS Schedule EX Retire Ready/Schedule Check for data-ready μops if the needed port / functional unit are available Select and dispatch multiple ready μops/clock to EXE Write back results into RS/ROB Wake-up μops in the RS that are waiting for the results as a sources Update data-ready status of these μops in the RS Write back results into the Physical Registers Reclaim RS entries Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Retire Predict/Fetch Decode Alloc Schedule EX Retire
IQ IDQ Alloc ROB RS Schedule EX Retire Retire oldest μops in the ROB A μop may retire if Its “executed” bit is set It is not marked with an exception / fault All preceding μops are eligible for retirement Commit results from the Physical Register to the Arch Register Reclaim the ROB entry In case of exception / fault Flush the pipeline and handle the exception / fault Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Jump Misprediction – Flush at Retire
When a mispredicted jump retires Flush the pipeline When the jump commits, all the instructions remaining in the pipe are younger than the jump  from the wrong path Reset the renaming map So all the registers are mapped to the architectural registers This is ok since there are no consumers of physical registers (pipe is flushed) Start fetching instructions from the correct path Disadvantage Very high misprediction penalty Misprediction is already known after the jump was executed We will see ways to recover a misprediction at execution

Flush at Retire: Misp. Jump at EXE
Fetch Decode IQ IDQ Alloc ROB RS Retire Schedule JEU Misprediction detected when jump is executed Mark the jump as mispredicted in ROB Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Flush at Retire: Misp. Jump Retires
Clear Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB When the mispredicted jump retires All instructions in the pipe are younger than the jump  all are from the wrong path  flush the entire pipeline Reset the renaming map Map all the registers to the architectural registers This is ok since there are no consumers of physical registers (pipe is flushed) Start fetching instructions from the correct path Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Jump Misprediction – Flush at Execute
When a jump misprediction is detected (at jump execution) Flush the in-order front-end Instructions already in the OOO part continue to execute Including wrong-path instructions (waist of execution resource and power) Start fetching and decoding instruction from the correct path Note that the “correct” path may still be wrong … An older instruction may cause an exception when it retires A older jump executed out-of-order can also mispredict Block younger jumps (executed OOO) from clearing The correct instruction stream is stalled at the Renamer The RAT was wrongly updated also by wrong path instruction When the mispredicted jump retires All instructions in the RS/ROB are from the wrong path  Flush all instructions from the RS/ROB Reset the Renamer to point only to architectural registers Un-stall Rename, and allow instructions from correct path to rename/alloc

Flush at EXE: Jump Gets to EXE
Fetch Decode IQ IDQ Alloc ROB RS Retire Schedule JEU Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Flush at EXE : Jump Gets to EXE
Fetch Decode IQ IDQ Alloc ROB RS Retire Schedule JEU Flush Flush the front-end and re-steer it to correct path Renamer state updated by wrong path  Block further allocation Update BPU OOO not flushed Contains both instruction older and yonger than the mispredicted jump Instructions already in the OOO part continue to execute Including wrong-path instructions (waist of execution resource and power) Block younger jumps from clearing Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Flush at EXE : Misp. Jump Retires
Fetch Decode IQ IDQ Alloc ROB RS Retire Schedule JEU Clear When the mispredicted jump retires Flush OOO Only instruction following the jump are left in the machine  they must all be flushed Resets all state in the OOO (Renamer, RS, ROB, MOB, etc.) Reset the Renamer to point only to architectural registers Allow allocation of instructions from correct path Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Periodic Checkpoints Allow allocation and execution of instructions from the correct path before the mispredicted jump retires Every few instructions take a checkpoint of the Renamer A snapshot of the current renaming map Taking a snapshot on every jump is expensive In case of misprediction Flush the frontend and start fetching instructions from the correct path Selectively flush younger instructions from the ROB/RS Recover Renamer to latest checkpoint taken prior to the mispredicted jump Recover Renamer to its state at the jump Rename instructions from the checkpoint and until the jump Allow instructions from the correct path to allocate oldest jump check point jump check point jump check point Re-rename jump check point jump

Checkpoints: Jump Gets to EXE
Clear Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB Clear raised on mispredicted jump Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Checkpoints: Jump Gets to EXE
BPU Update Clear Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB Clear raised on mispredicted jump Flush frontend and re-steer it to the correct path Flush all younger instructions in OOO Based on age (ROBid) comparison Update BPU Block further allocation until the Renamer mapping is recovered Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Checkpoints: Renamer Recovery
Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB Restore Renamer from latest check-point before the jump Recover Renamer to its states just after the jump Meanwhile front-end starts fetching and decoding instructions from the correct path Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Checkpoints: Renamer Recovery
Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB Once done restoring the Renamer allow allocation of instructions from correct path Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Jump Misprediction Recovery
Fetch correct path Alloc correct path jump retire After jump retires Flush jump EXE After jump executed jump EXE w/ checkpoints After RAT recovered from C.P. + history walk

Large ROB and RS are Important
A Large RS Increase the window in which looking for independent instructions Exposes more parallelism potential Large ROB The ROB is a superset of the RS  ROB size ≥ RS size Allows for covering long latency operations (cache miss, divide) Example Assume there is a Load that misses the L1 cache and hits the L2 cache Data takes ~10 cycles to return  ~30 new instrs get into pipeline Instructions following the Load cannot commit  pile up in the ROB Instructions independent of the load are executed, and leave the RS As long as the ROB is not full, we can keep fetching, renaming, and executing instructions

OOO Requires Accurate Branch Predictor
Accurate branch predictor increases the effective scheduling window size Speculate across multiple branches (a branch every 5 – 10 instructions) Instruction pool High chances to commit branches Low chances to commit

Out Of Order Execution Summary
Look ahead in a window of instructions Dispatch ready instructions to execution Do not depend on data from previous instructions still not executed Have the required execution resources available Advantages Exploit Instruction Level Parallelism beyond adjacent instructions Help cover latencies (e.g., L1 data cache miss, divide) Superior/complementary to compiler scheduler Can look for ILP beyond conditional branches In a given control path instructions may be independent Register Renaming: use more than the number architectural registers Complex micro-architecture Register renaming, complex scheduler, misprediction recovery Memory ordering (coming next)

OOO Execution of Memory Operations

OOO Execution of Memory Operations
The RS dispatches instructions based on register dependencies The RS cannot detect memory dependencies store Mem[r1+r3*2]r9 load r2Mem[r10+r7*2] Does not know the load/store memory addresses The RS dispatches load/store instructions to the Address Generation Unit (AGU) when the sources for the address calculation are ready The AGU calculates the virtual memory address: Base Register + (Scale × Index Register) + Displacement The AGU sends the address to the Memory Order Buffer (MOB) The MOB resolves memory dependencies and enforces memory ordering

Load/Store Ordering Loads can be executed OOO with respect to other loads Not allowing this would have a very big performance impact Stores write value to the data-cache in-order, post retirement Stores cannot write values to the data-cache speculatively There is no simple way to undo them Stores to the data-cache are never re-ordered among themselves Must be seen in-order to other cores or agents Loads need to maintain correct order with respect to stores The RS dispatches instructions based on register dependencies The RS cannot detect memory dependencies The MOB tracks dependencies between loads and stores

Load/Store Ordering The MOB tracks dependencies between loads and stores An older Store has an unresolved address  block load An older Store to same address, but Store’s data is not ready  block load Store Mem[2000] 7 Store Mem[r1+r3*2]  8 // address not calculated since // one of the source registers is // not ready yet Load R1  Mem[1000] STOP = ? Store Mem[2000] 7 Store Mem[1000]  r5 // data unknown since r5 is not ready Load R1  Mem[1000] STOP

Store Implemented as 2 μops
Stores are decoded into two independent μops STA (store-address): calculates the address of the store STD (store-data): stores the data into the Store Data buffer This makes the data available to any Load that needs it The actual write to memory is done when the store retires Separating Store to STA & STD improves performance Allows STA to dispatch earlier, even before the data is known Address conflicts resolved earlier  opens memory pipeline for other loads STA and STD can be issued to execution units in parallel STA dispatched to AGU when its sources (base + index) are ready STD dispatched to SDB when its source operand is available

Memory Order Buffer (MOB)
Store Coloring Each Store is allocated in-order in the Store Buffer, and gets a SBID Each load is allocated in-order in the Load Buffer, and gets LBID + current SBID Load is checked against all previous stores Stores with SBID ≤ load’s SBID Load blocked if Unresolved address of a relevant STAs STA to same address, but data not ready DCU miss MOB writes blocking info into load buffer Re-dispatches load when wake-up signal received If Load is not blocked  executed (bypassed) LBID SBID Store - 1 Load 2 3 4

Memory Disambiguation
The MOB predicts if a load can proceed despite unknown STAs Predict colliding  block Load if there is unknown STA (as usual) Predict non colliding  execute even if there are unknown STAs In case of a wrong prediction The entire pipeline is flushed when the load retires The load and all succeeding instructions are re-fetched Store Mem[2000] 7 Store Mem[????]  8 Load R1  Mem[1000]

Store → Load Forwarding
An older STA to same address, and Store’s data is ready Store → Load Forwarding: Load gets data directly from the Store Data Buffer Does not need to wait for the data to be written to the DCU (after the store retires) Some conditions apply, e.g. The store contains all the data being loaded The load is from a write-back memory type Some conditions on the load alignment Store Mem[1000]  7 Load R1  Mem[1000] SDB

DCU Miss Blocking caches severely hurt OOO
A cache miss prevents from other cache requests (which could possibly be hits) to be served Hurts one of the main gains from OOO – hiding cache misses The data-caches in an OOO machine are non-blocking If a Load misses in the DCU The DCU marks the write-back data as invalid Assigns a fill-buffer to the load, and issues an L2 request As long as there are free fill-buffers, more loads can be dispatched When the critical chunk returns, wakeup and re-dispatch the load Squash subsequent requests for the same missed cache line Use the same fill-buffer

Pipeline: Load: Allocate
Schedule AGU LB Write DTLB DCU WB MOB Retire IDQ RS ROB LB Allocate RS entry, ROB entry, and Load Buffer entry for the Load Assign Store Buffer ID (SBID) to enable ordering Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Bypassed Load: EXE
Alloc Schedule AGU LB Write Retire IDQ RS ROB MOB DTLB DCU WB LB RS dispatches Load to AGU when sources for address calculation are ready AGU calculates Load’s linear address: DS-Base + base + (Scale*Index) + Disp. Writes Address to the Load Buffer DTLB Virtual → Physical + DCU set access MOB checks blocking and forwarding DCU read / Store Data Buffer read (Store → Load forwarding) Write back data / write block code Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Blocked Load Re-dispatch
Alloc Schedule AGU LB Write Retire IDQ RS ROB MOB DTLB DCU WB LB MOB determines which loads are ready, and schedules one Load arbitrates for DTLB pipeline DTLB Virtual → Physical + DCU set access MOB checks blocking and forwarding DCU read / Store Data Buffer read Write back data / write block code Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Load: Retire
Alloc Schedule AGU LB Write Retire IDQ RS ROB MOB DTLB DCU WB LB Reclaim ROB, Load Buffer entries Commit results to RRF Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Stores are Done in Two Phases
At EXE Fill store buffers with linear + physical address and with data Once the store address and the store data are known Forward Store data to load operations that need it via the Store Data Buffer

Store Completion After the store retires – completion phase
The line must be in L1 D$, in E or M MESI state Otherwise, fetch it using a Read for Ownership request from L1 D$  L2$  LLC  L2$ and L1 D$ in other cores  Memory Read the data from the store buffers and write it to L1 D$ in M state Done post-retirement to preserve the order of memory writes Release the store buffer entry taken by the store Affects performance only if store buffer becomes full Platform On-die L3 Core 1 L1 L2 Memory Core 2

Pipeline: Store: Allocate
Schedule AGU SB Retire IDQ RS ROB DTLB SB Allocate ROB/RS Allocate Store Buffer entry Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Store: STA EXE
Alloc Schedule AGU SB V.A. Retire IDQ RS ROB DTLB SB P.A. SB RS dispatches STA to the AGU when the sources used for address calculation are ready AGU calculates the Store linear address Write virtual address to Store Buffer DTLB Virtual → Physical Load Buffer Memory Disambiguation verification Write physical address to the Store Buffer Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Store: STD EXE
Alloc Schedule SB data Retire IDQ RS ROB SB RS dispatches STD when the source register for STD is ready Write data to Store Data Buffer Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Senior Store Retirement
Alloc Schedule Retire IDQ RS ROB SB DCU MOB SB When the Store retires Mark the Store Buffer entry as senior When the DCU is idle  MOB dispatches a senior store Read senior store entry from the Store Buffer Store Buffer sends the data and the physical address DCU writes the data into the specified physical address Reclaim Store Buffer entry Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Computer Structure Out-Of-Order Execution

Similar presentations

Presentation on theme: "Computer Structure Out-Of-Order Execution"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Structure Out-Of-Order Execution

Similar presentations

Presentation on theme: "Computer Structure Out-Of-Order Execution"— Presentation transcript:

Similar presentations

About project

Feedback