Computer Structure Out-Of-Order Execution

Computer Structure Out-Of-Order Execution
Lihu Rappoport and Adi Yoaz

CPU Time = clock cycle  CPI  IC
What’s Next Goal: minimize CPU Time CPU Time = clock cycle  CPI  IC So far we have learned Minimize clock cycle  add more pipe stages Minimize CPI  use pipeline Minimize IC  architecture In a pipelined CPU CPI w/o hazards is 1 CPI with hazards is > 1 Adding more pipe stages reduces clock cycle but increases CPI Higher penalty due to control hazards More data hazards What can we do ? Further reduce the CPI !

A Superscalar CPU IF ID EXE MEM WB IF ID EXE MEM WB
Duplicating HW in one pipe stage won’t help e.g., have 2 ALUs the bottleneck moves to other stages Getting IPC > 1 requires to fetch, decode, exe, and retire >1 instruction per clock: IF ID EXE MEM WB IF ID EXE MEM WB

The Pentium Processor
Fetches and decodes 2 instructions per cycle Before register file read, decide on pairing: can the two instructions be executed in parallel Pairing decision is based on Data dependencies: 2nd instruction must be independent of 1st Resources: U-pipe and V-pipe are not symmetric (save HW) Common instructions can execute on either pipe Some instructions can execute only on the U-pipe If the 2nd instruction requires the U-pipe, it cannot pair Some instructions use resources of both pipes IF ID U-pipe V-pipe pairing

Misprediction Penalty in a Superscalar CPU
MPI : miss-per-instruction: #incorrectly predicted branches #predicted branches MPI = = MPR× total # of instructions total # of instructions MPI correlates well with performance, e.g., assume MPR = 5%, %branches = 20%  MPI = 1% Without hazards IPC=2 (2 instructions per cycles) Flush penalty of 5 cycles We get MPI = 1%  flush in every 100 instructions IPC=2  flush every 100/2 = 50 cycles 5 cycles flush penalty every 50 cycles  10% performance hit For IPC=1 we would get 5 cycles flush penalty per 100 cycles  5% performance hit Flush penalty increases as the machine is deeper and wider

Extract More ILP ILP – Instruction Level Parallelism
A given program, executed on a given input data has a given parallelism Can execute only independent instructions in parallel If for example each instruction is dependent on the previous instruction, the ILP of the program is 1 Adding more HW will not change that Adjacent instructions are usually dependent The utilization of the 2nd pipe is usually low There are algorithms in which both pipes are highly utilized Solution: Out-Of-Order Execution Look for independent instructions further ahead in the program Execute instructions based on data readiness Still need to keep the semantics of the original program

Data Flow Analysis Example:
(1) r1  r4 / r7 ; assume divide takes 20 cycles (2) r8  r1 + r2 (3) r5  r5 + 1 (4) r6  r6 - r3 (5) r4  r5 + r6 (6) r7  r8 * r4 1 3 4 5 2 6 In-order execution 1 3 4 2 5 6 Data Flow Graph r1 r5 r6 r4 r8 1 3 4 5 2 6 Out-of-order execution

OOOE – General Scheme Fetch & Decode Instruction pool In-order Retire
(commit) In-order Execute Out-of-order Fetch & decode instructions in parallel but in order Fill the Instruction Pool Execute ready instructions from the instructions pool All source data ready + needed execution resources available Once an instruction is executed signal all dependent instructions that data is ready Commit instructions in parallel but in-order State change (memory, register) and fault/exception handling

Write-After-Write Dependency
(1) r1R9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 (7) r4r3+r1 (8) r32

(1) r1R9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 If inst (3) is executed before inst (1), r1 ends up having a wrong value. Called write-after-write false dependency. (7) r4r3+r1 (8) r32

(1) r1R9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 Inst (4) should use the value of r1 produced by inst (3), even if inst (1) is executed after inst (3). (7) r4r3+r1 (8) r32 Write-After-Write (WAW) is a false dependency Not a real data dependency, but an artifact of OOO execution

Speculative Execution
(1) r1R9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 1/5 instruction is a branch  continue fetching, decoding, and allocating instructions into the instruction pool according to the predicted path. Called “speculative execution” (7) r4r3+r1 (8) r32

Write-After-Read Dependency
(1) r1R9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 (7) r4r3+r1 (8) r32

Write-After-Read Dependency
(1) r1R9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 (8) r32 If inst (8) is executed before inst (7), inst (7) gets a wrong value of r3. Called write-after-read false dependency. (7) r4r3+r1 Write-After-Read (WAR) is a false dependency Not a real data dependency, but an artifact of OOO execution

Register Renaming Hold a pool of physical registers
Map architectural registers into physical registers When an instruction is allocated into the inst. pool (still in-order) Allocate a free physical register from a pool The physical register points to the architectural register When an instruction executes and writes a result Write the result value to the physical register When an instruction needs data from a register Read data from the physical register allocated to the latest inst which writes to the same arch register, and precedes the current inst If no such instruction exists, read from the reset arch. value When an instruction commits Copy the value from its physical register to the architectural register If the RAT (still) maps the arch reg to the instruction’s physical register Update the RAT to map to the architectural register in the RRF

Register Renaming (1) r117 (2) r2r2+r1 (3) r123 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (7) r4r3+r1 (8) r32 Renaming r1:pr1 pr117 r2:pr2 pr2r2+pr1 r1:pr3 pr323 r3:pr4 pr4r3+pr3 r1:pr5 pr535 r4:pr6 pr6pr4+pr5 r3:pr7 pr72 r1 r2 r3 r4 Register mapping pr3 pr5 pr1 pr2 pr4 pr7 pr6 When an instruction commits: Copy its physical register into the architectural register

Speculative Execution – Misprediction
(1) r117 (2) r2r2+r1 (3) r123 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (7) r4r3+r1 (8) r32 Renaming r1:pr1 pr117 r2:pr2 pr2r2+pr1 r1:pr3 pr323 r3:pr4 pr4r3+pr3 r1:pr5 pr535 r4:pr6 pr6pr4+pr5 r3:pr7 pr72 If the predicted branch path turns out to be wrong (when the branch is executed): The instructions following the branch are flushed before they are committed  the architectural state is not changed r1 r2 r3 r4 Register mapping pr5 pr6 pr7 pr1 pr3 pr2 pr4

Speculative Execution – Misprediction
(1) r117 (2) r2r2+r1 (3) r123 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (7) r4r3+r1 (8) r32 Renaming r1:pr1 pr117 r2:pr2 pr2r2+pr1 r1:pr3 pr323 r3:pr4 pr4r3+pr3 r1:pr5 pr535 r4:pr6 pr6pr4+pr5 r3:pr7 pr72 But the register mapping was already wrongly updated by the wrong path instructions r1 r2 r3 r4 Register mapping pr5 pr6 pr7 pr3 pr1 pr2 pr4 Later on we will see various schemes to fix this

A Superscalar OOO Machine
In-order Fetch and decode instructions in parallel but in order Fetch & Decode rename Renames sources to physical registers Allocates a physical register to the destination

In-order Reorder Buffer (ROB) Pool of instructions waiting for retirement RS ROB Fetch & Decode rename Reservation Stations (RS) Pool of instructions waiting for execution Maintains per inst. sources ready/not-ready status Allocate instructions in RS and ROB

ROB Out-Of-Order Send “exe done” indication to ROB and reclaim RS entry Fetch & Decode rename Execute Write back value and mark more sources as ready Track which instruction are ready: have all sources are ready Dispatch ready instruction to execution ports in FIFO order The RS handles only register dependencies – it does not handle memory dependencies

In-order Retire (commit) RS ROB Fetch & Decode rename Commit instructions in parallel but in-order Handle faults/exceptions Reclaim ROB entry Execute

Architectural dest. reg
Re-order Buffer (ROB) Holds instructions from allocation and until retirement At the same order as in the program Provides a large physical register space for register renaming One physical register per each ROB entry Physical register number = entry number Each instruction has only one destination Buffers the execution results until retirement Valid data is set after instruction executed and result written to physical reg #entry Entry Valid Data Physical Reg Data Architectural dest. reg 1 12H R2 33H R3 2 xxx R8 39 XXX

RRF – Real Register File
Holds the Architectural Register File Architectural Register are numbered: 0 – R0, 1 – R1, … The value of an architectural register is the value written to it by the last instruction committed which writes to this register RRF: #entry Arch Reg Data 0 (R0) 9AH 1 (R1) F34H

Reservation station (RS)
Pool of all “not yet executed” instructions Holds the instruction’s attributes and source data until it is executed When instruction is allocated in RS, operand values are updated RS Entry valid opcode src1valid src1 src2 valid src2 Pdst 1 add 97H 12H 37 Operand from Data valid Get value from architectural register 1 RRF physical register ROB Wait for value

Reservation station (RS)
The RS maintains operands status “ready/not-ready” Each cycle, executed instructions make more operands “ready” The RS arbitrate the WB busses between the units The RS monitors the WB bus to capture data needed by awaiting instructions Data can be bypassed directly from WB bus to execution unit Instructions for which all operands are ready can be dispatched for execution Dispatcher chooses which of the ready instructions to execute next Dispatches chosen instructions to functional units

Allocation and Renaming
Perform register allocation and renaming for ≤4 instructions/cyc The Register Alias Table (RAT) Maps architectural registers into physical registers For each arch reg, holds the number of latest phy reg that updates it When a new instruction that writes to a arch reg R is allocated Record phy reg allocated to the inst. as the latest reg that updates R The Allocator (Alloc) Assigns each instruction an entry number in the ROB / RS For each one of the sources (architectural registers) of the instruction Lookup the RAT to find out the latest phy reg updating it Write it up in the RS entry Allocate Load & Store buffers in the MOB Arch reg #reg Location R1 1 RRF R2 19 ROB R3 23

Register Renaming example
IDQ add R1R2,R1 RAT / Alloc  #reg R1 1 RRF R2 19 ROB R3 23 #reg R0 37 ROB R1 19 R2 23 add ROB37ROB19,RRF1 ROB  RS # Data Valid data dst 19 1 12H R2 23 33H R3 37 x xxx 38 # Data Valid data dst 19 1 12H R2 23 33H R3 37 xxx R1 38 x entry valid op code src1valid src1 src2 valid src2 dst 1 add 97H 12H 37 RRF: 1 R1 97H

Register Renaming example (2)
IDQ sub R1R3,R1 RAT / Alloc  #reg R1 37 ROB R2 19 R3 23 #reg R0 38 ROB R1 19 R2 23 sub ROB38ROB23,ROB37 ROB  RS # Data Valid data dst 19 1 12H R2 23 33H R3 37 x xxx 38 # Data Valid data dst 19 1 12H R2 23 33H R3 37 xxx R1 38 entry valid op code src1valid src1 src2 valid src2 dst 1 add 97H 12H 37 sub rob37 33H 38 RRF: 1 R1 97H

Out-of-order Core: Execution Units
MIU Port 0 Port 1 Port 2 Port 3,4 SHF FMU FDIV IDIV FAU IEU JEU AGU Load Address Store Address internal 0-dealy bypass within each EU 2nd bypass in RS RS 1st bypass in MIU DCU SDB

In-Order Retire The Reorder Buffer (ROB)
Instructions are allocated in-order retire oldest alloc youngest

The ROB The Reorder Buffer (ROB) Instructions are allocated in-order
After an instruction is executed mark as executed  ready to retire retire oldest alloc youngest

The ROB The Reorder Buffer (ROB) Instructions are allocated in-order
After an instruction is executed mark as executed  ready to retire An executed instruction can be retired once all prior instructions have retired Once the oldest instructions in the ROB are ready to retire, they are retired Instructions are retired in-order Upon instruction retirement Copy value from phy reg to arch reg If the RAT maps the arch reg to the instruction’s phy reg  update the RAT to map to the arch reg in the RRF Its ROB entry is released retire oldest youngest alloc

Faults And Exceptions Faults and exceptions are served in program order At EXE: mark an instruction that takes a fault/exception Divide by 0, page fault, … retire oldest alloc youngest

Faults And Exceptions Faults and exceptions are served in program order At EXE: mark an instruction that takes a fault/exception Divide by 0, page fault, … Instructions older than the faulting instruction are retired When the faulting instruction retires – handle the fault retire oldest alloc youngest

Faults And Exceptions Faults and exceptions are served in program order At EXE: mark an instruction that takes a fault/exception Divide by 0, page fault, … Instructions older than the faulting instruction are retired When the faulting instruction retires – handle the fault Flush the ROB retire oldest alloc

The faulting instruction
Faults And Exceptions Faults and exceptions are served in program order At EXE: mark an instruction that takes a fault/exception Divide by 0, page fault, … Instructions older than the faulting instruction are retired When the faulting instruction retires – handle the fault Flush the ROB Initiate the fault handling code according to the fault type Re-fetch the faulting instruction and the subsequent instructions retire oldest Fault Handling Code The faulting instruction alloc youngest

Interrupts Interrupts are served when the next instruction retires
Let the instruction in the current cycle retire Interrupt retire oldest alloc youngest

Let the instruction in the current cycle retire Flush subsequent instructions Interrupt retire oldest alloc

Let the instruction in the current cycle retire Flush subsequent instructions Initiate the interrupt service code Fetch the subsequent instructions retire oldest alloc youngest alloc youngest

Pipeline: Fetch Predict/Fetch Decode Alloc Schedule EX Retire
IQ IDQ Alloc ROB RS Schedule EX Retire Fetch multiple instruction bytes from the I$ In case of a variable instruction length architecture (like x86) Length-decode instructions within the fetched instruction bytes Write the instructions into an Instruction Queue Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Decode Predict/Fetch Decode Alloc Schedule EX Retire
IQ IDQ Alloc ROB RS Schedule EX Retire Read multiple instructions from the IQ Decode the instructions For x86 processors, each instruction is decoded into ≥1 μops μops are RISC-like instructions Write the resulting μops into the Instruction Decoder Queue (IDQ) Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Allocate Predict/Fetch Decode Alloc Schedule EX Retire
IQ IDQ Alloc ROB RS Schedule EX Retire Allocate, port bind and rename multiple μops Allocate ROB/RS entry per μop If source data is available from ROB or RRF, write data to RS Otherwise, mark data not ready in RS Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: EXE Predict/Fetch Decode Alloc Schedule EX Retire
IQ IDQ Alloc ROB RS Schedule EX Retire Ready/Schedule Check for data-ready μops if needed functional unit available Select and dispatch multiple ready μops/clock to EXE Write back results into RS/ROB Wake up μops in the RS that are waiting for the results as a sources Update data-ready status of these μops in the RS Write back results into the Physical Registers Reclaim RS entries Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Retire Predict/Fetch Decode Alloc Schedule EX Retire
IQ IDQ Alloc ROB RS Schedule EX Retire Retire oldest μops in the ROB A μop may retire if Its “executed” bit is set It is not marked with an exception / fault All preceding μop are eligible for retirement Commit results from the Physical Register to the Arch Register Reclaim the ROB entry In case of exception / fault Flush the pipeline and handle the exception / fault Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Jump Misprediction – Flush at Retire
When a mispredicted jump retires Flush the pipeline When the jump commits, all the instructions remaining in the pipe are younger than the jump  from the wrong path Reset the renaming map So all the registers are mapped to the architectural registers This is ok since there are no consumers of physical registers (pipe is flushed) Start fetching instructions from the correct path Disadvantage Very high misprediction penalty Misprediction is already known after the jump was executed We will see ways to recover a misprediction at execution

Pipeline: Jump Gets to EXE
Fetch Decode IQ IDQ Alloc ROB RS Retire Schedule JEU Misprediction detected when jump is executed Do nothing Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Mispredicted Jump Retires
Clear Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB When the mispredicted jump retires All instructions in the pipe are younger than the jump  from the wrong path  flush the pipeline Reset the renaming map So all the registers are mapped to the architectural registers This is ok since there are no consumers of physical registers (pipe is flushed) Start fetching instructions from the correct path Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Jump Misprediction – Flush at Execute
When a jump misprediction is detected (at jump execution) Flush the in-order front-end Instructions already in the OOO part continue to execute Including wrong-path instructions (waist of execution resource and power) Start fetching and decoding instruction from the correct path Note that the “correct” path may still be wrong … An older instruction may cause an exception when it retires A older jump executed out-of-order can also mispredict Block younger jumps (executed OOO) from clearing The correct instruction stream is stalled at the RAT The RAT was wrongly updated also by wrong path instruction When the mispredicted jump retires All instructions in the RS/ROB are from the wrong path  Flush all instructions from the RS/ROB Reset the RAT to point only to architectural registers Un-stall RAT, and allow instructions from correct path to rename/alloc

Fetch Decode IQ IDQ Alloc ROB RS Retire Schedule JEU Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Fetch Decode IQ IDQ Alloc ROB RS Retire Schedule JEU Flush Flush front-end and re-steer it to correct path RAT state already updated by wrong path Block further allocation Update BPU OOO not flushed Instructions already in the OOO part continue to execute Including wrong-path instructions (waist of execution resource and power) Block younger jumps from clearing Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Mispredicted jump Retires
Fetch Decode IQ IDQ Alloc ROB RS Retire Schedule JEU Clear When the mispredicted jump retires Flush OOO Only instruction following the jump are left – they must all be flushed Resets all state in the OOO (RAT, RS, RB, MOB, etc.) Reset the RAT to point only to architectural registers Allow allocation of instructions from correct path Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Periodic Checkpoints Allow allocation and execution of instructions from the correct path before the mispredicted jump retires Every few instructions take a checkpoint of the RAT A snapshot of the current renaming map In case of misprediction Flush the frontend and start fetching instructions from the correct path Selectively flush younger instructions from the ROB/RS Recover RAT to latest checkpoint taken prior to the mispredicted jump Recover RAT to its state at the jump Rename instructions from the checkpoint and until the jump Allow instructions from the correct path to allocate check point check point check point Re-rename check point

Mispredicted Jump Gets to EXE
Clear Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB Clear raised on mispredicted jump Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Mispredicted Jump Gets to EXE
BPU Update Clear Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB Clear raised on mispredicted jump Flush frontend and re-steer it to the correct path Flush all younger instructions in OOO Update BPU Block further allocation Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

RAT Recovery Predict/Fetch Decode Alloc Schedule JEU Retire
IQ IDQ RS ROB Restore RAT from latest check-point before jump Recover RAT to its states just after the jump Before any instruction on the wrong path Meanwhile front-end starts fetching and decoding instructions from the correct path Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

RAT Recovery Predict/Fetch Decode Alloc Schedule JEU Retire
IQ IDQ RS ROB Once done restoring the RAT allow allocation of instructions from correct path Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Large ROB and RS are Important
A Large RS Increase the window in which looking for independent instructions Exposes more parallelism potential Large ROB The ROB is a superset of the RS  ROB size ≥ RS size Allows for covering long latency operations (cache miss, divide) Example Assume there is a Load that misses the L1 cache Data takes ~10 cycles to return  ~30 new instrs get into pipeline Instructions following the Load cannot commit  pile up in the ROB Instructions independent of the load are executed, and leave the RS As long as the ROB is not full, we can keep executing instructions

OOO Requires Accurate Branch Predictor
Accurate branch predictor increases the effective scheduling window size Speculate across multiple branches (a branch every 5 – 10 instructions) Instruction pool High chances to commit branches Low chances to commit

Out Of Order Execution Summary
Look ahead in a window of instructions Dispatch ready instructions to execution Do not depend on data from previous instructions still not executed Have the required execution resources available Advantages Exploit Instruction Level Parallelism beyond adjacent instructions Help cover latencies (e.g., L1 data cache miss, divide) Superior/complementary to compiler scheduler Can look for ILP beyond conditional branches In a given control path instructions may be independent Register Renaming: use more than the number architectural registers Complex micro-architecture Register renaming, complex scheduler, misprediction recovery Memory ordering (coming next)

OOO Execution of Memory Operations

OOO Execution of Memory Operations
The RS dispatches instructions based on register dependencies The RS cannot detect memory dependencies store Mem[r1+r3*2]r9 load r2Mem[r10+r7*2] Does not know the load/store memory addresses The RS dispatches load/store instructions to the Address Generation Unit (AGU) when the sources for the address calculation are ready The AGU calculates the linear (virtual) memory address Segment-Base + Base Register + (Scale × Index Register) + Displacement The AGU sends linear address to the Memory Order Buffer (MOB) The MOB resolves memory dependencies and enforces memory ordering

Load and Store Ordering
x86 has small register set  uses memory often Preventing Stores from passing Stores/Loads: 3%~5% perf. loss P6 chooses not allow Stores to pass Stores/Loads Preventing Loads from passing Loads/Stores: big perf. loss P6 allows Loads to pass Stores, and Loads to pass Loads Stores are not executed OOO Stores are never performed speculatively There is no transparent way to undo them Stores are also never re-ordered among themselves The Store Buffer dispatches a store only when The store has both its address and its data, and There are no older stores awaiting dispatch Store commits its value to memory (DCU) post retirement

Store Implemented as 2 μops
Store decoded as two independent μops STA (store-address): calculates the address of the store STD (store-data): stores the data into the Store Data buffer The actual write to memory is done when the store retires Separating STA & STD is important for memory OOO Allows STA to dispatch earlier, even before the data is known Address conflicts resolved earlier  opens memory pipeline for other loads STA and STD can be issued to execution units in parallel STA dispatched to AGU when its sources (base + index) are ready STD dispatched to SDB when its source operand is available

Load/Store Ordering The MOB tracks dependencies between loads and stores An older STA has an unresolved address  block load An older STA to same address, but Store’s data is not ready  block load Store Mem[2000] 7 Store Mem[????]  8 Load R1  Mem[1000] STOP Store Mem[2000] 7 Store Mem[1000]  ?? Load R1  Mem[1000] STOP

Memory Order Buffer (MOB)
Store Coloring Each Store is allocated in-order in the Store Buffer, and gets a SBID Each load is allocated in-order in the Load Buffer, and gets LBID + current SBID Load is checked against all previous stores Stores with SBID ≤ load’s SBID Load blocked if Unresolved address of a relevant STAs STA to same address, but data not ready MOB writes blocking info into load buffer Re-dispatches load when wake-up signal received If Load is not blocked  executed (bypassed) LBID SBID Store - 1 Load 2 3 4

Memory Disambiguation
The MOB predicts if a load can proceed despite unknown STAs Predict colliding  block Load if there is unknown STA (as usual) Predict non colliding  execute even if there are unknown STAs In case of wrong prediction The entire pipeline is flushed when the load retires Store Mem[2000] 7 Store Mem[????]  8 Load R1  Mem[1000]

Store → Load Forwarding
An older STA to same address, and Store’s data is ready  Store → Load Forwarding: Load gets data directly from SDB Does not need to wait for the data to be written to the DCU Store Mem[1000]  7 Load R1  Mem[1000]

DCU Miss Blocking caches severely hurt OOO
A cache miss prevents from other cache requests (which could possibly be hits) to be served Hurts one of the main gains from OOO – hiding caches misses Cache in OOO machine are non-blocking If a Load misses in the DCU The DCU marks the write-back data as invalid Assigns a fill buffer to the load, and issues an L2 request As long as there are still free fill buffer more loads can be dispatched When the critical chunk returns, wakeup and re-dispatch the load Squash subsequent requests for the same missed cache line Use the same fill buffer

Pipeline: Load: Allocate
Schedule AGU LB Write DTLB DCU WB MOB Retire IDQ RS ROB LB Allocate RS entry, ROB entry, and Load Buffer entry for the Load Assign Store Buffer ID (SBID) to enable ordering Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Bypassed Load: EXE
Alloc Schedule AGU LB Write Retire IDQ RS ROB MOB DTLB DCU WB LB RS checks when data used for address calculation is ready AGU calculates linear address: DS-Base + base + (Scale*Index) + Disp. Write Address to the Load Buffer DTLB Virtual → Physical + DCU set access MOB checks blocking and forwarding DCU read / Store Data Buffer read (Store → Load forwarding) Write back data / write block code Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Blocked Load Re-dispatch
Alloc Schedule AGU LB Write Retire IDQ RS ROB MOB DTLB DCU WB LB MOB determines which loads are ready, and schedules one Load arbitrates for DTLB pipeline DTLB Virtual → Physical + DCU set access MOB checks blocking and forwarding DCU read / Store Data Buffer read Write back data / write block code Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Load: Retire
Alloc Schedule AGU LB Write Retire IDQ RS ROB MOB DTLB DCU WB LB Reclaim ROB, Load Buffer entries Commit results to RRF Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Stores are Done in Two Phases
At EXE Fill store buffers with linear + physical address and with data Once store address and data are known Forward Store data to load operations that need it After the store retires – completion phase First, the line must be in L1 D$, in E or M MESI state Otherwise, fetch it using a Read for Ownership request from L1 D$  L2$  LLC  L2$ and L1 D$ in other cores  Memory Read the data from the store buffers and write it to L1 D$ in M state Done at retirement to preserve the order of memory writes Release the store buffer entry taken by the store Affects performance only if store buffer becomes full (allocation stalls) Loads needing the store data get when store is executed

Pipeline: Store: Allocate
Schedule AGU SB Retire IDQ RS ROB DTLB SB Allocate ROB/RS Allocate Store Buffer entry Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Store: STA EXE
Alloc Schedule AGU SB V.A. Retire IDQ RS ROB DTLB SB P.A. SB RS checks when data used for address calculation is ready dispatches STA to AGU AGU calculates linear address Write linear address to Store Buffer DTLB Virtual → Physical Load Buffer Memory Disambiguation verification Write physical address to Store Buffer Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Store: STD EXE
Alloc Schedule SB data Retire IDQ RS ROB SB RS checks when data for STD is ready dispatches STD Write data to Store Data Buffer Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Senior Store Retirement
Alloc Schedule Retire IDQ RS ROB SB DCU MOB SB When STA (and thus STD) retire Store Buffer entry marked as senior When the DCU is idle  MOB dispatches a senior store Read senior store entry from the Store Buffer Store Buffer sends the data and the physical address DCU writes the data into the specified physical address Reclaim Store Buffer entry Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Computer Structure Out-Of-Order Execution

Similar presentations

Presentation on theme: "Computer Structure Out-Of-Order Execution"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Structure Out-Of-Order Execution

Similar presentations

Presentation on theme: "Computer Structure Out-Of-Order Execution"— Presentation transcript:

Similar presentations

About project

Feedback