Download presentation
Presentation is loading. Please wait.
1
Computer Structure Out-Of-Order Execution
Lihu Rappoport and Adi Yoaz
2
CPU Time = clock cycle CPI IC
What’s Next Goal: minimize CPU Time CPU Time = clock cycle CPI IC So far we have learned Minimize clock cycle add more pipe stages Minimize CPI use pipeline Minimize IC architecture In a pipelined CPU CPI w/o hazards is 1 CPI with hazards is > 1 Adding more pipe stages reduces clock cycle but increases CPI Higher penalty due to control hazards More data hazards What can we do ? Further reduce the CPI !
3
A Superscalar CPU IF ID EXE MEM WB IF ID EXE MEM WB
Duplicating HW in one pipe stage won’t help e.g., have 2 ALUs the bottleneck moves to other stages Getting IPC > 1 requires to fetch, decode, exe, and retire >1 instruction per clock: IF ID EXE MEM WB IF ID EXE MEM WB
4
The Pentium Processor
Fetches and decodes 2 instructions per cycle Before register file read, decide on pairing: can the two instructions be executed in parallel Pairing decision is based on Data dependencies: 2nd instruction must be independent of 1st Resources: U-pipe and V-pipe are not symmetric (save HW) Common instructions can execute on either pipe Some instructions can execute only on the U-pipe If the 2nd instruction requires the U-pipe, it cannot pair Some instructions use resources of both pipes IF ID U-pipe V-pipe pairing
5
Misprediction Penalty in a Superscalar CPU
MPI : miss-per-instruction: #incorrectly predicted branches #predicted branches MPI = = MPR× total # of instructions total # of instructions MPI correlates well with performance, e.g., assume MPR = 5%, %branches = 20% MPI = 1% Without hazards IPC=2 (2 instructions per cycles) Flush penalty of 5 cycles We get MPI = 1% flush in every 100 instructions IPC=2 flush every 100/2 = 50 cycles 5 cycles flush penalty every 50 cycles 10% performance hit For IPC=1 we would get 5 cycles flush penalty per 100 cycles 5% performance hit Flush penalty increases as the machine is deeper and wider
6
Extract More ILP ILP – Instruction Level Parallelism
A given program, executed on a given input data has a given parallelism Can execute only independent instructions in parallel If for example each instruction is dependent on the previous instruction, the ILP of the program is 1 Adding more HW will not change that Adjacent instructions are usually dependent The utilization of the 2nd pipe is usually low There are algorithms in which both pipes are highly utilized Solution: Out-Of-Order Execution Look for independent instructions further ahead in the program Execute instructions based on data readiness Still need to keep the semantics of the original program
7
Data Flow Analysis Example:
(1) r1 r4 / r7 ; assume divide takes 20 cycles (2) r8 r1 + r2 (3) r5 r5 + 1 (4) r6 r6 - r3 (5) r4 r5 + r6 (6) r7 r8 * r4 1 3 4 5 2 6 In-order execution 1 3 4 2 5 6 Data Flow Graph r1 r5 r6 r4 r8 1 3 4 5 2 6 Out-of-order execution
8
OOOE – General Scheme Fetch & Decode Instruction pool In-order Retire
(commit) In-order Execute Out-of-order Fetch & decode instructions in parallel but in order Fill the Instruction Pool Execute ready instructions from the instructions pool All source data ready + needed execution resources available Once an instruction is executed signal all dependent instructions that data is ready Commit instructions in parallel but in-order State change (memory, register) and fault/exception handling
9
Write-After-Write Dependency
(1) r1R9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 (7) r4r3+r1 (8) r32
10
Write-After-Write Dependency
(1) r1R9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 If inst (3) is executed before inst (1), r1 ends up having a wrong value. Called write-after-write false dependency. (7) r4r3+r1 (8) r32
11
Write-After-Write Dependency
(1) r1R9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 Inst (4) should use the value of r1 produced by inst (3), even if inst (1) is executed after inst (3). (7) r4r3+r1 (8) r32 Write-After-Write (WAW) is a false dependency Not a real data dependency, but an artifact of OOO execution
12
Speculative Execution
(1) r1R9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 1/5 instruction is a branch continue fetching, decoding, and allocating instructions into the instruction pool according to the predicted path. Called “speculative execution” (7) r4r3+r1 (8) r32
13
Write-After-Read Dependency
(1) r1R9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 (7) r4r3+r1 (8) r32
14
Write-After-Read Dependency
(1) r1R9/17 (2) r2r2+r1 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (3) r123 (8) r32 If inst (8) is executed before inst (7), inst (7) gets a wrong value of r3. Called write-after-read false dependency. (7) r4r3+r1 Write-After-Read (WAR) is a false dependency Not a real data dependency, but an artifact of OOO execution
15
Register Renaming Hold a pool of physical registers
Map architectural registers into physical registers When an instruction is allocated into the inst. pool (still in-order) Allocate a free physical register from a pool The physical register points to the architectural register When an instruction executes and writes a result Write the result value to the physical register When an instruction needs data from a register Read data from the physical register allocated to the latest inst which writes to the same arch register, and precedes the current inst If no such instruction exists, read from the reset arch. value When an instruction commits Copy the value from its physical register to the architectural register If the RAT (still) maps the arch reg to the instruction’s physical register Update the RAT to map to the architectural register in the RRF
16
Register Renaming (1) r117 (2) r2r2+r1 (3) r123 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (7) r4r3+r1 (8) r32 Renaming r1:pr1 pr117 r2:pr2 pr2r2+pr1 r1:pr3 pr323 r3:pr4 pr4r3+pr3 r1:pr5 pr535 r4:pr6 pr6pr4+pr5 r3:pr7 pr72 r1 r2 r3 r4 Register mapping pr3 pr5 pr1 pr2 pr4 pr7 pr6 When an instruction commits: Copy its physical register into the architectural register
17
Speculative Execution – Misprediction
(1) r117 (2) r2r2+r1 (3) r123 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (7) r4r3+r1 (8) r32 Renaming r1:pr1 pr117 r2:pr2 pr2r2+pr1 r1:pr3 pr323 r3:pr4 pr4r3+pr3 r1:pr5 pr535 r4:pr6 pr6pr4+pr5 r3:pr7 pr72 If the predicted branch path turns out to be wrong (when the branch is executed): The instructions following the branch are flushed before they are committed the architectural state is not changed r1 r2 r3 r4 Register mapping pr5 pr6 pr7 pr1 pr3 pr2 pr4
18
Speculative Execution – Misprediction
(1) r117 (2) r2r2+r1 (3) r123 (4) r3r3+r1 (5) jcc L2 (6) L2 r135 (7) r4r3+r1 (8) r32 Renaming r1:pr1 pr117 r2:pr2 pr2r2+pr1 r1:pr3 pr323 r3:pr4 pr4r3+pr3 r1:pr5 pr535 r4:pr6 pr6pr4+pr5 r3:pr7 pr72 But the register mapping was already wrongly updated by the wrong path instructions r1 r2 r3 r4 Register mapping pr5 pr6 pr7 pr3 pr1 pr2 pr4 Later on we will see various schemes to fix this
19
A Superscalar OOO Machine
In-order Fetch and decode instructions in parallel but in order Fetch & Decode rename Renames sources to physical registers Allocates a physical register to the destination
20
A Superscalar OOO Machine
In-order Reorder Buffer (ROB) Pool of instructions waiting for retirement RS ROB Fetch & Decode rename Reservation Stations (RS) Pool of instructions waiting for execution Maintains per inst. sources ready/not-ready status Allocate instructions in RS and ROB
21
A Superscalar OOO Machine
ROB Out-Of-Order Send “exe done” indication to ROB and reclaim RS entry Fetch & Decode rename Execute Write back value and mark more sources as ready Track which instruction are ready: have all sources are ready Dispatch ready instruction to execution ports in FIFO order The RS handles only register dependencies – it does not handle memory dependencies
22
A Superscalar OOO Machine
In-order Retire (commit) RS ROB Fetch & Decode rename Commit instructions in parallel but in-order Handle faults/exceptions Reclaim ROB entry Execute
23
Architectural dest. reg
Re-order Buffer (ROB) Holds instructions from allocation and until retirement At the same order as in the program Provides a large physical register space for register renaming One physical register per each ROB entry Physical register number = entry number Each instruction has only one destination Buffers the execution results until retirement Valid data is set after instruction executed and result written to physical reg #entry Entry Valid Data Physical Reg Data Architectural dest. reg 1 12H R2 33H R3 2 xxx R8 39 XXX
24
RRF – Real Register File
Holds the Architectural Register File Architectural Register are numbered: 0 – R0, 1 – R1, … The value of an architectural register is the value written to it by the last instruction committed which writes to this register RRF: #entry Arch Reg Data 0 (R0) 9AH 1 (R1) F34H
25
Reservation station (RS)
Pool of all “not yet executed” instructions Holds the instruction’s attributes and source data until it is executed When instruction is allocated in RS, operand values are updated RS Entry valid opcode src1valid src1 src2 valid src2 Pdst 1 add 97H 12H 37 Operand from Data valid Get value from architectural register 1 RRF physical register ROB Wait for value
26
Reservation station (RS)
The RS maintains operands status “ready/not-ready” Each cycle, executed instructions make more operands “ready” The RS arbitrate the WB busses between the units The RS monitors the WB bus to capture data needed by awaiting instructions Data can be bypassed directly from WB bus to execution unit Instructions for which all operands are ready can be dispatched for execution Dispatcher chooses which of the ready instructions to execute next Dispatches chosen instructions to functional units
27
Allocation and Renaming
Perform register allocation and renaming for ≤4 instructions/cyc The Register Alias Table (RAT) Maps architectural registers into physical registers For each arch reg, holds the number of latest phy reg that updates it When a new instruction that writes to a arch reg R is allocated Record phy reg allocated to the inst. as the latest reg that updates R The Allocator (Alloc) Assigns each instruction an entry number in the ROB / RS For each one of the sources (architectural registers) of the instruction Lookup the RAT to find out the latest phy reg updating it Write it up in the RS entry Allocate Load & Store buffers in the MOB Arch reg #reg Location R1 1 RRF R2 19 ROB R3 23
28
Register Renaming example
IDQ add R1R2,R1 RAT / Alloc #reg R1 1 RRF R2 19 ROB R3 23 #reg R0 37 ROB R1 19 R2 23 add ROB37ROB19,RRF1 ROB RS # Data Valid data dst 19 1 12H R2 23 33H R3 37 x xxx 38 # Data Valid data dst 19 1 12H R2 23 33H R3 37 xxx R1 38 x entry valid op code src1valid src1 src2 valid src2 dst 1 add 97H 12H 37 RRF: 1 R1 97H
29
Register Renaming example (2)
IDQ sub R1R3,R1 RAT / Alloc #reg R1 37 ROB R2 19 R3 23 #reg R0 38 ROB R1 19 R2 23 sub ROB38ROB23,ROB37 ROB RS # Data Valid data dst 19 1 12H R2 23 33H R3 37 x xxx 38 # Data Valid data dst 19 1 12H R2 23 33H R3 37 xxx R1 38 entry valid op code src1valid src1 src2 valid src2 dst 1 add 97H 12H 37 sub rob37 33H 38 RRF: 1 R1 97H
30
Out-of-order Core: Execution Units
MIU Port 0 Port 1 Port 2 Port 3,4 SHF FMU FDIV IDIV FAU IEU JEU AGU Load Address Store Address internal 0-dealy bypass within each EU 2nd bypass in RS RS 1st bypass in MIU DCU SDB
31
In-Order Retire The Reorder Buffer (ROB)
Instructions are allocated in-order retire oldest alloc youngest
32
The ROB The Reorder Buffer (ROB) Instructions are allocated in-order
After an instruction is executed mark as executed ready to retire retire oldest alloc youngest
33
The ROB The Reorder Buffer (ROB) Instructions are allocated in-order
After an instruction is executed mark as executed ready to retire An executed instruction can be retired once all prior instructions have retired Once the oldest instructions in the ROB are ready to retire, they are retired Instructions are retired in-order Upon instruction retirement Copy value from phy reg to arch reg If the RAT maps the arch reg to the instruction’s phy reg update the RAT to map to the arch reg in the RRF Its ROB entry is released retire oldest youngest alloc
34
Faults And Exceptions Faults and exceptions are served in program order At EXE: mark an instruction that takes a fault/exception Divide by 0, page fault, … retire oldest alloc youngest
35
Faults And Exceptions Faults and exceptions are served in program order At EXE: mark an instruction that takes a fault/exception Divide by 0, page fault, … Instructions older than the faulting instruction are retired When the faulting instruction retires – handle the fault retire oldest alloc youngest
36
Faults And Exceptions Faults and exceptions are served in program order At EXE: mark an instruction that takes a fault/exception Divide by 0, page fault, … Instructions older than the faulting instruction are retired When the faulting instruction retires – handle the fault Flush the ROB retire oldest alloc
37
The faulting instruction
Faults And Exceptions Faults and exceptions are served in program order At EXE: mark an instruction that takes a fault/exception Divide by 0, page fault, … Instructions older than the faulting instruction are retired When the faulting instruction retires – handle the fault Flush the ROB Initiate the fault handling code according to the fault type Re-fetch the faulting instruction and the subsequent instructions retire oldest Fault Handling Code The faulting instruction alloc youngest
38
Interrupts Interrupts are served when the next instruction retires
Let the instruction in the current cycle retire Interrupt retire oldest alloc youngest
39
Interrupts Interrupts are served when the next instruction retires
Let the instruction in the current cycle retire Flush subsequent instructions Interrupt retire oldest alloc
40
Interrupts Interrupts are served when the next instruction retires
Let the instruction in the current cycle retire Flush subsequent instructions Initiate the interrupt service code Fetch the subsequent instructions retire oldest alloc youngest alloc youngest
41
Pipeline: Fetch Predict/Fetch Decode Alloc Schedule EX Retire
IQ IDQ Alloc ROB RS Schedule EX Retire Fetch multiple instruction bytes from the I$ In case of a variable instruction length architecture (like x86) Length-decode instructions within the fetched instruction bytes Write the instructions into an Instruction Queue Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
42
Pipeline: Decode Predict/Fetch Decode Alloc Schedule EX Retire
IQ IDQ Alloc ROB RS Schedule EX Retire Read multiple instructions from the IQ Decode the instructions For x86 processors, each instruction is decoded into ≥1 μops μops are RISC-like instructions Write the resulting μops into the Instruction Decoder Queue (IDQ) Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
43
Pipeline: Allocate Predict/Fetch Decode Alloc Schedule EX Retire
IQ IDQ Alloc ROB RS Schedule EX Retire Allocate, port bind and rename multiple μops Allocate ROB/RS entry per μop If source data is available from ROB or RRF, write data to RS Otherwise, mark data not ready in RS Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
44
Pipeline: EXE Predict/Fetch Decode Alloc Schedule EX Retire
IQ IDQ Alloc ROB RS Schedule EX Retire Ready/Schedule Check for data-ready μops if needed functional unit available Select and dispatch multiple ready μops/clock to EXE Write back results into RS/ROB Wake up μops in the RS that are waiting for the results as a sources Update data-ready status of these μops in the RS Write back results into the Physical Registers Reclaim RS entries Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
45
Pipeline: Retire Predict/Fetch Decode Alloc Schedule EX Retire
IQ IDQ Alloc ROB RS Schedule EX Retire Retire oldest μops in the ROB A μop may retire if Its “executed” bit is set It is not marked with an exception / fault All preceding μop are eligible for retirement Commit results from the Physical Register to the Arch Register Reclaim the ROB entry In case of exception / fault Flush the pipeline and handle the exception / fault Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
46
Jump Misprediction – Flush at Retire
When a mispredicted jump retires Flush the pipeline When the jump commits, all the instructions remaining in the pipe are younger than the jump from the wrong path Reset the renaming map So all the registers are mapped to the architectural registers This is ok since there are no consumers of physical registers (pipe is flushed) Start fetching instructions from the correct path Disadvantage Very high misprediction penalty Misprediction is already known after the jump was executed We will see ways to recover a misprediction at execution
47
Pipeline: Jump Gets to EXE
Fetch Decode IQ IDQ Alloc ROB RS Retire Schedule JEU Misprediction detected when jump is executed Do nothing Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
48
Pipeline: Mispredicted Jump Retires
Clear Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB When the mispredicted jump retires All instructions in the pipe are younger than the jump from the wrong path flush the pipeline Reset the renaming map So all the registers are mapped to the architectural registers This is ok since there are no consumers of physical registers (pipe is flushed) Start fetching instructions from the correct path Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
49
Jump Misprediction – Flush at Execute
When a jump misprediction is detected (at jump execution) Flush the in-order front-end Instructions already in the OOO part continue to execute Including wrong-path instructions (waist of execution resource and power) Start fetching and decoding instruction from the correct path Note that the “correct” path may still be wrong … An older instruction may cause an exception when it retires A older jump executed out-of-order can also mispredict Block younger jumps (executed OOO) from clearing The correct instruction stream is stalled at the RAT The RAT was wrongly updated also by wrong path instruction When the mispredicted jump retires All instructions in the RS/ROB are from the wrong path Flush all instructions from the RS/ROB Reset the RAT to point only to architectural registers Un-stall RAT, and allow instructions from correct path to rename/alloc
50
Pipeline: Jump Gets to EXE
Fetch Decode IQ IDQ Alloc ROB RS Retire Schedule JEU Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
51
Pipeline: Jump Gets to EXE
Fetch Decode IQ IDQ Alloc ROB RS Retire Schedule JEU Flush Flush front-end and re-steer it to correct path RAT state already updated by wrong path Block further allocation Update BPU OOO not flushed Instructions already in the OOO part continue to execute Including wrong-path instructions (waist of execution resource and power) Block younger jumps from clearing Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
52
Pipeline: Mispredicted jump Retires
Fetch Decode IQ IDQ Alloc ROB RS Retire Schedule JEU Clear When the mispredicted jump retires Flush OOO Only instruction following the jump are left – they must all be flushed Resets all state in the OOO (RAT, RS, RB, MOB, etc.) Reset the RAT to point only to architectural registers Allow allocation of instructions from correct path Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
53
Periodic Checkpoints Allow allocation and execution of instructions from the correct path before the mispredicted jump retires Every few instructions take a checkpoint of the RAT A snapshot of the current renaming map In case of misprediction Flush the frontend and start fetching instructions from the correct path Selectively flush younger instructions from the ROB/RS Recover RAT to latest checkpoint taken prior to the mispredicted jump Recover RAT to its state at the jump Rename instructions from the checkpoint and until the jump Allow instructions from the correct path to allocate check point check point check point Re-rename check point
54
Mispredicted Jump Gets to EXE
Clear Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB Clear raised on mispredicted jump Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
55
Mispredicted Jump Gets to EXE
BPU Update Clear Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB Clear raised on mispredicted jump Flush frontend and re-steer it to the correct path Flush all younger instructions in OOO Update BPU Block further allocation Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
56
RAT Recovery Predict/Fetch Decode Alloc Schedule JEU Retire
IQ IDQ RS ROB Restore RAT from latest check-point before jump Recover RAT to its states just after the jump Before any instruction on the wrong path Meanwhile front-end starts fetching and decoding instructions from the correct path Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
57
RAT Recovery Predict/Fetch Decode Alloc Schedule JEU Retire
IQ IDQ RS ROB Once done restoring the RAT allow allocation of instructions from correct path Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
58
Large ROB and RS are Important
A Large RS Increase the window in which looking for independent instructions Exposes more parallelism potential Large ROB The ROB is a superset of the RS ROB size ≥ RS size Allows for covering long latency operations (cache miss, divide) Example Assume there is a Load that misses the L1 cache Data takes ~10 cycles to return ~30 new instrs get into pipeline Instructions following the Load cannot commit pile up in the ROB Instructions independent of the load are executed, and leave the RS As long as the ROB is not full, we can keep executing instructions
59
OOO Requires Accurate Branch Predictor
Accurate branch predictor increases the effective scheduling window size Speculate across multiple branches (a branch every 5 – 10 instructions) Instruction pool High chances to commit branches Low chances to commit
60
Out Of Order Execution Summary
Look ahead in a window of instructions Dispatch ready instructions to execution Do not depend on data from previous instructions still not executed Have the required execution resources available Advantages Exploit Instruction Level Parallelism beyond adjacent instructions Help cover latencies (e.g., L1 data cache miss, divide) Superior/complementary to compiler scheduler Can look for ILP beyond conditional branches In a given control path instructions may be independent Register Renaming: use more than the number architectural registers Complex micro-architecture Register renaming, complex scheduler, misprediction recovery Memory ordering (coming next)
61
OOO Execution of Memory Operations
62
OOO Execution of Memory Operations
The RS dispatches instructions based on register dependencies The RS cannot detect memory dependencies store Mem[r1+r3*2]r9 load r2Mem[r10+r7*2] Does not know the load/store memory addresses The RS dispatches load/store instructions to the Address Generation Unit (AGU) when the sources for the address calculation are ready The AGU calculates the linear (virtual) memory address Segment-Base + Base Register + (Scale × Index Register) + Displacement The AGU sends linear address to the Memory Order Buffer (MOB) The MOB resolves memory dependencies and enforces memory ordering
63
Load and Store Ordering
x86 has small register set uses memory often Preventing Stores from passing Stores/Loads: 3%~5% perf. loss P6 chooses not allow Stores to pass Stores/Loads Preventing Loads from passing Loads/Stores: big perf. loss P6 allows Loads to pass Stores, and Loads to pass Loads Stores are not executed OOO Stores are never performed speculatively There is no transparent way to undo them Stores are also never re-ordered among themselves The Store Buffer dispatches a store only when The store has both its address and its data, and There are no older stores awaiting dispatch Store commits its value to memory (DCU) post retirement
64
Store Implemented as 2 μops
Store decoded as two independent μops STA (store-address): calculates the address of the store STD (store-data): stores the data into the Store Data buffer The actual write to memory is done when the store retires Separating STA & STD is important for memory OOO Allows STA to dispatch earlier, even before the data is known Address conflicts resolved earlier opens memory pipeline for other loads STA and STD can be issued to execution units in parallel STA dispatched to AGU when its sources (base + index) are ready STD dispatched to SDB when its source operand is available
65
Load/Store Ordering The MOB tracks dependencies between loads and stores An older STA has an unresolved address block load An older STA to same address, but Store’s data is not ready block load Store Mem[2000] 7 Store Mem[????] 8 Load R1 Mem[1000] STOP Store Mem[2000] 7 Store Mem[1000] ?? Load R1 Mem[1000] STOP
66
Memory Order Buffer (MOB)
Store Coloring Each Store is allocated in-order in the Store Buffer, and gets a SBID Each load is allocated in-order in the Load Buffer, and gets LBID + current SBID Load is checked against all previous stores Stores with SBID ≤ load’s SBID Load blocked if Unresolved address of a relevant STAs STA to same address, but data not ready MOB writes blocking info into load buffer Re-dispatches load when wake-up signal received If Load is not blocked executed (bypassed) LBID SBID Store - 1 Load 2 3 4
67
Memory Disambiguation
The MOB predicts if a load can proceed despite unknown STAs Predict colliding block Load if there is unknown STA (as usual) Predict non colliding execute even if there are unknown STAs In case of wrong prediction The entire pipeline is flushed when the load retires Store Mem[2000] 7 Store Mem[????] 8 Load R1 Mem[1000]
68
Store → Load Forwarding
An older STA to same address, and Store’s data is ready Store → Load Forwarding: Load gets data directly from SDB Does not need to wait for the data to be written to the DCU Store Mem[1000] 7 Load R1 Mem[1000]
69
DCU Miss Blocking caches severely hurt OOO
A cache miss prevents from other cache requests (which could possibly be hits) to be served Hurts one of the main gains from OOO – hiding caches misses Cache in OOO machine are non-blocking If a Load misses in the DCU The DCU marks the write-back data as invalid Assigns a fill buffer to the load, and issues an L2 request As long as there are still free fill buffer more loads can be dispatched When the critical chunk returns, wakeup and re-dispatch the load Squash subsequent requests for the same missed cache line Use the same fill buffer
70
Pipeline: Load: Allocate
Schedule AGU LB Write DTLB DCU WB MOB Retire IDQ RS ROB LB Allocate RS entry, ROB entry, and Load Buffer entry for the Load Assign Store Buffer ID (SBID) to enable ordering Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
71
Pipeline: Bypassed Load: EXE
Alloc Schedule AGU LB Write Retire IDQ RS ROB MOB DTLB DCU WB LB RS checks when data used for address calculation is ready AGU calculates linear address: DS-Base + base + (Scale*Index) + Disp. Write Address to the Load Buffer DTLB Virtual → Physical + DCU set access MOB checks blocking and forwarding DCU read / Store Data Buffer read (Store → Load forwarding) Write back data / write block code Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
72
Pipeline: Blocked Load Re-dispatch
Alloc Schedule AGU LB Write Retire IDQ RS ROB MOB DTLB DCU WB LB MOB determines which loads are ready, and schedules one Load arbitrates for DTLB pipeline DTLB Virtual → Physical + DCU set access MOB checks blocking and forwarding DCU read / Store Data Buffer read Write back data / write block code Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
73
Pipeline: Load: Retire
Alloc Schedule AGU LB Write Retire IDQ RS ROB MOB DTLB DCU WB LB Reclaim ROB, Load Buffer entries Commit results to RRF Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
74
Stores are Done in Two Phases
At EXE Fill store buffers with linear + physical address and with data Once store address and data are known Forward Store data to load operations that need it After the store retires – completion phase First, the line must be in L1 D$, in E or M MESI state Otherwise, fetch it using a Read for Ownership request from L1 D$ L2$ LLC L2$ and L1 D$ in other cores Memory Read the data from the store buffers and write it to L1 D$ in M state Done at retirement to preserve the order of memory writes Release the store buffer entry taken by the store Affects performance only if store buffer becomes full (allocation stalls) Loads needing the store data get when store is executed
75
Pipeline: Store: Allocate
Schedule AGU SB Retire IDQ RS ROB DTLB SB Allocate ROB/RS Allocate Store Buffer entry Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
76
Pipeline: Store: STA EXE
Alloc Schedule AGU SB V.A. Retire IDQ RS ROB DTLB SB P.A. SB RS checks when data used for address calculation is ready dispatches STA to AGU AGU calculates linear address Write linear address to Store Buffer DTLB Virtual → Physical Load Buffer Memory Disambiguation verification Write physical address to Store Buffer Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
77
Pipeline: Store: STD EXE
Alloc Schedule SB data Retire IDQ RS ROB SB RS checks when data for STD is ready dispatches STD Write data to Store Data Buffer Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
78
Pipeline: Senior Store Retirement
Alloc Schedule Retire IDQ RS ROB SB DCU MOB SB When STA (and thus STD) retire Store Buffer entry marked as senior When the DCU is idle MOB dispatches a senior store Read senior store entry from the Store Buffer Store Buffer sends the data and the physical address DCU writes the data into the specified physical address Reclaim Store Buffer entry Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.