Computer Structure 2015 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Computer Structure 2015 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz

Computer Structure 2015 – Out-Of-Order Execution 2 What’s Next u Goal: minimize CPU Time CPU Time = clock cycle  CPI  IC u So far we have learned  Minimize clock cycle  add more pipe stages  Minimize CPI  use pipeline  Minimize IC  architecture u In a pipelined CPU  CPI w/o hazards is 1  CPI with hazards is > 1 u Adding more pipe stages reduces clock cycle but increases CPI  Higher penalty due to control hazards  More data hazards u What can we do ? Further reduce the CPI !

Computer Structure 2015 – Out-Of-Order Execution 3 A Superscalar CPU u Duplicating HW in one pipe stage won’t help  e.g., have 2 ALUs  the bottleneck moves to other stages u Getting IPC > 1 requires to fetch, decode, exe, and retire >1 instruction per clock: IF ID EXE MEM WB

Computer Structure 2015 – Out-Of-Order Execution 4 The Pentium  Processor u Fetches and decodes 2 instructions per cycle  Before register file read, decide on pairing: can the two instructions be executed in parallel u Pairing decision is based on  Data dependencies: 2 nd instruction must be independent of 1 st  Resources: U-pipe and V-pipe are not symmetric (save HW) Common instructions can execute on either pipe Some instructions can execute only on the U-pipe If the 2 nd instruction requires the U-pipe, it cannot pair Some instructions use resources of both pipes IF ID U-pipe V-pipe pairing

Computer Structure 2015 – Out-Of-Order Execution 5 u MPI : miss-per-instruction: #incorrectly predicted branches #predicted branches MPI = = MPR× total # of instructions total # of instructions u MPI correlates well with performance, e.g., assume  MPR = 5%, %branches = 20%  MPI = 1%  Without hazards IPC=2 (2 instructions per cycles)  Flush penalty of 5 cycles u We get  MPI = 1%  flush in every 100 instructions  IPC=2  flush every 100/2 = 50 cycles  5 cycles flush penalty every 50 cycles  10% performance hit u For IPC=1 we would get  5 cycles flush penalty per 100 cycles  5% performance hit u Flush penalty increases as the machine is deeper and wider Misprediction Penalty in a Superscalar CPU

Computer Structure 2015 – Out-Of-Order Execution 6 Extract More ILP u ILP – Instruction Level Parallelism  A given program, executed on a given input data has a given parallelism  Can execute only independent instructions in parallel  If for example each instruction is dependent on the previous instruction, the ILP of the program is 1 Adding more HW will not change that u Adjacent instructions are usually dependent  The utilization of the 2 nd pipe is usually low  There are algorithms in which both pipes are highly utilized u Solution: Out-Of-Order Execution  Look for independent instructions further ahead in the program  Execute instructions based on data readiness  Still need to keep the semantics of the original program

Computer Structure 2015 – Out-Of-Order Execution 7 Data Flow Analysis u Example: (1) r1  r4 / r7 ; assume divide takes 20 cycles (2) r8  r1 + r2 (3) r5  r5 + 1 (4) r6  r6 - r3 (5) r4  r5 + r6 (6) r7  r8 * r4 1 3 4 5 2 6 In-order execution 1 3 4 526 Out-of-order execution 134 2 5 6 Data Flow Graph r1 r5r6 r4 r8

Computer Structure 2015 – Out-Of-Order Execution 8 OOOE – General Scheme u Fetch & decode instructions in parallel but in order  Fill the Instruction Pool u Execute ready instructions from the instructions pool  All source data ready + needed execution resources available u Once an instruction is executed  signal all dependent instructions that data is ready u Commit instructions in parallel but in-order  State change (memory, register) and fault/exception handling Retire (commit) In-order Fetch & Decode Instruction pool In-order Execute Out-of-order

Computer Structure 2015 – Out-Of-Order Execution 9 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Write-After-Write Dependency (8) r3  2 (7) r4  r3+r1 (3) r1  23

Computer Structure 2015 – Out-Of-Order Execution 10 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Write-After-Write Dependency (8) r3  2 (7) r4  r3+r1 (3) r1  23 If inst (3) is executed before inst (1), r1 ends up having a wrong value. Called write-after-write false dependency.

Computer Structure 2015 – Out-Of-Order Execution 11 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Write-After-Write Dependency (8) r3  2 (7) r4  r3+r1 (3) r1  23 Inst (4) should use the value of r1 produced by inst (3), even if inst (1) is executed after inst (3). Write-After-Write (WAW) is a false dependency Not a real data dependency, but an artifact of OOO execution

Computer Structure 2015 – Out-Of-Order Execution 12 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Speculative Execution (8) r3  2 (7) r4  r3+r1 (3) r1  23 1/5 instruction is a branch  continue fetching, decoding, and allocating instructions into the instruction pool according to the predicted path. Called “speculative execution”

Computer Structure 2015 – Out-Of-Order Execution 13 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Write-After-Read Dependency (3) r1  23 (8) r3  2 (7) r4  r3+r1

Computer Structure 2015 – Out-Of-Order Execution 14 (7) r4  r3+r1 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Write-After-Read Dependency (3) r1  23 (8) r3  2 If inst (8) is executed before inst (7), inst (7) gets a wrong value of r3. Called write-after-read false dependency. Write-After-Read (WAR) is a false dependency Not a real data dependency, but an artifact of OOO execution

Computer Structure 2015 – Out-Of-Order Execution 15 Register Renaming u Hold a pool of physical registers  Map architectural registers into physical registers (still in-order) u When an instruction is allocated into the instruction pool (still in-order)  Allocate a free physical register from a pool  The physical register points to the architectural register u When an instruction executes and writes a result  Write the result value to the physical register u When an instruction needs data from a register  Read data from the physical register allocated to the latest inst which writes to the same arch register, and precedes the current inst If no such instruction exists, read from the reset arch. value u When an instruction commits  Copy the value from its physical register to the architectural register

Computer Structure 2015 – Out-Of-Order Execution 16 Renaming r1:pr1 pr1  17 r2:pr2 pr2  r2+pr1 r1:pr3 pr3  23 r3:pr4 pr4  r3+pr3 r1:pr5 pr5  35 r4:pr6 pr6  pr4+pr5 r3:pr7 pr7  2 (1) r1  17 (2) r2  r2+r1 (3) r1  23 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 (7) r4  r3+r1 (8) r3  2 Register Renaming r1r2r3r4 Register mappingr1r2r3r4 pr1pr2 pr3pr4 pr5pr6 pr7 When an instruction commits: Copy its physical register into the architectural register

Computer Structure 2015 – Out-Of-Order Execution 17 Renaming r1:pr1 pr1  17 r2:pr2 pr2  r2+pr1 r1:pr3 pr3  23 r3:pr4 pr4  r3+pr3 r1:pr5 pr5  35 r4:pr6 pr6  pr4+pr5 r3:pr7 pr7  2 (1) r1  17 (2) r2  r2+r1 (3) r1  23 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 (7) r4  r3+r1 (8) r3  2 Speculative Execution – Misprediction r1r2r3r4 Register mappingr1r2r3r4 pr1pr2 pr3pr4 pr5pr6 pr7 If the predicted branch path turns out to be wrong (when the branch is executed): The instructions following the branch are flushed before they are committed  the architectural state is not changed

Computer Structure 2015 – Out-Of-Order Execution 18 Renaming r1:pr1 pr1  17 r2:pr2 pr2  r2+pr1 r1:pr3 pr3  23 r3:pr4 pr4  r3+pr3 r1:pr5 pr5  35 r4:pr6 pr6  pr4+pr5 r3:pr7 pr7  2 (1) r1  17 (2) r2  r2+r1 (3) r1  23 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 (7) r4  r3+r1 (8) r3  2 Speculative Execution – Misprediction r1r2r3r4 Register mappingr1r2r3r4 pr1pr2 pr3pr4 pr5pr6 pr7 But the register mapping was already wrongly updated by the wrong path instructions Later on we will see various schemes to fix this

Computer Structure 2015 – Out-Of-Order Execution 19 In-order A Superscalar OOO Machine Fetch & Decode Fetch and decode instructions in parallel but in order rename Renames sources to physical registers Allocates a physical register to the destination

Computer Structure 2015 – Out-Of-Order Execution 20 In-order RS ROB A Superscalar OOO Machine Fetch & Decode rename Reorder Buffer (ROB) Pool of instructions waiting for retirement Reservation Stations (RS) Pool of instructions waiting for execution Maintains per inst. sources ready/not-ready status Allocate instructions in RS and ROB

Computer Structure 2015 – Out-Of-Order Execution 21 Out-Of-Order A Superscalar OOO Machine Fetch & Decode rename Execute Write back value and mark more sources as ready Send “exe done” indication to ROB and reclaim RS entry Track which instruction are ready: have all sources are ready Dispatch ready instruction to execution ports in FIFO order The RS handles only register dependencies – it does not handle memory dependencies RS ROB

Computer Structure 2015 – Out-Of-Order Execution 22 A Superscalar OOO Machine In-order Retire (commit) Fetch & Decode rename Execute Commit instructions in parallel but in-order Handle faults/exceptions Reclaim ROB entry RS ROB

Computer Structure 2015 – Out-Of-Order Execution 23 Re-order Buffer (ROB)  Holds instructions from allocation and until retirment  At the same order as in the program  Provides a large physical register space for register renaming  One physical register per each ROB entry physical register number = entry number Each instruction has only one destination Buffer the execution results until retirement  Valid data is set after instruction executed and result written to physical reg #entry Entry Valid Data Valid Physical Reg Data Architectural dest. reg 01112HEBX 11133HECX 210xxxESI 3900xxxXXX

Computer Structure 2015 – Out-Of-Order Execution 24 RRF – Real Register File u Holds the Architectural Register File  Architectural Register are numbered: 0 – EAX, 1 – EBX, … u The value of an architectural register  is the value written to it by the last instruction committed which writes to this register RRF: #entryArch Reg Data 0 (EAX)9AH 1 (EBX)F34H

Computer Structure 2015 – Out-Of-Order Execution 25 Reservation station (RS) u Pool of all “not yet executed” instructions  Holds the instruction’s attributes and source data until it is executed u When instruction is allocated in RS, operand values are updated Operand fromData validGet value from architectural register 1 RRF physical register 0 ROB 1 Wait for value RS vsrc1vsrc2Pdst add197H112H37

Computer Structure 2015 – Out-Of-Order Execution 26 Reservation station (RS) u The RS maintains operands status “ready/not-ready”  Each cycle, executed instructions make more operands “ready” The RS arbitrate the WB busses between the units The RS monitors the WB bus to capture data needed by awaiting instructions Data can be bypassed directly from WB bus to execution unit  Instructions whose all operands are ready can be dispatched for execution Dispatcher chooses which of the ready instructions to execute next Dispatches chosen instructions to functional units

Computer Structure 2015 – Out-Of-Order Execution 27 Allocation and Renaming u Perform register allocation and renaming for ≤4 instructions/cyc u The Register Alias Table (RAT)  Maps architectural registers into physical registers For each arch reg, holds the number of latest phy reg that updates it  When a new instruction that writes to a arch reg R is allocated Record phy reg allocated to the inst. as the latest reg that updates R u The Allocator (Alloc)  Assigns each instruction an entry number in the ROB / RS  For each one of the sources (architectural registers) of the instruction Lookup the RAT to find out the latest phy reg updating it Write it up in the RS entry  Allocate Load & Store buffers in the MOB Arch reg#regLocation EAX0RRF EBX19ROB ECX23ROB

Computer Structure 2015 – Out-Of-Order Execution 28 Register Renaming example RS RAT / Alloc  IDQ Add EAX, EBX, EAX #reg EAX0RRF EBX19ROB ECX23ROB Add ROB37, ROB19, RRF0 ROB  # Data Valid DataDST 19VV12HEBX 23VV33HECX 37IxxxxXXX 38IxxxxXXX vsrc1vsrc2Pdst add197H112H37 RRF: 0EAX97H # Data Valid DataDST 19VV12HEBX 23VV33HECX 37VIxxxEAX 38IxxxxXXX #reg EAX37ROB EBX19ROB ECX23ROB

Computer Structure 2015 – Out-Of-Order Execution 29 ROB  RS RAT / Alloc  RRF: Register Renaming example (2) sub EAX, ECX, EAX #reg EAX37ROB EBX19ROB ECX23ROB sub ROB38, ROB23, ROB37 # Data Valid DataDST 19VV12HEBX 23VV33HECX 37VxxxxXXX 38IxxxxXXX vsrc1vsrc2Pdst add197H112H37 sub0rob37133H38 0EAX97H # Data Valid DataDST 19VV12HEBX 23VV33HECX 37VIxxxEAX 38VIxxxEAX #reg EAX38ROB EBX19ROB ECX23ROB IDQ

Computer Structure 2015 – Out-Of-Order Execution 30 Out-of-order Core: Execution Units MIU Port 0 Port 1 Port 2 Port 3,4 SHF FMU FDIV IDIV FAU IEU JEU IEU AGU Load Address Store Address internal 0-dealy bypass within each EU 2 nd bypass in RS RS 1 st bypass in MIU DCU SDB

Computer Structure 2015 – Out-Of-Order Execution 31 In-Order Retire u The Reorder Buffer (ROB)  Instructions are allocated in-order retire oldest alloc youngest

Computer Structure 2015 – Out-Of-Order Execution 32 The ROB u The ROB  Instructions are allocated in-order  After an instruction is executed marked as executed  ready for retirement retire oldest alloc youngest

Computer Structure 2015 – Out-Of-Order Execution 33 The ROB u The ROB  Instructions are allocated in-order  After an instruction is executed marked as executed  ready for retirement  An executed instruction can be retired once all prior instructions have retired Once oldest instructions in the ROB are ready to retire, they are retired Instructions are retired in-order  Upon instruction retirement Copy the value from its physical register to the architectural register Its ROB entry is released alloc retire oldest youngest

Computer Structure 2015 – Out-Of-Order Execution 34 Faults And Exceptions u Faults and exceptions are served in program order  At EXE: mark an instruction that takes a fault/exception Divide by 0, page fault, … retire oldest alloc youngest

Computer Structure 2015 – Out-Of-Order Execution 35 Faults And Exceptions u Faults and exceptions are served in program order  At EXE: mark an instruction that takes a fault/exception Divide by 0, page fault, …  Instructions older than the faulting instruction are retired  When the faulting instruction retires – handle the fault retire oldest alloc youngest

Computer Structure 2015 – Out-Of-Order Execution 36 Faults And Exceptions u Faults and exceptions are served in program order  At EXE: mark an instruction that takes a fault/exception Divide by 0, page fault, …  Instructions older than the faulting instruction are retired  When the faulting instruction retires – handle the fault Flush the ROB retire oldest alloc

Computer Structure 2015 – Out-Of-Order Execution 37 Faults And Exceptions u Faults and exceptions are served in program order  At EXE: mark an instruction that takes a fault/exception Divide by 0, page fault, …  Instructions older than the faulting instruction are retired  When the faulting instruction retires – handle the fault Flush the ROB Initiate the fault handling code according to the fault type Re-fetch the faulting instruction and the subsequent instructions retire oldest Fault Handling Code alloc youngest The faulting instruction

Computer Structure 2015 – Out-Of-Order Execution 38 Interrupts u Interrupts are served when the next instruction retires  Let the instruction in the current cycle retire retire oldest alloc youngest Interrupt

Computer Structure 2015 – Out-Of-Order Execution 39 Interrupts u Interrupts are served when the next instruction retires  Let the instruction in the current cycle retire  Flush subsequent instructions alloc retire oldest Interrupt

Computer Structure 2015 – Out-Of-Order Execution 40 Interrupts u Interrupts are served when the next instruction retires  Let the instruction in the current cycle retire  Flush subsequent instructions  Initiate the interrupt service code  Fetch the subsequent instructions retire oldest alloc youngest alloc youngest

Computer Structure 2015 – Out-Of-Order Execution 41 Pipeline: Fetch u Fetch multiple instruction bytes from the I$ u In case of a variable instruction length architecture (like x86)  Length-decode instructions within the fetched instruction bytes  Write the instructions into an Instruction Queue Predict/FetchDecode IQIDQ Alloc ROBRS ScheduleEXRetire

Computer Structure 2015 – Out-Of-Order Execution 42 Pipeline: Decode u Read multiple instructions from the IQ u Decode the instructions  For x86 processors, each instruction is decoded into ≥1 μops  μops are RISC-like instructions u Write the resulting μops into the Instruction Decoder Queue (IDQ) Predict/FetchDecode IQIDQ Alloc ROBRS ScheduleEXRetire

Computer Structure 2015 – Out-Of-Order Execution 43 Pipeline: Allocate u Allocate, port bind and rename multiple μops u Allocate ROB/RS entry per μop  If source data is available from ROB or RRF, write data to RS  Otherwise, mark data not ready in RS Predict/FetchDecode IQIDQ Alloc ROBRS ScheduleEXRetire

Computer Structure 2015 – Out-Of-Order Execution 44 Pipeline: EXE u Ready/Schedule  Check for data-ready μops if needed functional unit available  Select and dispatch multiple ready μops/clock to EXE u Write back results into RS/ROB  Wake up μops in the RS that are waiting for the results as a sources Update data-ready status of these μops in the RS  Write back results into the Physical Registers  Reclaim RS entries Predict/FetchDecode IQIDQ Alloc ROBRS ScheduleEXRetire

Computer Structure 2015 – Out-Of-Order Execution 45 Pipeline: Retire u Retire oldest μops in the ROB  A μop may retire if Its “executed” bit is set It is not marked with an exception / fault All preceding μop are eligible for retirement  Commit results from the Physical Register to the Arch Register  Reclaim the ROB entry u In case of exception / fault  Flush the pipeline and handle the exception / fault Predict/FetchDecode IQIDQ Alloc ROBRS ScheduleEXRetire

Computer Structure 2015 – Out-Of-Order Execution 46 Jump Misprediction – Flush at Retire u When a mispredicted jump retires  Flush the pipeline When the jump commits, all the instructions remaining in the pipe are younger than the jump  from the wrong path  Reset the renaming map So all the registers are mapped to the architectural registers This is ok since there are no consumers of physical registers (pipe is flushed)  Start fetching instructions from the correct path u Disadvantage  Very high misprediction penalty  Misprediction is already known after the jump was executed  We will see ways to recover a misprediction at execution

Computer Structure 2015 – Out-Of-Order Execution 47 FetchDecode IQIDQ Alloc ROBRS Retire Schedule JEU Pipeline: Jump Gets to EXE u Misprediction detected when jump is executed  Do nothing

Computer Structure 2015 – Out-Of-Order Execution 48 Pipeline: Mispredicted Jump Retires u When the mispredicted jump retires  All instructions in the pipe are younger than the jump  from the wrong path  flush the pipeline  Reset the renaming map So all the registers are mapped to the architectural registers This is ok since there are no consumers of physical registers (pipe is flushed)  Start fetching instructions from the correct path FetchDecode IQIDQ Alloc ROBRS Retire Schedule JEU Clear

Computer Structure 2015 – Out-Of-Order Execution 49 Jump Misprediction – Flush at Execute u When a jump misprediction is detected (at jump execution)  Flush the in-order front-end  Instructions already in the OOO part continue to execute Including wrong-path instructions (waist of execution resource and power)  Start fetching and decoding instruction from the correct path Note that the “correct” path may still be wrong … An older instruction may cause an exception when it retires A older jump executed out-of-order can also mispredict  Block younger jumps (executed OOO) from clearing  The correct instruction stream is stalled at the RAT The RAT was wrongly updated also by wrong path instruction u When the mispredicted jump retires  All instructions in the RS/ROB are from the wrong path  Flush all instructions from the RS/ROB  Reset the RAT to point only to architectural registers  Un-stall RAT, and allow instructions from correct path to rename/alloc

Computer Structure 2015 – Out-Of-Order Execution 50 FetchDecode IQIDQ Alloc ROBRS Retire Schedule JEU Pipeline: Jump Gets to EXE

Computer Structure 2015 – Out-Of-Order Execution 51 Pipeline: Jump Gets to EXE u Flush front-end and re-steer it to correct path u RAT state already updated by wrong path  Block further allocation u Update BPU u OOO not flushed  Instructions already in the OOO part continue to execute Including wrong-path instructions (waist of execution resource and power) u Block younger jumps from clearing FetchDecode IQIDQ Alloc ROBRS Retire Schedule JEU Flush

Computer Structure 2015 – Out-Of-Order Execution 52 Pipeline: Mispredicted jump Retires u When the mispredicted jump retires  Flush OOO Only instruction following the jump are left – they must all be flushed Resets all state in the OOO (RAT, RS, RB, MOB, etc.) Reset the RAT to point only to architectural registers  Allow allocation of instructions from correct path FetchDecode IQIDQ Alloc ROBRS Retire Schedule JEU Clear

Computer Structure 2015 – Out-Of-Order Execution 53 Periodic Checkpoints u Allow allocation and execution of instructions from the correct path before the mispredicted jump retires u Every few instructions take a checkpoint of the RAT  A snapshot of the current renaming map u In case of misprediction  Flush the frontend and start fetching instructions from the correct path  Selectively flush younger instructions from the ROB/RS  Recover RAT to latest checkpoint taken prior to the mispredicted jump  Recover RAT to its state at the jump Rename instructions from the checkpoint and until the jump  Allow instructions from the correct path to allocate check point Re-rename

Computer Structure 2015 – Out-Of-Order Execution 54 Mispredicted Jump Gets to EXE u Clear raised on mispredicted jump Clear Alloc ROBRS Schedule JEU RetirePredict/FetchDecode IQIDQ

Computer Structure 2015 – Out-Of-Order Execution 55 Mispredicted Jump Gets to EXE u Clear raised on mispredicted jump  Flush frontend and re-steer it to the correct path  Flush all younger instructions in OOO  Update BPU  Block further allocation Clear Predict/FetchDecode IQIDQ Alloc ROBRS ScheduleRetire BPU Update JEU

Computer Structure 2015 – Out-Of-Order Execution 56 RAT Recovery u Restore RAT from latest check-point before jump u Recover RAT to its states just after the jump  Before any instruction on the wrong path u Meanwhile front-end starts fetching and decoding instructions from the correct path Predict/FetchDecode IQIDQ Alloc ROBRS ScheduleRetire JEU

Computer Structure 2015 – Out-Of-Order Execution 57 RAT Recovery u Once done restoring the RAT  allow allocation of instructions from correct path Predict/FetchDecode IQIDQ Alloc ROBRS ScheduleRetire JEU

Computer Structure 2015 – Out-Of-Order Execution 58 Large ROB and RS are Important u A Large RS  Increase the window in which looking for independent instructions Exposes more parallelism potential u Large ROB  The ROB is a superset of the RS  ROB size ≥ RS size  Allows for covering long latency operations (cache miss, divide) u Example  Assume there is a Load that misses the L1 cache Data takes ~10 cycles to return  ~30 new instrs get into pipeline  Instructions following the Load cannot commit  pile up in the ROB  Instructions independent of the load are executed, and leave the RS As long as the ROB is not full, we can keep executing instructions

Computer Structure 2015 – Out-Of-Order Execution 59 OOO Requires Accurate Branch Predictor u Accurate branch predictor increases the effective scheduling window size  Speculate across multiple branches (a branch every 5 – 10 instructions) Instruction pool branches High chances to commit Low chances to commit

Computer Structure 2015 – Out-Of-Order Execution 60 Out Of Order Execution Summary u Look ahead in a window of instructions  Dispatch ready instructions to execution Do not depend on data from previous instructions still not executed Have the required execution resources available u Advantages  Exploit Instruction Level Parallelism beyond adjacent instructions  Help cover latencies (e.g., L1 data cache miss, divide)  Superior/complementary to compiler scheduler Can look for ILP beyond conditional branches In a given control path instructions may be independent Register Renaming: use more than the number architectural registers u Complex micro-architecture  Register renaming, complex scheduler, misprediction recovery  Memory ordering (coming next)

Computer Structure 2015 – Out-Of-Order Execution 61 OOO Execution of Memory Operations

Computer Structure 2015 – Out-Of-Order Execution 62 OOO Execution of Memory Operations u The RS dispatches instructions based on register dependencies  The RS cannot detect memory dependencies store Mem[r1+r3*2]  r9 load r2  Mem[r10+r7*2] Does not know the load/store memory addresses  The RS dispatches load/store instructions to the Address Generation Unit (AGU) when the sources for the address calculation are ready  The AGU calculates the linear (virtual) memory address Segment-Base + Base Register + (Scale × Index Register) + Displacement u The AGU sends linear address to the Memory Order Buffer (MOB)  The MOB resolves memory dependencies and enforces memory ordering

Computer Structure 2015 – Out-Of-Order Execution 63 Load and Store Ordering u x86 has small register set  uses memory often  Preventing Stores from passing Stores/Loads: 3%~5% perf. loss P6 chooses not allow Stores to pass Stores/Loads  Preventing Loads from passing Loads/Stores: big perf. loss P6 allows Loads to pass Stores, and Loads to pass Loads u Stores are not executed OOO  Stores are never performed speculatively There is no transparent way to undo them  Stores are also never re-ordered among themselves The Store Buffer dispatches a store only when The store has both its address and its data, and There are no older stores awaiting dispatch  Store commits its value to memory (DCU) post retirement

Computer Structure 2015 – Out-Of-Order Execution 64 Store Implemented as 2 μops u Store decoded as two independent μops  STA (store-address): calculates the address of the store  STD (store-data): stores the data into the Store Data buffer The actual write to memory is done when the store retires u Separating STA & STD is important for memory OOO  Allows STA to dispatch earlier, even before the data is known  Address conflicts resolved earlier  opens memory pipeline for other loads u STA and STD can be issued to execution units in parallel  STA dispatched to AGU when its sources (base + index) are ready  STD dispatched to SDB when its source operand is available

Computer Structure 2015 – Out-Of-Order Execution 65 u The MOB tracks dependencies between loads and stores  An older STA has an unresolved address  block load  An older STA to same address, but Store’s data is not ready  block load Load/Store Ordering Store Mem[2000]  7 Store Mem[????]  8 Load R1  Mem[1000] STOP Store Mem[2000]  7 Store Mem[1000]  ?? Load R1  Mem[1000] STOP

Computer Structure 2015 – Out-Of-Order Execution 66 Memory Order Buffer (MOB) u Store Coloring  Each Store is allocated in-order in the Store Buffer, and gets a SBID  Each load is allocated in-order in the Load Buffer, and gets LBID + current SBID u Load is checked against all previous stores  Stores with SBID ≤ load’s SBID u Load blocked if  Unresolved address of a relevant STAs  STA to same address, but data not ready u MOB writes blocking info into load buffer  Re-dispatches load when wake-up signal received u If Load is not blocked  executed (bypassed) LBIDSBID Store-0 -1 Load01 Store-2 Load12 22 32 Store-3 Load43

Computer Structure 2015 – Out-Of-Order Execution 67 u The MOB predicts if a load can proceed despite unknown STAs  Predict colliding  block Load if there is unknown STA (as usual)  Predict non colliding  execute even if there are unknown STAs u In case of wrong prediction  The entire pipeline is flushed when the load retires Memory Disambiguation Store Mem[2000]  7 Store Mem[????]  8 Load R1  Mem[1000]

Computer Structure 2015 – Out-Of-Order Execution 68 Store → Load Forwarding u An older STA to same address, and Store’s data is ready  Store → Load Forwarding: Load gets data directly from SDB  Does not need to wait for the data to be written to the DCU Store Mem[1000]  7 Load R1  Mem[1000]

Computer Structure 2015 – Out-Of-Order Execution 69 DCU Miss u Blocking caches severely hurt OOO  A cache miss prevents from other cache requests (which could possibly be hits) to be served  Hurts one of the main gains from OOO – hiding caches misses  Cache in OOO machine are non-blocking u If a Load misses in the DCU  The DCU marks the write-back data as invalid  Assigns a fill buffer to the load, and issues an L2 request As long as there are still free fill buffer more loads can be dispatched  When the critical chunk returns, wakeup and re-dispatch the load u Squash subsequent requests for the same missed cache line  Use the same fill buffer

Computer Structure 2015 – Out-Of-Order Execution 70 Pipeline: Load: Allocate u Allocate RS entry, ROB entry, and Load Buffer entry for the Load u Assign Store Buffer ID (SBID) to enable ordering IDQ Alloc ROBRS RetireSchedule LB AGU LB Write DTLBDCUWBMOB

Computer Structure 2015 – Out-Of-Order Execution 71 Pipeline: Bypassed Load: EXE u RS checks when data used for address calculation is ready u AGU calculates linear address: DS-Base + base + (Scale*Index) + Disp.  Write Address to the Load Buffer u DTLB Virtual → Physical + DCU set access u MOB checks blocking and forwarding u DCU read / Store Data Buffer read (Store → Load forwarding) u Write back data / write block code IDQ Alloc ROBRS RetireSchedule LB AGU LB Write DTLBDCUWBMOB

Computer Structure 2015 – Out-Of-Order Execution 72 Pipeline: Blocked Load Re-dispatch u MOB determines which loads are ready, and schedules one u Load arbitrates for DTLB pipeline u DTLB Virtual → Physical + DCU set access u MOB checks blocking and forwarding u DCU read / Store Data Buffer read u Write back data / write block code IDQ Alloc ROBRS RetireSchedule LB AGU LB Write DTLBDCUWBMOB

Computer Structure 2015 – Out-Of-Order Execution 73 Pipeline: Load: Retire u Reclaim ROB, Load Buffer entries u Commit results to RRF IDQ Alloc ROBRS RetireSchedule LB AGU LB Write DTLBDCUWBMOB

Computer Structure 2015 – Out-Of-Order Execution 74 Pipeline: Store: Allocate u Allocate ROB/RS u Allocate Store Buffer entry IDQRS AllocScheduleAGUSB DTLB ROB SB Retire

Computer Structure 2015 – Out-Of-Order Execution 75 Pipeline: Store: STA EXE u RS checks when data used for address calculation is ready  dispatches STA to AGU u AGU calculates linear address u Write linear address to Store Buffer u DTLB Virtual → Physical u Load Buffer Memory Disambiguation verification u Write physical address to Store Buffer IDQRS ScheduleAllocAGU SB V.A. ROB DTLB SB P.A. SB Retire

Computer Structure 2015 – Out-Of-Order Execution 76 Pipeline: Store: STD EXE u RS checks when data for STD is ready  dispatches STD u Write data to Store Data Buffer IDQRS ScheduleAlloc SB data ROB SB Retire

Computer Structure 2015 – Out-Of-Order Execution 77 Pipeline: Senior Store Retirement u When STA (and thus STD) retire  Store Buffer entry marked as senior u When the DCU is idle  MOB dispatches a senior store u Read senior store entry from the Store Buffer  Store Buffer sends the data and the physical address u DCU writes the data into the specified physical address u Reclaim Store Buffer entry SB IDQRS ScheduleAlloc ROB Retire SBDCUMOB

Computer Structure 2015 – Out-Of-Order Execution 78

Computer Structure 2015 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Similar presentations

Presentation on theme: "Computer Structure 2015 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Structure 2015 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Similar presentations

Presentation on theme: "Computer Structure 2015 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz."— Presentation transcript:

Similar presentations

About project

Feedback