Presentation on theme: "Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz."— Presentation transcript:
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz
Computer Structure 2014 – Out-Of-Order Execution 2 What’s Next u Goal: minimize CPU Time CPU Time = clock cycle CPI IC u So far we have learned Minimize clock cycle add more pipe stages Minimize CPI use pipeline Minimize IC architecture u In a pipelined CPU CPI w/o hazards is 1 CPI with hazards is > 1 u Adding more pipe stages reduces clock cycle but increases CPI Higher penalty due to control hazards More data hazards u What can we do ? Further reduce the CPI !
Computer Structure 2014 – Out-Of-Order Execution 3 A Superscalar CPU u Duplicating HW in one pipe stage won’t help e.g., have 2 ALUs the bottleneck moves to other stages u Getting IPC > 1 requires to fetch, decode, exe, and retire >1 instruction per clock: IF ID EXE MEM WB
Computer Structure 2014 – Out-Of-Order Execution 4 The Pentium Processor u Fetches and decodes 2 instructions per cycle Before register file read, decide on pairing: can the two instructions be executed in parallel u Pairing decision is based on Data dependencies: 2 nd instruction must be independent of 1 st Resources: U-pipe and V-pipe are not symmetric (save HW) Common instructions can execute on either pipe Some instructions can execute only on the U-pipe If the 2 nd instruction requires the U-pipe, it cannot pair Some instructions use resources of both pipes IF ID U-pipe V-pipe pairing
Computer Structure 2014 – Out-Of-Order Execution 5 u MPI : miss-per-instruction: #incorrectly predicted branches #predicted branches MPI = = MPR× total # of instructions total # of instructions u MPI correlates well with performance, e.g., assume MPR = 5%, %branches = 20% MPI = 1% Without hazards IPC=2 (2 instructions per cycles) Flush penalty of 5 cycles u We get MPI = 1% flush in every 100 instructions IPC=2 flush every 100/2 = 50 cycles 5 cycles flush penalty every 50 cycles 10% performance hit u For IPC=1 we would get 5 cycles flush penalty per 100 cycles 5% performance hit u Flush penalty increases as the machine is deeper and wider Misprediction Penalty in a Superscalar CPU
Computer Structure 2014 – Out-Of-Order Execution 6 Extract More ILP u ILP – Instruction Level Parallelism A given program, executed on a given input data has a given parallelism Can execute only independent instructions in parallel If for example each instruction is dependent on the previous instruction, the ILP of the program is 1 Adding more HW will not change that u Adjacent instructions are usually dependent The utilization of the 2 nd pipe is usually low There are algorithms in which both pipes are highly utilized u Solution: Out-Of-Order Execution Look for independent instructions further ahead in the program Execute instructions based on data readiness Still need to keep the semantics of the original program
Computer Structure 2014 – Out-Of-Order Execution 8 OOOE – General Scheme u Fetch & decode instructions in parallel but in order Fill the Instruction Pool u Execute ready instructions from the instructions pool All source data ready + needed execution resources available u Once an instruction is executed signal all dependent instructions that data is ready u Commit instructions in parallel but in-order State change (memory, register) and fault/exception handling Retire (commit) In-order Fetch & Decode Instruction pool In-order Execute Out-of-order
Computer Structure 2014 – Out-Of-Order Execution 10 (1) r1 R9/17 (2) r2 r2+r1 (4) r3 r3+r1 (5) jcc L2 (6) L2 r1 35 Write-After-Write Dependency (8) r3 2 (7) r4 r3+r1 (3) r1 23 If inst (3) is executed before inst (1), r1 ends up having a wrong value. Called write-after-write false dependency.
Computer Structure 2014 – Out-Of-Order Execution 11 (1) r1 R9/17 (2) r2 r2+r1 (4) r3 r3+r1 (5) jcc L2 (6) L2 r1 35 Write-After-Write Dependency (8) r3 2 (7) r4 r3+r1 (3) r1 23 Inst (4) should use the value of r1 produced by inst (3), even if inst (1) is executed after inst (3). Write-After-Write (WAW) is a false dependency Not a real data dependency, but an artifact of OOO execution
Computer Structure 2014 – Out-Of-Order Execution 12 (1) r1 R9/17 (2) r2 r2+r1 (4) r3 r3+r1 (5) jcc L2 (6) L2 r1 35 Speculative Execution (8) r3 2 (7) r4 r3+r1 (3) r1 23 1/5 instruction is a branch continue fetching, decoding, and allocating instructions into the instruction pool according to the predicted path. Called “speculative execution”
Computer Structure 2014 – Out-Of-Order Execution 14 (7) r4 r3+r1 (1) r1 R9/17 (2) r2 r2+r1 (4) r3 r3+r1 (5) jcc L2 (6) L2 r1 35 Write-After-Read Dependency (3) r1 23 (8) r3 2 If inst (8) is executed before inst (7), inst (7) gets a wrong value of r3. Called write-after-read false dependency. Write-After-Read (WAR) is a false dependency Not a real data dependency, but an artifact of OOO execution
Computer Structure 2014 – Out-Of-Order Execution 15 Register Renaming u Hold a pool of physical registers Map architectural registers into physical registers (still in-order) u When an instruction is allocated into the instruction pool (still in-order) Allocate a free physical register from a pool The physical register points to the architectural register u When an instruction executes and writes a result Write the result value to the physical register u When an instruction needs data from a register Read data from the physical register allocated to the latest inst which writes to the same arch register, and precedes the current inst If no such instruction exists, read from the reset arch. value u When an instruction commits Copy the value from its physical register to the architectural register
Computer Structure 2014 – Out-Of-Order Execution 19 Jump Misprediction – Flush at Retire u When the mispredicted jump retires Flush the pipeline When the branch commits, all the instructions remaining in the pipe are younger than the branch from the wrong path Reset the renaming map So all register are mapped to architectural registers This is ok since there are no consumers of physical registers (pipe is flushed) Start fetching instructions from the correct path u Disadvantage Very high misprediction penalty Misprediction is already known after the jump was executed We will see ways to recover a misprediction at execution
Computer Structure 2014 – Out-Of-Order Execution 20 OOO Requires Accurate Branch Predictor u Accurate branch predictor increases the effective scheduling window size Speculate across multiple branches (a branch every 5 – 10 instructions) Instruction pool branches High chances to commit Low chances to commit
Computer Structure 2014 – Out-Of-Order Execution 21 Interrupts and Faults Handling u Complications for pipelined and OOO execution Interrupts occur in the middle of an instruction A speculative instruction can get a fault (divide by 0, page fault) u Faults are served in program order, at retirement only Mark an instruction that takes a fault at execution Instructions older than the faulting instruction are retired Only when the faulting instruction retires – handle the fault Flush subsequent instructions Initiate the fault handling code according to the fault type Restart faulting and/or subsequent instructions u Interrupts are served when the next instruction retires Let the instruction in the current cycle retire Flush subsequent instructions and initiate the interrupt service code Fetch the subsequent instructions
Computer Structure 2014 – Out-Of-Order Execution 22 Out Of Order Execution Summary u Look ahead in a window of instructions Dispatch ready instructions to execution Do not depend on data from previous instructions still not executed Have the required execution resources available u Advantages Exploit Instruction Level Parallelism beyond adjacent instructions Help cover latencies (e.g., L1 data cache miss, divide) Superior/complementary to compiler scheduler Can look for ILP beyond conditional branches In a given control path instructions may be independent Register Renaming: use more than the number architectural registers u Complex micro-architecture Register renaming, complex scheduler, misprediction recovery Memory ordering – so far we did not talk about that
Your consent to our cookies if you continue to use this website.