Presentation is loading. Please wait.

Presentation is loading. Please wait.

Out-of-order execution Lihu Rappoport 11/2004 1 MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Similar presentations


Presentation on theme: "Out-of-order execution Lihu Rappoport 11/2004 1 MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport."— Presentation transcript:

1 out-of-order execution Lihu Rappoport 11/2004 1 MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport

2 out-of-order execution Lihu Rappoport 11/2004 2 What’s Next  Remember our goal: minimize CPU Time CPU Time = clock cycle  CPI  IC  So far we have learned –Minimize clock cycle  add more pipe stages –Minimize CPI  use pipeline –Minimize IC  architecture  In a pipelined CPU: –CPI w/o hazards is 1 –CPI with hazards is > 1  Adding more pipe stages reduces clock cycle but increases CPI: –Higher penalty due to control hazards –More data hazards  Beyond some point adding more pipe does not help  What can we do ? Further reduce the CPI !

3 out-of-order execution Lihu Rappoport 11/2004 3 A Superscalar CPU  Just duplicating the hardware in one of the pipe stage (e.g., have 2 ALUs) won’t help: –the bottleneck will simply move to other stages  In order to have CPI < 1 we need to be able to fetch, decode, execute, and retire more than a single instruction per clock: IF ID EXE MEM WB

4 out-of-order execution Lihu Rappoport 11/2004 4 The Pentium  Processor  Fetches and decodes 2 instructions per cycle  Before register file read a pairing decision is made: can the two instructions be executed in parallel  Pairing decision is based on –Data dependencies: instructions must be independent –Resources:  Some instructions uses resources from the 2 pipes  The second pipe can only execute part of the instructions IF ID U-pipe V-pipe pairing

5 out-of-order execution Lihu Rappoport 11/2004 5  MPI : miss-per-instruction: # of incorrectly predicted branches # of predicted branches MPI = = MPR* total # of instructions total # of instructions  MPI correlates well with performance. For example, assume: –MPR = 5%, %branches = 20%  MPI = 1% –Without hazards CPI=0.5 (2 instructions per cycles) –Flush penalty of 5 cycles.  We get:  MPI = 1%  flush in every 100 instructions  flush in every 50 cycles (since CPI=0.5),  5 cycles flush penalty every 50 cycles  10% in performance (For CPI=1 we get 5 cycles flush penalty every 100 cycles  5% in performance) Misprediction Penalty In a Super Scalar CPU

6 out-of-order execution Lihu Rappoport 11/2004 6 Is Superscalar Good Enough ?  A superscalar processor can fetch, decode, execute and retire two instructions in parallel  However, two instruction can only be executed in parallel if they are independent  But … adjacent instructions are usually dependent –The utilization of the second pipe is usually low –There are algorithms in which both pipes are highly utilized  What to do ?  Out-of-order execution: –Execute instruction not in program order –Still need to do the same as the original program

7 out-of-order execution Lihu Rappoport 11/2004 7 Out Of Order Execution  Execute instructions based on “data flow” rather than program order  Basic idea: look a head in a window of instructions and find instructions that are ready to execute: –Do not have data dependencies with previous instructions, which were still not executed –Resources are available  Out-of-order execution is: Starting the execution stage of an instruction before the execution stage of a previous instruction  Advantages: –Help exploit Instruction Level Parallelism (ILP) –Help cover latencies (e.g., cache miss, divide)  Can Compilers do the Work ? –Compilers can statically reschedule instructions –Compilers do not have run time information (e.g., cache miss, conditional branch direction → limited to basic blocks)

8 out-of-order execution Lihu Rappoport 11/2004 8 Data Flow Analysis  Example : assume that a divide operation takes 20 cycles. (1) r1  r4 / r7 (2) r8  r1 + r2 (3) r5  r5 + 1 (4) r6  r6 - r3 (5) r4  r5 + r6 (6) r7  r8 * r4 134 2 5 6 Data Flow Graph 1 3 4 5 2 6 In-order execution 1 3 4 526 Out-of-order execution

9 out-of-order execution Lihu Rappoport 11/2004 9 OOOE - General Scheme  Fetch & decode instructions in parallel, but in order, to fill-out an inst. pool  Out of the instructions in the window, execute instructions that are ready: –All the data the required for the instruction are ready –Execution resources are available  Executed instructions signals all dependant instructions that the data is ready  Commit a few instructions in-order –An instruction can commit only after all preceding instructions (in program order) committed Fetch & Decode Instruction pool Retire (commit) In-order Execute Out-of-order

10 out-of-order execution Lihu Rappoport 11/2004 10 Out Of Order Execution - Example  Assume that executing a divide operation takes 20 cycles. (1)r1  r5 / r4 (2)r3  r1 + r8 (3)r8  r5 + 1 (4)r3  r7 - 2 (5)r6  r6 + r7  Inst2 has a RAW dependency on r1 with Inst1, so it cannot be executed in parallel with Inst1  Can successive instructions pass Inst2 ? –What about Inst3 ? No - Inst2 must read r8 before Inst3 writes to it –What about Inst4 ? No - Inst4 must write to r3 after Inst2 –What about Inst5 ? Yes

11 out-of-order execution Lihu Rappoport 11/2004 11 False Dependencies  OOOE creates new dependencies: –WAR : an instruction writes into a register which needs to be read by an earlier instruction –WAW: an instruction writes into a register which needs to be written by an earlier instruction  These are false dependencies: –There no missing data  Still, false dependencies prevent executing instructions out-of-order  Solution: Register Renaming

12 out-of-order execution Lihu Rappoport 11/2004 12 Register Renaming  Hold a pool of physical registers  Architectural registers are mapped into physical registers –When an instruction writes to an architectural register  A free physical register is allocated from the pool  The physical register points to the architectural register  The instruction writes the value to the physical register –When an instruction reads from an architectural register  reads the data from the latest instruction which writes to the same architectural register, and precedes the current instruction  If no such instruction exists, read directly from the architectural register –When an instruction commits  Moves the value from the physical register to the architectural register it points

13 out-of-order execution Lihu Rappoport 11/2004 13 OOOE with Register Renaming: Example cycle 1 cycle 2 (1)r1  mem1r1’  mem1 (2)r2  r2 + r1 r2’  r2 + r1’ (3)r1  mem2r1”  mem2 (4)r3  r3 + r1 r3’  r3 + r1” (5)r1  mem3r1”’  mem3 (6)r4  r5 + r1 r4’  r5 + r1”’ (7)r5  2r5’  2 (8)r6  r5 + 2 r6’  r5’ + 2 Register Renaming Benefits Removes false dependencies Removes architecture limit for # of registers WAW WAR

14 out-of-order execution Lihu Rappoport 11/2004 14 Executing Beyond Branches  The scheme we saw so far does not search beyond a branch  Limited to the parallelism within a basic-block – A basic-block is ~5 instruction long (1) r1  r4 / r7 (2)r2  r2 + r1 (3)r3  r2 - 5 (4)beq r3,0,300 If the beq is predicted NT, (5)r8  r8 + 1 Inst 5 can be spec executed  We would like to look beyond branches –But what if we execute an instruction beyond a branch and then it turns out that we predicted the wrong path ? Solution: Speculative Execution

15 out-of-order execution Lihu Rappoport 11/2004 15 Speculative Execution Execution of instructions from a predicted (yet unsure) path Eventually, path may turn wrong Implementation: –Hold a pool of all “not yet executed” instructions –Fetch instructions into the pool from a predicted path –Instructions for which all operands are “ready” can be executed –An instruction may change the processor state (commit) only when it is safe  An instruction commits only when all previous (in-order) instructions had committed  Instructions commit in-order  Instructions which follow a branch commit only after the branch commits  If a predicted branch is wrong all the instructions which follow it are flushed Register Renaming helps speculative execution –Renamed registers are kept until speculation is verified to be correct

16 out-of-order execution Lihu Rappoport 11/2004 16 Speculative Execution - Example cycle 1 cycle 2 (1) r1  mem1 r1’  mem1 (2) r2  r2 + r1 r2’  r2 + r1’ (3) r1  mem2 r1”  mem2 (4) r3  r3 + r1 r3’  r3 + r1” (5) jmp cond L2 (6)L2 r1  mem3 r1”’  mem3 (7) r4  r5 + r1 r4’  r5 + r1”’ (8) r5  2 r5’  2 (9) r6  r5 + 2 r6’  r5’ + 2  Instructions 6-9 are speculatively executed –If the prediction turns wrong, they will be flushed  If the branch was predicted taken, the instructions from the other path would be have been speculatively executed. WAW WAR Speculative Execution

17 out-of-order execution Lihu Rappoport 11/2004 17 Out Of Order Execution - Summary  Advantages –Help exploit Instruction Level Parallelism (ILP) –Help cover latencies (e.g., cache miss, divide) –Superior/complementary to compiler scheduler  Dynamic instruction window  Reg Renaming: Can use more than the number architectural registers  Complex microarchitecture –Complex scheduler –Requires reordering mechanism (retirement) in the back-end for:  Precise interrupt resolution  Misprediction/speculation recovery  Memory ordering  Speculative Execution –Advantage: larger scheduling window  reveals more ILP –Issues: misprediction cost and misprediction recovery


Download ppt "Out-of-order execution Lihu Rappoport 11/2004 1 MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport."

Similar presentations


Ads by Google