Presentation is loading. Please wait.

Presentation is loading. Please wait.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Similar presentations


Presentation on theme: "1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren."— Presentation transcript:

1 1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital Information Systems

2 1/1/ / faculty of Electrical Engineering eindhoven university of technology Moving instructions around It is possible to change the execution order of instructions which do not have dependencies without renaming:with renaming: 1)R1 := R2 + 3R1b := R2a + 3 2)R3 := R1 x 2R3b := R1b x 2 3)R1 := R6 + R2R1c := R6a + R2a 4)R2 := R1 - 15R2b := R1c - 15 True dependencies: 2) comes after 1), 4) comes after 3) With renaming, these are the only sequence restrictions ! 3)R1c := R6a + R2a 4)R2b := R1c - 15 1)R1b := R2a + 3 2)R3b := R1b x 2 3)R1c := R6a + R2a 1)R1b := R2a + 3 2)R3b := R1b x 2 4)R2b := R1c - 15

3 1/1/ / faculty of Electrical Engineering eindhoven university of technology Out-of-order (OOO) execution Changing the order of instruction execution can remove pipeline stalls and/or fill delay slots: increase the performance –Instructions can be re-ordered in the program, but this is not OOO execution ! OOO execution: instructions are sent to the operational units (ALU, load/store...) in a different order than the program specifies OOO memory accessing is not discussed here

4 1/1/ / faculty of Electrical Engineering eindhoven university of technology Instruction buffers for OOO execution To be able to change the execution order, fetched instructions must be buffered fetch & decode ALU load/ store scheduler reservation station reservation station (renamed) registers scheduler central instruction window program memory fetch & decode ALU load/ store (renamed) registers 1) Separate instruction buffers for each functional unit 2) Central instruction buffer program memory

5 1/1/ / faculty of Electrical Engineering eindhoven university of technology Differences between buffer strategies Reservation stations have advantages +Smaller buffers, schedulers are simpler +Buffer entries can be tailored to instruction format +Routing of instructions across chip simpler The central instruction window also has advantages +Total number of buffered instructions can be smaller +The single scheduler can take better decisions +No ‘false locking’ with identical functional units

6 1/1/ / faculty of Electrical Engineering eindhoven university of technology False locking between functional units Instruction sequence: A1, B1, A2, B2, A3, B3, A4, B4 1 2 scheduler reservation stations (renamed) registers ALU’s fetch & decode program memory A4 A3 A2 A1 B4 B3 B2 B1 A1 B1 A4 A3 A2 B4 B3 B2 A1 B2 A4 A3 A2 B4 B3 A1 B3 A4 A3 A2 B4 A1 B4 A4 A3 A2A1 A4 A3 A2 A1 B1 A4 A3 A2 B4 B3 B2 locked A4 A3 B2 A1 B4 B3 A2 B1 A1 B1 A4 A3 B2 B4 B3 A2 A1 B1 A4 A3 B2 B4 B3 A2 locked A4 A3 B2 B4 B3 A2 A1 A4 A3 B2 B4 B3 A2 A1 false locking This will not happen with a central instruction window ! Hybrid solution:one reservation station + one scheduler for multiple identical functional units

7 1/1/ / faculty of Electrical Engineering eindhoven university of technology Scheduler operation The schedulers actually have only a simple task Pick ready-to-execute instructions from their buffers and send them to the appropriate operational units 'ready-to-execute'  with all source values known Try to calculate conditional jump results ASAP Otherwise: oldest instructions first

8 1/1/ / faculty of Electrical Engineering eindhoven university of technology ‘Ready to execute’ determination The scheduler(s) depend on other system parts to determine which instructions can be executed The fetch unit knows the original order of the instructions and must determine the dependencies The operational units signal the end of a dependency when writing a result operand The instruction buffer(s) determine from this information which instructions are ready to execute and store this knowledge in status flags

9 1/1/ / faculty of Electrical Engineering eindhoven university of technology The ‘scoreboard’, again A simple scoreboarding technique can be used for ‘ready to execute’ determination –Renamed registers get a flag bit which indicates the register does not contain a result yet –Each renamed destination register write sets the attached flag bit to indicate the result is available An instruction is ready to execute when all the flag bits of it's renamed source registers are set

10 1/1/ / faculty of Electrical Engineering eindhoven university of technology The problem with interrupts and traps OOO completion means instructions results may be written in an order which differs from the instruction sequence in the program –If an instruction generates a trap, instructions following it may already have changed registers (and/or memory locations !) –If an interrupt must break off processing, some instructions may not complete while later ones in the program have already completed

11 1/1/ / faculty of Electrical Engineering eindhoven university of technology Solution: a ‘safe state’ register set With these imprecise interrupts and traps, it is almost impossible to get the processor in a state from which it can be safely restarted We must find a way to maintain the 'visible' set of processor registers in a 'safe state': updated in the normal program order –We don't care if this updating of the safe state lags behind the normal updating of the renamed set

12 1/1/ / faculty of Electrical Engineering eindhoven university of technology 'reorder buffer' renamed registers safe register set Implementation of the safe state One common way to provide this 'safe' register set is by using a so-called 'reorder buffer' result bus(es) renamed register number read pointerwrite pointer 'head' 'tail' simulated FIFO renamed real register number in-order updates source operand 0 1 valid flags operand valid

13 1/1/ / faculty of Electrical Engineering eindhoven university of technology Safe register set Operation of the reorder buffer Four instructions writing to (real) registers R2, R1, R2 & R3 renamed renamed register real register value valid real register result renamed register number ‘head ’ NR2?6 N?R1:12 Y6R2:47 N?R3:114 N?R4:0 NR1?7 NR2?6 Y7R1:12 Y6R2:47 N?R3:114 N?R4:0 Y7R1:12 Y8R2:47 N?R3:114 N?R4:0 NR2?8 NR1?7 NR2?6 Y7R1:12 Y8R2:47 Y9R3:114 N?R4:0 NR3?9 NR2?8 NR1?7 NR2?6 N?R1:12 N?R2:47 N?R3:114 N?R4:0 Y7R1:12 Y8R2:47 Y9R3:114 N?R4:0 NR3?9 NR2?8 YR1337 NR2?63 Y7R1:12 Y8R2:47 Y9R3:114 N?R4:0 NR3?9 NR2?8 YR1337 YR26786 678 Y7R1:12 Y8R2:678 Y9R3:114 N?R4:0 NR3?9 NR2?8 YR1337 N?R1:33 Y8R2:678 Y9R3:114 N?R4:0 NR3?9 NR2?8 YR1337 N?R1:33 Y8R2:678 Y9R3:114 N?R4:0 NR3?9 NR2?8 6 N?R1:33 Y8R2:678 Y9R3:114 N?R4:0 YR369 NR2?8 10 N?R1:33 Y8R2:678 Y9R3:114 N?R4:0 YR369 YR2108 N?R1:33 N?R2:10 Y9R3:114 N?R4:0 YR369 YR2108 N?R1:33 N?R2:10 Y9R3:114 N?R4:0 YR369 N?R1:33 N?R2:10 N?R3:6 N?R4:0 YR369 N?R1:33 N?R2:10 N?R3:6 N?R4:0 Y7R1:12 Y8R2:678 Y9R3:114 N?R4:0 NR3?9 NR2?8 YR1337 YR26786 ‘retiring’ Reorder buffer FIFO

14 1/1/ / faculty of Electrical Engineering eindhoven university of technology Other solutions and variations Both 'history buffer' and 'future file' are (minor) variations/extensions on the reorder buffer A central instruction window can combine the reorder buffer and instruction buffer functions 'Checkpoint repair' makes backups of the complete register set when problems may occur –Only instructions which were already in execution at the time of the backup modify the backup's state (these must complete execution)

15 1/1/ / faculty of Electrical Engineering eindhoven university of technology OOO execution & conditional jumps Machines uncapable to move instructions across (conditional) jumps will not perform well –Basic block sizes of 4..6 instructions are normal for CISC's (6..8 instructions for RISC's) –Around half of the jumps is conditional ! The problem with conditional jumps –If the prediction is wrong, the processor state must be restored to the point of the jump instruction In fact, the same as if a trap occurred

16 1/1/ / faculty of Electrical Engineering eindhoven university of technology ‘Speculative’ OOO conditional jumps (1) 'Speculative fetching’ fetches and decodes instructions after the conditional jump, but does not take them in execution 'Speculative execution’ also executes instructions in the predicted path, using renaming as buffer for the in-order (safe) state –The speculative renamed registers are discarded when the prediction was incorrect –Rename indexes must be restored ! (checkpoint repair ?)

17 1/1/ / faculty of Electrical Engineering eindhoven university of technology ‘Speculative’ OOO conditional jumps (2) 'Multi-path speculative execution’ extends speculative execution to handle both paths following a conditional branch –may also allow multiple condition tests to be unresolved (needs more checkpointing buffers) Retiring of renamed registers is frozen for speculative renamed registers until the branch outcome is known

18 1/1/ / faculty of Electrical Engineering eindhoven university of technology Handling more instructions per clock Fetching more than one instruction per clock is generally not such a problem –Make the bus to the instruction memory wider ! Need more than one functional unit to actually execute the instructions in parallel Must also decode more than one instruction per clock to get a 'superscalar' processor

19 1/1/ / faculty of Electrical Engineering eindhoven university of technology Superscalar parts we have already seen Instruction decoders can easily send multiple instructions to separate reservation stations –With a minor increase in complexity even multiple instructions to the same reservation station The central instruction window can be modified to receive multiple instructions in a single cycle –The scheduler can be changed to handle multiple instructions in parallel

20 1/1/ / faculty of Electrical Engineering eindhoven university of technology Superscalar dependency detection Instruction dependency determination must now be partially implemented in a parallel form –Renamed register indexes must be forwarded between concurrently decoded instructions –It must be possible to create multiple renamed registers in a single cycle It must also be possible to update multiple in-order (safe) registers in parallel !

21 1/1/ / faculty of Electrical Engineering eindhoven university of technology Another method to go superscalar Very Large Instruction Word (VLIW) machines pack several ‘normal’ instructions in a single ‘superinstruction’ –They execute this superinstruction using separate functional units With all scheduling done by the compiler ! –Programming VLIW machines in assembly language is virtually impossible faster

22 1/1/ / faculty of Electrical Engineering eindhoven university of technology VLIW, but not exactly The Intel 80860 processor uses another trick which resembles VLIW operation –It always fetches two instructions at a time –If the first one is a floating point operation, it checks a flag in this instruction –If this flag is set, it assumes the second one is not a floating point operation and executes both in parallel Intel Pentium ‘pairs’ instructions without flags


Download ppt "1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren."

Similar presentations


Ads by Google