Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling.

Similar presentations


Presentation on theme: "© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling."— Presentation transcript:

1 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

2 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Outline General concepts –dataflow –dynamic scheduling with Tomasulo’s Algorithm The P6 Execution Microarchitecture Dynamic Scheduling Issues

3 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois The Execution Problem Instruction Supply Execution Mechanism Data Supply We are able to deliver instructions at high bandwidth, and we have techniques for high bandwidth, low-latency data supply. But nothing matters if we cannot consume everything at high bandwidth in the execution mechanism. We need to execute instructions in parallel. Fundamental problem: taking things in the order prescribed by the programmer will cause instruction dependencies to limit parallel execution of instructions.

4 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Dynamic Scheduling Reservation Station Renaming Retirement/Recovery Memory Disambiguation Tomasulo’s Algorithm

5 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Dataflow Concepts 1. MUL Ra, Rb -> Rm 2. ADD Rc, Rd -> Rn 3. SUB Rm, Rn -> Rx 4. ADD Rr, Rs -> Rm 5. ADD Rt, Rv -> Rn 6. DIV Rm, Rn -> Ry x = (a * b) - (c + d); y = (r + s) / (t + v); Source Code Machine Code 12 3 45 6 Dataflow Graph

6 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Data Dependences Data flow dependence –consumer-producer relationship –register bypass and interlocks Data output and antidependences –reuse of registers at compile time –register renaming

7 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Interlocking Allow instruction to execute only when data and resources ready –simple interlocking based on bypass logic for short pipelines –scoreboarding for deep pipelines –Tomasulo’s Algorithm for out-of-order instruction dispatch

8 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Tomasulo’s Algorithm Invented for IBM 360-91 FPU First published in 1967(IBM Journal) Not for general CPU design until 1990’s. –branch prediction and exception recovery problems solved

9 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Tomasulo’s Algorithm Register renaming –tags for values Out-of-order execution –reservation stations Data forwarding –common data bus

10 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Tomasulo’s Algorithm Instruction decode –fetch register file for value and tag –tag is handle for data currently being generated –determine RS to hold the decoded operations

11 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Reservation Station Hardware mechanism that enables instructions to execute out-of-order and as early as their source operands are ready. An instruction waits in the RS until the tags for its source operands have been broadcast by their producers.

12 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Tomasulo’s Algorithm Instruction Issue –insert operation and operands into reservation station entry asisgned –mark destination register as not ready

13 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Tomasulo’s Algorithm Operation dispatch –identify operations ready for execution –determine highest priority operation for each port/function unit

14 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Tomasulo’s Algorithm Data forwarding –result value and tag distributed to RS entries for associative search –result value and tag delivered to destination register for potential update

15 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Renaming Objective: want to eliminate WAR and WAW (false dependencies) Renaming happens in program order Renaming requires a table to map between architectural registers and physical registers

16 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Retirement What happens if we inadvertently execute an instruction that should not have been executed (i.e., branch misprediction) or execute an instruction incorrectly (i.e., exception)? Need to flush all bad instructions and make it look as if they never executed. And then start executing from the correct point.

17 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Retirement using Reorder Buffer Reorder Buffer tail pointer head pointer Insts, in program order An instruction that reaches the head and executes without exception can be safely retired Values from Data Bus Flushing inflight instructions is easy – clear out RS and ROB Recovering RAT state is hard. That’s where the ROB comes in.

18 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Putting it all together Register Alias Table Reservation Stations FU Reorder Buffer

19 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Memory Disambiguation 1. MUL Ra, Rb -> Rm 2. ADD Rc, Rd -> Rn 3. ST Rm -> 0(Rn) 4. LD 0(Rs) -> Rm 5. ADD Rt, Rv -> Rn 6. DIV Rm, Rn -> Ry 12 3 45 6 ??? Depends if Rn == Rs

20 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Conceptual Memory Order Buffer L/SAddrValueVV Loads/Stores in program order Stores write into buffer and pass to memory only after they reach the head and are retired. What about loads? Could go in order (highly conservative) Could wait until all previous unknown store addresses are known (not so conservative) Could go as soon as address is known (optimistic)

21 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois The P6 Execution Microarchitecture [making dynamic scheduling work at wide issue] Renaming Scheduling/Execution Memory Retirement Fetch/Decode in-order out-of-order

22 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois The P6 Register Alias Table ROB Entry NumberRRF Valid Srcs for μop0 Srcs for μop1 Srcs for μop2 Dests for μops ROB Allocator If the producer has already retired, the value is in the Retirement Register File (RRF Valid is 1) If the producer has not retired, then the value will have to be provided by the Reorder Buffer at the ROB Entry Number indicated in the RAT (RRF Valid is 0) From retire (Dest, ROB entry #s) Physical sources

23 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois ReOrder Buffer (ROB) Psrc Read and Pdest Write VValueDest Status PSrcs for μop0 PSrcs for μop1 PSrcs for μop2 PDests for μops from allocator Values for Psrcs Execution results and from function units

24 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Retirement Register File Psrc Read PSrcs for μop0 PSrcs for μop1 PSrcs for μop2 Values for Psrcs Value From ReOrder Buffer retirement

25 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Issue RAT RRF ROB Reservation Station Rename (RAT access) Register Read (Also ROB allocate) Issue (RS allocate)

26 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois P6 Reservation Station Entry Valid Psrc0 tag Psrc0 data Psrc0 V Opcode Psrc1 tag Psrc1 data Psrc1 V ROB Entry # Up to three μops per cycle are added to the ResStation

27 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Execution Reservation Station Integer Unit1 Integer Unit0 Load addr gen Store addr gen Floating point unit Memory Order Buffer Port0 Port1Port2Port3Port4 To Reorder Buffer Data Cache

28 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Memory Order Buffer Address Allocation happens in order, at issue. Store data is buffered in MOB until retirement of that store. STIDs correspond to the entry of the previous store. P6 Rule: STs must go in-order wrt other STs. LDs can go out- of-order wrt to other LDs and STs. LDs go as soon as address is ready. Clean up at retirement. L/S Store ID ST Addr LD Addr ST Data

29 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Retirement VValueDest Status Head Pointer If Status indicates all is OK, then the value is written, or committed, to the RRF. Also, the (Dest and ROB entry number) is sent to RAT to potentially set RRF Valid bit. If Status indicates something went wrong, then a recovery action is started. Up to 3 uops can be retired per cycle. Reorder Buffer

30 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Recovery ROB – flush all insts. RS – flush all insts. RRF – do nothing. RAT – Make all entries indicate RRF valid. Sent new PC to Fetch Mechanism

31 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Reservation Station Alternative Designs Value capture reservation stations v.s. tag- only reservation stations –Pentium IV adjusts tags rather than moves values when retiring an instruction –Need to keep entries in ROB longer until they no longer safe keep retired value visible to the subsequent instructions

32 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Other thoughts How many cycles for branch misprediction? Read Sohi and Smith for more general concepts Read about the MIPS 10K for details on an alternative implementation

33 © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Data Dependencies Read After Write –Flow Write After Write –Anti Write After Read –Output 1. MUL Ra, Rb -> Rm 3. SUB Rm, Rn -> Rx 1. MUL Ra, Rb -> Rm 4. ADD Rr, Rs -> Rm 3. SUB Rm, Rn -> Rx 4. ADD Rr, Rs -> Rm


Download ppt "© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling."

Similar presentations


Ads by Google