Presentation is loading. Please wait.

Presentation is loading. Please wait.

EE524/CptS561 Advanced Computer Architecture Dynamic Scheduling A scheme to overcome data hazards.

Similar presentations


Presentation on theme: "EE524/CptS561 Advanced Computer Architecture Dynamic Scheduling A scheme to overcome data hazards."— Presentation transcript:

1 EE524/CptS561 Advanced Computer Architecture Dynamic Scheduling A scheme to overcome data hazards

2 EE524/CptS561 Advanced Computer Architecture Advantages of Dynamic Scheduling Dynamic scheduling - hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior It handles cases when dependences unknown at compile time –it allows the processor to tolerate unpredictable delays such as cache misses, by executing other code while waiting for the miss to resolve It allows code that compiled for one pipeline to run efficiently on a different pipeline It simplifies the compiler Hardware speculation, a technique with significant performance advantages, builds on dynamic scheduling

3 EE524/CptS561 Advanced Computer Architecture HW Schemes: Instruction Parallelism Key idea: Allow instructions behind stall to proceed DIVDF0,F2,F4 ADDDF10,F0,F8 SUBDF12,F8,F14 Enables out-of-order execution and allows out-of-order completion (e.g., SUBD ) –In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue) Will distinguish when an instruction begins execution and when it completes execution; between 2 times, the instruction is in execution Note: Dynamic execution creates WAR and WAW hazards and makes exceptions harder

4 EE524/CptS561 Advanced Computer Architecture Dynamic Scheduling Step 1 Simple pipeline had 1 stage to check both structural and data hazards: Instruction Decode (ID), also called Instruction Issue Split the ID pipe stage of simple 5-stage pipeline into 2 stages: Issue—Decode instructions, check for structural hazards Read operands—Wait until no data hazards, then read operands

5 EE524/CptS561 Advanced Computer Architecture Tomasulo Algorithm Control & buffers distributed with Function Units (FU) –FU buffers called “reservation stations”; have pending operands Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming; –Avoids: WAR WAW hazards –More reservation stations than registers, so can do optimizations compilers cannot Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well. Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue.  RXinst. i RX  inst. j

6 EE524/CptS561 Advanced Computer Architecture FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP Operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses Tomasulo scheme

7 EE524/CptS561 Advanced Computer Architecture Reservation Station Components Op—Operation to perform in the unit (e.g., + or –) Vj, Vk—Value of Source operands –Store buffers has V field, result to be stored Qj, Qk—Reservation stations producing source registers (value to be written) –Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready –Store buffers only have Qi for RS producing result Busy—Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

8 EE524/CptS561 Advanced Computer Architecture Three Stages of Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2.Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Normal data bus: data + destination (“go to” bus) Common data bus: data + source (“come from” bus) –64 bits of data + 4 bits of Functional Unit source address –Write if matches expected Functional Unit (produces result) –Does the broadcast

9 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 0

10 EE524/CptS561 Advanced Computer Architecture FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses LD F6, 34(R2) Cycle: 0

11 EE524/CptS561 Advanced Computer Architecture R FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses LD F2, 45(R3) Cycle: 1 LD F6, 34(R2) F6 : load1

12 EE524/CptS561 Advanced Computer Architecture R3 134+R FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses MULTD F0,F2,F4 Cycle: 2 LD F2, 45(R3) LD F6, 34(R2) F6 : load1 F2 : load2

13 EE524/CptS561 Advanced Computer Architecture R3 1Mem[34+R2] Mload2“F4”1 FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses SUB F8,F6,F2 Cycle: 3 MULTD F0,F2,F4 LD F2, 45(R3) LD F6, 34(R2) F6 : load1 F2 : load2 F0 : mult1

14 EE524/CptS561 Advanced Computer Architecture FP Registers F2 : load2 F6 : load1 F6  Mem[34+R2] F0 : mult1 L1: Mem[34+R2] Mem[45+R3] Sload1load2 2 Mload2“F4”1 FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit To memory Store buffers Operation bus Operand buses DIVD F10,F0,F6 L1: Mem[34+R2] Mem[34+R2] Cycle: 4 SUB F8,F6,F2 MULTD F0,F2,F4 LD F2, 45(R3) LD F6, 34(R2) F8: add1 L1: Mem[34+R2]

15 EE524/CptS561 Advanced Computer Architecture SMem[R2]load2 DMult12 Mload2“F4”1 FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses ADD F6,F8,F2 L2: Mem[45+R3] Mem[45+R3] Cycle: 5 DIVD F10,F0,F6 SUB F8,F6,F2 MULTD F0,F2,F4 LD F2, 45(R3) F2 : load2 F2  Mem[45+R3] FP adders FP Multipliers F8: add1 F0 : mult1 F10: mult2 L2: Mem[45+R3]

16 EE524/CptS561 Advanced Computer Architecture Aadd1M[R3] 1SMem[R2]M[R3] DMult1M[R3]2 M M[R3] “F4”1 FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses Cycle: 6 ADD F6,F8,F2 DIVD F10,F0,F6 SUB F8,F6,F2 MULTD F0,F2,F4 F6: add2 F0 : mult1 F10: mult2 F8: add1

17 EE524/CptS561 Advanced Computer Architecture Aadd1M[R3] 1SMem[R2]M[R3] DMult1M[R3]2 M M[R3] “F4”1 FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses Cycle: 7 ADD F6,F8,F2 DIVD F10,F0,F6 SUB F8,F6,F2 MULTD F0,F2,F4 F6: add2 F0 : mult1 F10: mult2 F8: add1

18 EE524/CptS561 Advanced Computer Architecture Aadd1M[R3] 1SMem[R2]M[R3] DMult1M[R3]2 M M[R3] “F4”1 FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses Cycle: 8 ADD F6,F8,F2 DIVD F10,F0,F6 SUB F8,F6,F2 MULTD F0,F2,F4 Add1: M()-M() F6: add2 F0 : mult1 F10: mult2 F8: add1 M()-M() F8  M()-M()

19 EE524/CptS561 Advanced Computer Architecture AM()-M()M[R3] 1 DMult1M[R3]2 M M[R3] “F4”1 FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses Cycle: 9 ADD F6,F8,F2 DIVD F10,F0,F6 MULTD F0,F2,F4 F6: add2 F0 : mult1 F10: mult2

20 EE524/CptS561 Advanced Computer Architecture AM()-M()M[R3] 1 DMult1M[R3]2 M M[R3] “F4”1 FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses Cycle: 10 ADD F6,F8,F2 DIVD F10,F0,F6 MULTD F0,F2,F4 F6: add2 F0 : mult1 F10: mult2

21 EE524/CptS561 Advanced Computer Architecture AM()-M()M[R3] 1 DMult1M[R3]2 M M[R3] “F4”1 FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses Cycle: 11 ADD F6,F8,F2 DIVD F10,F0,F6 MULTD F0,F2,F4 F6: add2 F0 : mult1 F10: mult2 Add2: (M()-M())+M() F6  (M()-m())+M()

22 EE524/CptS561 Advanced Computer Architecture DMult1M[R3]2 M M[R3] “F4”1 FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses Cycle: 12 DIVD F10,F0,F6 MULTD F0,F2,F4 F0 : mult1 F10: mult2

23 EE524/CptS561 Advanced Computer Architecture DMult1M[R3]2 M M[R3] “F4”1 FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses Cycle: 13 DIVD F10,F0,F6 MULTD F0,F2,F4 F0 : mult1 F10: mult2

24 EE524/CptS561 Advanced Computer Architecture DMult1M[R3]2 M M[R3] “F4”1 FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses Cycle: 14 DIVD F10,F0,F6 MULTD F0,F2,F4 F0 : mult1 F10: mult2

25 EE524/CptS561 Advanced Computer Architecture DMult1M[R3]2 M M[R3] “F4”1 FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses Cycle: 15 DIVD F10,F0,F6 MULTD F0,F2,F4 F0 : mult1 F10: mult2

26 EE524/CptS561 Advanced Computer Architecture DMult1M[R3]2 M M[R3] “F4”1 FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses Cycle: 16 DIVD F10,F0,F6 MULTD F0,F2,F4 F0 : mult1 F10: mult2 Mult1: M()*F4 F0  M()*F4 M()*F4

27 EE524/CptS561 Advanced Computer Architecture DM()*F4M[R3]2 1 FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses Cycle: 17 DIVD F10,F0,F6 F10: mult2

28 EE524/CptS561 Advanced Computer Architecture DM()*F4M[R3]2 1 FP adders FP Multipliers Common data bus (CDB) Reservation Stations From memory Load buffers FP operation queue From instruction unit FP Registers To memory Store buffers Operation bus Operand buses Cycle: 57 DIVD F10,F0,F6 F10: mult2 Mult2: M()*F4 / M() F10  M()*F4 / M()

29 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 1 Yes

30 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 2 Note: Unlike 6600, can have multiple loads outstanding

31 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 3 Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued vs. scoreboard Load1 completing; what is waiting for Load1?

32 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 4 Load2 completing; what is waiting for it?

33 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 5

34 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 6 Issue ADDD here vs. scoreboard?

35 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 7 Add1 completing; what is waiting for it?

36 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 8

37 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 9

38 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 10 Add2 completing; what is waiting for it?

39 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 11 Write result of ADDD here vs. scoreboard?

40 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 12 Note: all quick instructions complete already

41 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 13

42 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 14

43 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 15 Mult1 completing; what is waiting for it?

44 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 16 Note: Just waiting for divide

45 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 55

46 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 56 Mult 2 completing; what is waiting for it?

47 EE524/CptS561 Advanced Computer Architecture Tomasulo Example Cycle 57 Again, in-order issue, out-of-order execution, completion

48 EE524/CptS561 Advanced Computer Architecture Tomasulo Drawbacks Complexity –delays of 360/91, MIPS 10000, IBM 620? Many associative stores (CDB) at high speed Performance limited by Common Data Bus –Multiple CDBs => more FU logic for parallel assoc stores

49 EE524/CptS561 Advanced Computer Architecture Tomasulo Loop Example Loop:LDF00R1 MULTDF4F0F2 SDF40R1 SUBIR1R1#8 BNEZR1Loop Assume Multiply takes 4 clocks Assume first load takes 8 clocks (cache miss?), second load takes 4 clocks (hit) To be clear, will show clocks for SUBI, BNEZ Reality, integer instructions ahead

50 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 0

51 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 1

52 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 2

53 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 3 Note: MULT1 has no registers names in RS

54 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 4

55 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 5

56 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 6 Note: F0 never sees Load1 result

57 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 7 Note: MULT2 has no registers names in RS

58 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 8

59 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 9 Load1 completing; what is waiting for it?

60 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 10 Load2 completing; what is waiting for it?

61 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 11

62 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 12

63 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 13

64 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 14 Mult1 completing; what is waiting for it?

65 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 15 Mult2 completing; what is waiting for it?

66 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 16

67 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 17

68 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 18

69 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 19

70 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 20

71 EE524/CptS561 Advanced Computer Architecture Loop Example Cycle 21

72 EE524/CptS561 Advanced Computer Architecture Tomasulo Summary Reservations stations: renaming to larger set of registers + buffering source operands –Prevents registers as bottleneck –Avoids WAR, WAW hazards of Scoreboard –Allows loop unrolling in HW Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions –Dynamic scheduling –Register renaming –Load/store disambiguation 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

73 EE524/CptS561 Advanced Computer Architecture Fetch Unit Dispatch unit w/ 8-entry instruction queue Register nos. Data Cache Instruction Cache Completion unit w/ reorder buffer XSU0XSU1 MCFXU LSUFPU BPU Instruction dispatch buses GP operand buses Instruction Operation buses Reorder buffer information Reservation Stations GP result buses Result status buses Register nos. Register nos. FP operand buses FP result buses Branch correction


Download ppt "EE524/CptS561 Advanced Computer Architecture Dynamic Scheduling A scheme to overcome data hazards."

Similar presentations


Ads by Google