Presentation is loading. Please wait.

Presentation is loading. Please wait.

Superscalar Processors J. Nelson Amaral. Scalar to Superscalar Scalar Processor: one instruction pass through each pipeline stage in each cycle Superscalar.

Similar presentations


Presentation on theme: "Superscalar Processors J. Nelson Amaral. Scalar to Superscalar Scalar Processor: one instruction pass through each pipeline stage in each cycle Superscalar."— Presentation transcript:

1 Superscalar Processors J. Nelson Amaral

2 Scalar to Superscalar Scalar Processor: one instruction pass through each pipeline stage in each cycle Superscalar Processor: multiple instructions at each pipeline stage in each cycle – Wider pipeline Superpipelined Processor: Decompose stages into smaller stages → More Stages – Deeper pipeline Baer p. 75

3 Superscalar Front end (IF and ID) – Must fetch and decode multiple instructions per cycle m-way superscalar: brings (ideally) m instructions per cycle into the pipeline Back end (EX, Mem and WB) – Must execute and write back several instructions per cycle Baer p. 75

4 Superscalar In-order (or static) – Instructions leave front-end in program order Out-of-order (or dynamic) – instructions leave front-end, and execute, in a different order than the program order – WB is called commit stage must ensure that the program semantics is followed more complex design Baer p. 76

5 Limits to Superscalar Performance Superscalars rely on exploiting Instruction- Level Parallelism (ILP) – They remove WAR and WAW dependences – But the amount of ILP is limited by RAW (true) dependences Baer p. 76 S0: R1 ← R2 + R3 S1: R4 ← R1 + R5 S2: R1 ← R6 + R7 S3: R4 ← R1 + R9 Example: Data Dependence Graph: S0 S1 S2 S3 RAW WAW RAW WAR WAW

6 Limits to Superscalar Performance Superscalars rely on exploiting Instruction- Level Parallelism (ILP) – They remove WAR and WAW dependences – But the amount of ILP is limited by RAW (true) dependences Baer p. 76 S0: R1 ← R2 + R3 S1: R4 ← R1 + R5 S2: R1 ← R6 + R7 S3: R4 ← R1 + R9 Example: Data Dependence Graph: S0 S1 RAW WAW S2 S3 RAW WAR WAW RB RA

7 Limits to Superscalar Performance Complexity of logic to remove dependencies – Designers predicted 8-way and 16-way superscalars – We have 6-way superscalars and m is not likely to grow Baer p. 76

8 Limits to Superscalar Performance Number of Forward Paths 1-way: Baer p. 76

9 Limits to Superscalar Performance Number of Forward Paths 2-way: m-way requires m 2 paths paths may become too long for signal propagation within a single clock Baer p. 76

10 Limits to Clock Cycle Reduction Power dissipation increases with frequency Read and Writing to pipeline registers in every cycle. – Time to access pipeline register imposes a bound on the duration of a pipeline stage Baer p. 76

11 Limits on Pipeline Length Speculative actions (pe. branch prediction) are resolved later in a longer pipeline – Recovery from misspeculation is delayed Branch Misspred. Penalty: 10 cycles Branch Misspred. Penalty: 20 cycles 31-stage pipeline 14-stage pipeline Baer p. 76

12 Why the Multicore Revolution? Power Dissipation: Linear growth with clock frequency - Cannot make single cores faster Moore’s Law: Number of transistors in a chip continues the exponential growth - What to do with extra logic? Design Complexity: Extracting more performance from single core requires extreme design complexity. - What to do with extra logic? Baer p. 77

13 Speed Demons X Brainiacs Pentium III Out-of-Order Superscalar 1999 DEC Alpha In-Order Superscalar 1994 Baer p. 77 register renaming reorder buffer reservation stations

14 Out-of-Order and Memory Hierarchy Question: Does out-of-order execution help hide memory latencies? Short answer: No. – Latencies of 100 cycles or more are too long and fill up all internal queues and stall pipelines – Latencies around 100 cycles are too short to justify context switching. Solution: hardware for several contexts to enable fast context switching → multithreading Baer p. 78

15 DEC Alpha way in-order RISC bit 32 Miss Address File: merge outstanding misses to the same L2 line. Instruction Buffer virtually indexed Baer p. 79

16 21164 Instruction Pipeline Integer pipe 1: shifter and multiplier Integer pipe 2: branches 48-entry I-TLB 64-entry D-TLB Baer p. 79

17 Integer pipe 1: shifter and multiplier Integer pipe 2: branches 48-entry I-TLB 64-entry D-TLB Brings 4 instructions from I-Cache (accesses I-Cache and ITLB in parallel) Performs branch prediction, calculates branch target slotting stage: steers instructions to units; resolves static conflicts resolves dynamic conflicts; schedules forwardings and stallings Baer p. 80

18 Example i1: R1 ← R2 + R3 # Use integer pipeline 1 i2: R4 ← R1 – R5 # Use integer pipeline 2 i3: R7 ← R8 – R9 # Requires an integer pipeline i4: F0 ← F2 + F4 # Floating point add i5: i6: i7: i8: i9: i10: i11: i12: Assume no structural or data hazard for these instructions. Baer p. 81

19 Front-end Occupancy S0S1S2S3 i5 i6 i7 i8 Time: t 0 i1 i2 i3 i4 Time: t 0 + 1Backend i1: R1 ← R2 + R3 i2: R4 ← R1 – R5 i3: R7 ← R8 – R9 i4: F0 ← F2 + F4 Baer p. 82

20 Front-end Occupancy S0S1S2S3 i9 i10 i11 i12 Time: t i1 i2 i3 i4 i5 i6 i7 i8 Time: t 0 + 2Backend i1: R1 ← R2 + R3 i2: R4 ← R1 – R5 i3: R7 ← R8 – R9 i4: F0 ← F2 + F4 Baer p. 82

21 Time: t Front-end Occupancy S0S1S2S3 i11 i12 i3 i4 i9 i10 i1 i2 i5 i6 i7 i8 Time: t 0 + 3Backend i3 cannot move to S3 because of resource conflict (there are only two integer pipelines) i4 does not move to S3 to preserve program order (it is blocked by i3) i1: R1 ← R2 + R3 i2: R4 ← R1 – R5 i3: R7 ← R8 – R9 i4: F0 ← F2 + F4 Baer p. 82

22 Time: t Front-end Occupancy S0S1S2S3 i11 i12 i3 i4 i9 i10 i1 i2 i5 i6 i7 i8 BackendTime: t i2 cannot move to the backend because of of RAW dependency with i1. i1: R1 ← R2 + R3 i2: R4 ← R1 – R5 i3: R7 ← R8 – R9 i4: F0 ← F2 + F4 Baer p. 82

23 i15 i16 i13 i14 Time: t Front-end Occupancy S0S1S2S3 i3 i4 i11 i12 i9 i10 i2 i5 i6 i7 i8 Backend i1 Time: t i1: R1 ← R2 + R3 i2: R4 ← R1 – R5 i3: R7 ← R8 – R9 i4: F0 ← F2 + F4 Baer p. 82

24 Backend Begins L1 D-cache and D-TLB accesses Decide hit/miss in L1 D-cache and D-TLB Hit: Forward data (if needed); write to int. or FP register Miss: Start access to L2 Data available if hit in L2 Baer p. 82

25 Scoreboard Speculation Example: a load L, and a dependent use U reach S3 at cycle t If the load hits L1-cache, then schedule L at t+1 and U at t+3. Scoreboard assumes it is a hit. Know if it is a hit or miss here. If it is a miss, abort any dependent instruction already issued. Baer p. 82

26 Can Compiler Help Performance? (Example) i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 Assume that all instructions are in issuing slot (state S2) at time t.

27 Compiler Effect S0S1S2S3Time: t i1 i2 i3 i4 Time: t + 1Backend Baer p. 82 i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 i5 i6 i7 i8 Instruction i3 cannot advance to S3 because of an structural hazard: The load in i1 uses an integer pipe to compute the address i9 i10 i11 i12

28 Time: t + 1 Compiler Effect S0S1S2S3 i1 i2 i3 i4 Backend Baer p. 82 i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 i5 i6 i7 i8 Time: t + 2 i2 cannot advance because of the RAW dependency with i1 Time: t + 3 at t+3 the load continues execution in the back end (2-cycle latency) i9 i10 i11 i12

29 i13 i14 i15 i16 Time: t + 3 Compiler Effect S0S1S2S3 i1 Backend Baer p. 82 i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 Time: t + 4 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12

30 Time: t + 4 Compiler Effect S0S1S2S3Backend Baer p. 82 i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 i2 i3 i4 i5 i6 i7 i8 i4 cannot advance because of the RAW dependency with i3 Time: t + 5 i9 i10 i11 i12 i13 i14 i15 i16

31 i17 i18 i19 i20 Time: t + 5 Compiler Effect S0S1S2S3Backend Baer p. 82 i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 i3 i4 advances to execution at t+6 and it will be the only integer instruction executing at that cycle. Time: t + 6 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16

32 After Compiler Optimization S0S1S2S3Time: t i1 i1’ i2 i3 Time: t + 1Backend Baer p. 82 i1: R1 ← Mem[R2] i1’: integer nop i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 i4 i5 i6 i7 Two integer Instructions advance to S3. i8 i9 i10 i11

33 i13 i14 i15 i12 i1 i1’ i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 Time: t + 1S0S1S2S3Backend Baer p. 82 Time: t + 2 After Compiler Optimization i1: R1 ← Mem[R2] i1’: integer nop i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5

34 Time: t + 2S0S1S2S3Backend Baer p. 82 i1 i1’ i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 Time: t + 3 Load in i1 still needs two cycles to execute. Time: t + 4 After Compiler Optimization i1: R1 ← Mem[R2] i1’: integer nop i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 i13 i14 i15 i12

35 i17 i18 i19 i16 Time: t + 4S0S1S2S3Backend Baer p. 82 i1 i2 and i3 can advance to backend together. There is no depencency between them. Time: t + 5 After Compiler Optimization i1: R1 ← Mem[R2] i1’: integer nop i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i13 i14 i15 i12

36 Time: t + 4S0S1S2S3Backend Baer p. 82 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i13 i14 i15 i4 still advances to backend at t+6! Time: t + 5 After Compiler Optimization i1: R1 ← Mem[R2] i1’: integer nop i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 i12 i17 i18 i19 i16 but now i5 could advance along with i4 * Textbook says that i4 would advance to backend at t+5. Time: t + 6

37 Scoreboarding “Scoreboarding allows instructions to execute out of order when there are sufficient resources and no data dependences.” John L. Hennessy and David A. Patterson Computer Architecture: A Quantitative Approach Third Edition, p. A-69.

38 Another scoreboarding

39 Scoreboarding Thornton Algorithm (Scoreboarding): CDC 6600 (1964): – A single unit (the scoreboard) monitors the progress of the execution of instructions and the status of all registers. Tomasulo’s Algorithm: IBM 360/91 (1967) – Reservation stations buffer operands and results. A Common Data Bus (CDB) distributes results directly to functional units Some of this material is from Prof. Vojin G. Oklobzija’s tutorial at ISSCC’97. Baer p. 81

40 CDC 6600 Group I Group II Group III Group IV Baer p. 86 Not shown: branch unit that modifies the PC

41 CDC 6600 Scoreboard Operation free functional unit? WAW hazard? yes Issue no Stall yes Stall no Issue Baer p. 86

42 CDC 6600 Scoreboard Operation Dispatch Mark execution unit busy Operands ready? Stall no yes Read operands Baer p. 87

43 CDC 6600 Scoreboard Operation Execution Execution complete? Stall no yes Notify Scoreboard that it is ready to write result Baer p. 87

44 CDC 6600 Scoreboard Operation Write result WAR hazard? Stall yes no Write WAR Example: i0 DIV.D F0, F2, F4 i1 ADD.D F10, F0, F8 i2 SUB.D F8, F8, F14 Has to stall the write of i2 until i1 has read F8 Baer p. 87

45 Scoreboarding Example i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Baer p. 88

46 i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle1 UnitBusy (U)? Mult10 Mult20 Adder0 RegisterUnit R4NIL R6NIL R8NIL Source Reg UnitsReg Flags InstructionStatus FjFkQjQkRjRk Instructions in Flight Fi Res. i1issuedR4R0R211 Baer p. 88 Mult1

47 i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle2 UnitBusy (U)? Mult10 Mult20 Adder0 RegisterUnit R4Mult1 R6NIL R8NIL Source Reg UnitsReg Flags InstructionStatus FjFkQjQkRjRk Instructions in Flight Fi Res. i1dispatchedR4R0R2 i2 issued R6R4R8Mult Baer p Mult2

48 i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle3 UnitBusy (U)? Mult11 Mult20 Adder0 RegisterUnit R4Mult1 R6Mult2 R8NIL Source Reg UnitsReg Flags InstructionStatus FjFkQjQkRjRk Instructions in Flight Fi Res. i1dispatchedR4R0R2 i2issuedR6R4R8Mult execute i3issuedR8R2R1211 i2 cannot be dispatched because R4 is not available Baer p. 88 Adder These values are wrong on Table 3.2 (p. 88) in the textbook

49 i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle4 UnitBusy (U)? Mult11 Mult20 Adder0 RegisterUnit R4Mult1 R6Mult2 R8Adder Source Reg UnitsReg Flags InstructionStatus FjFkQjQkRjRk Instructions in Flight Fi Res. i1R4R0R2 i2issuedR6R4R8Mult execute i3issuedR8R2R1211 i4 cannot issue: (i) Adder is busy; AND (ii) WAW dependency on i1 dispatched Baer p. 88 1

50 i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle5 UnitBusy (U)? Mult11 Mult20 Adder1 RegisterUnit R4Mult1 R6Mult2 R8Adder Source Reg UnitsReg Flags InstructionStatus FjFkQjQkRjRk Instructions in Flight Fi Res. i1R4R0R2 i2issuedR6R4R8Mult execute R8R2R1211dispatchedi3execute Baer p. 88 (No change)

51 i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle6 UnitBusy (U)? Mult11 Mult20 Adder1 RegisterUnit R4Mult1 R6Mult2 R8Adder Source Reg UnitsReg Flags InstructionStatus FjFkQjQkRjRk Instructions in Flight Fi Res. i1R4R0R2 i2issuedR6R4R8Mult execute R8R2R1211i3execute i3 asks for permission to write. Permission is denied (WAR with i2). Baer p. 88

52 i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle8 UnitBusy (U)? Mult11 Mult20 Adder1 RegisterUnit R4Mult1 R6Mult2 R8Adder Source Reg UnitsReg Flags InstructionStatus FjFkQjQkRjRk Instructions in Flight Fi Res. i1R4R0R2 i2issuedR6R4R8Mult execute R8R2R1211i3execute i1 asks for permission to write. Permission is granted. write Baer p. 88

53 i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle9 UnitBusy (U)? Mult10 Mult21 Adder1 RegisterUnit R4 R6Mult2 R8Adder Source Reg UnitsReg Flags InstructionStatus FjFkQjQkRjRk Instructions in Flight Fi Res. i2issuedR6R4R8Mult101 R8R2R1211i3execute dispatched write i4issueR4R14R1611 Baer p. 88 Adder

54 Register Renaming, Reorder Buffer, and Reservation Stations Difference between in-order X out-of-order execution: – When instructions leave the front end? In-order: WAR and WAW prevent dispatch Out-of-order: register renaming avoids WAR and WAW How are instructions processed in the back- end? Instructions can wait in reservation stations because of RAW dependencies or structural hazards A reorder buffer imposes program order commitment Baer p. 89

55 Register Renaming (example) i1: R1 ← R2/R3 # Takes a long time i2: R4 ← R1 + R5 i3: R5 ← R6 + R7 i4: R1 ← R8 + R9 In-order: Only i1 issues. Others are blocked by RAW dependency. Out-of-order: i3 and i4 can issue and finish execution while i1 executes The registers that appear in the program are logical or architectural registers. At the last stage of the front end all registers are mapped to physical registers. Baer p. 89

56 Renaming Process Renaming Stage: R i ←R j op R k R a ← R b op R c R b = Rename(R j ); R c = Rename(R k ); R a = freelist(first); Rename(R i ) = freelist(first); first ←next(first) Baer p. 90

57 Register Renaming (example) i1: R1 ← R2/R3 i2: R4 ← R1 + R5 i3: R5 ← R6 + R7 i4: R1 ← R8 + R9 RiRi Rename(Ri) R1 R2 R3 R4 R5 R6 R7 R8 R9 Freelist = {R32, R33, R34, R35, R36, …} R32 R33 R34 R35 i4 will finish execution before i1. Can we allow it to write the result to R1 before i1? How about i3, can it write into R5 before i1 and i2 complete? If i1 generates an exception, what will be the value of R5 in the exception state? Baer p. 90

58 Reorder Buffer Even though we allow out-of-order execution, we require in-order-completion. A reorder buffer (ROB) ensures that the results produced by instructions are committed to the logical register in order. Baer p. 91

59 Reorder Buffer (cont.) Each entry in the ROB has the following fields: – flag: has the instruction completed? – value: value computed by the instruction – result register name: logical register – instruction type: arithmetic/load/store/branch/… Each instruction that has its destination register renamed is entered in the ROB Baer p. 91

60 i1: R1 ← R2/R3 i2: R4 ← R1 + R5 i3: R5 ← R6 + R7 i4: R1 ← R8 + R9 RiRi Rename(Ri) R1 R2 R3 R4 R5 R6 R7 R8 R9 Freelist = {R32, R33, R34, R35, R36, …} R32 R33 R34 R35 InstructionFlagValueReg. NameType i1Not ReadyNoneR1Arit Head Tail i2Not ReadyNoneR4Arit i3Not ReadyNoneR5Arit i4Not ReadyNoneR1Arit ReadySome ReadySome ReadySome Baer p. 92

61 But…. Where do instructions wait before being executed? How an instruction knows that it is ready to be executed? Baer p. 93

62 Reservation Stations After register renaming, the front-end dispatches the instruction to a reservation station. Reservation stations can: – be grouped into a centralized queue called an instruction window. – be associated with functional units according to the opcode. Baer p. 93

63 Reservation Stations (cont.) Each entry in the Reservation Station must contain: – Operation to be performed – Source operands (either value or physical name of the register) – a flag indicates which one – physical name of the result register – ROB entry where the result will be stored. Baer p. 93

64 Scheduling Scheduling: Selection of which instruction should execute next in a given execution unit – oldest instruction; – critical instruction; Baer p. 93

65 Ready Bit A ready bit is associated with each physical register. When an instruction that uses a physical register Ri is dispatched: – if Ri is ready, pass Ri value to the reservation station and set flag to true (ready) – if Ri is not ready, pass the name of Ri to the reservation station and set flag to false (not ready) – When both flags are true, the instruction is ready to be issued. Baer p. 93

66 Ready Bit (cont.) Upon completion, an instruction broadcasts the name and content of its result physical register to all reservation stations (RS). – Each RS that needs it, will grab the content and update its flags. Baer p. 93


Download ppt "Superscalar Processors J. Nelson Amaral. Scalar to Superscalar Scalar Processor: one instruction pass through each pipeline stage in each cycle Superscalar."

Similar presentations


Ads by Google