Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

Similar presentations


Presentation on theme: "Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically."— Presentation transcript:

1 Pipelining 5

2 Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically or dynamically VLIW (Very Long Instruction Word) –Issue a single very long instruction per clock that contains a large number of real instructions –Instructions are scheduled statically by the compiler

3 Superscalar Superscalar DLX: 2 instructions, 1 FP & 1 anything else – Fetch 64-bits/clock cycle; Int on left, FP on right – Can only issue 2nd instruction if 1st instruction issues – More ports for FP registers to do FP load & FP op as doubles TypePipeStages Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB 1 cycle load delay expands to 3 instructions in Superscalar –instruction in right half can’t use it, nor instructions in next slot

4 Unrolled Loop 1 Loop:LDF0,0(R1) 2 LDF6,-8(R1) 3 LDF10,-16(R1) 4 LDF14,-24(R1) 5 ADDDF4,F0,F2 6 ADDDF8,F6,F2 7 ADDDF12,F10,F2 8 ADDDF16,F14,F2 9 SD0(R1),F4 10 SD-8(R1),F8 11 SD-16(R1),F12 12 SUBIR1,R1,#32 13 BNEZR1,LOOP 14 SD8(R1),F16; 8-32 = -24 14 clock cycles, or 3.5 per iteration LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles

5 Loop Unrolling in Superscalar Integer instructionFP instructionClock cycle Loop:LD F0,0(R1)1 LD F6,-8(R1)2 LD F10,-16(R1)ADDD F4,F0,F23 LD F14,-24(R1)ADDD F8,F6,F24 LD F18,-32(R1)ADDD F12,F10,F25 SD 0(R1),F4ADDD F16,F14,F26 SD -8(R1),F8ADDD F20,F18,F27 SD -16(R1),F128 SD -24(R1),F169 SUBI R1,R1,#4010 BNEZ R1,LOOP11 SD -32(R1),F2012 Unrolled 5 times to avoid delays (+1 due to Superscalar) 12 clocks, or 2.4 clocks per iteration

6 Dynamic Scheduling in Superscalar Dependencies will stop instruction issue Code compiled for non-superscalar will run poorly on superscalar Simple approach –Separate Tomasulo control –Separate reservation stations for Integer FU/Reg and for FP FU/Reg

7 Dynamic Scheduling in Superscalar How to issue two dependent instructions in the same cycle? - otherwise why use the dynamic scheduling? –Issue 2X Clock Rate, so that issue remains in order –More complex issue logic - relatively easy since only FP loads might cause dependency between integer and FP issue

8 Performance of Dynamic Superscalar Iteration InstructionsIssues ExecutesWrites result no. clock-cycle number 1LD F0,0(R1)124 1ADDD F4,F0,F2158 1SD 0(R1),F429 1SUBI R1,R1,#8345 1BNEZ R1,LOOP45 2LD F0,0(R1)568 2ADDD F4,F0,F25912 2SD 0(R1),F4613 2SUBI R1,R1,#8789 2BNEZ R1,LOOP89 5 clocks per iteration Branches, Decrements still take 1 clock cycle

9 Limits of Superscalar While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: –Exactly 50% FP operations –No hazards If more instructions issue at the same time, greater difficulty of decode and issue –Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue Issue rates of modern processors vary between 2 and 4 instructions per cycle.

10 VLIW Processors Very Long Instruction Word (VLIW) processors –Tradeoff instruction space for simple decoding –The long instruction word has room for many operations –By definition, all the operations the compiler puts in the long instruction word can execute in parallel –E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch »16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide –Need compiling technique that schedules across several branches (Trace Scheduling)

11 Loop Unrolling in VLIW Memory MemoryFPFPInt. op/Clock reference 1reference 2operation 1 op. 2 branch LD F0,0(R1)LD F6,-8(R1)1 LD F10,-16(R1)LD F14,-24(R1)2 LD F18,-32(R1)LD F22,-40(R1)ADDD F4,F0,F2ADDD F8,F6,F23 LD F26,-48(R1)ADDD F12,F10,F2ADDD F16,F14,F24 ADDD F20,F18,F2ADDD F24,F22,F25 SD 0(R1),F4SD -8(R1),F8ADDD F28,F26,F26 SD -16(R1),F12SD -24(R1),F167 SD -32(R1),F20SD -40(R1),F24SUBI R1,R1,#488 SD -0(R1),F28BNEZ R1,LOOP9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration Need more registers in VLIW

12 Limits of Multi-Issue Machines Inherent limitations of ILP –1 branch in 5 instructions => how to keep a 5-way VLIW busy? –Latencies of units => many operations must be scheduled –Need about Pipeline Depth x No. Functional Units of independent operations to keep the functional units busy Difficulties in building HW –Duplicate FUs to get parallel execution –Increased # of ports to register file –Increased # of ports to memory –Instruction issue hardware (a wide spectrum)

13 Limits to Multi-Issue Machines Limitations specific to either Superscalar or VLIW implementation –Superscalar instruction issue logic –VLIW code size: unroll loops + wasted fields –VLIW lock step => 1 hazard & all instructions stall –VLIW binary compatibility => object-code (binary) translation

14 Hardware-based Speculation Instructions are executed out of order and speculatively but committed in order Instruction commit is separated from instruction completion In-order instruction commit requires a hardware buffer called the reorder buffer The reorder buffer holds the results of an instruction between its completion time and commit time Exceptions are also processed in order (precise exception model)

15 Hardware-based Speculation IF EX FUn EX FU1 Write results Issue Structural hazard: delaying the issue until there is an empty reservation station and an empty slot in the reorder buffer RAW data hazard: wait at the reservation station until the values of the source registers are available reservation station 1 reservation station 2 reservation station 2 reservation station 1 reservation station 3 Commit Wait until the head of the reorder buffer is reached and the result is present

16 Hardware-based Speculation

17 Example Code LD F6,34(R2) LD F2,45(R3) MULTD F0,F2,F4 SUBD F8,F6,F2 DIVD F10,F0,F6 ADDD F6,F8,F2 Figure 4.35 in the book

18 Hardware-based Speculation Advantages –Lessens the performance degradation resulting from control hazards –Allows a precise exception model since exception conditions can be checked at the instruction commit time –Can incorporate hardware-based branch prediction –Does not require additional bookkeeping code –Does not depend on a good compiler - performs OK with non-optimized code Disadvantage –Hardware complexity

19 Summary Superscalar and VLIW –CPI < 1 –Superscalar is more hardware dependent (dynamic) –VLIW is more compiler dependent (static) –More instructions issue at same time => larger penalties for hazards Hardware-based speculative execution –Minimizes the impact of control hazards on performance –Enables a precise exception model for out of order and/or speculative execution


Download ppt "Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically."

Similar presentations


Ads by Google