Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Dr. Leo Porter Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License

Pipelining in Today’s Most Advanced Processors Not fundamentally different than the techniques we discussed Deeper pipelines Pipelining is combined with –superscalar execution –out-of-order execution –VLIW (very-long-instruction-word)

Deeper Pipelines Power 4 Pentium 3 Pentium 4 Give the Review of “Intel’s game”

End of superpipelining? Pipeline register overheads were already starting to play a role Thermal wall/Power wall – cannot increase clock rate so deeper pipelines have less utility In fact – pipelines are getting smaller Much like parallelism – cutting in half didn’t halve CT

Superscalar Execution IMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMReg

Selection AAny two instructions BAny two independent instructions CAn arithmetic instruction and a memory instruction DAny instruction and a memory instruction ENone of the above What can this do in parallel?

A modest superscalar MIPS what can this machine do in parallel? what other logic is required? Represents earliest superscalar technology (eg, circa early 1990s) Hazards – detecting independent insts.

Superscalar Execution To execute four instructions in the same cycle, we must find four independent instructions If the four instructions fetched are guaranteed by the compiler to be independent, this is a VLIW machine If the four instructions fetched are only executed together if hardware confirms that they are independent, this is an in-order superscalar processor. If the hardware actively finds four (not necessarily consecutive) instructions that are independent, this is an out-of-order superscalar processor.

Superscalar Scheduling Assume in-order, 2-issue, ld-store followed by integer. In which cycle can we start “executing” of each instruction (assume the first lw has already gone through F and D. lw $6, 36($2) add $5, $6, $4 lw $7, 1000($5) sub $9, $12, $5 00110011 A 01220122 01240124 B C D 02330233 E None are correct D (X has to wait on stall) Point out “issue” is used in modern machines to say “start executing” or go to X in our MIPS pipeline – POINT out ASSUME same pipeline as MIPS

Superscalar Scheduling Assume in-order, 4-issue, any combination. In which cycle can we start “executing” each instruction (assume the first lw has already gone through F and D. lw $6, 36($2) add $5, $6, $4 lw $7, 1000($5) sub $9, $12, $5 sw $5, 200($6) add $3, $9, $9 and $11, $7, $6 000011100001111 A E None are correct 01122330112233 B 02333450233345 C 02344550234455 D C What we’ve done here could be done in HW or SW

VLIW Advantages In the past – the strongest argument for VLIW has been that by removing the complexity of doing dynamic scheduling in hardware, we can increase clock rate. This advantage now runs into the same problem as superpipelining. What now becomes the best argument for VLIW? SelectionBest argument AVLIW can find more ILP than hardware scheduling BVLIW is a legacy component to the Itanium 2 CVLIW enables a single compilation for multiple generations of processors DVLIW can be more power efficient ENone of the above Talk about predication

Which of the following pairs of instructions represent hazards which only apply to out-of-order execution? SelectionInstruction Pair A1 B2 C3 D1 and 2 E1 and 3 lw $1, 0 ($2) add $1, $2, $3 lw $1, 0 ($2) add $3, $2, $1 add $2, $1, $4 123 1: WAW 2: RAW Talk through Reg rename (virtual 3: WAR registers in a way)

Early Out of order Processor FetchDecode Instruction Queue Register Rename INT ALU INT ALU INT ALU FP ALU FP ALU Load Queue Store Queue L1 Result Bus **Spend a lot of time talking this trhough – how do various units communicate, reservation stations, etc. Point out memory load/store queues In order front end – out of order back end

Which of the following are not possible if you allow out-of-order commit (no reorder buffer)? SelectionInstruction Pair A1, 2, and 4 B1 and 3 C3 and 4 D1, 3, and 4 ENone of the above D – Talk through what each means before posing the question. 1.speculate on load instructions 2.forward values between instructions 3.speculate on branches 4.provide “precise” interrupts

Modern OOO Processor FetchDecode Instruction Queue Register Rename INT ALU INT ALU INT ALU FP ALU FP ALU Load Queue Store Queue L1 Reorder Buffer Point out memory load/store queues and how a long latency miss can still mess things up

Out of order with Reorder Buffer Assume 2-issue out-of-order, any pair of instructions. Register renaming. Execute begins as soon as operands are available. (Assume all instructions applicable insts. are held in the instruction queue). When does each instruction issue (execute)? lw $6, 36($2) add $5, $6, $4 beq $2, $6 there #pred NT PC+4lw $7, 1000($4) sub $9, $12, $8 add $3, $5, $6 022013022013 A B 020112020112 C 022334022334 022344022344 D E None are correct Point out VLIW can’t do this.

Pentium 4

Modern Processors Pentium II, III – 3-wide superscalar, out-of-order, 14 integer pipeline stages Pentium 4 – 3-wide superscalar, out-of-order, simultaneous multithreading, 20+ pipe stages AMD Athlon, 3-wide ss, out-of-order, 10 integer pipe stages AMD Opteron, similar to Athlon, with 64-bit registers, 12 pipe stages, better multiprocessor support. Alpha 21164 – 2-wide ss, in-order, 7 pipe stages Alpha 21264 – 4-wide ss, out-of-order, 7 pipe stages Intel Itanium – 3-operation VLIW, 2-instruction issue (6 ops per cycle), in-order, 10-stage pipeline

Nehalem From: Fast Thread Migration via Working Set Prediction. Brown, Porter, Tullsen. HPCA 2011

Advanced Pipelining -- Key Points ET = Number of instructions * CPI * cycle time Pipelining attempts to get CPI close to 1. To improve performance we must reduce CT (superpipelining) or CPI below one (superscalar, VLIW). Hardware or software can guarantee instruction independence Modern processors often do in-order fetch, in-order commit, with out-of-order execution in between

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Similar presentations

Presentation on theme: "Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Similar presentations

Presentation on theme: "Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike."— Presentation transcript:

Similar presentations

About project

Feedback