Download presentation
Presentation is loading. Please wait.
1
Out of Order Processors
2
Outline Pipeline events OoO Classes IO2I Processors Dynamic Scheduling
Scoreboard Tomasulo's Algorithm Alpha OoO implementation
3
MIPS Pipeline Events Instruction Issue
When an instruction moves into the EX stage after completing the ID stage Decode Stage = Instruction decode+Structural hazard detection and Operand ready identification+Register Read Instruction Commit When an instruction is guaranteed to commit The instruction updates the state of the processor Branch Delay Clock cycles needed to ascertain whether NPC is to be used or the address after the effective address calculation
4
Out-of-Order Classes
5
OoO Motivating Code Sequence
Compilers for sequential machines have no way of expressing the inherent parallelism in the code VLIW processors, Data flow machines
6
I4: Inorder Fetch, Issue, Write Back, Commit
X M W
7
I4: Inorder Fetch, Issue, Write Back, Commit
X1 X0 F D W M0 M1
8
I4: Inorder Fetch, Issue, Write Back, Commit
X1 X2 X3 X0 M2 M3 F D M0 M1 W Y0 Y1 Y2 Y3 Integer Function Unit
9
I4: Inorder Fetch, Issue, Write Back, Commit
X1 X2 X3 X0 M2 M3 F D M0 M1 W Y0 Y1 Y2 Y3 Memory Access Unit
10
I4: Inorder Fetch, Issue, Write Back, Commit
X1 X2 X3 X0 M2 M3 F D M0 M1 W Y0 Y1 Y2 Y3 Multiply Function Unit
11
I4: Inorder Fetch, Issue, Write Back, Commit
X1 X2 X3 X0 F M2 M3 D I M0 M1 W Y0 Y1 Y2 Y3 Full bypassing Issue stage
12
IO2I: IO Fetch, OoO Issue, OoO Write Back, IO Commit
X0 SB PRF ARF F D I M0 M1 W C ROB IQ S0 Y0 Y1 Y2 Y3
13
Dynamic Scheduling Out-of-order execution
Check for structural and data hazards Begin executing as soon as operands are available Implies out-of-order completion WAR and WAW hazards Imprecise exceptions DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F12, F8, F14
14
Dynamic Scheduling Separate ID stage into 2 stages:
Issue: Decode and check for structural hazards Read Operands: Wait till data hazards clear, read operands when ready Multi-cycle execution Scoreboard CDC6600 (1965) Mainframe computer 16 functional units – 4 FP, 5 Memory reference units, 7 INT.
15
Scoreboarding Example
L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2
16
Scoreboarding Example
Before second L.D is about to Write Result Instruction Status Instruction Issue Read operands Execution complete Write result L.D F6, 34(R2) √ √ √ √ L.D F2, 45(R3) √ √ √ MUL.D F0,F2,F4 √ SUB.D F8,F6,F2 √ DIV.D F10,F0,F6 √ ADD.D F6,F8,F2 Functional Unit Status Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide Yes Load F2 R3 No Yes Mult F0 F2 F4 Integer No Yes No Yes Sub F8 F6 F2 Integer Yes No Yes Div F10 F0 F6 Mult1 No Yes Register Result Status F0 F2 F4 F6 F8 F10 12 ... F30 FU Mult1 Integer Add Divide
17
Scoreboarding Example
Instruction Status Instruction Issue Read operands Execution complete Write result L.D F6, 34(R2) √ √ √ √ L.D F2, 45(R3) √ √ √ √ MUL.D F0,F2,F4 √ SUB.D F8,F6,F2 √ DIV.D F10,F0,F6 √ ADD.D F6,F8,F2 Functional Unit Status Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No Yes Load F2 R3 No Yes Mult F0 F2 F4 Integer No Yes Yes No Yes Sub F8 F6 F2 Integer Yes Yes No Yes Div F10 F0 F6 Mult1 No Yes Register Result Status F0 F2 F4 F6 F8 F10 12 ... F30 FU Mult1 Integer Add Divide
18
Scoreboarding Example
Instruction Status Instruction Issue Read operands Execution complete Write result L.D F6, 34(R2) √ √ √ √ L.D F2, 45(R3) √ √ √ √ MUL.D F0,F2,F4 √ √ SUB.D F8,F6,F2 √ √ √ √ DIV.D F10,F0,F6 √ ADD.D F6,F8,F2 Functional Unit Status Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No Yes Mult F0 F2 F4 No Yes No Yes No Yes Sub F8 F6 F2 Yes No Yes No Yes Div F10 F0 F6 Mult1 No Yes Register Result Status F0 F2 F4 F6 F8 F10 12 ... F30 FU Mult1 Add Divide
19
Scoreboarding Example
Instruction Status Instruction Issue Read operands Execution complete Write result L.D F6, 34(R2) √ √ √ √ L.D F2, 45(R3) √ √ √ √ MUL.D F0,F2,F4 √ √ SUB.D F8,F6,F2 √ √ √ √ DIV.D F10,F0,F6 √ ADD.D F6,F8,F2 Functional Unit Status Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No Yes Mult F0 F2 F4 No No No Yes No Sub F8 F6 F2 No No Yes Div F10 F0 F6 Mult1 No Yes Register Result Status F0 F2 F4 F6 F8 F10 12 ... F30 FU Mult1 Add Divide
20
Scoreboarding Example
Instruction Status Instruction Issue Read operands Execution complete Write result L.D F6, 34(R2) √ √ √ √ L.D F2, 45(R3) √ √ √ √ MUL.D F0,F2,F4 √ √ √ SUB.D F8,F6,F2 √ √ √ √ DIV.D F10,F0,F6 √ ADD.D F6,F8,F2 √ √ √ Functional Unit Status Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No Yes Mult F0 F2 F4 No No No Yes Add F6 F8 F2 Yes No Yes No Yes Div F10 F0 F6 Mult1 No Yes Register Result Status F0 F2 F4 F6 F8 F10 12 ... F30 FU Mult1 Add Divide
21
Scoreboarding Example
Instruction Status Instruction Issue Read operands Execution complete Write result L.D F6, 34(R2) √ √ √ √ L.D F2, 45(R3) √ √ √ √ MUL.D F0,F2,F4 √ √ √ √ SUB.D F8,F6,F2 √ √ √ √ DIV.D F10,F0,F6 √ √ √ ADD.D F6,F8,F2 √ √ √ √ Functional Unit Status Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No Yes No Mult F0 F2 F4 No No No Yes No Add F6 F8 F2 No No Yes Div F10 F0 F6 Mult1 No Yes No Yes No Register Result Status F0 F2 F4 F6 F8 F10 12 ... F30 FU Mult1 Add Divide
22
Tomasulo's Algorithm Invented by Robert Tomasulo for the IBM 360/91 (3 years after CDC6600) Goal: High Performance without special compilers Tomasulo Algorithm vs. Scoreboard Influenced designs of Alpha 21264, HP 8000, MIPS , Pentium II, Power PC 604 … Tomasulo, [1967]. “An efficient algorithm for exploiting multiple arithmetic units,” IBM J. Research and Development 11:1 (Jan),
23
Tomasulo's Algorithm From Instruction Unit Instruction Queue
FP Registers Load/Store operations Store buffers ADDRESS UNIT Load buffers 3 2 2 1 1 Reservation Stations Data Address MEMORY UNIT FP ADDER FP MULTIPLIERS Common Data Bus
24
Steps in Tomasulo's Algorithm
Issue Check for structural hazards Queue in the Reservation Station Keep track of FU generating operand if not available in RF Eliminates WAR and WAW hazards Also called dispatch Execute Monitor CDB for operand (Eliminates RAW hazards) Write result Write result on the CDB RS is marked available
25
Example √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ Qi Mult1 Load2 Add2 Add1 Mult2
Instruction Status Instruction Issue Read operands Write result L.D F6, 34(R2) √ √ √ L.D F2, 44(R3) √ √ √ MUL.D F0,F2,F4 √ √ √ SUB.D F8,F2,F6 √ √ √ DIV.D F10,F0,F6 √ ADD.D F6,F8,F2 √ √ √ Reservation Stations Name Busy Op Vj Vk Qj Qk A Load1 Load2 Add1 Add2 Add3 Mult1 Mult2 yes no Load 34 34+Regs[R2] yes no Load 44 44+Regs[R3] yes no SUB Mem[44+Regs[R3]] Mem[34+Regs[R2]] Load2 Load1 yes no ADD Add1[F8] Mem[44+Regs[R3]] Add1 Load2 no yes no MUL Mem[44+Regs[R3]] Regs[F4] Load2 yes DIV Mem[34+Regs[R2]] Mult1 Load1 Register Status Field F0 F2 F4 F6 F8 F10 12 ... F30 Qi Mult1 Load2 Add2 Add1 Mult2
26
OoO Processor Implementation
Reorder Buffer (RoB) Register File R1 – R32 Branch Prediction Instruction Fetch I1 I2 I3 I4 I5 I6 T1 T2 T3 T4 T5 T6 R1 ← R1 + R2 R2 ← R1 + R3 BEQZ R2 R3 ← R1 + R2 R1 ← R3 + R2 ALU ALU ALU Decode and Rename T1 ← R1 + R2 T2 ← T1 + R3 BEQZ T2 T4 ← T1 + T2 T5 ← T4 + T2 Instruction Fetch Queue Issue Queue
27
Alpha 21264 OoO Implementation
Register File R1 – R32 Reorder Buffer (RoB) Branch Prediction Instruction Fetch I1 I2 I3 I4 I5 I6 R1 → P1 R2 → P39 ... R1 ← R1 + R2 R2 ← R1 + R3 BEQZ R2 R3 ← R1 + R2 R1 ← R3 + R2 ALU ALU ALU Decode and Rename T1 ← R1 + R2 T2 ← T1 + R3 BEQZ T2 T4 ← T1 + T2 T5 ← T4 + T2 Instruction Fetch Queue Issue Queue R. E. Kessler, The Alpha Microprocessor. IEEE Micro, 19(2), 1999.
28
References ELE475. David Wentzlaff. Princeton.
CS6810. Rajeev Balasubramonian, PennState. Shen and Lipasti. Modern Processor Design. Hennessy and Patterson. CA. 5ed.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.