COMP 740: Computer Architecture and Implementation

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
A scheme to overcome data hazards
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
COMP4611 Tutorial 6 Instruction Level Parallelism
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Computer Architecture Lec 8 – Instruction Level Parallelism.
CS136, Advanced Architecture Speculation. CS136 2 Outline Speculation Speculative Tomasulo Example Memory Aliases Exceptions VLIW Increasing instruction.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Lecture 8: More ILP stuff Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Mar 17, 2009 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
EECC551 - Shaaban #1 lec # 8 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
1 Overcoming Control Hazards with Dynamic Scheduling & Speculation.
1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
IBM System 360. Common architecture for a set of machines
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Out of Order Processors
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
CS5100 Advanced Computer Architecture Hardware-Based Speculation
CPE 631 Lecture 15: Exploiting ILP with SW Approaches
Lecture 12 Reorder Buffers
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Chapter 3: ILP and Its Exploitation
Advantages of Dynamic Scheduling
David Patterson Electrical Engineering and Computer Sciences
CS152 Computer Architecture and Engineering Lecture 18 Dynamic Scheduling (Cont), Speculation, and ILP.
Tomasulo With Reorder buffer:
11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
John Kubiatowicz Electrical Engineering and Computer Sciences
Larry Wittie Computer Science, StonyBrook University and ~lw
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Adapted from the slides of Prof
Chapter 3: ILP and Its Exploitation
September 20, 2000 Prof. John Kubiatowicz
Lecture 7 Dynamic Scheduling
Overcoming Control Hazards with Dynamic Scheduling & Speculation
Presentation transcript:

COMP 740: Computer Architecture and Implementation Montek Singh Oct 31, 2016 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

Outline Multiple-Issue Architectures Scheduling Superscalar processors VLIW (very long instruction word) processors Scheduling Statically scheduled (using compiler techniques) Dynamically scheduled (using variants of Tomasulo’s alg.) Reading: HP5, Sections 3.6-3.10

Multiple Issue Eliminating data and control stalls can achieve CPI of 1 Can we decrease CPI below 1? Not if we issue only one instruction per clock cycle Multiple-issue processors allow multiple instructions to issue in a clock cycle Superscalar: issue varying numbers of instructions per clock Statically scheduled by compiler, OR Dynamically scheduled by hardware VLIW: issue fixed number of instructions per clock Statically scheduled by compiler Examples Superscalar: IBM PowerPC, Sun SuperSPARC, DEC Alpha, HP 8000 Intel Core series: 4-way superscalar VLIW: Intel/HP Itanium

A Superscalar Version of MIPS Two instructions can be issued per clock cycle One can be load/store/branch/integer operation Other can be any FP operation Need to fetch and decode 64 bits per cycle Instructions paired and aligned on 64-bit boundary Integer instruction appears first Dynamic issue First instruction issues if independent and satisfies other criteria Second instruction issues only if first one does, and is independent and satisfies similar criteria Limitation One-cycle delay for loads and branches now turns into three-instruction delay! … because instructions are now squeezed closer together

Simple Superscalar MIPS

Performance of Static Superscalar LOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 SUBI R1, R1, 8 BNEZ R1, LOOP LOOP: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) ADD.D F4, F0, F2 L.D F14, -24(R1) ADD.D F8, F6, F2 L.D F18, -32(R1) ADD.D F12, F10, F2 S.D 0(R1), F4 ADD.D F16, F14, F2 S.D -8(R1), F8 ADD.D F20, F18, F2 S.D -16(R1), F12 SUBI R1, R1, 40 S.D 16(R1), F16 BNEZ R1, LOOP S.D 8(R1), F20 Assumptions: L.D takes 2 cycles, ADD.D takes 3 Loop unrolled five times and scheduled statically 6 cycles per element in original scheduled code 2.4 cycles per element in superscalar code (2.5x) Loop unrolling gets us from 6 to 3.5 cycles per element (1.7x) Superscalar execution from 3.5 to 2.4 cycles per element (1.5x)

Multiple-Issue with Dynamic Scheduling Extend Tomasulo’s algorithm support issuing 2 instr/cycle: 1 integer, 1 FP Simple approach: separate Tomasulo Control for Integer and FP units: one set of reservation stations for Integer unit and one for FP unit How to do instruction issue with two instructions and keep in-order instruction issue for Tomasulo? issue logic runs in one-half clock cycle can do two in-order issues in one clock cycle

Performance of Dynamic Superscalar Iter. no. Instructions Issues Executes Writes result (clock-cycle number) 1 L.D F0,0(R1) 1 2 4 1 ADD.D F4,F0,F2 1 5 8 1 S.D 0(R1),F4 2 9 1 SUBI R1,R1,8 3 4 5 1 BNEZ R1,LOOP 4 5 2 L.D F0,0(R1) 5 6 8 2 ADD.D F4,F0,F2 5 9 12 2 S.D 0(R1),F4 6 13 2 SUBI R1,R1,8 7 8 9 2 BNEZ R1,LOOP 8 9 4 clocks per iteration

Limits of Superscalar While Integer/FP split is simple to implement, we get CPI of 0.5 only for programs with: Exactly 50% FP operations No hazards If more instructions issue at same time, greater difficulty of decode and issue Even 2-scalar machine has to do a lot of work Examine 2 opcodes Examine 6 register specifiers Decide if 1 or 2 instructions can issue If 2 are issuing, issue them in-order

Modern Superscalar 4-6 instructions/clock Intel Core series (i3/i5/i7): 4-way superscalar Issue in fraction of clock steps, and also include logic to deal with dependencies Need to complete (i.e., write results of) multiple instructions per clock as well

Very Long Instruction Word (VLIW) VLIW: trade off instruction space for simple decoding The long instruction word has room for many operations By definition, all the operations the compiler puts in the long instruction word can execute in parallel E.g., 2 integer operations, 2 FP operations, 2 memory references, 1 branch 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide Need very sophisticated compiling technique … … that schedules across several branches

Loop Unrolling in VLIW Assumptions: L.D takes 2 cycles, ADD.D takes 3 Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 op. 2 branch L.D F0,0(R1) L.D F6,-8(R1) 1 L.D F10,-16(R1) L.D F14,-24(R1) 2 L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3 L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4 ADD.D F20,F18,F2 ADD.D F24,F22,F2 5 S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6 S.D -16(R1),F12 S.D -24(R1),F16 7 S.D -32(R1),F20 S.D -40(R1),F24 SUBI R1,R1,#48 8 S.D -0(R1),F28 BNEZ R1,LOOP 9 Assumptions: L.D takes 2 cycles, ADD.D takes 3 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration Need more registers in VLIW

Limits to Multi-Issue Machines Inherent limitations of ILP 1 branch in 5 instructions: How to keep a 5-way VLIW busy? Latencies of units: many operations must be scheduled #independent instr. needed: Pipeline Depth  Number of FU ~15-20 independent instructions! Difficulties in building the underlying hardware Large increase in bandwidth of memory and register-file Limitations specific to either SS or VLIW implementation Decode issue in SS VLIW code size: unroll loops + wasted fields in VLIW VLIW lock step  1 hazard & all instructions stall VLIW & binary compatibility is practical weakness

Hardware Support for More ILP Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP If false, then do nothing at all neither store result nor cause exception Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move (CMOV) Drawbacks to conditional instructions Still takes a clock even if “annulled” Stall if condition evaluated late Complex conditions reduce effectiveness; condition becomes known late in pipeline

Hardware Support for More ILP Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences (including exceptions) if branch is not actually taken Hardware needs to provide an “undo” operation = squash Often try to combine with dynamic scheduling Tomasulo: compute speculatively, “commit” later When instruction no longer speculative, write results (instruction commit) execute out-of-order but commit in order Example: PowerPC 620, MIPS R10000, Intel P6, AMD K5 …

Hardware support for More ILP Need HW buffer for results of uncommitted instructions: reorder buffer (ROB) Reorder buffer can be operand source Once instruction commits, result is found in register 3 fields: instr. type, destination, value Use reorder buffer number instead of reservation station Instructions commit in order As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions Reorder Buffer Instr Queue FP Regs Res Stations Res Stations FP Adder FP Mult

Four Steps of Speculative Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instruction & send operands & reorder buffer no. for destination; each RS now also has a field for ROB#. 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting RS’s & ROB; mark RS available 4. Commit—update register with reorder result When instruction at head of reorder buffer & result present, update register with result (or store to memory) and remove instruction from reorder buffer

Result Shift Register and Reorder Buffer General solution to three problems Precise exceptions Speculative execution Register renaming Solution in three steps In-order initiation, out-of-order termination (using RSRa) In-order initiation, in-order termination (using RSRb) In-order initiation, in-order termination, with renaming (using ROB) Architectural model Essentially MIPS FP pipeline FP add/sub takes 2 clock cycles, multiplication 5, division 10 Memory accesses take 1 clock cycle Integer instructions take 1 clock cycle 1 branch delay slot, delayed branches

Step I: I-O Initiation, O-O Termination (RSRa) LOOP: LD F6, 32(R2) LD F2, 48(R3) MULTD F0, F2, F4 ADDI R2, R2, 8 ADDI R3, R3, 8 SUBD F8, F6, F2 DIVD F10, F10, F0 ADDD F6, F8, F6 BLEZ R4, LOOP ADDI R4, R4, 1

Step II: I-O Initiation, I-O Termination (RSRb)

Step III: Use Re-order Buffer (ROB) Combine benefits of early issue and in-order update of state Obtained from RSRa by adding a renaming mechanism to it Add a FIFO to RSRa (implement as circular buffer) When RSRa allows issuing of new instruction Enter instruction at tail of circular buffer Buffer entry has multiple fields [Result; Valid Bit; Destination Register Name; PC value; Exceptions] Termination happens when result is produced, broadcast on CDB, written into circular buffer (replace M with T) Written ROB entry can serve as source of operands from now on Commit happens when value is moved from circular buffer to register (replace W with C) Happens when instruction reaches head of circular buffer and has completed execution with no exceptions

ROB: I-O Initiation, I-O Termination LOOP: LD F6, 32(R2) LD F2, 48(R3) MULTD F0, F2, F4 ADDI R2, R2, 8 ADDI R3, R3, 8 SUBD F8, F6, F2 DIVD F10, F10, F0 ADDD F6, F8, F6 BLEZ R4, LOOP ADDI R4, R4, 1

States of Circular Reorder Buffer LOOP: LD F6, 32(R2) LD F2, 48(R3) MULTD F0, F2, F4 ADDI R2, R2, 8 ADDI R3, R3, 8 SUBD F8, F6, F2 DIVD F10, F10, F0 ADDD F6, F8, F6 BLEZ R4, LOOP ADDI R4, R4, 1 Entry in yellow is at head of buffer Entry in green is tail of buffer, i.e., next instruction goes here Greyed instructions have committed

Tomasulo With Reorder buffer: Done? FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest F0 LD F0,10(R2) N Registers To Memory Dest Dest from Memory Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Dest Reservation Stations 1 10+R2 FP adders FP multipliers

Tomasulo With Reorder buffer: Done? FP Op Queue F10 F0 ADDD F10,F4,F0 LD F0,10(R2) N ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Registers To Memory Dest Dest from Memory Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel 2 ADDD R(F4),ROB1 Dest Reservation Stations 1 10+R2 FP adders FP multipliers

Tomasulo With Reorder buffer: Done? FP Op Queue F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Registers To Memory Dest Dest from Memory Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel 2 ADDD R(F4),ROB1 3 DIVD ROB2,R(F6) Dest Reservation Stations 1 10+R2 FP adders FP multipliers

Tomasulo With Reorder buffer: Done? FP Op Queue N -- BNE F2,<…> F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Registers To Memory Dest Dest from Memory Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel 2 ADDD R(F4),ROB1 3 DIVD ROB2,R(F6) Dest Reservation Stations 1 10+R2 FP adders FP multipliers

Tomasulo With Reorder buffer: Done? FP Op Queue F4 LD F4,0(R3) N -- BNE F2,<…> F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Registers To Memory Dest Dest from Memory Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel 2 ADDD R(F4),ROB1 3 DIVD ROB2,R(F6) Dest Reservation Stations 1 10+R2 5 0+R3 FP adders FP multipliers

Tomasulo With Reorder buffer: Done? FP Op Queue F0 ADDD F0,F4,F6 N F4 LD F4,0(R3) -- BNE F2,<…> F2 F10 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Registers To Memory Dest Dest from Memory Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) 3 DIVD ROB2,R(F6) Dest Reservation Stations 1 10+R2 5 0+R3 FP adders FP multipliers

Tomasulo With Reorder buffer: Done? FP Op Queue -- F0 ROB5 ST F4,0(R3) ADDD F0,F4,F6 N F4 LD F4,0(R3) BNE F2,<…> F2 F10 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Registers To Memory Dest Dest from Memory Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) 3 DIVD ROB2,R(F6) Dest Reservation Stations 1 10+R2 5 0+R3 FP adders FP multipliers

Tomasulo With Reorder buffer: Done? FP Op Queue -- F0 M[30] ST 0(R3),F4 ADDD F0,F4,F6 Y N F4 LD F4,0(R3) BNE F2,<…> F2 F10 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Registers To Memory Dest Dest from Memory Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel 2 ADDD R(F4),ROB1 6 ADDD M[30],R(F6) 3 DIVD ROB2,R(F6) Dest Reservation Stations 1 10+R2 FP adders FP multipliers

Tomasulo With Reorder buffer: Done? FP Op Queue -- F0 M[30] <val2> ST 0(R3),F4 ADDD F0,F4,F6 Y Ex F4 LD F4,0(R3) BNE F2,<…> N F2 F10 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Registers To Memory Dest Dest from Memory Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel 2 ADDD R(F4),ROB1 3 DIVD ROB2,R(F6) Dest Reservation Stations 1 10+R2 FP adders FP multipliers

Tomasulo With Reorder buffer: Done? FP Op Queue -- F0 M[30] <val2> ST 0(R3),F4 ADDD F0,F4,F6 Y Ex F4 LD F4,0(R3) BNE F2,<…> N ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest What about memory hazards??? Reorder Buffer F2 DIVD F2,F10,F6 N F10 ADDD F10,F4,F0 N Oldest F0 LD F0,10(R2) N Registers To Memory Dest Dest from Memory Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel 2 ADDD R(F4),ROB1 3 DIVD ROB2,R(F6) Dest Reservation Stations 1 10+R2 FP adders FP multipliers

Avoiding Memory Hazards WAW and WAR hazards through memory are eliminated because updating of memory occurs in order, when a store is at head of the ROB no earlier loads or stores can still be pending RAW hazards through memory are maintained by two restrictions: not allowing a load to initiate the second step of its execution if any active ROB entry occupied by a store has a Destination field that matches the value of the A field of the load, and maintaining the program order for the computation of an effective address of a load with respect to all earlier stores. So, load that accesses a memory location written to by an earlier store can’t perform the memory access until the store has written the data

Complexity of ROB Assume dual-issue superscalar Load/Store machine with three-operand instructions 64 registers 16-entry circular buffer Hardware support needed for ROB For each buffer entry One write port Four read ports (two source operands of two instructions) Four 6-bit comparators for associative lookup For each read port 16-way “priority” encoder with wrap-around (to get latest value) Limited capacity of ROB is a structural hazard Repeated writes to same register actually happen This is not the case in “classical” Tomasulo