Superscalar Processors

Slides:

Advertisements

Similar presentations

CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Advertisements

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

A scheme to overcome data hazards

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

COMP25212 Advanced Pipelining Out of Order Processors.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,

04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

CS203 – Advanced Computer Architecture ILP and Speculation.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Instruction-Level Parallelism and Its Dynamic Exploitation

IBM System 360. Common architecture for a set of machines

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

/ Computer Architecture and Design

/ Computer Architecture and Design

Out of Order Processors

Step by step for Tomasulo Scheme

Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1

CS203 – Advanced Computer Architecture

Microprocessor Microarchitecture Dynamic Pipeline

Advantages of Dynamic Scheduling

High-level view Out-of-order pipeline

CMSC 611: Advanced Computer Architecture

Lecture 6: Advanced Pipelines

A Dynamic Algorithm: Tomasulo’s

Out of Order Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 11: Memory Data Flow Techniques

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

Checking for issue/dispatch

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Static vs. dynamic scheduling

Static vs. dynamic scheduling

Tomasulo Organization

Adapted from the slides of Prof

Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005

September 20, 2000 Prof. John Kubiatowicz

Lecture 7 Dynamic Scheduling

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Superscalar Processors J. Nelson Amaral

Scalar to Superscalar Scalar Processor: one instruction pass through each pipeline stage in each cycle Superscalar Processor: multiple instructions at each pipeline stage in each cycle Wider pipeline Superpipelined Processor: Decompose stages into smaller stages → More Stages Deeper pipeline Baer p. 75

Superscalar Front end (IF and ID) Back end (EX, Mem and WB) Must fetch and decode multiple instructions per cycle m-way superscalar: brings (ideally) m instructions per cycle into the pipeline Back end (EX, Mem and WB) Must execute and write back several instructions per cycle Baer p. 75

Superscalar In-order (or static) Out-of-order (or dynamic) Instructions leave front-end in program order Out-of-order (or dynamic) instructions leave front-end, and execute, in a different order than the program order WB is called commit stage must ensure that the program semantics is followed more complex design Baer p. 76

Limits to Superscalar Performance Superscalars rely on exploiting Instruction-Level Parallelism (ILP) They remove WAR and WAW dependences But the amount of ILP is limited by RAW (true) dependences Data Dependence Graph: S0: R1 ← R2 + R3 S1: R4 ← R1 + R5 S2: R1 ← R6 + R7 S3: R4 ← R1 + R9 Example: S0 RAW WAW S1 WAR S2 WAW RAW S3 Baer p. 76

Limits to Superscalar Performance Superscalars rely on exploiting Instruction-Level Parallelism (ILP) They remove WAR and WAW dependences But the amount of ILP is limited by RAW (true) dependences Data Dependence Graph: S0: R1 ← R2 + R3 S1: R4 ← R1 + R5 S2: R1 ← R6 + R7 S3: R4 ← R1 + R9 Example: S0 WAW RAW S1 WAR RA WAW RB RA S2 S3 RAW Baer p. 76

Limits to Superscalar Performance Complexity of logic to remove dependencies Designers predicted 8-way and 16-way superscalars We have 6-way superscalars and m is not likely to grow Baer p. 76

Limits to Superscalar Performance Number of Forward Paths 1-way: Baer p. 76

Limits to Superscalar Performance Number of Forward Paths 2-way: m-way requires m2 paths paths may become too long for signal propagation within a single clock Baer p. 76

Limits to Clock Cycle Reduction Power dissipation increases with frequency Read and Writing to pipeline registers in every cycle. Time to access pipeline register imposes a bound on the duration of a pipeline stage Baer p. 76

Limits on Pipeline Length Speculative actions (pe. branch prediction) are resolved later in a longer pipeline Recovery from misspeculation is delayed Branch Misspred. Penalty: 10 cycles Branch Misspred. Penalty: 20 cycles 31-stage pipeline 14-stage pipeline Baer p. 76

Why the Multicore Revolution? Power Dissipation: Linear growth with clock frequency - Cannot make single cores faster Moore’s Law: Number of transistors in a chip continues the exponential growth - What to do with extra logic? Design Complexity: Extracting more performance from single core requires extreme design complexity. - What to do with extra logic? Baer p. 77

Speed Demons X Brainiacs Pentium III Out-of-Order Superscalar 1999 DEC Alpha In-Order Superscalar 1994 register renaming reorder buffer reservation stations Baer p. 77

Out-of-Order and Memory Hierarchy Question: Does out-of-order execution help hide memory latencies? Short answer: No. Latencies of 100 cycles or more are too long and fill up all internal queues and stall pipelines Latencies around 100 cycles are too short to justify context switching. Solution: hardware for several contexts to enable fast context switching → multithreading Baer p. 78

DEC Alpha 21164 4-way in-order RISC Instruction Buffer virtually indexed Instruction Buffer 32 32 64-bit Miss Address File: merge outstanding misses to the same L2 line. Baer p. 79

21164 Instruction Pipeline Integer pipe 1: shifter and multiplier Integer pipe 2: branches 48-entry I-TLB 64-entry D-TLB Baer p. 79

Brings 4 instructions from I-Cache (accesses I-Cache and ITLB in parallel) Performs branch prediction, calculates branch target slotting stage: steers instructions to units; resolves static conflicts resolves dynamic conflicts; schedules forwardings and stallings Integer pipe 1: shifter and multiplier Integer pipe 2: branches 48-entry I-TLB 64-entry D-TLB Baer p. 80

Example i1: R1 ← R2 + R3 # Use integer pipeline 1 i3: R7 ← R8 – R9 # Requires an integer pipeline i4: F0 ← F2 + F4 # Floating point add i5: i6: i7: i8: i9: i10: i11: i12: Assume no structural or data hazard for these instructions. Baer p. 81

Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 1 Time: t0 i1: R1 ← R2 + R3 i2: R4 ← R1 – R5 i3: R7 ← R8 – R9 i4: F0 ← F2 + F4 Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 1 Time: t0 i5 i6 i7 i8 i1 i2 i3 i4 Baer p. 82

Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 2 Time: t0 + 1 i1: R1 ← R2 + R3 i2: R4 ← R1 – R5 i3: R7 ← R8 – R9 i4: F0 ← F2 + F4 Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 2 Time: t0 + 1 i9 i10 i11 i12 i1 i2 i3 i4 i5 i6 i7 i8 Baer p. 82

Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 2 Time: t0 + 3 i1: R1 ← R2 + R3 i2: R4 ← R1 – R5 i3: R7 ← R8 – R9 i4: F0 ← F2 + F4 Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 2 Time: t0 + 3 i9 i5 i1 i2 i10 i6 i11 i7 i3 i12 i8 i4 i3 cannot move to S3 because of resource conflict (there are only two integer pipelines) i4 does not move to S3 to preserve program order (it is blocked by i3) Baer p. 82

Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 4 Time: t0 + 3 i1: R1 ← R2 + R3 i2: R4 ← R1 – R5 i3: R7 ← R8 – R9 i4: F0 ← F2 + F4 Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 4 Time: t0 + 3 i9 i5 i1 i10 i6 i2 i11 i7 i3 i12 i8 i4 i2 cannot move to the backend because of of RAW dependency with i1. Baer p. 82

Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 5 Time: t0 + 4 i1: R1 ← R2 + R3 i2: R4 ← R1 – R5 i3: R7 ← R8 – R9 i4: F0 ← F2 + F4 Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 5 Time: t0 + 4 i3 i4 i11 i12 i9 i10 i2 i5 i6 i7 i8 i15 i16 i13 i14 i1 Baer p. 82

Backend Begins L1 D-cache and D-TLB accesses Miss: Start access to L2 Data available if hit in L2 Begins L1 D-cache and D-TLB accesses Hit: Forward data (if needed); write to int. or FP register Decide hit/miss in L1 D-cache and D-TLB Baer p. 82

Scoreboard Speculation Example: a load L, and a dependent use U reach S3 at cycle t If the load hits L1-cache, then schedule L at t+1 and U at t+3. Know if it is a hit or miss here. Scoreboard assumes it is a hit. If it is a miss, abort any dependent instruction already issued. Baer p. 82

Can Compiler Help Performance? (Example) i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 Assume that all instructions are in issuing slot (state S2) at time t.

Compiler Effect S0 S1 S2 S3 Backend Time: t + 1 Time: t i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 Compiler Effect S0 S1 S2 S3 Backend Time: t + 1 Time: t i9 i10 i11 i12 i5 i6 i7 i8 i1 i2 i3 i4 Instruction i3 cannot advance to S3 because of an structural hazard: The load in i1 uses an integer pipe to compute the address Baer p. 82

Compiler Effect S0 S1 S2 S3 Backend Time: t + 1 Time: t + 2 i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 Compiler Effect S0 S1 S2 S3 Backend Time: t + 1 Time: t + 2 Time: t + 3 i9 i10 i11 i12 i5 i6 i7 i8 i1 i2 i3 i4 i2 cannot advance because of the RAW dependency with i1 at t+3 the load continues execution in the back end (2-cycle latency) Baer p. 82

Compiler Effect S0 S1 S2 S3 Backend Time: t + 4 Time: t + 3 i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 Compiler Effect S0 S1 S2 S3 Backend Time: t + 4 Time: t + 3 i13 i14 i15 i16 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i1 Baer p. 82

Compiler Effect S0 S1 S2 S3 Backend Time: t + 5 Time: t + 4 i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 Compiler Effect S0 S1 S2 S3 Backend Time: t + 5 Time: t + 4 i13 i14 i15 i16 i9 i10 i11 i12 i5 i6 i7 i8 i2 i3 i4 i4 cannot advance because of the RAW dependency with i3 Baer p. 82

Compiler Effect S0 S1 S2 S3 Backend Time: t + 6 Time: t + 5 i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 Compiler Effect S0 S1 S2 S3 Backend Time: t + 6 Time: t + 5 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 i20 i3 i4 advances to execution at t+6 and it will be the only integer instruction executing at that cycle. Baer p. 82

After Compiler Optimization i1: R1 ← Mem[R2] i1’: integer nop i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 After Compiler Optimization S0 S1 S2 S3 Backend Time: t + 1 Time: t i8 i9 i10 i11 i4 i1 i1’ i5 i6 i2 i7 i3 Two integer Instructions advance to S3. Baer p. 82

After Compiler Optimization i1: R1 ← Mem[R2] i1’: integer nop i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 After Compiler Optimization S0 S1 S2 S3 Backend Time: t + 1 Time: t + 2 i13 i14 i15 i12 i1 i1’ i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 Baer p. 82

After Compiler Optimization i1: R1 ← Mem[R2] i1’: integer nop i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 After Compiler Optimization S0 S1 S2 S3 Backend Time: t + 2 Time: t + 4 Time: t + 3 i13 i14 i15 i12 i8 i9 i10 i11 i4 i1 i5 i1’ i6 i2 i7 i3 Load in i1 still needs two cycles to execute. Baer p. 82

After Compiler Optimization i1: R1 ← Mem[R2] i1’: integer nop i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 After Compiler Optimization S0 S1 S2 S3 Backend Time: t + 4 Time: t + 5 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i13 i14 i15 i12 i17 i18 i19 i16 i1 i2 and i3 can advance to backend together. There is no depencency between them. Baer p. 82

After Compiler Optimization i1: R1 ← Mem[R2] i1’: integer nop i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 After Compiler Optimization S0 S1 S2 S3 Backend Time: t + 5 Time: t + 4 Time: t + 6 i17 i18 i19 i16 i12 i8 i9 i10 i11 i4 i13 i14 i15 i5 i6 i2 i3 i7 i4 still advances to backend at t+6! but now i5 could advance along with i4 * Textbook says that i4 would advance to backend at t+5. Baer p. 82

Scoreboarding “Scoreboarding allows instructions to execute out of order when there are sufficient resources and no data dependences.” John L. Hennessy and David A. Patterson Computer Architecture: A Quantitative Approach Third Edition, p. A-69.

Another scoreboarding

Scoreboarding Thornton Algorithm (Scoreboarding): CDC 6600 (1964): A single unit (the scoreboard) monitors the progress of the execution of instructions and the status of all registers. Tomasulo’s Algorithm: IBM 360/91 (1967) Reservation stations buffer operands and results. A Common Data Bus (CDB) distributes results directly to functional units Some of this material is from Prof. Vojin G. Oklobzija’s tutorial at ISSCC’97. Baer p. 81

CDC 6600 Group I Group II Group III Group IV Not shown: branch unit that modifies the PC Group II Group III Group IV Baer p. 86

CDC 6600 Scoreboard Operation Issue free functional unit? Stall no WAW hazard? yes Stall yes Issue no Baer p. 86

CDC 6600 Scoreboard Operation Dispatch Mark execution unit busy Operands ready? Stall no yes Read operands Baer p. 87

CDC 6600 Scoreboard Operation Execution Execution complete? Stall no yes Notify Scoreboard that it is ready to write result Baer p. 87

CDC 6600 Scoreboard Operation Write result WAR hazard? Stall yes no Write WAR Example: i0 DIV.D F0, F2, F4 i1 ADD.D F10, F0, F8 i2 SUB.D F8, F8, F14 Has to stall the write of i2 until i1 has read F8 Baer p. 87

Scoreboarding Example i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Baer p. 88

Instructions in Flight i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle 1 Source Reg Units Reg Flags Instruction Status Fj Fk Qj Qk Rj Rk Instructions in Flight Fi Res. i1 issued R4 R0 R2 1 1 Unit Busy (U)? Mult1 Mult2 Adder Register Unit R4 NIL R6 R8 Mult1 Baer p. 88

Instructions in Flight i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle 2 Source Reg Units Reg Flags Instruction Status Fj Fk Qj Qk Rj Rk Instructions in Flight Fi Res. i1 dispatched issued R4 R0 R2 1 1 i2 issued R6 R4 R8 Mult1 1 Unit Busy (U)? Mult1 Mult2 Adder Register Unit R4 Mult1 R6 NIL R8 1 Mult2 Baer p. 88

Instructions in Flight i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle 3 i2 cannot be dispatched because R4 is not available Source Reg Units Reg Flags Instruction Status Fj Fk Qj Qk Rj Rk Instructions in Flight Fi Res. i1 dispatched execute R4 R0 R2 1 1 i2 issued R6 R4 R8 Mult1 1 i3 issued These values are wrong on Table 3.2 (p. 88) in the textbook R8 R2 R12 1 1 Unit Busy (U)? Mult1 1 Mult2 Adder Register Unit R4 Mult1 R6 Mult2 R8 NIL Adder Baer p. 88

Instructions in Flight i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle 4 i4 cannot issue: (i) Adder is busy; AND (ii) WAW dependency on i1 Source Reg Units Reg Flags Instruction Status Fj Fk Qj Qk Rj Rk Instructions in Flight Fi Res. i1 execute R4 R0 R2 1 1 i2 issued R6 R4 R8 Mult1 1 i3 dispatched issued R8 R2 R12 1 1 Unit Busy (U)? Mult1 1 Mult2 Adder Register Unit R4 Mult1 R6 Mult2 R8 Adder 1 Baer p. 88

Instructions in Flight i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle 5 (No change) Source Reg Units Reg Flags Instruction Status Fj Fk Qj Qk Rj Rk Instructions in Flight Fi Res. i1 execute R4 R0 R2 1 1 i2 issued R6 R4 R8 Mult1 1 i3 dispatched execute R8 R2 R12 1 1 Unit Busy (U)? Mult1 1 Mult2 Adder Register Unit R4 Mult1 R6 Mult2 R8 Adder Baer p. 88

Instructions in Flight i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle 6 i3 asks for permission to write. Permission is denied (WAR with i2). Source Reg Units Reg Flags Instruction Status Fj Fk Qj Qk Rj Rk Instructions in Flight Fi Res. i1 execute R4 R0 R2 1 1 i2 issued R6 R4 R8 Mult1 1 i3 execute R8 R2 R12 1 1 Unit Busy (U)? Mult1 1 Mult2 Adder Register Unit R4 Mult1 R6 Mult2 R8 Adder Baer p. 88

Instructions in Flight i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle 8 i1 asks for permission to write. Permission is granted. Source Reg Units Reg Flags Instruction Status Fj Fk Qj Qk Rj Rk Instructions in Flight Fi Res. i1 execute write R4 R0 R2 1 1 i2 issued R6 R4 R8 Mult1 1 i3 execute R8 R2 R12 1 1 Unit Busy (U)? Mult1 1 Mult2 Adder Register Unit R4 Mult1 R6 Mult2 R8 Adder Baer p. 88

Instructions in Flight i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle 9 Source Reg Units Reg Flags Instruction Status Fj Fk Qj Qk Rj Rk Instructions in Flight Fi Res. i2 dispatched issued R6 R4 R8 Mult1 1 i3 execute write R8 R2 R12 1 1 i4 issue R4 R14 R16 1 Unit Busy (U)? Mult1 Mult2 1 Adder Register Unit R4 R6 Mult2 R8 Adder Adder Baer p. 88

Register Renaming, Reorder Buffer, and Reservation Stations Difference between in-order X out-of-order execution: When instructions leave the front end? In-order: WAR and WAW prevent dispatch Out-of-order: register renaming avoids WAR and WAW How are instructions processed in the back-end? Instructions can wait in reservation stations because of RAW dependencies or structural hazards A reorder buffer imposes program order commitment Baer p. 89

Register Renaming (example) i1: R1 ← R2/R3 # Takes a long time i2: R4 ← R1 + R5 i3: R5 ← R6 + R7 i4: R1 ← R8 + R9 The registers that appear in the program are logical or architectural registers. In-order: Only i1 issues. Others are blocked by RAW dependency. At the last stage of the front end all registers are mapped to physical registers. Out-of-order: i3 and i4 can issue and finish execution while i1 executes Baer p. 89

Renaming Process Renaming Stage: Ri ←Rj op Rk Ra ← Rb op Rc Rb = Rename(Rj); Rc = Rename(Rk); Ra = freelist(first); Rename(Ri) = freelist(first); first ←next(first) Baer p. 90

Register Renaming (example) How about i3, can it write into R5 before i1 and i2 complete? If i1 generates an exception, what will be the value of R5 in the exception state? i1: R1 ← R2/R3 i2: R4 ← R1 + R5 i3: R5 ← R6 + R7 i4: R1 ← R8 + R9 R32 Ri Rename(Ri) R1 R2 R3 R4 R5 R6 R7 R8 R9 R32 R35 R33 R32 R34 R35 R33 R34 i4 will finish execution before i1. Can we allow it to write the result to R1 before i1? Freelist = {R32, R33, R34, R35, R36, …} Baer p. 90

Reorder Buffer Even though we allow out-of-order execution, we require in-order-completion. A reorder buffer (ROB) ensures that the results produced by instructions are committed to the logical register in order. Baer p. 91

Reorder Buffer (cont.) Each entry in the ROB has the following fields: flag: has the instruction completed? value: value computed by the instruction result register name: logical register instruction type: arithmetic/load/store/branch/… Each instruction that has its destination register renamed is entered in the ROB Baer p. 91

i1: R1 ← R2/R3 i2: R4 ← R1 + R5 i3: R5 ← R6 + R7 i4: R1 ← R8 + R9 R32 Instruction Flag Value Reg. Name Type Head i1 Not Ready None R1 Arit Ready Some Tail i2 Not Ready None R4 Arit i3 Not Ready None R5 Arit Ready Some i4 Not Ready None R1 Arit Ready Some Ri Rename(Ri) R1 R2 R3 R4 R5 R6 R7 R8 R9 R32 R35 i1: R1 ← R2/R3 i2: R4 ← R1 + R5 i3: R5 ← R6 + R7 i4: R1 ← R8 + R9 R32 R33 R32 R33 R34 R34 R35 Freelist = {R32, R33, R34, R35, R36, …} Baer p. 92

But…. Where do instructions wait before being executed? How an instruction knows that it is ready to be executed? Baer p. 93

Reservation Stations After register renaming, the front-end dispatches the instruction to a reservation station. Reservation stations can: be grouped into a centralized queue called an instruction window. be associated with functional units according to the opcode. Baer p. 93

Reservation Stations (cont.) Each entry in the Reservation Station must contain: Operation to be performed Source operands (either value or physical name of the register) – a flag indicates which one physical name of the result register ROB entry where the result will be stored. Baer p. 93

Scheduling Scheduling: Selection of which instruction should execute next in a given execution unit oldest instruction; critical instruction; Baer p. 93

Ready Bit A ready bit is associated with each physical register. When an instruction that uses a physical register Ri is dispatched: if Ri is ready, pass Ri value to the reservation station and set flag to true (ready) if Ri is not ready, pass the name of Ri to the reservation station and set flag to false (not ready) When both flags are true, the instruction is ready to be issued. Baer p. 93

Ready Bit (cont.) Upon completion, an instruction broadcasts the name and content of its result physical register to all reservation stations (RS). Each RS that needs it, will grab the content and update its flags. Baer p. 93