Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.

Slides:

Advertisements

Similar presentations

Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.

Advertisements

CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 3, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Introduction)

A scheme to overcome data hazards

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

CS6290 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

COMP25212 Advanced Pipelining Out of Order Processors.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

ECE 2162 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.

Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)

CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Expl. ILP & Dyn.Sched CSE 4711 How to improve (decrease) CPI Recall: CPI = Ideal CPI + CPI contributed by stalls Ideal CPI =1 for single issue machine.

Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.

CET 520/ Gannod1 Section A.8 Dynamic Scheduling using a Scoreboard.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.

1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,

04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;

CIS 662 – Computer Architecture – Fall Class 11 – 10/12/04 1 Scoreboarding  The following four steps replace ID, EX and WB steps  ID: Issue –

Dataflow Order Execution  Use data copying and/or hardware register renaming to eliminate WAR and WAW register name refers to a temporary value produced.

COMP25212 Advanced Pipelining Out of Order Processors.

CS203 – Advanced Computer Architecture ILP and Speculation.

Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

Instruction-Level Parallelism and Its Dynamic Exploitation

IBM System 360. Common architecture for a set of machines

Dynamic Scheduling Why go out of style?

/ Computer Architecture and Design

Tomasulo’s Algorithm Born of necessity

Out of Order Processors

Dynamic Scheduling and Speculation

Step by step for Tomasulo Scheme

CS203 – Advanced Computer Architecture

Microprocessor Microarchitecture Dynamic Pipeline

Chapter 3: ILP and Its Exploitation

Advantages of Dynamic Scheduling

High-level view Out-of-order pipeline

CMSC 611: Advanced Computer Architecture

A Dynamic Algorithm: Tomasulo’s

Out of Order Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

ECE 2162 Reorder Buffer.

Checking for issue/dispatch

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

How to improve (decrease) CPI

Static vs. dynamic scheduling

Advanced Computer Architecture

Static vs. dynamic scheduling

Tomasulo Organization

Reduction of Data Hazards Stalls with Dynamic Scheduling

Instruction Execution Cycle

Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005

Extending simple pipeline to multiple pipes

September 20, 2000 Prof. John Kubiatowicz

Pipeline Control unit (highly abstracted)

CSL718 : Superscalar Processors

High-level view Out-of-order pipeline

Lecture 7 Dynamic Scheduling

CSE 586 Computer Architecture Lecture 3

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s algorithm as implemented first in IBM 360/91 –Control decentralized at each functional unit –Forwarding –Concept and implementation of renaming registers that eliminates WAR and WAW hazards

Dyn. Sched. CSE 471 Autumn 0220 Improving on Dispatch with Reservation Stations With each functional unit, have a set of buffers or reservation stations –Keep operands and function to perform –Operands can be values or names of units that will produce the value (register renaming) with appropriate flags Not both operands have to be “ready” at the same time When both operands have values, functional unit can execute on that pair of operands When a functional unit computes a result, it broadcasts its name and the value.

Dyn. Sched. CSE 471 Autumn 0221 Tomasulo’s solution to resolve hazards Structural hazards –No free reservation station (stall at issue time). No further issue RAW dependency (detected in each functional unit -- decentralized) –The instruction with the dependency is issued (put in a reservation station) but not dispatched (stalled). Subsequent instructions can be issued, dispatched, executed and completed. No WAR and WAW hazards –Because of register renaming through reservation stations Forwarding –Done at end of execution by use of a common (broadcast) data bus

Dyn. Sched. CSE 471 Autumn 0222 Example machine (cf. Figure 3.2) From memory From I-unit Fp registersLoad buffers Store buffers To memory Reservation stations F-p units Common data bus

Dyn. Sched. CSE 471 Autumn 0223 An instruction goes through 3 steps Assume the instruction has been fetched 1. Issue, dispatch, and read operands –Check for structural hazard (no free reservation station or no free load-store buffer for a memory operation). If there is structural hazard, stall until it is not present any longer –Reserve the next reservation station –Read source operands If they have values, put the values in the reservation station If they have names, store their names in the reservation station –Rename result register with the name of the reservation station in the functional unit that will compute it

Dyn. Sched. CSE 471 Autumn 0224 An instruction goes through 3 steps (c‘ed) 2. Execute –If any of the source operands is not ready (i.e., the reservation station holds at least one name), monitor the bus for broadcast –When both operands have values, execute 3. Write result –Broadcast name of the unit and value computed. Any reservation station/result register with that name grabs the value Note two more sources of structural hazard: –Two reservation stations in the same functional unit are ready to execute in the same cycle: choose the “first” one –Two functional units want to broadcast at the same time. Priority is encoded in the hardware

Dyn. Sched. CSE 471 Autumn 0225 Implementation All registers (except load buffers) contain a pair {value,tag} The tag (or name) can be: –Zero (or a special pattern) meaning that the value is indeed a value –The name of a load buffer –The name of a reservation station within a functional unit A reservation station consists of : –The operation to be performed –2 pairs (value,tag) (Vj,Qj) (Vk,Qk) –A flag indicating whether the accompanying f-u is busy or not

Dyn. Sched. CSE 471 Autumn 0226 Instruction Issue Execute Write result Load F6, 34(r2) yes yes yes Load F2, 45(r3) yes yes Mul F0, F2, F4 yes Sub F8, F6, F2 yes Div F10, F0, F6 yes Add F6,F8,F2 yes Reservation Stations Name Busy Fm Vj Vk Qj Qk Add 1 yes Sub (Load1) 0 Load2 Add2 yes Add Add1 Load2 Mul1 yes Mul (F4) Load2 0 Add3 no Mul2 yes Div (Load1) Mul1 0 Register status F0 (Mul1) F2 (Load2) F4 ( ) F6(Add2 ) F8 (Add1) F10 (Mul2) F12... Initial: waiting for F2 to be loaded from memory (x) Means a value: contents of x Qj = 0 means Vj has a value

Dyn. Sched. CSE 471 Autumn 0227 Instruction Issue Execute Write result Load F6, 34(r2) yes yes yes Load F2, 45(r3) yes yes yes Mul F0, F2, F4 yes yes Sub F8, F6, F2 yes yes Div F10, F0, F6 yes Add F6,F8,F2 yes Reservation Stations Name Busy Fm Vj Vk Qj Qk Add 1 yes Sub (Load1) (Load2) 0 0 Add2 yes Add (Load2) Add1 0 Mul1 yes Mul (Load2) (F4) 0 0 Add3 no Mul2 yes Div (Load1) Mul1 0 Register status F0 (Mul1) F2 () F4 ( ) F6(Add2 ) F8 (Add1) F10 (Mul2) F12... Cycle after 2nd load has written its result

Dyn. Sched. CSE 471 Autumn 0228 Instruction Issue Execute Write result Load F6, 34(r2) yes yes yes Load F2, 45(r3) yes yes yes Mul F0, F2, F4 yes yes Sub F8, F6, F2 yes yes yes Div F10, F0, F6 yes Add F6,F8,F2 yes yes Reservation Stations Name Busy Fm Vj Vk Qj Qk Add 1 no Add2 yes Add (Add1) (Load2) 0 0 Mul1 yes Mul (Load2) (F4) 0 0 Add3 no Mul2 yes Div (Load1) Mul1 0 Register status F0 (Mul1) F2 () F4 ( ) F6(Add2 ) F8 () F10 (Mul2) F12... Cycle after sub has written its result

Dyn. Sched. CSE 471 Autumn 0229 Instruction Issue Execute Write result Load F6, 34(r2) yes yes yes Load F2, 45(r3) yes yes yes Mul F0, F2, F4 yes yes Sub F8, F6, F2 yes yes yes Div F10, F0, F6 yes Add F6,F8,F2 yes yes yes Reservation Stations Name Busy Fm Vj Vk Qj Qk Add 1 no Add2 no Mul1 yes Mul (Load2) (F4) 0 0 Add3 no Mul2 yes Div (Load1) Mul1 0 Register status F0 (Mul1) F2 () F4 ( ) F6() F8 () F10 (Mul2) F12... Cycle after add has written its result

Dyn. Sched. CSE 471 Autumn 0230 Instruction Issue Execute Write result Load F6, 34(r2) yes yes yes Load F2, 45(r3) yes yes yes Mul F0, F2, F4 yes yes yes Sub F8, F6, F2 yes yes yes Div F10, F0, F6 yes yes Add F6,F8,F2 yes yes yes Reservation Stations Name Busy Fm Vj Vk Qj Qk Add 1 no Add2 no Mul1 no Add3 no Mul2 yes Div (Mul1) (Load1) 0 0 Register status F0 () F2 () F4 ( ) F6() F8 () F10 (Mul2) F12... Cycle after mul has written its result

Dyn. Sched. CSE 471 Autumn 0231 Other checks/possibilities In the example in the book there is no load/store dependencies but since they can happen –Load/store buffers must keep the addresses of the operands –On load, check if there is a corresponding address in store buffers. If so, get the value/tag from there (load/store buffers have tags) –Better yet, have load/store functional units (still needs checking) The Tomasulo engine was intended only for f-p operations. We need to generalize to include –Handling branches, exceptions etc –In-order completion –More general register renaming mechanisms –Multiple instruction issues