Tomasulo’s Algorithm Born of necessity

Slides:



Advertisements
Similar presentations
Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.
Advertisements

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
COMP25212 Advanced Pipelining Out of Order Processors.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Computer Architecture
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Out-of-order execution: Scoreboarding and Tomasulo Week 2
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
CIS 662 – Computer Architecture – Fall Class 11 – 10/12/04 1 Scoreboarding  The following four steps replace ID, EX and WB steps  ID: Issue –
Dataflow Order Execution  Use data copying and/or hardware register renaming to eliminate WAR and WAW ­register name refers to a temporary value produced.
COMP25212 Advanced Pipelining Out of Order Processors.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
Dynamic Scheduling Why go out of style?
/ Computer Architecture and Design
Module: Part 2 Dynamic Scheduling in Hardware - Tomasulo’s Algorithm
Out of Order Processors
Step by step for Tomasulo Scheme
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
Microprocessor Microarchitecture Dynamic Pipeline
Chapter 3: ILP and Its Exploitation
Advantages of Dynamic Scheduling
High-level view Out-of-order pipeline
11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Checking for issue/dispatch
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
September 20, 2000 Prof. John Kubiatowicz
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
CS5100 Advanced Computer Architecture Dynamic Scheduling
Adapted from the slides of Prof
CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming February.
Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
15-740/ Computer Architecture Lecture 10: Out-of-Order Execution
September 20, 2000 Prof. John Kubiatowicz
High-level view Out-of-order pipeline
Lecture 7 Dynamic Scheduling
Conceptual execution on a processor which exploits ILP
Dynamic Scheduling Physical Register File ready bits Issue Queue (IQ)
Presentation transcript:

Tomasulo’s Algorithm Born of necessity Used in IBM 360/91 floating-point unit Many long-latency operations Need dynamic scheduling: mitigate long stalls ISA specified only 4 floating-point registers Need register renaming: with only 4 registers, WAW/WAR hazards pop up quickly Especially in floating-point code: loops by definition cause repeated writes to same registers Renaming: recognize and give unique names to different dynamic instances of the same register specifier ECE 463/521, Profs. Gehringer, Rotenberg, & Conte, Dept. of ECE, NC State University

Key aspects of Tomasulo’s Alg. Read register operands at dispatch stage If operands are available, the data is buffered along with the instruction in “reservation stations” CDC: only buffers the instruction, all operands are read from register file when all operands are ready (operands read at issue stage) CDC – late reads / Tomasulo – early reads: early reads help WAR condition Unavailable registers are renamed at dispatch stage Waiting instructions replace register specifiers with a “tag” indicating the producer instruction Register specifiers are used only once, at dispatch! Renaming eliminates WAR/WAW hazards Successive writes to a register Only last one is actually used to update register: helps WAW condition CDC: stall in dispatch until WAW hazard goes away ECE 463/521, Profs. Gehringer, Rotenberg, & Conte, Dept. of ECE, NC State University

Other differences with CDC Distributed control Reservation stations (versus scoreboard) Results broadcast to both register file and functional units Result bus called the “Common Data Bus” (CDB) Don’t have to wait for value to go through register file (i.e., use bypasses). Functional units don’t contend for register file ports. ECE 463/521, Profs. Gehringer, Rotenberg, & Conte, Dept. of ECE, NC State University

360/91 FP unit (MIPS version) FLT. PT. OPERATION QUEUE From IF unit From memory 4 FP REGISTERS LOAD BUFFERS 6 5 4 OPERAND BUSES 3 2 OPERATION BUS 1 3 2 RESERVATION STATIONS STORE BUFFERS 2 1 1 To memory FP adders FP multipliers COMMON DATA BUS (CDB) ECE 463/521, Profs. Gehringer, Rotenberg, & Conte, Dept. of ECE, NC State University

Tomasulo’s Alg. (MIPS version) Four stages after fetch Dispatch Check for structural hazard Stall in dispatch stage if no free reservation station Read register file Read data operands if available Read “tag” if data operand unavailable: a tag is the reservation station number of the producer instruction Route instruction plus data or tags to reservation station, where it waits until all operands are available Issue Wait until operands become available on the CDB (match CDB tag against operand tags). Grab operands from CDB and issue to FU (if free). Execute Write result Broadcast result + tag to all reservation stations, store buffers, and register file via the CDB. Write register file only if the CDB tag matches the tag in the register file. ECE 463/521, Profs. Gehringer, Rotenberg, & Conte, Dept. of ECE, NC State University

Four stages (cont…) Notice: Same four stages as Thornton, but Tomasulo reads register file at dispatch stage and replaces register specifiers with tags. No WAW / WAR checks (using tags eliminates these) Again, H&P uses different names than what we will use Worse, H&P specifies only three stages for Tomasulo But any dynamic scheduling inherently has dispatch, issue, execute, write result … ECE 463/521, Profs. Gehringer, Rotenberg, & Conte, Dept. of ECE, NC State University

Register renaming Consider simple example with register reuse Dataflow graph with both true and false dependences All instructions execute serially Due to reuse of F0 by the loads But those are 2 distinct instances of F0 Use different names for 2 instances of F0 L.D F0, 34(R2) ADD.D F4, F0, F2 L.D F0, 45(R3) ADD.D F8, F0, F6 L.D (1) True dependence (RAW / F0) ADD.D (1) Output dependence (WAW / F0) Anti-dependence (WAR / F0) L.D (2) True dependence (RAW / F0) ADD.D (2) ECE 463/521, Profs. Gehringer, Rotenberg, & Conte, Dept. of ECE, NC State University

Register renaming Same program segment with F0 renamed Tomasulo’s Alg: Use reservation station number (tag) of producer instruction (e.g. load buffer 1 = load1) This guarantees unique names for unique values Dataflow graph with only true dependences Renaming removes output and anti-dependences Parallelism is exposed A. L.D load1, 34(R2) B. ADD.D F4, load1, F2 C. L.D load2, 45(R3) D. ADD.D F8, load2, F6 L.D (1) L.D (2) True dependence (RAW / load1) True dependence (RAW / load2) ADD.D (1) ADD/D (2) ECE 463/521, Profs. Gehringer, Rotenberg, & Conte, Dept. of ECE, NC State University

How (Tomasulo) renaming works (1) FLT. PT. OPERATION QUEUE From IF unit F0 From memory D ADD.D <- F0 4 FP REGISTERS C L.D F0 <- LOAD BUFFERS B 6 ADD.D <- F0 A 5 L.D F0 <- 4 OPERAND BUSES 3 2 OPERATION BUS 1 3 2 RESERVATION STATIONS STORE BUFFERS 2 1 1 To memory FP adders FP multipliers COMMON DATA BUS (CDB) ECE 463/521, Profs. Gehringer, Rotenberg, & Conte, Dept. of ECE, NC State University

Tomasulo renaming (2) FLT. PT. OPERATION QUEUE From IF unit F0 load1 From memory 4 FP REGISTERS D ADD.D <- F0 LOAD BUFFERS C L.D F0 <- 6 B ADD.D <- F0 5 4 OPERAND BUSES 3 2 OPERATION BUS 1 A load1 3 2 RESERVATION STATIONS STORE BUFFERS 2 1 1 To memory FP adders FP multipliers COMMON DATA BUS (CDB) ECE 463/521, Profs. Gehringer, Rotenberg, & Conte, Dept. of ECE, NC State University

Tomasulo renaming (3) FLT. PT. OPERATION QUEUE From IF unit F0 load1 From memory 4 FP REGISTERS LOAD BUFFERS D ADD.D <- F0 6 C L.D F0 <- 5 4 OPERAND BUSES 3 2 OPERATION BUS 1 A load1 3 2 RESERVATION STATIONS STORE BUFFERS 2 1 1 B load1 (value) To memory FP adders FP multipliers COMMON DATA BUS (CDB) ECE 463/521, Profs. Gehringer, Rotenberg, & Conte, Dept. of ECE, NC State University

Tomasulo renaming (4) FLT. PT. OPERATION QUEUE From IF unit F0 load2 From memory 4 FP REGISTERS LOAD BUFFERS 6 D ADD.D <- F0 5 4 OPERAND BUSES 3 2 C load2 OPERATION BUS 1 A load1 3 2 RESERVATION STATIONS STORE BUFFERS 2 1 1 B load1 (value) To memory FP adders FP multipliers COMMON DATA BUS (CDB) ECE 463/521, Profs. Gehringer, Rotenberg, & Conte, Dept. of ECE, NC State University

Tomasulo renaming (5) FLT. PT. OPERATION QUEUE From IF unit F0 load2 From memory 4 FP REGISTERS LOAD BUFFERS 6 5 4 OPERAND BUSES 3 2 C load2 OPERATION BUS 1 A load1 3 2 RESERVATION STATIONS STORE BUFFERS 2 D load2 (value) 1 1 B load1 (value) To memory FP adders FP multipliers COMMON DATA BUS (CDB) ECE 463/521, Profs. Gehringer, Rotenberg, & Conte, Dept. of ECE, NC State University

(6) Second load completes 1st FLT. PT. OPERATION QUEUE From IF unit F0 value2 From memory 4 FP REGISTERS LOAD BUFFERS 6 5 4 OPERAND BUSES 3 2 OPERATION BUS 1 A load1 3 2 RESERVATION STATIONS STORE BUFFERS 2 1 load2 C value2 1 B load1 (value) D value2 (value) To memory FP adders FP multipliers COMMON DATA BUS (CDB) ECE 463/521, Profs. Gehringer, Rotenberg, & Conte, Dept. of ECE, NC State University

Tomasulo renaming (7) FLT. PT. OPERATION QUEUE From IF unit F0 value2 From memory 4 FP REGISTERS LOAD BUFFERS 6 5 4 OPERAND BUSES 3 2 OPERATION BUS 1 3 2 RESERVATION STATIONS STORE BUFFERS 2 1 load1 A value1 1 B value1 (value) To memory FP adders FP multipliers COMMON DATA BUS (CDB) ECE 463/521, Profs. Gehringer, Rotenberg, & Conte, Dept. of ECE, NC State University