ECE 2162 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.

Slides:



Advertisements
Similar presentations
Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.
Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 3, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Introduction)
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
CS6290 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
CIS 662 – Computer Architecture – Fall Class 11 – 10/12/04 1 Scoreboarding  The following four steps replace ID, EX and WB steps  ID: Issue –
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
IBM System 360. Common architecture for a set of machines
/ Computer Architecture and Design
Tomasulo’s Algorithm Born of necessity
Out of Order Processors
Dynamic Scheduling and Speculation
Step by step for Tomasulo Scheme
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
Microprocessor Microarchitecture Dynamic Pipeline
Lecture 10 Tomasulo’s Algorithm
Lecture 12 Reorder Buffers
High-level view Out-of-order pipeline
CMSC 611: Advanced Computer Architecture
Lecture 6: Advanced Pipelines
A Dynamic Algorithm: Tomasulo’s
Out of Order Processors
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
ECE 2162 Reorder Buffer.
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
Checking for issue/dispatch
Static vs. dynamic scheduling
Advanced Computer Architecture
Static vs. dynamic scheduling
Tomasulo Algorithm Example
Tomasulo Organization
Adapted from the slides of Prof
September 20, 2000 Prof. John Kubiatowicz
CS203 – Advanced Computer Architecture
Lecture 7 Dynamic Scheduling
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Presentation transcript:

ECE 2162 Tomasulo’s Algorithm

Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available to satisfy data dependences –Removes name dependences through register renaming –Very similar to what is used today Almost all modern high-performance processors use a derivative of Tomasulo’s… much of the terminology survives to today. 2

Tomasulo’s Algorithm: The Picture 3

Issue (1) –Get next instruction from instruction queue. –Find a free reservation station for it (if none are free, stall until one is) –Read operands that are in the registers –If the operand is not in the register, find which reservation station will produce it –In effect, this step renames registers (reservation station IDs are “temporary” names) 4

Issue (2) F2=F4+F1 AdderFP-Cmplx A1 (1) A2 (2) A3 (3) C1 (4) C2 (5) F1 F2 F3 F4 RAT F1 F2 F3 F4 Reg File  To-Do list (from last slide): Get next inst from IB’s Find free reservation station Read operands from RF Record source of other operands Update source mapping (RAT) To-Do list (from last slide): Get next inst from IB’s Find free reservation station Read operands from RF Record source of other operands Update source mapping (RAT) F1 = F2 + F3 F4 = F1 – F2 F1 = F2 / F3 Instruction Buffers

Issue (2) F2=F4+F1F1=F2/F3 AdderFP-Cmplx A1 (1) A2 (2) A3 (3) C1 (4) C2 (5) F1 F2 F3 F4 RAT F1 F2 F3 F4 Reg File  To-Do list (from last slide): Get next inst from IB’s Find free reservation station Read operands from RF Record source of other operands Update source mapping (RAT) To-Do list (from last slide): Get next inst from IB’s Find free reservation station Read operands from RF Record source of other operands Update source mapping (RAT) (1)2.718 F1 = F2 + F3 F4 = F1 – F2 F1 = F2 / F3 Instruction Buffers

Execute (1) –Monitor results as they are produced –Put a result into all reservation stations waiting for it (missing source operand) –When all operands available for an instruction, it is ready (we can actually execute it) –Several ready instrs for one functional unit? Pick one. Except for load/store Load/Store must be done in the proper order to avoid hazards through memory (more loads/stores this in a later lecture) 7

Execute (2) F1=F2/F3 AdderFP-Cmplx A1 (1) A2 (2) A3 (3) C1 (4) C2 (5) To-Do list (from last slide): Monitor results from ALUs Capture matching operands Compete for ALUs To-Do list (from last slide): Monitor results from ALUs Capture matching operands Compete for ALUs (1)2.718 F1 = F2 + F3 F4 = F1 – F2 F1 = F2 / F3 Instruction Buffers F1 F2 F3 F4 RAT F1 F2 F3 F4 Reg File F2=F4+F1 8

3 Execute (2) F4=F1-F2 F1=F2+F3 F1=F2/F3 AdderFP-Cmplx A1 (1) A2 (2) A3 (3) C1 (4) C2 (5) (4)(1) To-Do list (from last slide): Monitor results from ALUs Capture matching operands Compete for ALUs To-Do list (from last slide): Monitor results from ALUs Capture matching operands Compete for ALUs (1)2.718 F1 = F2 + F3 F4 = F1 – F2 F1 = F2 / F3 Instruction Buffers F1 F2 F3 F4 RAT F1 F2 F3 F4 Reg File F2=F4+F1 9

Execute (2) F4=F1-F2 F1=F2+F3 F1=F2/F3 AdderFP-Cmplx A1 (1) A2 (2) A3 (3) C1 (4) C2 (5) (4)(1) To-Do list (from last slide): Monitor results from ALUs Capture matching operands Compete for ALUs To-Do list (from last slide): Monitor results from ALUs Capture matching operands Compete for ALUs (1)2.718 F2=F4+F1 (1) F1 = F2 + F3 F4 = F1 – F2 F1 = F2 / F3 Instruction Buffers F1 F2 F3 F4 RAT F1 F2 F3 F4 Reg File F2=F4+F1 10

Execute (2) F4=F1-F2 F1=F2+F3 F1=F2/F3 AdderFP-Cmplx A1 (1) A2 (2) A3 (3) C1 (4) C2 (5) (4)(1) To-Do list (from last slide): Monitor results from ALUs Capture matching operands Compete for ALUs To-Do list (from last slide): Monitor results from ALUs Capture matching operands Compete for ALUs (1)2.718 F2=F4+F1 (1) F1 = F2 + F3 F4 = F1 – F2 F1 = F2 / F3 Instruction Buffers F1 F2 F3 F4 RAT F1 F2 F3 F4 Reg File 11

Execute (3) More than one ready inst for the same unit F4=F3-F2 F1=F2+F3 F1=F2/F3 AdderFP-Cmplx A1 (1) A2 (2) A3 (3) C1 (4) C2 (5) (1) (1) F2=F4+F1 (1) Common heuristic: oldest first You can do whatever: it only affects performance, not correctness 12

Write Result (1) –When result is computed, make it available on the “common data bus” (CDB), where waiting reservation stations can pick it up –Result stored in the register file –Stores write to memory –This step frees the reservation station –For our register renaming, this recycles the temporary name (future instructions can again find the value in the actual register, until it is renamed again) 13

Write Result (2) F4=F1-F2 F1=F2+F3 F1=F2/F3 AdderFP-Cmplx A1 (1) A2 (2) A3 (3) C1 (4) C2 (5) (4) To-Do list (from last slide): Broadcast on CDB Writeback to RF Update Mapping Free reservation station To-Do list (from last slide): Broadcast on CDB Writeback to RF Update Mapping Free reservation station F1 F2 F3 F4 Reg File F1 F2 F3 F4 RAT (1) F2=F4+F  (1)  F1 = F2 + F3 F4 = F1 – F2 F1 = F2 / F F2 = F4 + F Only update RAT (and RF) if RAT still contains your mapping! Only update RAT (and RF) if RAT still contains your mapping! X 14

Tomasulo’s Algorithm: Load/Store The reservation stations take care of dependences through registers. Dependences also possible through memory –Loads and stores not reordered in original IBM 360 –We’ll talk about how to do load-store reordering later 15

Detailed Example – Cycle 0 1. L.DF6, 34(R2) 2. L.DF2, 45(R3) 3. MUL.DF0, F2, F4 4. SUB.DF8, F2, F6 5. DIV.DF10,F0,F6 6. ADD.DF6, F8, F2 Is ExW BusyOpVjVkQjQkA Reservation Stations … F0F2F4F6F8F10F12 RAT: LD1 LD2 AD1 AD2 AD3 ML1 ML2 Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Assume R2 is 100 R3 is 200 F4 is … Architecture Reg File: 16

Detailed Example – Cycle 1 1. L.DF6, 34(R2) 2. L.DF2, 45(R3) 3. MUL.DF0, F2, F4 4. SUB.DF8, F2, F6 5. DIV.DF10,F0,F6 6. ADD.DF6, F8, F2 1 Is ExW 1L.D134 BusyOpVjVkQjQkA Reservation Stations LD1… F0F2F4F6F8F10F12 LD1 LD2 AD1 AD2 AD3 ML1 ML2 Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Assume R2 is 100 R3 is 200 F4 is 2.5 RAT: 2.5… Architecture Reg File: 17

Detailed Example – Cycle 2 1. L.DF6, 34(R2) 2. L.DF2, 45(R3) 3. MUL.DF0, F2, F4 4. SUB.DF8, F2, F6 5. DIV.DF10,F0,F6 6. ADD.DF6, F8, F Is ExW 1L.D134 1L.D245 BusyOpVjVkQjQkA Reservation Stations LD2LD1… F0F2F4F6F8F10F12 LD1 LD2 AD1 AD2 AD3 ML1 ML2 Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles 2 Assume R2 is 100 R3 is 200 F4 is 2.5 RAT: 2.5… Architecture Reg File: 18

Detailed Example – Cycle 3 1. L.DF6, 34(R2) 2. L.DF2, 45(R3) 3. MUL.DF0, F2, F4 4. SUB.DF8, F2, F6 5. DIV.DF10,F0,F6 6. ADD.DF6, F8, F Is ExW 1L.D134 1L.D245 1MUL.D2.5LD2 BusyOpVjVkQjQkA Reservation Stations ML1LD2LD1… F0F2F4F6F8F10F12 LD1 LD2 AD1 AD2 AD3 ML1 ML2 Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles 2 3 Assume R2 is 100 R3 is 200 F4 is RAT: 2.5… Architecture Reg File: 19

Detailed Example – Cycle 4 1. L.DF6, 34(R2) 2. L.DF2, 45(R3) 3. MUL.DF0, F2, F4 4. SUB.DF8, F2, F6 5. DIV.DF10,F0,F6 6. ADD.DF6, F8, F Is ExW 0 1L.D245 1SUB.D0.5LD2 1MUL.D2.5LD2 BusyOpVjVkQjQkA Reservation Stations ML1LD2AD1… F0F2F4F6F8F10F12 LD1 LD2 AD1 AD2 AD3 ML1 ML2 Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Assume R2 is 100 R3 is 200 F4 is RAT: … Architecture Reg File: 20

Detailed Example – Cycle 5 1. L.DF6, 34(R2) 2. L.DF2, 45(R3) 3. MUL.DF0, F2, F4 4. SUB.DF8, F2, F6 5. DIV.DF10,F0,F6 6. ADD.DF6, F8, F Is ExW 0 0 1SUB.D MUL.D DIV.D0.5ML1 BusyOpVjVkQjQkA Reservation Stations ML1AD1ML2… F0F2F4F6F8F10F12 LD1 LD2 AD1 AD2 AD3 ML1 ML2 Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Assume R2 is 100 R3 is 200 F4 is RAT: … Architecture Reg File: 21

Detailed Example – Cycle 6 1. L.DF6, 34(R2) 2. L.DF2, 45(R3) 3. MUL.DF0, F2, F4 4. SUB.DF8, F2, F6 5. DIV.DF10,F0,F6 6. ADD.DF6, F8, F Is ExW 0 0 1SUB.D ADD.D1.5AD1 1MUL.D DIV.D0.5ML1 BusyOpVjVkQjQkA Reservation Stations ML1AD2AD1ML2… F0F2F4F6F8F10F12 LD1 LD2 AD1 AD2 AD3 ML1 ML2 Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Assume R2 is 100 R3 is 200 F4 is RAT: … Architecture Reg File: 22

Detailed Example – Cycle 8 1. L.DF6, 34(R2) 2. L.DF2, 45(R3) 3. MUL.DF0, F2, F4 4. SUB.DF8, F2, F6 5. DIV.DF10,F0,F6 6. ADD.DF6, F8, F Is ExW ADD.D MUL.D DIV.D0.5ML1 BusyOpVjVkQjQkA Reservation Stations ML1AD2ML2… F0F2F4F6F8F10F12 LD1 LD2 AD1 AD2 AD3 ML1 ML2 Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Assume R2 is 100 R3 is 200 F4 is RAT: … Architecture Reg File: 23

Detailed Example – Cycle 9 1. L.DF6, 34(R2) 2. L.DF2, 45(R3) 3. MUL.DF0, F2, F4 4. SUB.DF8, F2, F6 5. DIV.DF10,F0,F6 6. ADD.DF6, F8, F Is ExW ADD.D MUL.D DIV.D0.5ML1 BusyOpVjVkQjQkA Reservation Stations ML1AD2ML2… F0F2F4F6F8F10F12 LD1 LD2 AD1 AD2 AD3 ML1 ML2 Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Assume R2 is 100 R3 is 200 F4 is RAT: … Architecture Reg File: 24

Detailed Example – Cycle L.DF6, 34(R2) 2. L.DF2, 45(R3) 3. MUL.DF0, F2, F4 4. SUB.DF8, F2, F6 5. DIV.DF10,F0,F6 6. ADD.DF6, F8, F Is ExW MUL.D DIV.D0.5ML1 BusyOpVjVkQjQkA Reservation Stations ML1ML2… F0F2F4F6F8F10F12 LD1 LD2 AD1 AD2 AD3 ML1 ML2 Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Assume R2 is 100 R3 is 200 F4 is RAT: … Architecture Reg File: 25

Detailed Example – Cycle L.DF6, 34(R2) 2. L.DF2, 45(R3) 3. MUL.DF0, F2, F4 4. SUB.DF8, F2, F6 5. DIV.DF10,F0,F6 6. ADD.DF6, F8, F Is ExW DIV.D BusyOpVjVkQjQkA Reservation Stations ML2… F0F2F4F6F8F10F12 LD1 LD2 AD1 AD2 AD3 ML1 ML2 Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Assume R2 is 100 R3 is 200 F4 is RAT: … Architecture Reg File: 26

Detailed Example – Cycle L.DF6, 34(R2) 2. L.DF2, 45(R3) 3. MUL.DF0, F2, F4 4. SUB.DF8, F2, F6 5. DIV.DF10,F0,F6 6. ADD.DF6, F8, F Is ExW DIV.D BusyOpVjVkQjQkA Reservation Stations ML2… F0F2F4F6F8F10F12 LD1 LD2 AD1 AD2 AD3 ML1 ML2 Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Assume R2 is 100 R3 is 200 F4 is RAT: … Architecture Reg File: 27

Detailed Example – Cycle L.DF6, 34(R2) 2. L.DF2, 45(R3) 3. MUL.DF0, F2, F4 4. SUB.DF8, F2, F6 5. DIV.DF10,F0,F6 6. ADD.DF6, F8, F Is ExW BusyOpVjVkQjQkA Reservation Stations … F0F2F4F6F8F10F12 LD1 LD2 AD1 AD2 AD3 ML1 ML2 Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Assume R2 is 100 R3 is 200 F4 is RAT: … Architecture Reg File: 28

Timing Example Kind of hard to keep track with previous table-based approach Simplified version to track timing only F6,34(R2)124L.D OperandsIsExecWrCommentsInst F2, 45(R3)235L.D F0,F2,F43616MUL.D F8,F2,F6468SUB.D F10,F0,F651757DIV.D F6,F8,F26911ADD.D Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles Load: 2 cycles Add: 2 cycles Mult: 10 cycles Divide: 40 cycles 29