MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

Slides:



Advertisements
Similar presentations
Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.
Advertisements

Instruction Level Parallelism
CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s.
Execution Cycle. Outline (Brief) Review of MIPS Microarchitecture Execution Cycle Pipelining Big vs. Little Endian-ness CPU Execution Time 1 IF ID EX.
We will resume in: 25 Minutes.
PSSA Preparation.
CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.
Chapter 3 – Dynamic Scheduling
Instruction-Level Parallelism
ILP: Software Approaches
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
COMP25212 Advanced Pipelining Out of Order Processors.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
Tomasulo’s Approach and Hardware Based Speculation
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Out-of-order execution: Scoreboarding and Tomasulo Week 2
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
Chapter 3 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2011 Computer Applications Text book slides: Computer Architec ture:
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
COMP25212 Advanced Pipelining Out of Order Processors.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
ECE562/468 Advanced Computer Architecture Prof. Honggang Wang
/ Computer Architecture and Design
Tomasulo’s Algorithm Born of necessity
Out of Order Processors
Step by step for Tomasulo Scheme
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
CSE 520 Computer Architecture Lec Chapter 2 - DS-Tomasulo
Lecture 6 Score Board And Tomasulo’s Algorithm
March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
Chapter 3: ILP and Its Exploitation
Advantages of Dynamic Scheduling
High-level view Out-of-order pipeline
11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
COMP s1 Seminar 3: Dynamic Scheduling
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
September 20, 2000 Prof. John Kubiatowicz
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
CS5100 Advanced Computer Architecture Dynamic Scheduling
CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming February.
/ Computer Architecture and Design
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
September 20, 2000 Prof. John Kubiatowicz
Lecture 7 Dynamic Scheduling
Conceptual execution on a processor which exploits ILP
Presentation transcript:

MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1

2 The Tomasulos Algorithm From IBM 360/91 Goal: High Performance using a limited number of registers without a special compiler – 4 double-precision FP registers on 360 – Uses register renaming Why Study a 1966 Computer? – The descendants of this include: Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, …

3 Tomasulo Algorithm Control & buffers are distributed with Function Units (FU) – FU buffers called reservation stations (RS) – Contain information about instructions, including operands – More reservation stations than registers, so can do optimizations compilers cant Registers in instructions replaced by values or pointers to reservation stations – form of register renaming – avoids WAR, WAW hazards Results to FU from RS, not through registers (equivalent of forwarding). A Common Data Bus (CDB) broadcasts results to all FUs (their RSes) Loads and Stores treated as FUs with RSes as well

4 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From Mem FP Registers Reservation Stations Common Data Bus (CDB) To Mem FP Op Queue Load Buffers Store Buffers Load1 Load2 Load3 Load4 Load5 Load6

5 Reservation Station Components Busy: Indicates reservation station or FU is busy Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands Qj, Qk: Reservation stations producing source registers (value to be written) – Note: Qj,Qk=0 => ready A: effective address Tomasulo Organization

6 Register result status Qi – Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register Common data bus – Normal data bus: data + destination (go to bus) – CDB: data + source (come from bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast Tomasulo Organization

7 Three Stages of Tomasulo Algorithm 1.Issueget instruction from FP Op Queue – If reservation station free (no structural hazard), control issues the instruction & sends operands (renames registers). 2.Executeoperate on operands (EX) – When both operands ready then execute; if not ready, watch Common Data Bus for result 3.Write resultfinish execution (WB) – Write on Common Data Bus to all awaiting units; mark reservation station available

8 Tomasulo Loop Example Loop:LDF0,0(R1) MULTDF4,F0,F2 SDF4,0(R1) SUBIR1,R1,#8 BNEZR1,Loop This time assume multiply takes 4 clock cycles in the execution stage Assume 1st load takes 8 clock cycles (L1 cache miss) in the execution stage, 2nd load takes 1 extra cycle (hit) Assume store takes 3 cycles in the execution stage To be clear, will show clocks for SUBI, BNEZ Show about 2 iterations

9 Loop Example using simplified presentation for load/store components Added Store Buffers Value of Register used for address, iteration control Instruction Loop Iter- ation Count Instruction status: ExecWrite ITERInstructionjkIssueCompResultBusyAddr Qk 1LDF00R1Load1No 1MULTDF4F0F2Load2No 1SDF40R1Load3No 2LDF00R1Store1No 2MULTDF4F0F2Store2No 2SDF40R1Store3No Reservation Stations: S1S2RS TimeNameBusyOpVjVkQjQkCode: Add1NoLDF00R1 Add2NoMULTDF4F0F2 Add3NoSDF40R1 Mult1NoSUBIR1 #8 Mult2NoBNEZR1Loop Register result status Clock R1 F0F2F4F6F8F10F12...F Qi

10 Loop Example Cycle 1

11 Loop Example Cycle 2

12 Loop Example Cycle 3

13 Loop Example Cycle 4 Dispatching SUBI Instruction (not in FP queue)

14 Loop Example Cycle 5 And, BNEZ instruction (not in FP queue)

15 Loop Example Cycle 6 Notice that F0 never sees Load from location 80

16 Loop Example Cycle 7 Register file completely detached from computation First and Second iteration completely overlapped

17 Loop Example Cycle 8

18 Loop Example Cycle 9 Load1 completing: who is waiting? Note: Dispatching SUBI

19 Loop Example Cycle 10 Load2 completing: who is waiting? Note: Dispatching BNEZ Instruction status: ExecWrite ITERInstructionjkIssueCompResultBusyAddr Qk 1LDF00R11910Load1No 1MULTDF4F0F22Load2Yes72 1SDF40R13Load3No 2LDF00R1610Store1Yes80Mult1 2MULTDF4F0F27Store2Yes72Mult2 2SDF40R18Store3No Reservation Stations: S1S2RS TimeNameBusyOpVjVkQjQkCode: Add1NoLDF00R1 Add2NoMULTDF4F0F2 Add3NoSDF40R1 4Mult1YesMultdM[80]R(F2)SUBIR1 #8 Mult2YesMultdR(F2)Load2BNEZR1Loop Register result status Clock R1 F0F2F4F6F8F10F12...F Qi Load2Mult2

20 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From Mem FP Registers Reservation Stations Common Data Bus (CDB) To Mem FP Op Queue Load Buffers Store Buffers Load1 Load2 Load3 Load4 Load5 Load6

21 Loop Example Cycle 11 Next load in sequence

22 Loop Example Cycle 12 Why not issue third multiply?

23 Loop Example Cycle 13 Why not issue third store?

24 Loop Example Cycle 14 Mult1 completing. Who is waiting?

25 Loop Example Cycle 15 Mult2 completing. Who is waiting?

26 Loop Example Cycle 16

27 Loop Example Cycle 17

28 Loop Example Cycle 18

29 Loop Example Cycle 19

30 Loop Example Cycle 20 Once again: In-order issue, out-of-order execution and out-of-order completion.

31 Why can Tomasulo overlap iterations of loops? Register renaming – Multiple iterations use different physical destinations for registers (dynamic loop unrolling). Reservation stations – Buffer old values of registers - avoiding the WAR stall that we saw in the scoreboard. Other perspective: Tomasulo builds data flow dependency graph on the fly.

32 Tomasulos scheme offers 2 major advantages (1)the distribution of the hazard detection logic – Distributed reservation stations and the CDB – If multiple instructions waiting on single result, the instructions can be released simultaneously by broadcast on CDB – If a centralized register file were used, the units would have to read their results from the registers when register buses are available. (2) the elimination of stalls for WAW and WAR hazards