Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.

Slides:



Advertisements
Similar presentations
Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.
Advertisements

MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.
CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s.
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)
Tomasulo’s Approach and Hardware Based Speculation
ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Out-of-order execution: Scoreboarding and Tomasulo Week 2
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
CET 520/ Gannod1 Section A.8 Dynamic Scheduling using a Scoreboard.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,
CSC 4250 Computer Architectures September 29, 2006 Appendix A. Pipelining.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
COMP25212 Advanced Pipelining Out of Order Processors.
CS203 – Advanced Computer Architecture ILP and Speculation.
Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
/ Computer Architecture and Design
Out of Order Processors
Step by step for Tomasulo Scheme
CS203 – Advanced Computer Architecture
Lecture 6 Score Board And Tomasulo’s Algorithm
Advantages of Dynamic Scheduling
High-level view Out-of-order pipeline
11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
COMP s1 Seminar 3: Dynamic Scheduling
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Checking for issue/dispatch
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Static vs. dynamic scheduling
CSCE430/830 Computer Architecture
Static vs. dynamic scheduling
September 20, 2000 Prof. John Kubiatowicz
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
Adapted from the slides of Prof
Lecture 5 Scoreboarding: Enforce Register Data Dependence
CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.
CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming February.
Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
September 20, 2000 Prof. John Kubiatowicz
High-level view Out-of-order pipeline
Lecture 7 Dynamic Scheduling
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk

Scoreboarding

3/43 Figure A.51 The basic structure of a DLX processor with a scoreboard Scoreboard Integer unit FP add FP divide FP mult RegistersData buses Control/status Scoreboard

4/43 Four Stages of Scoreboard Control: ISSUE 1.Issue: decode instructions & check for structural hazards (ID1) If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared. Algorithm: Assure In-Order issue Multiple issues per cycle are allowed Check if Destination Register is already reserved for writing (WAW) Check if Read-Operand stage of Functional Unit is free (Structural)

5/43 Four Stages of Scoreboard Control: READ-OPERANDS 2.Read operands: wait until no data hazards, then read operands (ID2) – First Functional Pipeline Stage A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit. When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order. Algorithm: Wait for operands to become available, Register Result Status (RAW) Operand Caching is allowed Forwarding from another WB stage is allowed

6/43 Four Stages of Scoreboard Control – ex + write 3. Execution: operate on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. This stage can be (sub-)pipelined. 4. Write result: finish execution (WB) Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, it stalls the instruction. Algorithm: Delay write until all Rj and Rk fields for this register are marked as either cached or read. If caching of operands is done: forward answer right away. If not, wait until all operands are read before writing. Forward answers to units waiting for this write for their operand.

7/43 Three Parts of the Scoreboard 1.Instruction status Indicates which of 4 steps the instruction is in. 2.Functional unit status Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy – Indicates whether the unit is busy or not Op – Operation to perform in the unit (e.g., + or -) Fi – Destination register Fj, Fk – Source-register numbers Qj, Qk – Functional units producing source registers Fj, Fk Rj, Rk – Flags indicating when Fj, Fk are available and not yet read. (Alternatively: read and cached ) 3.Register result status: Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register.

8/43 Scoreboard Example Cycle 1 R2 has not been read/cached until cycle 2!!!

9/43 Scoreboard Example Cycle 2 Issue 2nd LD or MULT?

10/43 Scoreboard Example Cycle 4 Yes

11/43 Scoreboard Example Cycle 5 SUPERSCALAR: Issue MULTD?

12/43 Scoreboard Example Cycle 6

13/43 Scoreboard Example Cycle 7 Read multiply operands? DIVD could have been issued on this cycle.

14/43 Scoreboard Example Cycle 8a

15/43 Scoreboard Example Cycle 8b

16/43 Scoreboard Example Cycle 9 Issue ADDD?

17/43 Scoreboard Example Cycle 11

18/43 Scoreboard Example Cycle 12

19/43 Scoreboard Example Cycle 13

20/43 Scoreboarding Summary Limitations of CDC 6600 scoreboard No forwarding hardware Limited to instructions in basic block (small window) Small number of functional units (structural hazards), especially integer/load/store units Do not issue if structural or WAW hazards Wait for WAR hazards Imprecise exceptions Key idea: Allow instructions behind stall to proceed Decode issue instructions and read operands Enables out-of-order execution out-of-order completion

21/43 Scoreboarding Summary Modern Day Improvements: All operands are cached as soon as available Forwarding Pipelining Functional Units Microcoding, eg. IA32 (widens execution window) More precise exceptions In order retirement Works best with tons of actual registers Tomasulo approach: Reservation stations vs. Forwarding and Caching Temporary Registers work as many virtual registers

Tomasulos Approach

23/43 Hardware Schemes for ILP Key idea: Allow instructions behind stall to proceed Decode => issue instructions and read operands Enables out-of-order execution => out-of-order completion Why in hardware at run time? Works when dependence is not known at run time Simplifies compiler Allows code for one machine to run well on another Out-of-order execution divides ID stage: Issue decode instructions, check for structural hazards Read operands wait until no data hazards, then read operands

24/43 Tomasulos Algorithm For IBM 360/91 about 3 years after CDC 6600 Goal: High performance without special compilers Differences between IBM 360 & CDC 6600 ISA IBM has only 2 register specifiers/instruction vs. 3 in CDC 6600 IBM has 4 FP registers vs. 8 in CDC 6600 Differences between Tomasulos Algorithm & Scoreboard Control & buffers (called reservation stations) distributed with functional units vs. centralized in scoreboard Registers in instructions replaced by pointers to reservation station buffer HW renaming of registers to avoid WAR, WAW hazards Common data bus (CDB) broadcasts results to functional units Load and stores treated as functional units as well Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604,...

25/43 Three Stages of Tomasulo Algorithm 1. Issue: Get instruction from FP operation queue If reservation station free, issues instruction & sends operands (renames registers). 2. Execution: Operate on operands (EX) When operands ready then execute; if not ready, watch common data bus for result. 3. Write result: Finish execution (WB) Write on common data bus to all awaiting units; mark reservation station available. Common data bus: data + source (come from bus)

26/43 Tomasulo Organization FP Adders Common data bus (CDB) From Memory FP Registers Load Buffers From Instruction Unit Operand Bus Store Buffers To Memory FP Multipliers FP Op Queue Operation Bus Reservation Stations FP Mul Res. Station FP Add Res. Station

27/43 Reservation Station Components Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – Reservation stations producing source registers Vj, Vk – Value of source operands Rj, Rk – Flags indicating when Vj, Vk are ready Busy – Indicates reservation station and FU is busy Register result status Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register.

28/43 Tomasulo Example Cycle 1

29/43 ENGS 116 Lecture 8 29 Tomasulo Example Cycle 2

30/43 Tomasulo Example Cycle 3 Register names are renamed in reservation stations Load1 completing who is waiting for Load1?

31/43 Tomasulo Example Cycle 4 Load2 completing who is waiting for it?

32/43 Tomasulo Example Cycle 5

33/43 Tomasulo Example Cycle 6

34/43 Tomasulo Summary Reservation stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of scoreboard Allows loop unrolling in HW Not limited to basic blocks (integer units get ahead, beyond branches) Lasting Contributions Dynamic scheduling Register renaming Load/store disambiguation 360/91 descendants are Pentium III; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

35/43 Tomasulo with Speculation 1. Issue – Empty reservation station and an empty ROB slot. Send operands to reservation station from register file or from ROB. This stage is often referred to as: dispatch 2. Execute – Monitor CDB for operands, check RAW hazards. When both operands are available, then execute. 3. Write Result – When available, write result to CDB through to ROB and any waiting reservation stations. Stores write to value field in ROB. 4. Commit – Three cases: Normal Commit: write registers, in order commit Store: update memory Incorrect branch: flush ROB, reservation stations and restart execution at correct PC

36/43

Now, for the grand finale Lets compare!!!

38/43 Figure A.51 The basic structure of a DLX processor with a scoreboard Scoreboard Integer unit FP add FP divide FP mult RegistersData buses Control/status Scoreboard

39/43 Tomasulo Organization FP Adders Common data bus (CDB) From Memory FP Registers Load Buffers From Instruction Unit Operand Bus Store Buffers To Memory FP Multipliers FP Op Queue Operation Bus Reservation Stations FP Mul Res. Station FP Add Res. Station

40/43 Scoreboard Example Cycle 6

41/43 Tomasulo Example – cycle 6

42/43 Differences between Tomasulos Algorithm & Scoreboard Control & buffers (reservation stations) distributed with functional units Registers in instructions replaced by pointers to reservation station buffer HW renaming of registers to avoid WAR, WAW hazards Common data bus (CDB) broadcasts results to functional units Load and stores treated as functional units as well Stages: Issue, Execution, Write result Control & buffers centralized Use actual registers Do not issue if structural or WAW hazards Wait for WAR hazards Forwarding? Stages: Issue, Read operands, Execution, Write result

43/43