Dataflow Order Execution  Use data copying and/or hardware register renaming to eliminate WAR and WAW ­register name refers to a temporary value produced.

Slides:



Advertisements
Similar presentations
Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.
Advertisements

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s.
CSE 502: Computer Architecture
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
COMP25212 Advanced Pipelining Out of Order Processors.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Complex Pipelining II Steve Ko Computer Sciences and Engineering University at Buffalo.
Register Data Flow Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.
True Data Dependences and the Data Flow Limit  A RAW dependence between two instructions is called a true data dependence Due to the producer-consumer.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.
EENG449b/Savvides Lec /22/05 March 22, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)
ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:
CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Out-of-order execution: Scoreboarding and Tomasulo Week 2
Instruction-Level Parallelism dynamic scheduling prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University May 2015Instruction-Level Parallelism.
Chapter One Introduction to Pipelined Processors.
Speeding up of pipeline segments © Fr Dr Jaison Mulerikkal CMI.
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
Computer Architecture: Out-of-Order Execution
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
COMP25212 Advanced Pipelining Out of Order Processors.
CS203 – Advanced Computer Architecture ILP and Speculation.
Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
IBM System 360. Common architecture for a set of machines
Dynamic Scheduling Why go out of style?
/ Computer Architecture and Design
Tomasulo’s Algorithm Born of necessity
Step by step for Tomasulo Scheme
CS203 – Advanced Computer Architecture
Chapter One Introduction to Pipelined Processors
Microprocessor Microarchitecture Dynamic Pipeline
CSCE 432/832 High Performance Processor Architectures Register Data Flow Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,
Chapter 3: ILP and Its Exploitation
Advantages of Dynamic Scheduling
High-level view Out-of-order pipeline
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
ECE 2162 Reorder Buffer.
Adapted from the slides of Prof
Checking for issue/dispatch
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Static vs. dynamic scheduling
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
CS5100 Advanced Computer Architecture Dynamic Scheduling
Adapted from the slides of Prof
Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005
September 20, 2000 Prof. John Kubiatowicz
Lecture 7 Dynamic Scheduling
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Dataflow Order Execution  Use data copying and/or hardware register renaming to eliminate WAR and WAW ­register name refers to a temporary value produced by an earlier instruction (ISA perspective) ­decouple register name from fixed storage location ­disambiguate between register name reuse  Maintain a window (or windows) of several pending instructions with only RAW dependence  Issue instructions out-of-order ­find instructions whose input operands are available ­give preference to older instructions ­A completing instruction’s result can trigger other pending instructions (RAW)

IBM 360 Floating Point Unit (Base Model)

Diversified Pipelined Inorder Issue, Out-of-order Complete  Multiple functional units (FU’s) ­Floating-point add ­Floating-point multiply/divide  Three register files (pseudo reg-reg machine in FP unit) ­(4) floating-point registers (FLR) ­(6) floating-point buffers (FLB) ­(3) store data buffers (SDB)  Out of order instruction execution: ­After decode the instruction unit passes all floating point instructions (in order) to the floating-point operation stack (FLOS). ­In the floating point unit, instructions are then further decoded and issued from the FLOS to the two FU’s  Variable operation latencies (not pipelined): ­Floating-point add: 2 cycles ­Floating-point multiply: 3 cycles ­Floating-point divide: 12 cycles

Tomasulo’s Algorithm [IBM 360/91, 1967]

Reservation Station  Buffers where instructions can wait for RAW hazard resolution and execution  Associate more than one set of buffering registers (control, source, sink) with each FU  virtual FU’s. ­Add unit: three reservation stations ­Multiply/divide unit: two reservation stations  Pending (not yet executing) instructions can have either value operands or pseudo operands (aka. tags). Mult RS2 RS1  Mult RS1 Mult RS2

Rename Tags  Register names are normally bound to FLR registers  When an FLR register is stale, the register “name” is bound to the pending-update instruction  Tags are names to refer to these pending-update instructions  In Tomasulo, A “tag” is statically bound to the buffer where a pending-update instruction waits. ­6 FLB’s ­5 reservation stations (3 add RSs, 2 multiply/divide RSs)  4-bit tag is needed to identify the 11 potential sources  Instructions can be dispatched to RSs with either value operands or just tags. ­Tag operand  unfulfilled RAW dependence ­the instruction in the RS corresponding to the Tag will produce the actual value eventually

Common Data Bus (CDB)  CDB is driven by all units that can update FLR ­When an instruction finishes, it broadcasts both its “tag” and its result on the CDB. ­Why don’t we need the destination register name?  Sources of CDB: ­Floating-point buffers (FLB) ­Two FU’s (add unit and the multiply/divide unit)  The CDB is monitored by all units that was left holding a tag instead of a value operand ­Listens for tag broadcast on the CDB ­If a tag matches, grab the value  Destinations of CDB: ­Reservation stations ­Store data buffers (SDB) ­Floating-point registers (FLR)

Output Dependences (WAW) Superscalar Execution Check List INSTRUCTION PROCESSING CONSTRAINTS Resource ContentionCode Dependences Control Dependences Data Dependences True Dependences Anti-Dependences Storage Conflicts (Structural Dependences) (RAW) (WAR)

Structural Dependence Resolution  Structural dependence: virtual FU’s ­FLOS can hold and decode up to 8 instructions. ­Instructions are dispatched to the 5 reservation stations (virtual FU’s) even though there are only two physical FU’s. ­Hence, structural dependence does not stall decoding Why is this useful?

Resolving True-Dependence  True dependence: Tags + CDB ­If an operand is available in FLR, it is copied to RS ­If an operand is not available then a tag is copied to the RS instead. This tag identifies the source (RS/instruction) of the pending write ­Eventually the source instruction completes and broadcasts its tag and value on the CDB ­Any reservation station entry, FLR entry or SDB entry that holds a matching tag as operand will latch in the broadcasted value from the CDB. RAW dependence does not block subsequent independent instructions and does not block an FU

RAW Example: RSTagSinkTagSrc Adder RSTagSinkTagSrc 4 5 Mult/Div FLRBusyTagData Cyc #1: RSTagSinkTagSrc Adder RSTagSinkTagSrc 4 5 Mult/Div FLRBusyTagData Cyc #2: RSTagSinkTagSrc Adder RSTagSinkTagSrc 4 5 Mult/Div FLRBusyTagData Cyc #3: i: R2  R0 + R4 j: R8  R0 + R2

RAW Example: RSTagSinkTagSrc Adder RSTagSinkTagSrc 4 5 Mult/Div FLRBusyTagData X Cyc #1: dispatch i RSTagSinkTagSrc Adder RSTagSinkTagSrc 4 5 Mult/Div FLRBusyTagData X X2-- Cyc #2: dispatch j RSTagSinkTagSrc Adder RSTagSinkTagSrc 4 5 Mult/Div FLRBusyTagData X2-- Cyc #3: i in RS 1 broadcasts tag and result: CBD= > i: R2  R0 + R4 j: R8  R0 + R2

Resolving Anti-Dependence  Anti-dependence: Operand Copying ­If an operand is available in FLR, it is copied to RS with the issuing instruction ­By copying this operand to RS, all WAR dependencies due to future writes to this same register are resolved Hence, the reading of an operand is not delayed, possibly due to other dependencies, and subsequent writes are also not delayed.

WAR Example: RSTagSinkTagSrc Adder RSTagSinkTagSrc 4 5 Mult/Div FLRBusyTagData Cyc #1: RSTagSinkTagSrc Adder RSTagSinkTagSrc 4 5 Mult/Div FLRBusyTagData Cyc #2: RSTagSinkTagSrc Adder RSTagSinkTagSrc 4 5 Mult/Div FLRBusyTagData Cyc #3: i: R4  R0 x R8 j: R0  R4 x R2 k: R2  R2 + R8

WAR Example: RSTagSinkTagSrc Adder RSTagSinkTagSrc Mult/Div FLRBusyTagData X Cyc #1: dispatch i RSTagSinkTagSrc Adder RSTagSinkTagSrc Mult/Div FLRBusyTagData 0X5-- 2X1 4X Cyc #2: dispatch j & k (assume dual issue) RSTagSinkTagSrc Adder RSTagSinkTagSrc Mult/Div FLRBusyTagData 0X5-- 2X1 4X Cyc #3: i: R4  R0 x R8 j: R0  R4 x R2 k: R2  R2 + R8

WAR Example: RSTagSinkTagSrc Adder RSTagSinkTagSrc Mult/Div FLRBusyTagData 0X Cyc #4: RS 1 and 4 completes CBD= > & > i: R4  R0 x R8 j: R0  R4 x R2 k: R2  R2 + R8

Resolving Output-Dependence  Output dependence: “register renaming” + result forwarding ­If a FLR is waiting for a pending write, it’s tag field will contain the tag of the source instruction ­If a 2nd instruction comes along and want to write the same register the register can be renamed to the 2nd instruction (i.e. new tag) Any instruction that needs the value of the 1st pending write has the tag of the 1st instruction. Hence, the correct value will be forwarded from the 1st instruction directly any subsequent instruction that reads the register will get the tag, or eventually the result, of the 2nd instruction WAW dependence is resolved without stalling a physical functional unit and does not require additional buffers to ensure sequential write back to the register file.

WAW Example: RSTagSinkTagSrc Adder RSTagSinkTagSrc 4 5 Mult/Div FLRBusyTagData Cyc #1: RSTagSinkTagSrc Adder RSTagSinkTagSrc 4 5 Mult/Div FLRBusyTagData Cyc #2: RSTagSinkTagSrc Adder RSTagSinkTagSrc 4 5 Mult/Div FLRBusyTagData Cyc #3: i: R4  R0 x R8 j: R2  R0 + R4 k: R4  R0 + R8 l: R8  R4 x R8

WAW Example: RSTagSinkTagSrc Adder RSTagSinkTagSrc 4 5 Mult/Div FLRBusyTagData Cyc #4: RSTagSinkTagSrc Adder RSTagSinkTagSrc 4 5 Mult/Div FLRBusyTagData Cyc #5: RSTagSinkTagSrc Adder RSTagSinkTagSrc 4 5 Mult/Div FLRBusyTagData Cyc #6: i: R4  R0 x R8 j: R2  R0 + R4 k: R4  R0 + R8 l: R8  R4 x R8

WAW Example: RSTagSinkTagSrc Adder RSTagSinkTagSrc Mult/Div FLRBusyTagData X1-- 4X Cyc #1: dispatch i and j RSTagSinkTagSrc Adder RSTagSinkTagSrc Mult/Div FLRBusyTagData X1-- 4X2 8X5 Cyc #2: dispatch k and l RSTagSinkTagSrc Adder RSTagSinkTagSrc Mult/Div FLRBusyTagData X1-- 4X2 8X5 Cyc #3: i: R4  R0 x R8 j: R2  R0 + R4 k: R4  R0 + R8 l: R8  R4 x R8

WAW Example: RSTagSinkTagSrc Adder RSTagSinkTagSrc Mult/Div FLRBusyTagData X X5-- Cyc #4: RS 2 and 4 completes: CBD= > & > RSTagSinkTagSrc Adder RSTagSinkTagSrc 4 5 Mult/Div FLRBusyTagData Cyc #5: RSTagSinkTagSrc Adder RSTagSinkTagSrc 4 5 Mult/Div FLRBusyTagData Cyc #6: i: R4  R0 x R8 j: R2  R0 + R4 k: R4  R0 + R8 l: R8  R4 x R8

Code Sequence for Example 4 w:R4  R0 + R8 x:R2  R0 x R4 y:R4  R4 + R8 z:R8  R4 x R2 w x y z