Critical Path Analysis w x y z w x y z (2) Max{3,2} (3) (2) (3) (2) 10 cyc8 cyc.

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
COMP25212 Advanced Pipelining Out of Order Processors.
Out-of-Order Machine State Instruction Sequence: Inorder State: Look-ahead State: Architectural State: R3  A R7  B R8  C R7  D R4  E R3  F R8  G.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.
February 28, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
Computer Architecture Computer Architecture Superscalar Processors Ola Flygt Växjö University +46.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
Dataflow Order Execution  Use data copying and/or hardware register renaming to eliminate WAR and WAW ­register name refers to a temporary value produced.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
CS203 – Advanced Computer Architecture ILP and Speculation.
Lecture: Out-of-order Processors
Dynamic Scheduling Why go out of style?
/ Computer Architecture and Design
Smruti R. Sarangi IIT Delhi
PowerPC 604 Superscalar Microprocessor
CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions
CS203 – Advanced Computer Architecture
Lecture: Out-of-order Processors
Microprocessor Microarchitecture Dynamic Pipeline
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Out of Order Processors
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Smruti R. Sarangi IIT Delhi
ECE 2162 Reorder Buffer.
Lecture 11: Memory Data Flow Techniques
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Checking for issue/dispatch
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
15-740/ Computer Architecture Lecture 10: Out-of-Order Execution
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Critical Path Analysis w x y z w x y z (2) Max{3,2} (3) (2) (3) (2) 10 cyc8 cyc

In-order State and Precise Interrupt  If an IBM 360/91 instruction causes an exception, can we stop the processor in a precise state?  By the time j executes, k has already updated R4? How do you rewind the register file to the state just after i? Recall, i never even got to update R4!!  Next time, how to maintain an “in-order” state of the machine. (In-order state = the machine state as viewed by the first not-yet-completed instruction.) i: R4  R0 x R8 j: R2  R0 + R4 Exception!! k: R4  R0 + R8 l: R8  R4 x R8

A Modern Superscalar Processor

Modern Enhancements to Tomasulo’s Algorithm TomasuloModern Machine Width - Peak IPC = 1“Peak” IPC = 8 (Structural Dep.)2 F.P. functional. units6-10 functional units Single CDBMany forwarding buses Anti-Dep. -Operand copyingRenamed register Output Dep. -Reserv. Station TagRenamed register True Data Dep.-Tag-based forwardingTag-based forwarding Exceptions -ImprecisePrecise (Require ROB)

Out-of-Order Machine State Instruction Sequence: Inorder State: Look-ahead State: Architectural State: R3  A R7  B R8  C R7  D R4  E R3  F R8  G R3  H R3  A R8  C R7  D R4  E R3  F R8  G R3  H R7  D R4  E R8  G R3  H gray=dispatched but not yet executed instructions

Elements of Modern Micro-dataflow inorder out-of-order inorder

Steps during Dynamic Execution  DISPATCH: ­Read operands from Register File (ARF) and/or Rename Register File (RRF) (RRF may return value or Tag) ­Allocate new RRF entry and rename destination register to it ­Allocate Reorder Buffer (ROB) entry ­Advance instruction to appropriate Reservation Station (RS)  EXECUTE: ­RS entry monitors result bus for rename register Tag(s) to latch in pending operand(s) ­When all operands ready, issue instruction into Functional Unit (FU) and deallocate RS entry (no further stalling in execution pipe) ­When execution finishes, broadcast result to waiting RS entries and RRF entry  COMPLETE: ­When ready to commit result into “in-order” state: 1. Update architectural register from RRF entry, deallocate RRF entry, and if it is a store instruction, advance it to Store Buffer 2. Deallocate ROB entry and instruction is considered architecturally completed

Metaflow Lightning SPARC Processor  Superscalar fetch, issue, and execution  Micro-dataflow instruction scheduling  register renaming + memory renaming  Speculative execution with rapid rewinding  Precise Interrupts circa 1991 Claim: “Factor of 2-3 performance advantage from architecture”

Metaflow Datapath ICache issue DRIS (Renaming + Reservation Stations + Reorder Buff.) Retire Scheduler Register File Branch Pred. Speculative State In-order State

Metaflow DRIS  Deferred-scheduling Register-renaming Instruction Shelf (i.e. ROB + Rename Table + Reservation Stations)  A storage array with multiported RAMs and CAMs (a.k.a. a very-very complicated register-file like thing)  A DRIS entry is maintained for every instruction in flight. Lock1RN1ID1 Source 1 Lock2RN2ID2 Source 2 latestRDData Destination Dispatched Fxn UnitExecuted Status PC

DRIS  Circular Queue Structure Instructions stored in original program order  New entries are allocated at the head of the queue as new instructions are issued  Entries are committed in-order from the tail of the queue to the register file and memory youngest oldest n-1 n

Issue * : (Rename+Decode)  A new ID (aka Tag) is allocated to each instruction when issued into DRIS ID is the index of the allocated DRIS entry location  The ID is used to refer to the result of that instruction  Register operand lookup, add rd, rs, rt 1.Search DRIS to see if an older instruction has rs or rt as its destination. If so, rename the sources by setting the ID field. 2.If renamed, check to see if data are ready. If not, set the locked bit. Lock1RN1ID1 Source 1 Lock2RN2ID2 Source 2 latestRDData Destination

Issue: add rd, rs, rt Assume new is the ID for the current instruction. RN1[new]= rs ; RN2[new]= rt ; Locked1[new]= false ; Locked2[new]= false ; ID1[new]= not_valid ; ID2[new]= not_valid ; forall (id) // over all active DRIS entries if ((RD[id] == rs) && Latest[id] ) ID1[new] = id ; if (!Executed[id]) Locked1[new]= true ; forall (id) if ((RD[id] == rt) && Latest[id] ) ID2[new] = id ; if (!Executed[id]) Locked2[new]= true ; Lock1RN1ID1Lock2RN2ID2latestRDData

Associative Lookup invalidRDlatest invalidRDlatest validRDlatest validRD validRD validRD invalidRD invalidRD latest rs = tail pointer head per Return My Tag =====

Issue: add rd, rs, rt RD[new] = rd ; forall (id) if (RD[id] == rd) Latest[id]= false ; Latest[new]= true ; Dispatched[new]= false ; Executed[new]= false ; FxnU[new]=Integer ALU ; Lock1RN1ID1Lock2RN2ID2latestRDData Dispatched Fxn UnitExecuted PC

Micro-Dataflow Scheduling  The scheduler dispatches according to ­availability of pending instructions’ operands ­availability of the functional units ­chronological order of the instructions  Find the “oldest” N instructions such that !locked1[id] && !locked2[id] && Dispatched[id]= false && Executed[id]= false && notBusy(fxnUnit[id]) Is “oldest-firsrt” always the best strategy?  Dispatch and set Dispatch[id]= true

Dispatching * : add rd,rs,rt  A dispatched instruction is sent to the functional unit with its operands and its ID  Operands could come from: ­DRIS: Data[IDx[ID]]speculative state when DRIS[IDx[ID]] is active ­Register File: RF[RNx[ID]]in-order state when DRIS[IDx[ID]] is invalid or retired Lock1RN1ID1Lock2RN2ID2latestRDData DispatchedFxn UnitExecutedPC

Scheduling Memory Operations  Memory data dependence (RAW, WAR, WAW)  When to start a load instruction (on a uniprocessor)? ­no more older store instructions in DRIS or ­must know the addresses of all older stores in DRIS or ­load speculatively and just reload if RAW hazard  Storing to memory irrevocably changes the in-order machine state, therefore, a store instruction can only be executed when ­it is the oldest instruction in DRIS or ­all instructions before the store have completed and thus can no longer cause exceptions (no unresolved/predicted branches)

Update  A Fxn unit returns both the result and the associated ID  The DRIS entry is updated Data[ID]=result ; Executed[ID]= true ;  Enable other instructions that uses this result forall (id) if (ID1[id]==ID) Locked1[id]= false ; if (ID2[id]== ID)Locked2[id]= false ; Lock1RN1ID1Lock2RN2ID2latestRDData

Retire  Instructions retires strictly in-order from the oldest entry of the DRIS  Data[retiree] is written (aka. committed) to the register file (speculative  in-order state)  Store instructions are only executed when retiring from DRIS

youngest oldest Precise Exceptions  Discard all DRIS entries younger than the offending instruction  How about older instruction that hasn’t finished yet?  When to start executing the interrupt handler? ­Performance ­Protection ­An earlier exception? This works on branch misprediction too!! youngest oldest youngest oldest youngest oldest exception

 To support N-way issue into DRIS per cycle ­Nx3 simultaneous 5-bit associative lookups  To support N-way dispatch per cycle ­1 prioritized associative lookup of N entries ­Nx2 indexed lookup in DRIS ­Nx2 indexed lookup in the GPR  To support N-way update per cycle ­N indexed write to DRIS ­Nx2 associative lookup and write in DRIS  To support N-way retire per cycle ­N indexed lookup in DRIS ­N indexed write to GPR The Cost of Implementing DRIS

Allocate Reorder Buffer entries Reg. Write Back (Reorder Buff.) Integer Float.- Load/ Point Store Decentralized Reordering Structure inorder out-of-order inorder

Register Renaming Alternatives  Number of rename registers  Organization of rename registers ­Separate rename register file ­Pooled architectural/rename register file  Allocation of rename registers ­Fixed for each architectural register ­Shared by all architectural registers  Physical Location of rename registers ­Attached to the architectural register file ­Attached to the reorder buffer  Methods for rename lookup

Register Renaming Mechanisms Data Busy Tag Data Valid RRF Register specifier Operand read Next entry to complete Next entry to be allocated ARF Map Table What happens when you get an exception?

TS1S2S3OPTS1S2S3 Register Renaming in the RS/6000 Incoming FPU instructions pass through a renaming table prior to decode Physical register names only within the FPU!! 32 architectural registers  40 physical registers Complex control logic maintains active register mapping