Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters J. Nelson Amaral.

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Superscalar Processors

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

A scheme to overcome data hazards

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

COMP25212 Advanced Pipelining Out of Order Processors.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.

Computer Architecture: Out-of-Order Execution

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

CS203 – Advanced Computer Architecture ILP and Speculation.

IBM System 360. Common architecture for a set of machines

Lecture: Out-of-order Processors

/ Computer Architecture and Design

Smruti R. Sarangi IIT Delhi

CS203 – Advanced Computer Architecture

Lecture: Out-of-order Processors

Microprocessor Microarchitecture Dynamic Pipeline

High-level view Out-of-order pipeline

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Out of Order Processors

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Smruti R. Sarangi IIT Delhi

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Adapted from the slides of Prof

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters J. Nelson Amaral

Tomasulo Algorithm

IBM 360/91 Floating Point Arithmetic Unit Tomasulo Algorithm: A reservation station for each functional unit. Baer, p. 97 Free/Occupied bit Flag = on → Data = value Flag = off → Data = tag A tag (pointer) to the ROB entry that will store result.

Decode-rename Stage Reservation Station Available? Structural Hazard Stall incoming instructions No Free ROB Entry? Structural Hazard Stall incoming instructions No Assign reservation station and tail of ROB to instruction Yes Baer p. 97

Dispatch Stage Map for each source operand? ROB Entry Forward ROB tag to RS. ReadyBit(RS) ← 0 Logical Register ROB Entry Flag? Forward value to Reservation Station (RS) ReadyBit(RS) ← 1 Tag Value Map result register to tag Enter tag into RS Enter instruction at tail of ROB ResultFlag(tail of ROB) ←0 Baer p. 98

Issue Stage Both Flags in RS are on? Issue instruction to functional unit to start execution Yes No Function unit stalled? (waiting for CDB) Yes If multiple functional units of the same type are available, use a scheduling algorithm CDB = Common Data Bus Baer p. 98

Execute Last cycle of execution? Broadcast result and associated tag Yes No Got ownership of CDB No If multiple functional units request ownership of the Common Data Bus (CDB) on the same cycle a hardwired priority protocol picks the winner. Baer p. 98 ROB stores result in entry identified by tag. Set corresponding ReadyBit. RSs with same tag store result and set corresponding flag. Yes

Commit Stage Is there a result at the head of ROB? No Store result in logical register Delete ROB entry Yes Baer p. 97

Operation Timings Assuming no dependencies Baer, p. 98 Addition: Time: Multiplication: Time: Decoded DispatchedIssued finish execution broadcast commit (if head of ROB) DecodedDispatched Issued finish execution broadcast commit (if head of ROB)

Example i1: R4 ← R0 * R2 # use reservation station 1 of multiplier i2: R6 ← R4 * R8 # use reservation station 2 of multiplier i3: R8 ← R2 + R12 # use reservation station 1 of adder i4: R4 ← R14 + R16 # use reservation station 2 of adder

Index ⋅⋅⋅ 4567 Register Map ROB i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R Time: 8 E1 E2 FlagDataLog. Reg 0E1R4head 0E2R6 tail 8 FreeFlag1Oper1Flag2Oper2Tag Multiplier Reservation Stations 0 10E11 (R8)E2 FreeFlag1Oper1Flag2Oper2Tag Adder Reservation Stations Executing Dispatched i2 is in this res. station.

Index ⋅⋅⋅ 4567 Register Map ROB i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R Time: 8 E4 E2 FlagDataLog. Reg 0E4R4 0E1R4head 0E2R6 0E3R8 tail E3 8 FreeFlag1Oper1Flag2Oper2Tag Multiplier Reservation Stations 0 10E11 (R8)E2 FreeFlag1Oper1Flag2Oper2Tag 0 11(R14)1 (R16)E4 0 Adder Reservation Stations Executing Dispatched Ready to Broadc. Dispatched “register R4, which was renamed as ROB entry E1 and tagged as such in the reservation station Mult2, is now mapped to ROB entry E4.” (Baer, p. 102)

Index ⋅⋅⋅ 4567 Register Map ROB i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R Time: 8 E4 E2 FlagDataLog. Reg 0E4R4 0E1R4head 0E2R6 1(i3)R8 tail E3 8 FreeFlag1Oper1Flag2Oper2Tag Multiplier Reservation Stations 0 10E11 (R8)E2 FreeFlag1Oper1Flag2Oper2Tag Adder Reservation Stations Dispatched Broadcast Ready to Broadc. Assume Adder has priority to broadcast.

Index ⋅⋅⋅ 4567 Register Map ROB i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R Time: 8 E4 E2 FlagDataLog. Reg 1(i4)R4 0E1R4head 0E2R6 1(i3)R8 tail E3 8 FreeFlag1Oper1Flag2Oper2Tag Multiplier Reservation Stations 0 10E11 (R8)E2 FreeFlag1Oper1Flag2Oper2Tag Adder Reservation Stations Dispatched Broadcast Ready to Broadc. Assume Adder has priority to broadcast.

Index ⋅⋅⋅ 4567 Register Map ROB i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R Time: 8 E4 E2 FlagDataLog. Reg 1(i4)R4 1(i1)R4head 0E2R6 1(i3)R8 tail E3 8 FreeFlag1Oper1Flag2Oper2Tag Multiplier Reservation Stations 0 11(i1)1 (R8)E2 FreeFlag1Oper1Flag2Oper2Tag Adder Reservation Stations Dispatched Broadcast

Index ⋅⋅⋅ 4567 Register Map i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R Time: 8 E4 E2 E3 8 FreeFlag1Oper1Flag2Oper2Tag Multiplier Reservation Stations 0 0 FreeFlag1Oper1Flag2Oper2Tag Adder Reservation Stations Commit Executing ROB FlagDataLog. Reg 0(i4)R4 0(i1)R4 0E2R6head 0(i3)R8 tail

IBM 360/91 – unveiled in 1966

Some variant of the Tomasulo algorithm is the basis for the design of all out-of-order processors. Baer p. 97

Data dependency between instruction Where should these instructions wait? How do they become ready for issue? Several instructions get to the end of the front end and have to wait for operands. Baer p. 177

Wakeup Stage Detects instruction readiness. We hope for m instructions to be woken up on each cycle. Baer p. 177

Select Step or Scheduling step: Arbitrates between multiple instructions vieing for the same instruction unit. – Variations of fist-come-first-serve (of FIFO) Bypassing (or forwarding) of operands to units allows earlier selection. Critical instructions may have preference for selection. Baer p. 177

Out-of-Order Architectures Key idea: allow instructions following a stalled one to start execution out of order. A FIFO schedule is not a good idea! Where to store stalled instructions? Baer p. 178

Two Extreme Solutions Tomasulo: a separate reservation station for each functional unit. (distributed window) Instruction Window: a centralized reservation station for all functional units (centralized window) IBM PowerPC series Intel P6 architecture Baer p. 178

A Hybrid Solution Reservation stations are shared among groups of functional units (hybrid window). MIPS R10000: 3 sets of reservation stations: address calculations floating-point units load-store units Baer p. 178

How a design team selects between a centralized, distributed or hybrid window? What are the compromises? Baer p. 179

Window design Resource allocation: centralized is better – static partitioning of resources is worse than dynamic allocation Large windows: speed and power come into play Baer p. 179

Two-Step Instruction Issue Wakeup: instruction is ready for execution Select: instruction is assigned to an execution unit.

Wakeup Step Baer p. 180 f Functional units Window entries w

Window entry with buses from 8 exec units

Wakeup Step Baer p. 180 f Functional units Window entries w We need one bus from each functional unit to each window entry We also need two comparators for each functional unit in each window entry Thus we need 2fw comparators If we separate the functional units and window slots into two equal-size groups, we only need. fw/2 comparators We will also need fewer (shorter) buses from units to slots.

Select Step Priority encoder: a circuit that receives several requests and issues one grant woken up instructions vying for the same unit send requests. priority related to position in window Smaller window → smaller priority encoder Baer p. 181

When should a centralized window be replaced by a distributed or hybrid one? When the wakeup-select step are on the critical path. Threshold appears to be windows with around 64 entries on a 4-wide superscalar processor Baer p. 182

Intel Pentium 4: 2 large windows 2 schedulers per window Intel Pentium III and Intel core: Smaller centralized window AMD Opteron: 4 sets of reservation stations Baer p. 182

Relation between Select and Wake Up i: R 51 ← R 22 + R 33 i+1: R 43 ← R 27 – R 51 Example: The name given to the result of instruction i (R 51 ) must be broadcast as soon as instruction i is selected. Broadcasting the tag of R51 wakes up instruction i+1. For single-cycle latency instructions, the start of the execution is too late to broadcast the tag. Baer p. 183

Speculative Wake Up and Select i: R 51 ← load(R 22 ) i+1: R 43 ← R 27 – R 51 i+2: R 35 ← R 51 + R 28 Example: In this case the tag of the destination of instruction i is broadcast. Instructions i+1 and i+2 are speculatively woken up and selected based on a cache-hit latency. In the case of a cache miss all dependent instructions that have been woken up and selected must be aborted. Baer p. 183

Speculative Selection and the Reservation Stations An instruction must remain in a reservation station after it is scheduled – A bit indicates that the instruction has been selected – Station is free once it is sure that the instruction selection is not speculative anymore Windows are large in comparison with the number of functional units – accommodate many instructions in flight, some speculatively. Baer p. 183

Integrated Register File Tomasulo Reservation Stations What happens upon selection of an instruction? Functional Unit Reservation Station Opcode Operands Opcode Operands Functional Unit Instruction Window Physical Register File Baer p. 183

The complexity of Bypassing i: R 51 ← R 22 + R 33 i+1: R 43 ← R 27 – R 51 Example: Functional Unit A Compute i Functional Unit B Compute i+1 Output of A must be forwarded to B bypassing storage. Baer p. 183

The complexity of Bypassing i: R 51 ← R 22 + R 33 i+1: R 43 ← R 27 – R 51 Example: Functional Unit A Compute i Functional Unit B Now the bypass must forward the output to the input of A. Compute i+1 But the hardware has to implement both buses. Baer p. 183

The complexity of Bypassing i: R 51 ← R 22 + R 33 i+1: R 43 ← R 27 – R 51 Example: Functional Unit A Compute i Functional Unit B Compute i+1 Also, we need buses to forward the output of B. In general, given k functional units we may need k 2 buses. Buses become long to avoid crossing each other. Forwarding may limit the number of functional units in a processor. Forwarding may need more than one cycle to complete. Baer p. 184

Load Speculation Load Address Speculation – Used for data prefetching Memory dependence prediction – Used to speculate data flow from a store to a subsequent load. Baer p. 185

Store Buffer Store Buffer: A circular queue – Entry allocated when store instruction is decoded – Entry removed when store is committed Keep data for stores that have not yet committed Baer p. 185

States of a Store Buffer Entry AV: Available AD: Address is known CO: Committed RE: Result and Address known Address Computation Data to be stored is still to be computed by another instruction Store instruction reaches top of ROB Data written to cache What happens with store buffer on a branch misprediction? Baer p. 185

Handling Store Buffer on Branch Misprediction and Exceptions. Entries preceeding the mispredicted branch: – are in COMMIT state – must be written to cache Entries following misprediction – become AVAILABLE Exceptions: similar – Must write the COMMIT entries to cache before handling exception Baer p. 186

Load Instructions and Load Speculation Baer p. 187

Load /Store Window Implementation – Most Restricted Load/Store Window (FIFO) Loads/Stores inserted in program order. Loads/Stores removed in same order – at mot one per cycle. Single window for loads and stores. Baer p. 187

Load Bypassing Compare address of load with all addresses in store buffer – Load bypassing: If there is no match → load can proceed – What happens if the operand address of any entry in store buffer is not yet computed? load cannot proceed – What happens if there is a match to an entry that is not committed? load cannot access cache “match” is the last match in program order. Need associative search of operand addresses in store buffer Baer p. 187

Load Forwarding If these conditions are true: – A load match a store buffer entry AND – The result is available for the entry ( entry is in RE or CO state) Then the result can be sent to the register specified by the load If the match is with an entry in AD state then: – Load waits for entry to reach RE state

Load Speculation in Out-of-Order Architectures Dynamic Memory Disambiguation Problem: Loads are issued speculatively ahead of preceding stores in program order. How to ensure that data dependences are not violated?

Three approaches Pessimistic: Wait until certain that load can proceed. (like load forwarding and bypassing) Optimistic: Load always proceeds speculatively. Need a recovery mechanism. Dependence prediction: use a predictor to decide to speculate or not. Try to have fewer recoveries.

Example i1: st R1, memadd1 ⋅⋅⋅ i2: st R2, memadd2 ⋅⋅⋅ i3: ld R3, memadd3 ⋅⋅⋅ i4: ld R4, memadd4 Baer p. 188 true dependency Pessimistic: i3 and i4 cannot issue until i2 has computed its result: i2 must be at least in RE (Result) i4 proceeds once i1 and i2 are in AD (Address)

Example i1: st R1, memadd1 ⋅⋅⋅ i2: st R2, memadd2 ⋅⋅⋅ i3: ld R3, memadd3 ⋅⋅⋅ i4: ld R4, memadd4 Baer p. 189 true dependency Optimistic: i3 and i4 issue as soon as possible (load-buffer entries are created) A store reaches CO address compared associatively with load-buffer entries

Example i1: st R1, memadd1 ⋅⋅⋅ i2: st R2, memadd2 ⋅⋅⋅ i3: ld R3, memadd3 ⋅⋅⋅ i4: ld R4, memadd4 Baer p. 189 true dependency AD i1memadd1 AD i2memadd2 Store Buffer: 1 i3memadd3 1 i4memadd4 Load Buffer: Indicates that the load is speculative CO Nothing happens because there is no match in load buffer.

Example i1: st R1, memadd1 ⋅⋅⋅ i2: st R2, memadd2 ⋅⋅⋅ i3: ld R3, memadd3 ⋅⋅⋅ i4: ld R4, memadd4 Baer p. 189 true dependency CO i1memadd1 CO i2memadd2 Store Buffer: 1 i3memadd3 1 i4memadd4 Load Buffer: match i3 has to be reissued i4 has to be reissued because it is after i3 in program order some implementations only reissue instructions that depend on i3

Example i1: st R1, memadd1 ⋅⋅⋅ i2: st R2, memadd2 ⋅⋅⋅ i3: ld R3, memadd3 ⋅⋅⋅ i4: ld R4, memadd4 Baer p. 189 true dependency Dependence Prediction: with correct predictions, i4 can proceed and we avoid reissueing i3.

Motivation: Optimistic Memory dependencies are rare: Less than 10% of loads depend on an earlier store Baer p. 190

Motivation: Dependence Prediction Load misspeculations are expensive and predictors can reduce them. What strategy should we use for predicting profitable speculations? Baer p. 190

Simple Strategy Memory dependencies are infrequent Predict that all loads can be speculated If a load L is misspeculated All subsequent instances of L must waitWe need a bit to remember. Where should this bit be stored? Baer p. 190

Simple strategy (cont.) Single prediction bit P associated with instruction in cache. When load instr. brought into cache→ P = 1 Load is misspeculated→ P = 0 Line evicted from cache and reloaded→ P = 1 Strategy used in the DEC Alpha Baer p. 190

Principle Behind Load Prediction “static store-load instruction pairs that cause most of the dynamic data missprediction are relatively few and exhibit temporal locality.” Moshovos A., Breach S. E., Vijaykumar T. N., Sohi G. S., “Dynamic Speculation and Synchronization of Data Dependences,” International Symposium on Computer Architecture, (ISCA) 1997, Denver, CO, USA

Ideal load speculation Avoids mis-speculation. Allows loads to execute as early as possible. Loads with no true dependences → Execute without delay. A load with a true dependence → Execute as soon as the store that produces the data commits. MoshovosISCA97.

A Real Predictor MoshovosISCA97. Dynamically identify store-load pairs that are likely to be data dependent. i Provide a synchronization mechanism to instances of these dependences. ii Uses this mechanism to synchronize the store and the load. iii

Load Predictor Table Baer p. 190 Hash based on PC Saturating counters Predictor States: 00: strong nospeculate 01: weak nospeculate 10: weak speculate 11: strong speculate tag load buffer entry: op.address: memory address of operand spec.bit: speculative load? update.bit: should update predictor at commit/abort? Each load instruction has a loadspec bit. Incrementing a saturating counter moves it toward strong speculate.

Load/Decode Stage Set loadspec bit according to value of counter associated with the load PC Baer p. 190

After Operand Address is Computed Uncommitted Younger Stores? Enter in the load buffer: op.ad0tag0 spec.bit update.bit loadspec Issue Cache Access Enter in the load buffer: op.ad1tag0 Enter in the load buffer: op.ad0tag1 Wait (like in pessimistic solution) No On Off Yes Baer p. 190

Store Commit Stage For all matches in load buffer spec.bit update.bit ← 1 Load Abort: Predictor ← Strong NoSpeculate Recover from misspeculated load Baer p. 191 Off On It was correct to not speculate and should keep not speculating in the future

Store Commit Stage spec.bit increment saturating counter speculating was correct On update.bit increment saturating counter would like to speculate in the future Off predictor ← strong nospeculate On Baer p. 191

Store Sets Baer p. 191

Motivation for Store Sets The past is a good predictor for future memory-order violations. Must also predict: When one load is dependent on multiple stores store A store B store C load D load E load F When multiple loads depend on one store. ChrysosISCA98 Chrysos, G. Z. and Emer, J. S., “Memory Dependence Prediction using Store Sets,” International Symposium on Computer Architecture, 1998 pp

Store Set Definition Given a load L, the store set of L is the set of all stores that L has ever depended upon. Ideally, any time a store-load dependence is detected, the store is added to the load’s store set table. To make a prediction, the store set table of the load is searched for all uncommitted younger stores. ChrysosISCA98 Too expensive! We need an approximation.

Implementation of Store Sets Memory Dependence Prediction Both loads and stores have entries in Store Set ID Table. ChrysosISCA98

Store Set Examples: Multiple loads depend on one store j: loadadd1k: loadadd2 ⋅⋅⋅ i: storeadd3 i→i→ j→j→ SSIT LFST k→k→ Baer p. 192

Store Set Examples: Multiple loads depend on one store i: storedd2j: storeadd3 k: loadadd1 ⋅⋅⋅ i→i→ j→j→ SSIT LFST k→k→ Baer p. 192

Store Set Examples: Multiple loads depend on multiple stores i: storedd2 j: storeadd3 k: loadadd1 i→i→ j→j→ SSIT LFST l→l→ l: loadadd4 ⋅⋅⋅ k→k→ We have a conflict between the LFST entry associated with i and l. Winner is the entry with smaller index in SSIT Make loser point to the winner’s entry. Baer p. 192

Evaluating Load Speculation Performance benefits from load speculation depends on: – speculation miss rate – cost of misspeculation recovery Baer p. 194

Evaluating Load Speculation - Terminology Conflicting load: at the time the load is ready to issue there is a previous store in the instruction window whose operand address is unknown. Colliding load: the load is dependent on one of the stores with which it conflicts. Baer p. 194

Evaluating Load Speculation – Typical measurements In a 32-entry load-store window, there are – 25% of loads are non-conflicting – of the 75% conflicting loads: only 10% actually collide In larger windows the percentage of: – non-conflicting loads increase – colliding loads decrease Baer p. 194

Back-End Optimizations Branch prediction – “a must” Load speculation (load-bypassing stores) – “important” because other instructions depend on the load Prediction of load latency – “common” to hide load latency in the cache hierarchy Baer p. 195

Other Back-End Optimizations Value Prediction – Predict the value that an instruction will compute May restrict to the value loaded by loads Critical Instructions – Predict which instructions are in the critical path. Baer p

Clustered Microarchitectures Baer p. 201

Back-end Limitations to m Large windows: large m requires large windows. Expensive in hardware and power dissipation Many functional units: many (long) buses; affect forwarding. Centralized Resources (p. e. Register File): large resources, many ports. Baer p. 201

Definition of a Cluster A cluster is formed by: – A set of functional units – A register file – An instruction window (or reservation stations) Baer p. 201

Clustered Microarchitecture Baer p. 202

Register File Replication A copy of the register file in each cluster – Small number of clusters – Can use crossbar switch for interconnection – Example (Alpha 21264): integer unit is two clusters; each cluster has a full copy of the 80 registers Baer p. 202

Changes because of Clustering Front end – steer instruction to window of a cluster static: compile time decision dynamic: by hardware at runtime Back end – Copy results into registers of other clusters – Intercluster latency affects wake up and select Baer p. 202

Effect of Clustering in Performance Latency to forward results between clusters Sensitive to load balancing between clusters Conflicting goals: – keep producers and consumers of data into same cluster – balance the workload Baer p. 202

Distributed Register Files Steering affects Renaming – Assume that an instruction a is assigned to cluster c i A free register form c i will be used for the result of a – If an operand of a is produced by an instruction b in a cluster c j, what needs to be done? 1. Another free register of c i is assigned to this operand. 2. A copy instruction is inserted in c j immediately b. 3. The copy is kept in c i for use by other instructions. Baer p. 203

Clustered microarchitectures can be seen as a step in the evolution from monolithic processors to multiprocessors.

Chapter Summary: Back end is important for performance – Tomasulo Algorithm – Centralized/Distributed/Hybrid Windows – Wakeup/Select steps – Scheduling: Critical instructions first – Loads: Bypassing stores Forwarding values Speculating on the absence of dependences with stores – Clustering to reduce wiring complexity