Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters J. Nelson Amaral.

Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters J. Nelson Amaral

Tomasulo Algorithm

IBM 360/91 Floating Point Arithmetic Unit Tomasulo Algorithm: A reservation station for each functional unit. Baer, p. 97 Free/Occupied bit Flag = on → Data = value Flag = off → Data = tag A tag (pointer) to the ROB entry that will store result.

Decode-rename Stage Reservation Station Available? Structural Hazard Stall incoming instructions No Free ROB Entry? Structural Hazard Stall incoming instructions No Assign reservation station and tail of ROB to instruction Yes Baer p. 97

Dispatch Stage Map for each source operand? ROB Entry Forward ROB tag to RS. ReadyBit(RS) ← 0 Logical Register ROB Entry Flag? Forward value to Reservation Station (RS) ReadyBit(RS) ← 1 Tag Value Map result register to tag Enter tag into RS Enter instruction at tail of ROB ResultFlag(tail of ROB) ←0 Baer p. 98

Issue Stage Both Flags in RS are on? Issue instruction to functional unit to start execution Yes No Function unit stalled? (waiting for CDB) Yes If multiple functional units of the same type are available, use a scheduling algorithm CDB = Common Data Bus Baer p. 98

Execute Last cycle of execution? Broadcast result and associated tag Yes No Got ownership of CDB No If multiple functional units request ownership of the Common Data Bus (CDB) on the same cycle a hardwired priority protocol picks the winner. Baer p. 98 ROB stores result in entry identified by tag. Set corresponding ReadyBit. RSs with same tag store result and set corresponding flag. Yes

Commit Stage Is there a result at the head of ROB? No Store result in logical register Delete ROB entry Yes Baer p. 97

Operation Timings Assuming no dependencies Baer, p. 98 Addition: 01234567Time: Multiplication: 01234567Time: Decoded DispatchedIssued finish execution broadcast commit (if head of ROB) DecodedDispatched Issued finish execution broadcast commit (if head of ROB)

Example i1: R4 ← R0 * R2 # use reservation station 1 of multiplier i2: R6 ← R4 * R8 # use reservation station 2 of multiplier i3: R8 ← R2 + R12 # use reservation station 1 of adder i4: R4 ← R14 + R16 # use reservation station 2 of adder

Index ⋅⋅⋅ 4567 Register Map ROB i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R16 01234567 Time: 8 E1 E2 FlagDataLog. Reg 0E1R4head 0E2R6 tail 8 FreeFlag1Oper1Flag2Oper2Tag Multiplier Reservation Stations 0 10E11 (R8)E2 FreeFlag1Oper1Flag2Oper2Tag 0 0 0 Adder Reservation Stations Executing Dispatched i2 is in this res. station.

Index ⋅⋅⋅ 4567 Register Map ROB i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R16 01234567 Time: 8 E4 E2 FlagDataLog. Reg 0E4R4 0E1R4head 0E2R6 0E3R8 tail E3 8 FreeFlag1Oper1Flag2Oper2Tag Multiplier Reservation Stations 0 10E11 (R8)E2 FreeFlag1Oper1Flag2Oper2Tag 0 11(R14)1 (R16)E4 0 Adder Reservation Stations Executing Dispatched Ready to Broadc. Dispatched “register R4, which was renamed as ROB entry E1 and tagged as such in the reservation station Mult2, is now mapped to ROB entry E4.” (Baer, p. 102)

Index ⋅⋅⋅ 4567 Register Map ROB i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R16 01234567 Time: 8 E4 E2 FlagDataLog. Reg 0E4R4 0E1R4head 0E2R6 1(i3)R8 tail E3 8 FreeFlag1Oper1Flag2Oper2Tag Multiplier Reservation Stations 0 10E11 (R8)E2 FreeFlag1Oper1Flag2Oper2Tag 0 0 0 Adder Reservation Stations Dispatched Broadcast Ready to Broadc. Assume Adder has priority to broadcast.

Index ⋅⋅⋅ 4567 Register Map ROB i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R16 01234567 Time: 8 E4 E2 FlagDataLog. Reg 1(i4)R4 0E1R4head 0E2R6 1(i3)R8 tail E3 8 FreeFlag1Oper1Flag2Oper2Tag Multiplier Reservation Stations 0 10E11 (R8)E2 FreeFlag1Oper1Flag2Oper2Tag 0 0 0 Adder Reservation Stations Dispatched Broadcast Ready to Broadc. Assume Adder has priority to broadcast.

Index ⋅⋅⋅ 4567 Register Map ROB i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R16 01234567 Time: 8 E4 E2 FlagDataLog. Reg 1(i4)R4 1(i1)R4head 0E2R6 1(i3)R8 tail E3 8 FreeFlag1Oper1Flag2Oper2Tag Multiplier Reservation Stations 0 11(i1)1 (R8)E2 FreeFlag1Oper1Flag2Oper2Tag 0 0 0 Adder Reservation Stations Dispatched Broadcast

Index ⋅⋅⋅ 4567 Register Map i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R16 01234567 Time: 8 E4 E2 E3 8 FreeFlag1Oper1Flag2Oper2Tag Multiplier Reservation Stations 0 0 FreeFlag1Oper1Flag2Oper2Tag 0 0 0 Adder Reservation Stations Commit Executing ROB FlagDataLog. Reg 0(i4)R4 0(i1)R4 0E2R6head 0(i3)R8 tail

IBM 360/91 – unveiled in 1966

Some variant of the Tomasulo algorithm is the basis for the design of all out-of-order processors. Baer p. 97

Data dependency between instruction Where should these instructions wait? How do they become ready for issue? Several instructions get to the end of the front end and have to wait for operands. Baer p. 177

Wakeup Stage Detects instruction readiness. We hope for m instructions to be woken up on each cycle. Baer p. 177

Select Step or Scheduling step: Arbitrates between multiple instructions vieing for the same instruction unit. – Variations of fist-come-first-serve (of FIFO) Bypassing (or forwarding) of operands to units allows earlier selection. Critical instructions may have preference for selection. Baer p. 177

Out-of-Order Architectures Key idea: allow instructions following a stalled one to start execution out of order. A FIFO schedule is not a good idea! Where to store stalled instructions? Baer p. 178

Two Extreme Solutions Tomasulo: a separate reservation station for each functional unit. (distributed window) Instruction Window: a centralized reservation station for all functional units (centralized window) IBM PowerPC series Intel P6 architecture Baer p. 178

A Hybrid Solution Reservation stations are shared among groups of functional units (hybrid window). MIPS R10000: 3 sets of reservation stations: address calculations floating-point units load-store units Baer p. 178

How a design team selects between a centralized, distributed or hybrid window? What are the compromises? Baer p. 179

Window design Resource allocation: centralized is better – static partitioning of resources is worse than dynamic allocation Large windows: speed and power come into play Baer p. 179

Two-Step Instruction Issue Wakeup: instruction is ready for execution Select: instruction is assigned to an execution unit.

Wakeup Step Baer p. 180 f Functional units Window entries w

Window entry with buses from 8 exec units

Wakeup Step Baer p. 180 f Functional units Window entries w We need one bus from each functional unit to each window entry We also need two comparators for each functional unit in each window entry Thus we need 2fw comparators If we separate the functional units and window slots into two equal-size groups, we only need. fw/2 comparators We will also need fewer (shorter) buses from units to slots.

Select Step Priority encoder: a circuit that receives several requests and issues one grant woken up instructions vying for the same unit send requests. priority related to position in window Smaller window → smaller priority encoder Baer p. 181

When should a centralized window be replaced by a distributed or hybrid one? When the wakeup-select step are on the critical path. Threshold appears to be windows with around 64 entries on a 4-wide superscalar processor Baer p. 182

Intel Pentium 4: 2 large windows 2 schedulers per window Intel Pentium III and Intel core: Smaller centralized window AMD Opteron: 4 sets of reservation stations Baer p. 182

Relation between Select and Wake Up i: R 51 ← R 22 + R 33 i+1: R 43 ← R 27 – R 51 Example: The name given to the result of instruction i (R 51 ) must be broadcast as soon as instruction i is selected. Broadcasting the tag of R51 wakes up instruction i+1. For single-cycle latency instructions, the start of the execution is too late to broadcast the tag. Baer p. 183

Speculative Wake Up and Select i: R 51 ← load(R 22 ) i+1: R 43 ← R 27 – R 51 i+2: R 35 ← R 51 + R 28 Example: In this case the tag of the destination of instruction i is broadcast. Instructions i+1 and i+2 are speculatively woken up and selected based on a cache-hit latency. In the case of a cache miss all dependent instructions that have been woken up and selected must be aborted. Baer p. 183

Speculative Selection and the Reservation Stations An instruction must remain in a reservation station after it is scheduled – A bit indicates that the instruction has been selected – Station is free once it is sure that the instruction selection is not speculative anymore Windows are large in comparison with the number of functional units – accommodate many instructions in flight, some speculatively. Baer p. 183

Integrated Register File Tomasulo Reservation Stations What happens upon selection of an instruction? Functional Unit Reservation Station Opcode Operands Opcode Operands Functional Unit Instruction Window Physical Register File Baer p. 183

The complexity of Bypassing i: R 51 ← R 22 + R 33 i+1: R 43 ← R 27 – R 51 Example: Functional Unit A Compute i Functional Unit B Compute i+1 Output of A must be forwarded to B bypassing storage. Baer p. 183

The complexity of Bypassing i: R 51 ← R 22 + R 33 i+1: R 43 ← R 27 – R 51 Example: Functional Unit A Compute i Functional Unit B Now the bypass must forward the output to the input of A. Compute i+1 But the hardware has to implement both buses. Baer p. 183

The complexity of Bypassing i: R 51 ← R 22 + R 33 i+1: R 43 ← R 27 – R 51 Example: Functional Unit A Compute i Functional Unit B Compute i+1 Also, we need buses to forward the output of B. In general, given k functional units we may need k 2 buses. Buses become long to avoid crossing each other. Forwarding may limit the number of functional units in a processor. Forwarding may need more than one cycle to complete. Baer p. 184

Load Speculation Load Address Speculation – Used for data prefetching Memory dependence prediction – Used to speculate data flow from a store to a subsequent load. Baer p. 185

Store Buffer Store Buffer: A circular queue – Entry allocated when store instruction is decoded – Entry removed when store is committed Keep data for stores that have not yet committed Baer p. 185

States of a Store Buffer Entry AV: Available AD: Address is known CO: Committed RE: Result and Address known Address Computation Data to be stored is still to be computed by another instruction Store instruction reaches top of ROB Data written to cache What happens with store buffer on a branch misprediction? Baer p. 185

Handling Store Buffer on Branch Misprediction and Exceptions. Entries preceeding the mispredicted branch: – are in COMMIT state – must be written to cache Entries following misprediction – become AVAILABLE Exceptions: similar – Must write the COMMIT entries to cache before handling exception Baer p. 186

Load Instructions and Load Speculation Baer p. 187

Load /Store Window Implementation – Most Restricted Load/Store Window (FIFO) Loads/Stores inserted in program order. Loads/Stores removed in same order – at mot one per cycle. Single window for loads and stores. Baer p. 187

Load Bypassing Compare address of load with all addresses in store buffer – Load bypassing: If there is no match → load can proceed – What happens if the operand address of any entry in store buffer is not yet computed? load cannot proceed – What happens if there is a match to an entry that is not committed? load cannot access cache “match” is the last match in program order. Need associative search of operand addresses in store buffer Baer p. 187

Load Forwarding If these conditions are true: – A load match a store buffer entry AND – The result is available for the entry ( entry is in RE or CO state) Then the result can be sent to the register specified by the load If the match is with an entry in AD state then: – Load waits for entry to reach RE state

Load Speculation in Out-of-Order Architectures Dynamic Memory Disambiguation Problem: Loads are issued speculatively ahead of preceding stores in program order. How to ensure that data dependences are not violated?

Three approaches Pessimistic: Wait until certain that load can proceed. (like load forwarding and bypassing) Optimistic: Load always proceeds speculatively. Need a recovery mechanism. Dependence prediction: use a predictor to decide to speculate or not. Try to have fewer recoveries.

Example i1: st R1, memadd1 ⋅⋅⋅ i2: st R2, memadd2 ⋅⋅⋅ i3: ld R3, memadd3 ⋅⋅⋅ i4: ld R4, memadd4 Baer p. 188 true dependency Pessimistic: i3 and i4 cannot issue until i2 has computed its result: i2 must be at least in RE (Result) i4 proceeds once i1 and i2 are in AD (Address)

Example i1: st R1, memadd1 ⋅⋅⋅ i2: st R2, memadd2 ⋅⋅⋅ i3: ld R3, memadd3 ⋅⋅⋅ i4: ld R4, memadd4 Baer p. 189 true dependency Optimistic: i3 and i4 issue as soon as possible (load-buffer entries are created) A store reaches CO address compared associatively with load-buffer entries

Example i1: st R1, memadd1 ⋅⋅⋅ i2: st R2, memadd2 ⋅⋅⋅ i3: ld R3, memadd3 ⋅⋅⋅ i4: ld R4, memadd4 Baer p. 189 true dependency AD i1memadd1 AD i2memadd2 Store Buffer: 1 i3memadd3 1 i4memadd4 Load Buffer: Indicates that the load is speculative CO Nothing happens because there is no match in load buffer.

Example i1: st R1, memadd1 ⋅⋅⋅ i2: st R2, memadd2 ⋅⋅⋅ i3: ld R3, memadd3 ⋅⋅⋅ i4: ld R4, memadd4 Baer p. 189 true dependency CO i1memadd1 CO i2memadd2 Store Buffer: 1 i3memadd3 1 i4memadd4 Load Buffer: match i3 has to be reissued i4 has to be reissued because it is after i3 in program order some implementations only reissue instructions that depend on i3

Example i1: st R1, memadd1 ⋅⋅⋅ i2: st R2, memadd2 ⋅⋅⋅ i3: ld R3, memadd3 ⋅⋅⋅ i4: ld R4, memadd4 Baer p. 189 true dependency Dependence Prediction: with correct predictions, i4 can proceed and we avoid reissueing i3.

Motivation: Optimistic Memory dependencies are rare: Less than 10% of loads depend on an earlier store Baer p. 190

Motivation: Dependence Prediction Load misspeculations are expensive and predictors can reduce them. What strategy should we use for predicting profitable speculations? Baer p. 190

Simple Strategy Memory dependencies are infrequent Predict that all loads can be speculated If a load L is misspeculated All subsequent instances of L must waitWe need a bit to remember. Where should this bit be stored? Baer p. 190

Simple strategy (cont.) Single prediction bit P associated with instruction in cache. When load instr. brought into cache→ P = 1 Load is misspeculated→ P = 0 Line evicted from cache and reloaded→ P = 1 Strategy used in the DEC Alpha 21264 Baer p. 190

Principle Behind Load Prediction “static store-load instruction pairs that cause most of the dynamic data missprediction are relatively few and exhibit temporal locality.” Moshovos A., Breach S. E., Vijaykumar T. N., Sohi G. S., “Dynamic Speculation and Synchronization of Data Dependences,” International Symposium on Computer Architecture, (ISCA) 1997, Denver, CO, USA

Ideal load speculation Avoids mis-speculation. Allows loads to execute as early as possible. Loads with no true dependences → Execute without delay. A load with a true dependence → Execute as soon as the store that produces the data commits. MoshovosISCA97.

A Real Predictor MoshovosISCA97. Dynamically identify store-load pairs that are likely to be data dependent. i Provide a synchronization mechanism to instances of these dependences. ii Uses this mechanism to synchronize the store and the load. iii

Load Predictor Table Baer p. 190 Hash based on PC Saturating counters Predictor States: 00: strong nospeculate 01: weak nospeculate 10: weak speculate 11: strong speculate tag load buffer entry: op.address: memory address of operand spec.bit: speculative load? update.bit: should update predictor at commit/abort? Each load instruction has a loadspec bit. Incrementing a saturating counter moves it toward strong speculate.

Load/Decode Stage Set loadspec bit according to value of counter associated with the load PC Baer p. 190

After Operand Address is Computed Uncommitted Younger Stores? Enter in the load buffer: op.ad0tag0 spec.bit update.bit loadspec Issue Cache Access Enter in the load buffer: op.ad1tag0 Enter in the load buffer: op.ad0tag1 Wait (like in pessimistic solution) No On Off Yes Baer p. 190

Store Commit Stage For all matches in load buffer spec.bit update.bit ← 1 Load Abort: Predictor ← Strong NoSpeculate Recover from misspeculated load Baer p. 191 Off On It was correct to not speculate and should keep not speculating in the future

Store Commit Stage spec.bit increment saturating counter speculating was correct On update.bit increment saturating counter would like to speculate in the future Off predictor ← strong nospeculate On Baer p. 191

Store Sets Baer p. 191

Motivation for Store Sets The past is a good predictor for future memory-order violations. Must also predict: When one load is dependent on multiple stores store A store B store C load D load E load F When multiple loads depend on one store. ChrysosISCA98 Chrysos, G. Z. and Emer, J. S., “Memory Dependence Prediction using Store Sets,” International Symposium on Computer Architecture, 1998 pp. 142-153.

Store Set Definition Given a load L, the store set of L is the set of all stores that L has ever depended upon. Ideally, any time a store-load dependence is detected, the store is added to the load’s store set table. To make a prediction, the store set table of the load is searched for all uncommitted younger stores. ChrysosISCA98 Too expensive! We need an approximation.

Implementation of Store Sets Memory Dependence Prediction Both loads and stores have entries in Store Set ID Table. ChrysosISCA98

Store Set Examples: Multiple loads depend on one store j: loadadd1k: loadadd2 ⋅⋅⋅ i: storeadd3 i→i→ j→j→ SSIT LFST k→k→ Baer p. 192

Store Set Examples: Multiple loads depend on one store i: storedd2j: storeadd3 k: loadadd1 ⋅⋅⋅ i→i→ j→j→ SSIT LFST k→k→ Baer p. 192

Store Set Examples: Multiple loads depend on multiple stores i: storedd2 j: storeadd3 k: loadadd1 i→i→ j→j→ SSIT LFST l→l→ l: loadadd4 ⋅⋅⋅ k→k→ We have a conflict between the LFST entry associated with i and l. Winner is the entry with smaller index in SSIT Make loser point to the winner’s entry. Baer p. 192

Evaluating Load Speculation Performance benefits from load speculation depends on: – speculation miss rate – cost of misspeculation recovery Baer p. 194

Evaluating Load Speculation - Terminology Conflicting load: at the time the load is ready to issue there is a previous store in the instruction window whose operand address is unknown. Colliding load: the load is dependent on one of the stores with which it conflicts. Baer p. 194

Evaluating Load Speculation – Typical measurements In a 32-entry load-store window, there are – 25% of loads are non-conflicting – of the 75% conflicting loads: only 10% actually collide In larger windows the percentage of: – non-conflicting loads increase – colliding loads decrease Baer p. 194

Back-End Optimizations Branch prediction – “a must” Load speculation (load-bypassing stores) – “important” because other instructions depend on the load Prediction of load latency – “common” to hide load latency in the cache hierarchy Baer p. 195

Other Back-End Optimizations Value Prediction – Predict the value that an instruction will compute May restrict to the value loaded by loads Critical Instructions – Predict which instructions are in the critical path. Baer p. 196-201

Clustered Microarchitectures Baer p. 201

Back-end Limitations to m Large windows: large m requires large windows. Expensive in hardware and power dissipation Many functional units: many (long) buses; affect forwarding. Centralized Resources (p. e. Register File): large resources, many ports. Baer p. 201

Definition of a Cluster A cluster is formed by: – A set of functional units – A register file – An instruction window (or reservation stations) Baer p. 201

Clustered Microarchitecture Baer p. 202

Register File Replication A copy of the register file in each cluster – Small number of clusters – Can use crossbar switch for interconnection – Example (Alpha 21264): integer unit is two clusters; each cluster has a full copy of the 80 registers Baer p. 202

Changes because of Clustering Front end – steer instruction to window of a cluster static: compile time decision dynamic: by hardware at runtime Back end – Copy results into registers of other clusters – Intercluster latency affects wake up and select Baer p. 202

Effect of Clustering in Performance Latency to forward results between clusters Sensitive to load balancing between clusters Conflicting goals: – keep producers and consumers of data into same cluster – balance the workload Baer p. 202

Distributed Register Files Steering affects Renaming – Assume that an instruction a is assigned to cluster c i A free register form c i will be used for the result of a – If an operand of a is produced by an instruction b in a cluster c j, what needs to be done? 1. Another free register of c i is assigned to this operand. 2. A copy instruction is inserted in c j immediately b. 3. The copy is kept in c i for use by other instructions. Baer p. 203

Clustered microarchitectures can be seen as a step in the evolution from monolithic processors to multiprocessors.

Chapter Summary: Back end is important for performance – Tomasulo Algorithm – Centralized/Distributed/Hybrid Windows – Wakeup/Select steps – Scheduling: Critical instructions first – Loads: Bypassing stores Forwarding values Speculating on the absence of dependences with stores – Clustering to reduce wiring complexity

Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters J. Nelson Amaral.

Similar presentations

Presentation on theme: "Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters J. Nelson Amaral."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters J. Nelson Amaral.

Similar presentations

Presentation on theme: "Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters J. Nelson Amaral."— Presentation transcript:

Similar presentations

About project

Feedback