Presentation is loading. Please wait.

Presentation is loading. Please wait.

February 18, 2004 Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Slide 1 of 23 Understanding Scheduling Replay Schemes Ilhyun Kim Mikko H. Lipasti PHARM Team.

Similar presentations


Presentation on theme: "February 18, 2004 Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Slide 1 of 23 Understanding Scheduling Replay Schemes Ilhyun Kim Mikko H. Lipasti PHARM Team."— Presentation transcript:

1 February 18, 2004 Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Slide 1 of 23 Understanding Scheduling Replay Schemes Ilhyun Kim Mikko H. Lipasti PHARM Team University of Wisconsin-Madison

2 February 18, 2004 Slide 2 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Speculation vs. Recovery All speculative techniques share a few common requirements Some mechanisms for generating predictions Microarchitectural support for realizing the benefits of predictions Recovery for mispredictions Relatively little focus on recovery Prediction and speculative techniques have been discussed extensively Vague descriptions like refetch, squash, reissue and replay Recovery for speculative scheduling: scheduling replay What are the issues in scheduling replay? What functionalities should it provide? What are the potential limitations?

3 February 18, 2004 Slide 3 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Related Work Selective re-issue Initially proposed for value prediction Assumed by many data-speculation techniques Detailed mechanics were not fully described and/or developed Generic dependence vector scheme [Sazeides, Ph.D. thesis] Scheduling replay Alpha 21264: squashing replay Pentium 4: selective replay based on replay queue Evaluation of replay schemes [Morancho et al.] Scheduling miss prediction [Yoaz et al.] Our work Provides a framework for developing & analyzing replay schemes Proposes token-based selective replay

4 February 18, 2004 Slide 4 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Outline Speculative Scheduling & Wavefront Propagation Parallel Verification Scheduling Replay Schemes Token-based Selective Replay Performance Evaluation Conclusions

5 February 18, 2004 Slide 5 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Speculative Scheduling Overview Original Tomasulo’s algorithm Sched FetchDecode Atomic sched / exe WBCommit / Exe FetchDecodeSchedDispRFExeWBCommit cannot achieve max ILP FetchDecodeSchedDispRFExeWBCommit speculative issue verify scheduling decisions Speculative Scheduling Source of scheduling misses Load instructions: D-cache / DTLB misses, store-to-load aliasing Performance / complexity optimization techniques

6 February 18, 2004 Slide 6 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Speculative Execution Wavefront Initiated by a set of wakeup and select operation that links data dependences Speculative “Image” of execution Execution Wavefront Delay between the two wavefronts FetchDecRenQueSchedDisp RF ExeWB Com- mit Ren dependence linking data linking Real Execution Wavefront Speculative Execution Wavefront Speculative Execution Wavefront Real Execution Wavefront The scheduled image is projected to the EXE stage, initiating the real execution wavefront Serves to verify the scheduled execution Comparing the scheduled and actual execution latencies Speculative Execution Wavefront Real Execution Wavefront Verification runs behind speculative execution wavefront  The current execution verifies scheduling decisions made in the past cache miss detected

7 February 18, 2004 Slide 7 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Speculative Execution Wavefront Initiated by a set of wakeup and select operation that links data dependences Speculative “Image” of execution Scoreboard is OK, but not enough Speculative Execution Wavefront Real Execution Wavefront The scheduled image is projected to the EXE stage, initiating the real execution wavefront Serves to verify the scheduled execution Comparing the scheduled and actual execution latencies Serial Verification Triggers re-scheduling of directly dependent instructions e.g. a scoreboard propagates poison bits along with data dependences Scoreboard FetchDecRenQueSchedDisp RF ExeWB Com- mit Ren dependence linking data linking Real Execution Wavefront Speculative Execution Wavefront  Hard to stop invalid speculative execution wavefront Verification and schedule propagates at the same rate The scheduler doesn’t know which instructions depend on the miss The scheduler keeps issuing instructions unnecessarily

8 February 18, 2004 Slide 8 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Invalid Wavefront Propagation Parser Gap max 836max 157 Serial verification Parallel verification A load miss can propagate through 836 instruction levels!! Not bounded by the size of the instruction window (8-wide, 128RUU) Total issue count goes up by 15% in parser (compared to parallel verification) average 10% in SPEC2K INT, worst 42% in mcf Negative impacts on performance and power Need a mechanism to stop invalid wavefront propagation

9 February 18, 2004 Slide 9 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Parallel Verification Issued instructions are verified in parallel Verification catches up with invalid speculative execution wavefront The scheduler does not trigger any further incorrect issue Other independent instructions may be issued instead  Focus of this talk: parallel verification for scheduling replay FetchDecRenQueSchedDisp RF ExeWB Com- mit Ren dependence linking data linking Real Execution Wavefront Speculative Execution Wavefront parallel verification terminated speculative execution wavefront

10 February 18, 2004 Slide 10 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Outline Speculative Scheduling & Wavefront Propagation Parallel Verification Scheduling Replay Schemes Token-based Selective Replay Performance Evaluation Conclusions

11 February 18, 2004 Slide 11 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Requirements of Parallel Verification Propagation of scheduling verification should be FASTER than that of speculative execution wavefront propagation Verification catches up with invalid speculative wavefront Verification should be performed on the transitive closure of dependent instructions No invalid wavefront slips through invalidation / recovery Ideal scheduling replay All mis-scheduled dependent instructions are invalidated instantly Independent instructions are unaffected (selective replay)

12 February 18, 2004 Slide 12 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Reducing Name Space for dependence tracking A naïve way: dependence vector scheme works, but… Dependence vector size == the max number of loads in the window Propagate full vectors to dependent instructions at e.g. rename time Scalability issues (e.g. replay at any instruction boundary)  Approximation or conversion of the name space for precise dependence tracking into a smaller set Reduce the number of bits in dependence vectors Scheduling miss detected Am I dependent on the miss? Faster verification  multi-level dependence tracking

13 February 18, 2004 Slide 13 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Non-Selective Replay (aka “squashing” replay) Kill operands with non-zero-value timers Assuming all operands awakened after the misscheduled instruction are incorrect Dependence tracking: wakeup order (imprecise) Sched DispRFExeVerify Invalidate & replay ALL instructions in the load shadow LD ADD OR AND BR LD ADD OR AND BR LD ADD OR AND BR miss resolved LD ADD OR AND BR LD ADD OR AND BR LD ADD OR AND BR LD ADD OR Cache miss AND BR tag L = = Kill wire tag bus timer start 4 timer L 0 ready L tag R 0 timer R 1 ready R tag L = = Kill wire tag bus timer start 3 timer L 1 ready L tag R 0 timer R 1 ready R tag L = = Kill wire tag bus timer start 4 timer L 1 ready L tag R 0 timer R 1 ready R tag L = = Kill wire tag bus timer start 2 timer L 1 ready L tag R 0 timer R 1 ready R tag L = = Kill wire tag bus timer start 4 timer L 0 ready L tag R 0 timer R 1 ready R wakeup OR instruction Kill wire (single wire)

14 February 18, 2004 Slide 14 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Delayed Selective Replay Invalidates all conservatively (same as non-selective replay) Samples the completion signal in the given issue slot at timer 0 Selectively re-validates direct child instructions if no poison bit from scoreboard Dependence tracking: wakeup order and position (imprecise) Sched DispRFExeVerify tagR ReadyL tagL = Kill wire wakeup bus timer ReadyR timer = = Slot # Completion bus (wire / issue slot) = timer start timer start Scoreboard Completion bus ADD OR XOR ANDBR LD OR ANDBR LD ADD ANDBR LD ADD OR BR LD ADD OR LD ADD OR SUB Cache miss XOR ANDBR re-validate invalidated source operand (prevent further propagation)

15 February 18, 2004 Slide 15 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Position-based Selective Replay Ideal selective recovery Dependence tracking is managed in a matrix form Column: load issue slot, row: pipeline stages Dependence tracking: 2-dimensional position (precise) 1 0 0 1 0 1 0 bit merge & shift tagR ReadyR ReadyL tagL = = Kill bus (wire/mem port) tag bus dependence info bus (mem ports X depth) ADD 0 1 0 LD ADD 0 1 0 LD ADD 0 0 1 0 ADD Shift down every cycle in sync with pipeline flow Propagate matrices along with tag broadcast LD ADD 0 1 0 0 1 0 SLL LD OR ANDSLL XOR 0 1 0 0 AND 0 1 0 0 1 0 OR 0 1 0 0 XOR 0 1 0 0 1 0 ADD Cache miss detected LD ADD 0 1 0 0 1 SLL LD OR ANDSLL XOR 0 1 0 0 AND 0 1 0 0 1 OR 0 1 0 0 XOR 0 1 0 0 1 ADD ALU pipeMEM pipe Sched Disp RF Exe Verify

16 February 18, 2004 Slide 16 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Outline Speculative Scheduling & Wavefront Propagation Parallel Verification Scheduling Replay Schemes Token-based Selective Replay Performance Evaluation Conclusions

17 February 18, 2004 Slide 17 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Limitations of Replay Schemes Performance scalability Non-selective scheme replays independent instructions Delayed selective replay creates bubbles in scheduling Complexity issue in position-based replay Extra wires increase exponentially as the machine grows A function of memory ports, issue width and pipeline depth e.g. 50 to 196 extra wires when transitioning from 4 to 8-wide machines Incompatible with data-speculation techniques (e.g. value prediction) Data-speculation techniques collapse true data dependences Wakeup order or position no longer correlates to dependences

18 February 18, 2004 Slide 18 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Overcoming the Limitations Source of the limitations Dependence information propagates as a part of scheduling or execution process Move dependence propagation out of scheduling logic Track dependences in program order (i.e. in rename stage) Similar to dependence vector scheme  requires a big name space How to reduce the bits while providing precise dependence tracking?  Token-based selective replay Tracks dependences only for the instructions likely to be misscheduled Plant tokens in loads based on scheduling hit/miss prediction Propagate the tokens to dependent instructions Selectively recover instructions with the token Expensive backup recovery if token planting is incorrect Squash & re-insert in program order (analogous to bpred recovery)

19 February 18, 2004 Slide 19 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Token-based Selective Replay Pipeline structure FetchDecodeRename sched miss predictor token allocator PC sched miss confidence token allocation/ deallocation token propagation high-confidence? name space? ScheduleExeVerifyCommit selective replay for token heads deallocate tokens when retired Queue non-selective replay for others Squash & reinsert instructions in program order Source register mapping from Rename table dep vector 0 0 1 1 0 1 0 0 Physical reg ID Src0 dep vector 1 0 1 0 0 1 0 0 Physical reg ID Src1 1 0 1 1 0 1 0 0 conventional instruction / reg info vector merge dep_vector 1 0 1 1 0 1 0 0 head 1 token_ID 111 + 1 0 1 1 0 1 0 1 back to rename table token allocated ? new token ID new dep_vector to issue queue dep_vector 1 0 1 1 0 1 0 0 tagR ReadyR ReadyL tagL = = tag bus head 1 token_ID 111 Kill bus # wires in kill bus = 2 X (# tokensr) Token allocation

20 February 18, 2004 Slide 20 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Machine parameters Simplescalar-Alpha-based 4- and 8-wide OoO 4-wide: 128 ROB, 64 LSQ, 64 IQ, 2 memory ports 8-wide: 256 ROB, 128 LSQ, 128 IQ, 4 memory ports Speculative scheduling, 6-cycle schedule-to-verify delay 32K IL1 (2), 32K DL1 (2), 512K L2 (8), memory (100) Combined branch prediction, fetch until the first taken branch Position-based selective replay Token-based selective replay 4-wide: 8 tokens, 8-wide: 16 tokens Scheduling miss predictor: 4k-entry, PC direct-mapped 2-bit counters 4-cycle penalty for squashing instructions from issue queue Re-insert instructions at the rate of machine width SPEC2K INT, reduced input sets Reference input sets for crafty, eon and gap up to 3B instructions

21 February 18, 2004 Slide 21 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Scheduling Misses Covered by Tokens 75~92% of scheduling misses are recovered by tokens selectively The misses not covered by tokens are recovered non-selectively (re-insert) mcf runs out of tokens due to many concurrent misses Name space reduction 8-wide: Naïve vector scheme tracks 128 loads  16 loads (16 tokens) 3.712.0927.5910.436.86 % load sched misses / load issues 6.863.1827.6012.318.88 4-wide, 8 tokens 8-wide, 16 tokens

22 February 18, 2004 Slide 22 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Normalized Issue Count Selective replay is essential for lower issue count Significant increase in non-selective replay Independent instructions are unnecessarily replayed Worse on wider machines Token scheme performs as well as ideal scheme (position-based) except for mcf: low scheduling miss coverage 4-wide 8-wide

23 February 18, 2004 Slide 23 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 non-selectivedelayedtoken 8-wide Normalized IPC Non-selective and delayed schemes do not scale to wider machines Scheduling miss penalty grows as the width grows Token selective recovery Works better than non-selective or delayed selective schemes in many cases Better performance scalability non-selectivedelayedtoken 4-wide

24 February 18, 2004 Slide 24 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Discussion Delayed selective recovery A good design alternative to ideal scheme on a 4-wide machine Good tradeoffs among complexity, performance, and issue count Complexity is a function of the number of tokens (not the machine width nor depth) in token scheme # extra wires in the scheduler = 2 X (# tokens) Position-based scheme: {(width) X (depth) + 1} X (# mem ports) 32 (token-based) vs. 196 (position-based) on our 8-wide machine Better for wider and deeper machines Support for data-speculation techniques Token scheme correctly tracks true data dependences in program order Other schemes cannot recover unless correct dependences are carried through the scheduler

25 February 18, 2004 Slide 25 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Conclusions Scheduling replay is essential for speculative scheduling Invalidate and re-schedule incorrectly issued instructions Increasingly important as the pipeline become wider and deeper Speculative wavefront propagation in scheduling replay Incurred by the schedule-to-verify delay Negatively affects issue count (power) and performance Scheduling replay needs multi-level dependence tracking to avoid unnecessary issue under misses Issues in efficient dependence tracking Non-selective, delayed selective and position-based selective schemes Token-based selective replay Scalable to wider machines, support for data speculation

26 February 18, 2004 Slide 26 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Questions??

27 February 18, 2004 Slide 27 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Scheduling miss predictor performance at different threshold PC-indexed, direct-mapped, 4K entries Coverage of scheduling misses (higher is better) Loads predicted to be a miss (lower is better)

28 February 18, 2004 Slide 28 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Limitations with data-speculation techniques Assumptions enabling the name space conversion (into a smaller set) Data-dependence enforcement, deterministic schedule-to-verify delay Tracking issue / execution status filters out independent instructions Data-speculation breaks those assumptions  Cannot be directly applied to data-speculation recovery......…… Sched......…… ExeVerify …… issuemiss detected Issued dependent / independent Executed independent unissued Sched Replay variable......…… Sched......……ExeVerify…… issue miss detected issued dependent / independent Executed dependent / independent unissued collapsed data-dependence Data Speculation Recovery

29 February 18, 2004 Slide 29 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 8-wide 4-wide Normalized IPC Re-insert All scheduling misses are recovered by squashing & re-inserting Worst-case performance of token-based replay Conservative Loads with high misscheduling confidence are scheduled based on L2 latency Squashing & re-inserting if mis-scheduled May unnecessary delay too many loads non-selectivedelayedtokenre-insertconservative

30 February 18, 2004 Slide 30 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Scheduling Replay Models Replay-queue-based Replay (like the Pentium 4) Issued instructions move from issue queue to replay queue Circulates instructions until they hit in the scoreboard Parallel verification for this model is left to future work Exe pipeline verify verification status (kill bus) retire from issue queue if correctly executed Issue-queue-based Replay (our assumption) issue queue = = = = cache miss detected

31 February 18, 2004 Slide 31 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Parallel Verification A load miss can propagate through 836 instruction levels!! Not bounded by the size of the instruction window (8-wide, 128RUU) Total issue count goes up by 15% in parser average 10% in SPEC2K INT, worst 42% in mcf Negative impacts on performance and power scoreboard / checker SchedExe cache miss signal cycle n cycle n+1 cycle n+2 cache miss signal Sched Exe dependence tracking / parallel verification terminated speculative execution wavefront

32 February 18, 2004 Slide 32 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Position-based Selective Replay Ideal selective recovery Dependence tracking is managed in a matrix form Column: load issue slot, row: pipeline stages Dependence tracking: precise position information merge matices ADD 0 0 1 OR 0 0 1 SLL 0 0 1 AND 0 1 0 0 1 XOR 0 1 0 0 LD ADD OR XOR ANDSLL Integer pipeline Mem pipeline (width 2) Sched Disp RF Exe verify ADD 0 0 1 0 OR 0 0 1 0 XOR 0 1 0 0 LD OR ANDSLL ADD XOR SLL 0 0 1 0 AND 0 1 0 0 1 0 tag / dep info broadcast kill bus broadcast killed Cycle n Cycle n+1 Sched Disp RF Exe verify 1 0 0 1 0 1 0 bit merge & shift invalidate if bits match in the last row tagR ReadyR ReadyL tagL = = Kill bus tag bus dependence info bus Cache miss Detected


Download ppt "February 18, 2004 Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Slide 1 of 23 Understanding Scheduling Replay Schemes Ilhyun Kim Mikko H. Lipasti PHARM Team."

Similar presentations


Ads by Google