Presentation is loading. Please wait.

Presentation is loading. Please wait.

ISCA-36 :: June 23, 2009 Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir.

Similar presentations


Presentation on theme: "ISCA-36 :: June 23, 2009 Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir."— Presentation transcript:

1 ISCA-36 :: June 23, 2009 Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn.edu

2 [ 2 ] Brief Overview Dynamically scheduled superscalar processors Scalable load & store queues SVW/SQIP [Roth05, Sha05] Latency-tolerant processors CPR/CFP [Akkary03, Srinivasan04] DKIP, FMC [Pericas06, Pericas07] Scalable load & store queues for latency-tolerant processors SA-LQ/HSQ [Akkary03] SRL [Gandhi05] ELSQ [Pericas08] Granularity mismatch: checkpoint (CPR) vs. instruction (SVW/SQIP) Decoupled Store Completion & Silent Deterministic Replay

3 [ 3 ] Outline Background CPR/CFP SVW/SQIP The granularity mismatch problem DSC/SDR Evaluation

4 [ 4 ] CPR/CFP Latency-tolerant: scale key window structures under LL$ miss Issue queue, regfile, load & store queues CFP (Continual Flow Pipeline) [Srinivasan04] Scale issue queue & regfile by “slicing out” miss-dependent insns CPR (Checkpoint Processing & Recovery) [Akkary03] Scale regfile by limiting recovery to pre-created checkpoints + Aggressive reclamation of non-checkpoint registers – Unintended consequence? checkpoint-granularity “bulk commit” SA-LQ (Set-Associative Load Queue) [Akkary03] HSQ (Hierarchical Store Queue) [Akkary03]

5 [ 5 ] Baseline Performance (& Area) ASSOC (baseline): 64/48 entry fully-associative load/store queues 8SA-LQ/HSQ: 512-entry load queue, 256-entry store queue – Load queue: area is fine, poor performance (set conflicts) – Store queue: performance is fine, area inefficient (large CAM)

6 [ 6 ] SQIP SQIP (Store Queue Index Prediction) [Sha05] Scales store queue/buffer by eliminating associative search @dispatch: load predicts store queue position of forwarding store @execute: load indexes store queue at this position dispatch commit A:StB:LdP:StQ:StR:Ld…S:+T:Br [?] [x20][x10] instruction streamolderyounger addresses Preliminaries: SSNs (Store Sequence Numbers) [Roth05] Stores named by monotonically increasing sequence numbers Low-order bits are store queue/buffer positions Global SSNs track dispatch, commit, (store) completion SSNs P:St

7 [ 7 ] SVW Store Vulnerability Window (SVW) [Roth05] Scales load queue by eliminating associative search Load verification by in-order re-execution prior to commit Highly filtered: <1% of loads actually re-execute x18 x20 SSBF (SSN Bloom Filter) x?8 x?0 Address-indexed SSBF tracks [addr, SSN] of commited stores @commit: loads check SSBF, re-execute if possibly incorrect verify/ x18 x?8 complete [x20] [x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br [x18] commit [x18]

8 [ 8 ] SVW–NAIVE SVW: 512-entry indexed load queue, 256-entry store queue – Slowdowns over 8SA-LQ (mesa, wupwise) – Some slowdowns even over ASSOC too (bzip2, vortex) Why? Not forwarding mis-predictions … store-load serialization Load Y can’t verify until older store X completes to D$

9 [ 9 ] Store-Load Serialization: ROB SVW/SQIP example: SSBF verification “hole” Load R forwards from store  vulnerable to stores – No SSBF entry for address [x10]  must replay Can’t search store buffer  wait until stores – in D$ In a ROB processor … (P) will complete (and usually quickly) In a CPR processor … complete [x10][x20] [x18][x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br verify/commit x18 x20 x?8 x?0 x20 x?0 complete verify/commit x18 x20 x?8 x?0

10 [ 10 ] Store-Load Serialization: CPR P will complete … unless it’s in same checkpoint as R Deadlock: load R can’t verify  store P can’t complete Resolve: squash (ouch), on re-execute, create checkpoint before R P and R will be in separate checkpoints Better: learn and create checkpoints before future instances of R This is SVW–TRAIN verify complete [x10][x20] [x18][x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br x18 x20 x?8 x?0 commit

11 [ 11 ] SVW–TRAIN + Better than SVW–NAÏVE – But worse in some cases (art, mcf, vpr) Over-checkpointing holds too many registers Checkpoint may not be available for branches

12 [ 12 ] What About Set-Associative SSBFs? + Higher associativity helps (reduces hole frequency) but … – We’re replacing store queue associativity with SSBF associativity Trying to avoid things like this Want a better solution…

13 [ 13 ] DSC (Decoupled Store Completion) No fundamental reason we cannot complete stores – All older instructions have completed What’s stopping us? definition of commit & architected state CPR: commit = oldest register checkpoint (checkpoint granularity) ROB: commit = SVW-verify (instruction granularity) Restore ROB definition Allow stores to complete past oldest checkpoint This is DSC (Decoupled Store Completion) verify complete [x10][x20] [x18][x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br commit complete commit verify/commit

14 [ 14 ] DSC: What About Mis-Speculations? DSC: Architected state younger than oldest checkpoint What about mis-speculation (e.g., branch T mis-predicted)? Can only recover to checkpoint Squash committed instructions? Squash stores visible to other processors? etc. How do we recover architected state? verify/commit [x10][x20] [x18][x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br complete T:Br ?

15 [ 15 ] Silent Deterministic Recovery (SDR) Reconstruct architected state on demand Squash to oldest checkpoint and replay … Deterministically: re-produce committed values Silently: without generating coherence events How? discard committed stores at rename (already in SB or D$) How? read load values from load queue Avoid WAR hazards with younger stores Same thread (e.g., B  Q) or different thread (coherence) verify/commit complete [x10][x20] [x18][x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br

16 [ 16 ] Outline Background DSC/SDR (yes, that was it) Evaluation Performance Performance-area trade-offs

17 [ 17 ] Performance Methodology Workloads SPEC2000, Alpha AXP ISA, -O4, train inputs, 2% periodic sampling Cycle-level simulator configuration 4-way superscalar out-of-order CPR/CFP processor 8 checkpoints, 32/32 INT/FP issue queue entries 32KByte D$, 15-cycle 2MByte L2, 8 8-entry stream prefetchers 400 cycle memory, 4Byte/cycle memory bus

18 [ 18 ] SVW+DSC/SDR + Outperforms SVW–Naïve and SVW–Train + Outperforms 8SA-LQ on average (by a lot) – Occasional slight slowdowns (eon, vortex) relative to 8SA-LQ These are due to forwarding mis-speculation

19 [ 19 ] Smaller, Less-Associative SSBFs Does DSC/SDR make set-associative SSBFs unnecessary? You can bet your associativity on it

20 [ 20 ] Fewer Checkpoints DSC/SDR reduce need for large numbers of checkpoints Don’t need checkpoints to serialize store/load pairs Efficient use of D$ bandwidth even with widely spaced checkpoints Good: checkpoints are expensive

21 [ 21 ] … And Less Area Area methodology CACTI-4 [Tarjan04], 45nm Sum areas for load/store queues (SSBF & predictor too if needed) E.g., 512-entry 8SA-LQ / 256-entry HSQ 6.6% speedup, 0.91mm 2 High-performance/low-area

22 [ 22 ] How Performance/Area Was Won + SVW load queue: big performance gain (no conflicts) & small area loss + SQIP store queue: small performance loss & big area gain (no CAM) Big SVW performance gain offsets small SQIP performance loss Big SQIP area gain offsets small SVW area loss + DSC/SDR: big performance gain & small area gain

23 [ 23 ] DSC/SDR Performance/Area DSC/SDR improve SVW/SQIP IPC and reduce its area No new structures, just new ways of using existing structures + No SSBF checkpoints + No checkpoint-creation predictor + More tolerant to reduction in checkpoints, SSBF size

24 [ 24 ] Pareto Analysis SVW/SQIP+DSC/SDR dominates all other designs SVW/SQIP are low area (no CAMs) DSC/SDR needed to match IPC of fully-associative load queue (FA-LQ)

25 [ 25 ] Related Work SRL (Store Redo Log) [Gandhi05] Large associative store queue  FIFO buffer + forwarding cache Expands store queue only under LL$ misses  under-performs HSQ Unordered late-binding load/store queues [Sethumadhavan08] Entries only for executed loads and stores Poor match for centralized latency tolerant processors Cherry [Martinez02] “Post retirement” checkpoints No large load/store queues, but may benefit from DSC/SDR Deterministic replay (e.g., race debugging) [Xu04, Narayanasamy06]

26 [ 26 ] Conclusions Checkpoint granularity … + … register management: good – … store commit: somewhat painful DSC/SDR: the good parts of the checkpoint world Checkpoint granularity registers + instruction granularity stores Key 1: disassociate commit from oldest register checkpoint Key 2: reconstruct architected state silently on demand Committed load values available in load queue + Allow checkpoint processor to use SVW/SQIP load/store queues Performance and area advantages + Simplify multi-processor operation for checkpoint processors

27 [ 27 ]


Download ppt "ISCA-36 :: June 23, 2009 Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir."

Similar presentations


Ads by Google