ISCA-36 :: June 23, 2009 Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir.

Slides:



Advertisements
Similar presentations
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Advertisements

NC STATE UNIVERSITY Transparent Control Independence (TCI) Ahmed S. Al-Zawawi Vimal K. Reddy Eric Rotenberg Haitham H. Akkary* *Dept. of Electrical & Computer.
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
SoC CAD 1 Simultaneous Continual Flow Pipeline Architecture 徐 子 傑 Hsu,Zi Jei Department of Electrical Engineering National Cheng Kung University Tainan,
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Alpha Microarchitecture Onur/Aditya 11/6/2001.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.
Multiscalar processors
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
EECS 470 Memory Scheduling Lecture 11 Coverage: Chapter 3.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical.
1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
A Position-Insensitive Finished Store Buffer Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.
HPCA-15 :: Feb 18, 2009 iCFP: Tolerating All Level Cache Misses in In-Order Processors Andrew Hilton, Santosh Nagarakatte, Amir Roth University of Pennsylvania.
Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
1 Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Speculation Amir Roth University of Pennsylvania.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Ginger: Control Independence Using Tag Rewriting Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, ISCA-34 :: June, 2007.
MoBS-5 :: June 21, 2009 FIESTA: A Sample-Balanced Multi-Program Workload Methodology Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Amir Roth and Gurindar S. Sohi University of Wisconsin-Madison
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Multiscalar Processors
Lecture: Cache Hierarchies
CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions
Lecture: Out-of-order Processors
Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania.
Lecture: Cache Hierarchies
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Address-Value Delta (AVD) Prediction
Lecture 11: Memory Data Flow Techniques
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture: Cache Innovations, Virtual Memory
Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Project Guidelines Prof. Eric Rotenberg.
Handling Stores and Loads
Presentation transcript:

ISCA-36 :: June 23, 2009 Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania {adhilton,

[ 2 ] Brief Overview Dynamically scheduled superscalar processors Scalable load & store queues SVW/SQIP [Roth05, Sha05] Latency-tolerant processors CPR/CFP [Akkary03, Srinivasan04] DKIP, FMC [Pericas06, Pericas07] Scalable load & store queues for latency-tolerant processors SA-LQ/HSQ [Akkary03] SRL [Gandhi05] ELSQ [Pericas08] Granularity mismatch: checkpoint (CPR) vs. instruction (SVW/SQIP) Decoupled Store Completion & Silent Deterministic Replay

[ 3 ] Outline Background CPR/CFP SVW/SQIP The granularity mismatch problem DSC/SDR Evaluation

[ 4 ] CPR/CFP Latency-tolerant: scale key window structures under LL$ miss Issue queue, regfile, load & store queues CFP (Continual Flow Pipeline) [Srinivasan04] Scale issue queue & regfile by “slicing out” miss-dependent insns CPR (Checkpoint Processing & Recovery) [Akkary03] Scale regfile by limiting recovery to pre-created checkpoints + Aggressive reclamation of non-checkpoint registers – Unintended consequence? checkpoint-granularity “bulk commit” SA-LQ (Set-Associative Load Queue) [Akkary03] HSQ (Hierarchical Store Queue) [Akkary03]

[ 5 ] Baseline Performance (& Area) ASSOC (baseline): 64/48 entry fully-associative load/store queues 8SA-LQ/HSQ: 512-entry load queue, 256-entry store queue – Load queue: area is fine, poor performance (set conflicts) – Store queue: performance is fine, area inefficient (large CAM)

[ 6 ] SQIP SQIP (Store Queue Index Prediction) [Sha05] Scales store queue/buffer by eliminating associative load predicts store queue position of forwarding load indexes store queue at this position dispatch commit A:StB:LdP:StQ:StR:Ld…S:+T:Br [?] [x20][x10] instruction streamolderyounger addresses Preliminaries: SSNs (Store Sequence Numbers) [Roth05] Stores named by monotonically increasing sequence numbers Low-order bits are store queue/buffer positions Global SSNs track dispatch, commit, (store) completion SSNs P:St

[ 7 ] SVW Store Vulnerability Window (SVW) [Roth05] Scales load queue by eliminating associative search Load verification by in-order re-execution prior to commit Highly filtered: <1% of loads actually re-execute x18 x20 SSBF (SSN Bloom Filter) x?8 x?0 Address-indexed SSBF tracks [addr, SSN] of commited loads check SSBF, re-execute if possibly incorrect verify/ x18 x?8 complete [x20] [x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br [x18] commit [x18]

[ 8 ] SVW–NAIVE SVW: 512-entry indexed load queue, 256-entry store queue – Slowdowns over 8SA-LQ (mesa, wupwise) – Some slowdowns even over ASSOC too (bzip2, vortex) Why? Not forwarding mis-predictions … store-load serialization Load Y can’t verify until older store X completes to D$

[ 9 ] Store-Load Serialization: ROB SVW/SQIP example: SSBF verification “hole” Load R forwards from store  vulnerable to stores – No SSBF entry for address [x10]  must replay Can’t search store buffer  wait until stores – in D$ In a ROB processor … (P) will complete (and usually quickly) In a CPR processor … complete [x10][x20] [x18][x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br verify/commit x18 x20 x?8 x?0 x20 x?0 complete verify/commit x18 x20 x?8 x?0

[ 10 ] Store-Load Serialization: CPR P will complete … unless it’s in same checkpoint as R Deadlock: load R can’t verify  store P can’t complete Resolve: squash (ouch), on re-execute, create checkpoint before R P and R will be in separate checkpoints Better: learn and create checkpoints before future instances of R This is SVW–TRAIN verify complete [x10][x20] [x18][x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br x18 x20 x?8 x?0 commit

[ 11 ] SVW–TRAIN + Better than SVW–NAÏVE – But worse in some cases (art, mcf, vpr) Over-checkpointing holds too many registers Checkpoint may not be available for branches

[ 12 ] What About Set-Associative SSBFs? + Higher associativity helps (reduces hole frequency) but … – We’re replacing store queue associativity with SSBF associativity Trying to avoid things like this Want a better solution…

[ 13 ] DSC (Decoupled Store Completion) No fundamental reason we cannot complete stores – All older instructions have completed What’s stopping us? definition of commit & architected state CPR: commit = oldest register checkpoint (checkpoint granularity) ROB: commit = SVW-verify (instruction granularity) Restore ROB definition Allow stores to complete past oldest checkpoint This is DSC (Decoupled Store Completion) verify complete [x10][x20] [x18][x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br commit complete commit verify/commit

[ 14 ] DSC: What About Mis-Speculations? DSC: Architected state younger than oldest checkpoint What about mis-speculation (e.g., branch T mis-predicted)? Can only recover to checkpoint Squash committed instructions? Squash stores visible to other processors? etc. How do we recover architected state? verify/commit [x10][x20] [x18][x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br complete T:Br ?

[ 15 ] Silent Deterministic Recovery (SDR) Reconstruct architected state on demand Squash to oldest checkpoint and replay … Deterministically: re-produce committed values Silently: without generating coherence events How? discard committed stores at rename (already in SB or D$) How? read load values from load queue Avoid WAR hazards with younger stores Same thread (e.g., B  Q) or different thread (coherence) verify/commit complete [x10][x20] [x18][x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br

[ 16 ] Outline Background DSC/SDR (yes, that was it) Evaluation Performance Performance-area trade-offs

[ 17 ] Performance Methodology Workloads SPEC2000, Alpha AXP ISA, -O4, train inputs, 2% periodic sampling Cycle-level simulator configuration 4-way superscalar out-of-order CPR/CFP processor 8 checkpoints, 32/32 INT/FP issue queue entries 32KByte D$, 15-cycle 2MByte L2, 8 8-entry stream prefetchers 400 cycle memory, 4Byte/cycle memory bus

[ 18 ] SVW+DSC/SDR + Outperforms SVW–Naïve and SVW–Train + Outperforms 8SA-LQ on average (by a lot) – Occasional slight slowdowns (eon, vortex) relative to 8SA-LQ These are due to forwarding mis-speculation

[ 19 ] Smaller, Less-Associative SSBFs Does DSC/SDR make set-associative SSBFs unnecessary? You can bet your associativity on it

[ 20 ] Fewer Checkpoints DSC/SDR reduce need for large numbers of checkpoints Don’t need checkpoints to serialize store/load pairs Efficient use of D$ bandwidth even with widely spaced checkpoints Good: checkpoints are expensive

[ 21 ] … And Less Area Area methodology CACTI-4 [Tarjan04], 45nm Sum areas for load/store queues (SSBF & predictor too if needed) E.g., 512-entry 8SA-LQ / 256-entry HSQ 6.6% speedup, 0.91mm 2 High-performance/low-area

[ 22 ] How Performance/Area Was Won + SVW load queue: big performance gain (no conflicts) & small area loss + SQIP store queue: small performance loss & big area gain (no CAM) Big SVW performance gain offsets small SQIP performance loss Big SQIP area gain offsets small SVW area loss + DSC/SDR: big performance gain & small area gain

[ 23 ] DSC/SDR Performance/Area DSC/SDR improve SVW/SQIP IPC and reduce its area No new structures, just new ways of using existing structures + No SSBF checkpoints + No checkpoint-creation predictor + More tolerant to reduction in checkpoints, SSBF size

[ 24 ] Pareto Analysis SVW/SQIP+DSC/SDR dominates all other designs SVW/SQIP are low area (no CAMs) DSC/SDR needed to match IPC of fully-associative load queue (FA-LQ)

[ 25 ] Related Work SRL (Store Redo Log) [Gandhi05] Large associative store queue  FIFO buffer + forwarding cache Expands store queue only under LL$ misses  under-performs HSQ Unordered late-binding load/store queues [Sethumadhavan08] Entries only for executed loads and stores Poor match for centralized latency tolerant processors Cherry [Martinez02] “Post retirement” checkpoints No large load/store queues, but may benefit from DSC/SDR Deterministic replay (e.g., race debugging) [Xu04, Narayanasamy06]

[ 26 ] Conclusions Checkpoint granularity … + … register management: good – … store commit: somewhat painful DSC/SDR: the good parts of the checkpoint world Checkpoint granularity registers + instruction granularity stores Key 1: disassociate commit from oldest register checkpoint Key 2: reconstruct architected state silently on demand Committed load values available in load queue + Allow checkpoint processor to use SVW/SQIP load/store queues Performance and area advantages + Simplify multi-processor operation for checkpoint processors

[ 27 ]