1 Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Speculation Amir Roth University of Pennsylvania.

Slides:

Advertisements

Similar presentations

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Advertisements

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.)

Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

ISCA-36 :: June 23, 2009 Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir.

CS 7810 Lecture 8 Memory Dependence Prediction using Store Sets G.Z. Chrysos and J.S. Emer Proceedings of ISCA

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

CS 152 Computer Architecture & Engineering Andrew Waterman University of California, Berkeley Section 8 Spring 2010.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

EECS 470 Memory Scheduling Lecture 11 Coverage: Chapter 3.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical.

Revisiting Load Value Speculation:

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Lecture 13 Slide 1 EECS 470 © Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

A Position-Insensitive Finished Store Buffer Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.

EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.)

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

G. Venkataramani, I. Doudalis, Y. Solihin, M. Prvulovic HPCA ’08 Reading Group Presentation 02/14/2008.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

Amir Roth and Gurindar S. Sohi University of Wisconsin-Madison

Lecture: Out-of-order Processors

Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania.

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lu Peng, Jih-Kwon Peir, Konrad Lai

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

CS 152 Computer Architecture & Engineering

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Module 3: Branch Prediction

Lecture 11: Memory Data Flow Techniques

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lecture 8: Dynamic ILP Topics: out-of-order processors

Control unit extension for data hazards

Instruction Level Parallelism (ILP)

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Additional ILP Topics Prof. Eric Rotenberg

Control unit extension for data hazards

Control unit extension for data hazards

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 9: Dynamic ILP Topics: out-of-order processors

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Spring 2019 Prof. Eric Rotenberg

Handling Stores and Loads

Presentation transcript:

1 Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Speculation Amir Roth University of Pennsylvania

2 Load Speculation Load/store unit: performance critical, but complex Ambiguous dependences  associative search  scalability issues Overcome with load speculation (wrt older stores) “Load speculation”: no-dependence on unresolved stores Speculative forwarding: identity of forwarding store Redundant load elimination: no conflicting store since first load Not value prediction Common question: how to verify speculation? “Load speculation”: associative load queue search Speculative forwarding: same Redundant load elimination: ??

3 Verification: Load Re-Execution EX2WBRRSCRNIDIF2CTIF1EX1 D$ st orig-value != re-executed-value ? flush RX2RX1 ld D$D$ Load re-execution: general-purpose load verifier Re-execute loads in-order prior to commit Flush/restart if re-executed value != original value Performance +No associative load queue search +No false squashes (value based) –D$ bandwidth contention –Reduced effective capacity of LSQ, regfile –Multi-cycle D$  critical loop with store retirement Obvious key: re-execute as few loads as possible

4 Performance Additional pipeline stage? Additional table access? SSBF is much smaller than D$ (e.g., 1KB vs. 64KB) High bandwidth Single-cycle access  no critical loop +Reduces re-executions by 5-20X SVW: Store Vulnerability Window SVW SSBF ldst SVW: general-purpose load re-execution filter Loads access new table to see if they can safely skip re-execution EX2WBRRSCRNIDIF2RX2CTIF1EX1RX1 stld orig-value != re-executed-value ? flush D$D$

5 Talk Outline SVW framework SVW applied to… Load speculation Speculative forwarding Redundant load elimination Something new … and pretty cool Summary

6 SVW Concept Observe: load mis-speculation is a function of stores No store to load’s address in a long time  no need to re-execute SVW formalizes this notion Each load vulnerable to some window of stores older than itself… …and no stores older than this window Load’s Store Vulnerability Window (SVW) Re-execute load only if it collides with store in its SVW loadolder stores don’t careSVW Size of SVW is important: too small (wrong), too big (slow)

7 SVW Implementation EX2WBRRSCRNIDIF2RX2CTIF1EX1RX1SVW SSBF ldst At SVW: …the address collision test SSBF (Store Sequence Bloom Filter): address-indexed table of SSNs SSBF[store.addr] = store.SSN load.must-re-execute = (SSBF[load.addr] > ld.SVW) Same for all forms of speculation SSN RN st At RENAME: a way of physically specifying store windows… Monotonically increasing store sequence numbers (SSNs) store.SSN = SSN RENAME ++ load.SVW = SSN-of-youngest-”don’t-care”-store Speculation specific

8 SVW for “Load Speculation” Load speculation (Intelligently) issue loads before older unresolved stores Widely-used: e.g., store sets [Moshovos+’97, Chrysos+Emer’98, Yoaz+’99] Load speculation+REX Replace load queue search with load re-execution [Cain+Lipasti’04] SVW for load speculation+REX load.SVW = SSN-of-youngest-”don’t care”-store “Don’t care” stores? Those that committed before load dispatched Speculation restricted to instruction window Implementation: load.SVW = SSN COMMIT

9 Load Speculation+SVW: Example Three events (Rename, Exec, SVW/Re-exec) in the life of a load Addresses: A,B,C,D, SSNs/SVWs: 0,1,…64,65,66… LSQ: load store t=X Rename load, establish SVW load.SVW = SSN COMMIT (64) 6665 head tail 64 ? DC 0 D C 0 B 0 A SSBF 64 SSN COMMIT t=X+5 Execute load (ambiguous, but correct) Also commit store 65 head ? DC A 0 D C 0 B 0 A 6665 head 64 C DC A 65 D 66 C 0 B 0 A t=X+10 SVW load, check address collision SSBF[load.addr] (0) > load.SVW (64)? No, no need to re-execute < 0

10 Load Speculation+SVW: Other Example Same setup t=X Rename load, establish SVW load.SVW = SSN COMMIT (64) t=X+5 Execute load (ambiguous, and wrong) Also commit store head tail 64 ? DC 0 D C 0 B 0 A SSBF 64 SSN COMMIT 6665 head 64 ? DC A 65 D 64 C 0 B 0 A head 64 A DC A 65 D 64 C 0 B 66 A t=X+10 SVW load, check address collision SSBF[load.addr] (66) > load.SVW (64)? Yes, must re-execute <

11 (How Well) Does This Work? What we care about Reductions in re-execution rates Speedups Experimental parameters SPECint2000, Alpha ISA, -O4, no nops, Train inputs, 2% sampling 8-way out-of-order superscalar, 20 stage pipeline 256-entry ROB, 48-entry SQ, 80 RS 32KB, 2-way, 2-cycle I/D$, 512KB, 8-way, 15-cycle L2 16-bit SSNs  wrap-around every 64K stores On wrap-around: drain pipe (like store-barrier), clear SSBF 512-entry SSBF  1KB

12 Load Speculation: Results Baseline: load speculation with associative load queue (LQ) Load speculation+REX Re-execution rate: ~9% Speedup: ~0.5% Natural re-execution filter: only if issued with unknown stores…

13 Load Speculation: Results Baseline: load speculation with associative load queue (LQ) Load speculation+REX Re-execution rate: ~9% Speedup: ~0.5% Natural re-execution filter: only if issued with unknown stores… Load speculation+REX+SVW Additional filter: …and only if store in SVW wrote to load’s address Re-execution rate: ~1% Speedup: ~1.5% SVW is an enhancer

14 SVW for Speculative Forwarding Attacks store queue (as opposed to load queue) scalability Observe: Conventional store queue performs two functions Store retirement: non-associative, all stores Store forwarding: associative, few (predictable) loads/stores Exploit: decompose store queue [Roth’04, Baugh+Zilles’04] Retirement queue: large, non-associative, off critical path Forwarding queue: small, associative, on critical path Speculatively steer loads/stores to forwarding queue Re-execute loads to check speculation/train steering predictor SVW for speculative forwarding: same load.SVW = SSN COMMIT

15 Speculative Forwarding: Results Baseline: 48-entry/2-port/4-cycle store queue Speculative forwarding (16/1/2)+REX Re-execution rates: 100%, Speedups: –12% No natural re-execution filter

16 Speculative Forwarding: Results Baseline: 48-entry/2-port/4-cycle store queue Speculative forwarding (16/1/2)+REX Re-execution rates: 100%, Speedups: –12% No natural re-execution filter Speculative forwarding+REX+SVW Re-execution rate: 14%, speedups: 14% SVW is an enabler

17 SVW for Redundant Load Elimination Redundant load elimination Identify redundant loads (e.g., by tracking register dependences) Eliminate them (e.g., by pointing to output preg of original load) +Removes loads from execution core, reduces observed load latency –Must re-execute eliminated loads Address-based invalidation is expensive and incomplete Many instantiations [Sodani+’97, Onder+’99, Petric+’02,’05] SVW for redundant load elimination: a different definition Eliminated load vulnerable to all stores … starting at original load Pass this SSN via reuse table Notice: speculation operates outside instruction window Load queue search is not an option

18 Redundant Load Elimination: Results Baseline: no load elimination Redundant load Elimination+REX Re-execution rate: ~24%, Speedups: ~4% Natural re-execution filter: only eliminated loads

19 Redundant Load Elimination: Results Baseline: no load elimination Redundant load elimination+REX Re-execution rate: ~24%, Speedups: ~4% Natural re-execution filter: only eliminated loads Redundant load elimination+REX+SVW Re-execution rate: ~0.5%, Speedups: ~8% SVW is an enhancer … but a really good one

20 Something Cool: REX+SVW … –REX! Originally: SVW as re-execution filter EX2WBRRSCRNIDIF2RX2CTIF1EX1RX1 D$D$ stld SVW SSBF ldst orig-value != re-execute value ? flush EX2WBRRSCRNIDIF2CTIF1EX1 D$ st SVW SSBF ldst SSBF hit ? flush Alternatively: SVW as re-execution substitute SSBF hit = re-execution failure, flush+restart Thanks Vlad

21 REX+SVW–REX: Results Load reuse Low filtered re-execution rates: < 1% Almost as good as SVW+REX, often better than REX Speculative forwarding (not shown) Higher filtered re-execution rates: 10+% Noticeable performance degradation Future: reduce re-executions even more

22 Summary Load re-execution +Supports simplifications in hard part of core (load/store) –Cost not exactly zero: D$ bandwidth, serialization, queue pressure Store Vulnerability Window (SVW) Store tracking mechanism for reducing load re-executions +Enhances some optimizations (load speculation, load reuse) +Enables others (speculative forwarding)