Ginger: Control Independence Using Tag Rewriting Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, ISCA-34 :: June, 2007.

Slides:

Advertisements

Similar presentations

NC STATE UNIVERSITY Transparent Control Independence (TCI) Ahmed S. Al-Zawawi Vimal K. Reddy Eric Rotenberg Haitham H. Akkary* *Dept. of Electrical & Computer.

Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.)

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

ISCA-36 :: June 23, 2009 Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Multiscalar processors

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Revisiting Load Value Speculation:

Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.

1 Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Speculation Amir Roth University of Pennsylvania.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Transformer: A Functional-Driven Cycle-Accurate Multicore Simulator 1 黃翔 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan,

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.

Lecture: Out-of-order Processors

Amir Roth and Gurindar S. Sohi University of Wisconsin-Madison

Lecture: Out-of-order Processors

Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania.

Exploring Value Prediction with the EVES predictor

Lecture 6: Advanced Pipelines

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Lecture 18: Pipelining Today’s topics:

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

How to improve (decrease) CPI

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Patrick Akl and Andreas Moshovos AENAO Research Group

Lecture 9: Dynamic ILP Topics: out-of-order processors

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Transparent Control Independence (TCI)

Handling Stores and Loads

Presentation transcript:

Ginger: Control Independence Using Tag Rewriting Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, ISCA-34 :: June, 2007

Control Independence (CI) Branch mispredictions limit single-thread performance Improve prediction accuracy? Hard Predicate? Cost on correct predictions Exploit control independence (CI) to reduce squash penalty This paper: Ginger, a new (better) CI microarchitecture A: bez r1, D D: r2=2 E: r3=r1+1 F: r4=r2+1 G: r5=ld(r4) B: r2=1 C: jmp E Control dependent (CD) insns D: r2=2B: r2=1 C: jmp E Control independent (CI) insns } E: r3=r1+1 F: r4=r2+1 G: r5=ld(r4) remember acronyms CI, CD

Exploiting Control Independence Conventional recovery 1 Squash all post mis-prediction insns 2 Fetch/execute all correct-path insns – Re-fetch/re-execute CI insns (waste) How to “Insert” CD insns? What to do about CI insns that depend on CD insns? CI recovery 1 Squash only wrong-path CD insns 2 Fetch/execute only correct-path CD insns + Preserve CI insns: E, F,G + Preserve un-dispatched CI insns: H, I… A: bez r1, D E: r3=r1+1 F: r4=r2+1 G: r5=ld(r4) A: bez r1, D D: r2=2 E: r3=r1+1 F: r4=r2+1 G: r5=ld(r4) E: r3=r1+1 F: r4=r2+1 G: r5=ld(r4) B: r2=1 C: jmp E D: r2=2B: r2=1 C: jmp E F: r4=r2+1 D: r2=2B: r2=1

Out-of-Order Renaming CI step 1: replace CD insns CI Step 2: out-of-order renaming Step 1 changes inputs for some CI insns CI data dependent (CIDD) insns: F and G (transitively, via F) 1 Must identify CIDD insns and repair their inputs 2 Must re-issue CIDD insns that have already issued Key feature of CI, implementation distinguishes CI schemes 12 ?? Start: wrong path A: bez p1, D D: p2=2 E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) CI halfway A: bez p1, D E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) B: p6=1 C: jmp E Goal: correct path A: bez p1, D E: p3=p1+1 F: p4=p6+1 G: p5=ld(p4) B: p6=1 C: jmp E D: p2=2B: p6=1 F: p4=p6+1 F: p4=p2+1 remember CIDD acronym too

Outline Control Independence (CI) and out-of-order renaming Prior CI microarchitectures (ooo renaming schemes) “Walker” Skipper Ginger Comparative performance evaluation Conclusion

“Walker” [Rotenberg+, HPCA’99] Ooo renaming: walk all CI insns Re-rename, re-dispatch if inputs (transitively) changed + Reactive: no penalty on correct prediction (no worse than base) – High overhead on mis-prediction Walk and re-renames CI data independent insns (CIDI): E Typically many more of those than CIDD Still better than baseline A: bez p1, D E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) B: p6=1 C: jmp E F: p4=p6+1 input transitively changed  re-dispatch input changed  re-dispatch

Skipper [Cher+, MICRO’01] Ooo renaming: proactive CI + pre-synchronization 1 Defer CD fetch until branch resolves (reserve space) 2 Pre-synchronize: predict CD output registers (r2) and pre-allocate 3 After correct-path CD, dispatch/execute “pmoves” + Low ooo renaming overhead on mis-prediction Proportional to CD region register output set – Same overhead even on correct prediction A: bez p1, D E: p3=p1+1 F: p4=p9+1 G: p5=ld(p4) pre-synchronize P: p9=?? “pmove” P: p9=p6 B: p6=1 C: jmp E

OOO Renaming: “Walker”+Skipper  Ginger “Walker”: walk CI insns + Reactive: no overhead on correct predictions – High overhead on mis-predictions: proportional to CI insns Skipper: pre-synchronize + Low overhead on mis-predictions: proportional to CD registers – Proactive: same overhead on correct predictions Ginger: tag rewriting + Low overhead on mis-predictions: proportional to CD registers + Reactive: no overhead on correct predictions Proactive also possible, but not really worth it + Uses (mostly) existing hardware + Supports ooo renaming of loads

Outline Control Independence (CI) and out-of-order renaming Prior CI microarchitectures (ooo renaming schemes) Ginger Tag rewriting Selective re-dispatch Out-of-order renaming for loads Inserting CD insns Comparative performance evaluation Conclusion

Tag Rewriting at 32K Feet Recall: ooo renaming Correctness: repair F’s r2 input p2  p6 Performance: without walking E and G also Tag rewriting: ooo renaming by register, not by insn 1 Identify which registers have changed (r2: p2  p6) 2 Do a fast “search-replace” on CI insns + 1 step (“search-replace” p2  p6), not 3 (re-rename E, F, G) ? How to actually do both of these things CI halfway A: bez p1, D E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) B: p6=1 C: jmp E Goal: correct path A: bez p1, D E: p3=p1+1 F: p4=p6+1 G: p5=ld(p4) B: p6=1 C: jmp E F: p4=p6+1F: p4=p2+1 you are “here”

Tag Rewriting 1: Tracking Register Changes Active map table: correct-path mappings at E (CI start) Need: checkpoint for wrong-path mappings at E Bitvectors identify which registers must be rewritten From  to = wrong-path  correct-path ? How to get wrong-path checkpoint (“CI checkpoint”) r1r2r3 p6p1p3 p2 r1r2r3 p1p3 Start: wrong path A: bez p1, D D: p2=2 E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) CI halfway A: bez p1, D E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) B: p6=1 C: jmp E p6p or you are “here” 11

Tag Rewriting 0: Setup How do we know to create the CI checkpoint? Predict that branch A is low-confidence [Jacobson+ MICRO’06] Start tracking written registers How do we know where to create it? Predict A’s convergence PC: E [Cher+ MICRO’01, Collins+ MICRO’04] Take CI checkpoint before convergence PC is renamed p2 r1r2r3 p1p3 Start: wrong path A: bez p1, D E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) 000 D: p2=2 1 r1r2r3

Tag Rewriting 2: Actual Tag Rewriting Tags must be re-written in two places In younger issue queue entries In younger map table checkpoints: to rename future insns correctly r1r2r3 p1p3p2 CI halfway A: bez p1, D E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) B: p6=1 C: jmp E r1r2r3 p6p1p3p2 r1r2r3 p1p3 F: p4=p2+1 you are “here”

Basic Tag Rewriting Approach Observe: tag rewriting hardware (mostly) exists But used for different purposes: rename, dispatch, wakeup Exploit: borrow existing hardware 1 Stop the pipeline for a few cycles 2 Walk changed registers & tag rewrite 3 Restart the pipeline with correct dependences linked

Tag Rewriting Hardware Issue queue Existing: wakeup match = “search”, dispatch write = “replace” Some additional logic may be necessary (age tags) Map table checkpoints Some additional hardware here (but not associative search) See paper ptagr r r r wakeup tags == == dispatch tags/ready bits age > >

CIDD Re-Dispatch So far: tag rewriting for insns in issue queue ROB-size issue queue? Segmented/pipelined? [Hrishikesh+, ISCA’02] – No, slows down common-case wakeup/select Now: conventional issue queue, issued insns leave as usual CIDD insns re-dispatch from someplace That place itself must supports tag rewriting map table ready bits ROB regfile exec issue queue issue queue? ?

CIDD Re-Dispatch Ginger: a ROB-sized re-dispatch queue Internal wakeup/select re-dispatch loop Separate from issue wakeup/select + Supports tag rewriting to identify initial re-dispatch wave + Transitively identifies minimal dependent slice for re-dispatch Segmented/pipelined and “half-bandwidth”  slow Only 2% of insns re-dispatch  slow is fine map table ready bits issue queue ROB regfile exec re-dispatch queue

CIDD Loads CIDD loads: depend (via memory) on CD stores How are these identified when CD stores inserted/removed? SQIP (store queue index prediction) [Sha+ MICRO’05] Solution for large LSQ + Makes store-load forwarding act like register communication Supports “store tag rewriting” A: bez r1, D D: st(r1)=2 E: r3=r1+1 F: r4=r2+1 G: r5=ld(r1) B: r2=1 C: jmp E

SQIP and Store Tag Rewriting 15 second introduction to SQIP Store map table: store-PC  SQ index Forwarding predictor: load-PC  store-PC Load G  store D  SQ index 6 Load G’s second register tag is 6 Load G indexes SQ at position 6 Store tag rewriting Checkpoint & walk store map table Search-and-replace old-SQ-index  new-SQ-index Re-dispatch load if SQ-index tag has changed FG D– CDE ––– A: bez p1, D D: E: p3=p1+1 F: p4=p2+1 G: p5=ld(p1) 6 G: p5=ld(p1), 6

Inserting CD Instructions Ginger uses proactive resource management (a la Skipper) Not the same as proactive ooo renaming Predict convergence distance Reserve ROB, LSQ, and physical registers for them + Simplifies CD insn insertion + Simplifies commit and recovery, avoids resource deadlocks + Keeps CI stores in SQ positions: minimizes store tag rewriting – Reduces window utilization, but still better than non-CI } Convergence distance: here 2 insns A: bez p1, D E: r3=r1+1 F: r4=r2+1 G: r5=ld(r4) E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) D: p2=2

Outline Control Independence (CI) and out-of-order renaming Prior CI microarchitectures (ooo renaming schemes) Ginger Acronym pop quiz Comparative performance evaluation Conclusion

Experimental Methodology Goal: compare ooo renaming schemes Re-implemented “Walker”, Skipper All things equal other than ooo renaming Paper also has selective branch recovery (SBR) [Gandhi+ HPCA’04] Simulated configuration 4-way fetch/issue/commit, 21-stage pipe, 512 ROB, 64 issue queue 32KB hybrid gShare, 8KB confidence predictor 2-way, 8-stage re-dispatch, 16 checkpoints Statically computed convergence PCs & distances CI for branches confidence <95%, convergence distance <256 Benchmarks: SPECint2000, MediaBench, CommBench Gmeans over entire suite

Before We Start: Ideal CI Ideal CI: instantaneous, zero bandwidth ooo renaming Not a CI limit study in any other sense 95% confidence, 256 convergence distance limits apply Mis-predictions CI’ed: 55% Speedups: 8% SPECint, 14% Comm, 16% Media Perfect branch prediction provides higher speedups

Comparative Performance: Ginger Mis-predictions CI’ed: 53% Speedups: 5% SPECint, 11% Comm, 12% Media + Ooo renaming overhead of tag rewriting is low: ~3%

Comparative Performance: Walker Mis-predictions CI’ed: 56% + Exploits more CI opportunities: 1 checkpoint per CI, not 2 Speedups: 1% SPECint, 7% Comm, 5% Media – High rename/dispatch bandwidth overhead

Comparative Performance: Skipper Mis-predictions CI’ed: 29% – Penalty on correct prediction  possible slowdowns – Limits benefit to very low confidence branches (<80%) – In turn, limits CI opportunities Speedups: -1% SPECint, 8% Comm, 9% Media

More Insight: Dispatch Bandwidth Dispatch bandwidth: limits commit bandwidth Overhead: slot spent on anything other than committing insn Non-CI processor overheads – Squashed insns/fetch refill stalls: big components – Full window stalls: smaller, partially due to mis-predictions vpr (SPECint)

More Insight: Dispatch Bandwidth Effect of ideal CI + Reduces squashed insns: CI insns + Reduces fetch refill stalls: don’t squash front-end insns, dispatch – Increases full window stalls: space reservation, higher utilization – Some low overhead for CIDD re-dispatch: ~2% vpr (SPECint)

More Insight: Dispatch Bandwidth Effect of realistic CI Some additional ooo renaming overhead: tag rewrites, pmoves Additional inefficiencies and limitations vpr (SPECint)

More Insight: Dispatch Bandwidth Ginger + Low ooo renaming overhead: few other inefficiencies tag rewriting vpr (SPECint)

More Insight: Dispatch Bandwidth – Walker: high ooo renaming bandwidth overhead – Skipper: very high ooo renaming bandwidth overhead Restricted to very low confidence branches vpr (SPECint)

Conclusions Control independence (CI) Complements improvements in predictor accuracy Ooo renaming: most important feature, should be: Low-overhead on mis-prediction No overhead on correct prediction (“reactive”) Ginger: new reactive CI microarchitecture Out-performs previous schemes: “Walker”, Skipper Tag rewriting: new ooo renaming scheme + Uses (largely) existing hardware + Supports ooo memory renaming too New re-dispatch mechanism: potentially useful beyond CI

Selective Branch Recovery [Gandhi+, HPCA’04] Ooo renaming: annul wrong-path CD instructions 1 Transform wrong-path CD insns to pmoves (in place) 2 Re-dispatch them and CIDD insns (from recovery buffer) – Limited applicability: can remove CD instructions, but not insert Exact convergence : works for “if-then”, not “if-then-else” A: beqz p1, D D: p2 = 2 E: p3 = p1+1 G: p5 = ld(p4) F: p4 = p2+1 re-dispatch D: p2 = p9 transform to “pmove”, re-dispatch

Comparative Performance: SBR Mis-predictions CI’ed: 26% – Inability to insert CD insns limits CI opportunities Speedups: 0% SPECint, 5% Comm, 3% Media – CD to pmove transform adds latency  possible slowdowns