Ginger: Control Independence Using Tag Rewriting Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, ISCA-34 :: June, 2007.

Ginger: Control Independence Using Tag Rewriting Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn.edu ISCA-34 :: June, 2007

Control Independence (CI) Branch mispredictions limit single-thread performance Improve prediction accuracy? Hard Predicate? Cost on correct predictions Exploit control independence (CI) to reduce squash penalty This paper: Ginger, a new (better) CI microarchitecture A: bez r1, D D: r2=2 E: r3=r1+1 F: r4=r2+1 G: r5=ld(r4) B: r2=1 C: jmp E Control dependent (CD) insns D: r2=2B: r2=1 C: jmp E Control independent (CI) insns } E: r3=r1+1 F: r4=r2+1 G: r5=ld(r4) remember acronyms CI, CD

Exploiting Control Independence Conventional recovery 1 Squash all post mis-prediction insns 2 Fetch/execute all correct-path insns – Re-fetch/re-execute CI insns (waste) How to “Insert” CD insns? What to do about CI insns that depend on CD insns? CI recovery 1 Squash only wrong-path CD insns 2 Fetch/execute only correct-path CD insns + Preserve CI insns: E, F,G + Preserve un-dispatched CI insns: H, I… A: bez r1, D E: r3=r1+1 F: r4=r2+1 G: r5=ld(r4) A: bez r1, D D: r2=2 E: r3=r1+1 F: r4=r2+1 G: r5=ld(r4) E: r3=r1+1 F: r4=r2+1 G: r5=ld(r4) B: r2=1 C: jmp E D: r2=2B: r2=1 C: jmp E F: r4=r2+1 D: r2=2B: r2=1

Out-of-Order Renaming CI step 1: replace CD insns CI Step 2: out-of-order renaming Step 1 changes inputs for some CI insns CI data dependent (CIDD) insns: F and G (transitively, via F) 1 Must identify CIDD insns and repair their inputs 2 Must re-issue CIDD insns that have already issued Key feature of CI, implementation distinguishes CI schemes 12 ?? Start: wrong path A: bez p1, D D: p2=2 E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) CI halfway A: bez p1, D E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) B: p6=1 C: jmp E Goal: correct path A: bez p1, D E: p3=p1+1 F: p4=p6+1 G: p5=ld(p4) B: p6=1 C: jmp E D: p2=2B: p6=1 F: p4=p6+1 F: p4=p2+1 remember CIDD acronym too

Outline Control Independence (CI) and out-of-order renaming Prior CI microarchitectures (ooo renaming schemes) “Walker” Skipper Ginger Comparative performance evaluation Conclusion

“Walker” [Rotenberg+, HPCA’99] Ooo renaming: walk all CI insns Re-rename, re-dispatch if inputs (transitively) changed + Reactive: no penalty on correct prediction (no worse than base) – High overhead on mis-prediction Walk and re-renames CI data independent insns (CIDI): E Typically many more of those than CIDD Still better than baseline A: bez p1, D E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) B: p6=1 C: jmp E F: p4=p6+1 input transitively changed  re-dispatch input changed  re-dispatch

Skipper [Cher+, MICRO’01] Ooo renaming: proactive CI + pre-synchronization 1 Defer CD fetch until branch resolves (reserve space) 2 Pre-synchronize: predict CD output registers (r2) and pre-allocate 3 After correct-path CD, dispatch/execute “pmoves” + Low ooo renaming overhead on mis-prediction Proportional to CD region register output set – Same overhead even on correct prediction A: bez p1, D E: p3=p1+1 F: p4=p9+1 G: p5=ld(p4) pre-synchronize P: p9=?? “pmove” P: p9=p6 B: p6=1 C: jmp E

OOO Renaming: “Walker”+Skipper  Ginger “Walker”: walk CI insns + Reactive: no overhead on correct predictions – High overhead on mis-predictions: proportional to CI insns Skipper: pre-synchronize + Low overhead on mis-predictions: proportional to CD registers – Proactive: same overhead on correct predictions Ginger: tag rewriting + Low overhead on mis-predictions: proportional to CD registers + Reactive: no overhead on correct predictions Proactive also possible, but not really worth it + Uses (mostly) existing hardware + Supports ooo renaming of loads

Outline Control Independence (CI) and out-of-order renaming Prior CI microarchitectures (ooo renaming schemes) Ginger Tag rewriting Selective re-dispatch Out-of-order renaming for loads Inserting CD insns Comparative performance evaluation Conclusion

Tag Rewriting at 32K Feet Recall: ooo renaming Correctness: repair F’s r2 input p2  p6 Performance: without walking E and G also Tag rewriting: ooo renaming by register, not by insn 1 Identify which registers have changed (r2: p2  p6) 2 Do a fast “search-replace” on CI insns + 1 step (“search-replace” p2  p6), not 3 (re-rename E, F, G) ? How to actually do both of these things CI halfway A: bez p1, D E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) B: p6=1 C: jmp E Goal: correct path A: bez p1, D E: p3=p1+1 F: p4=p6+1 G: p5=ld(p4) B: p6=1 C: jmp E F: p4=p6+1F: p4=p2+1 you are “here”

Tag Rewriting 1: Tracking Register Changes Active map table: correct-path mappings at E (CI start) Need: checkpoint for wrong-path mappings at E Bitvectors identify which registers must be rewritten From  to = wrong-path  correct-path ? How to get wrong-path checkpoint (“CI checkpoint”) r1r2r3 p6p1p3 p2 r1r2r3 p1p3 Start: wrong path A: bez p1, D D: p2=2 E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) CI halfway A: bez p1, D E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) B: p6=1 C: jmp E p6p2 100100 or you are “here” 11

Tag Rewriting 0: Setup How do we know to create the CI checkpoint? Predict that branch A is low-confidence [Jacobson+ MICRO’06] Start tracking written registers How do we know where to create it? Predict A’s convergence PC: E [Cher+ MICRO’01, Collins+ MICRO’04] Take CI checkpoint before convergence PC is renamed p2 r1r2r3 p1p3 Start: wrong path A: bez p1, D E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) 000 D: p2=2 1 r1r2r3

Tag Rewriting 2: Actual Tag Rewriting Tags must be re-written in two places In younger issue queue entries In younger map table checkpoints: to rename future insns correctly r1r2r3 p1p3p2 CI halfway A: bez p1, D E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) B: p6=1 C: jmp E r1r2r3 p6p1p3p2 r1r2r3 p1p3 F: p4=p2+1 you are “here”

Basic Tag Rewriting Approach Observe: tag rewriting hardware (mostly) exists But used for different purposes: rename, dispatch, wakeup Exploit: borrow existing hardware 1 Stop the pipeline for a few cycles 2 Walk changed registers & tag rewrite 3 Restart the pipeline with correct dependences linked

Tag Rewriting Hardware Issue queue Existing: wakeup match = “search”, dispatch write = “replace” Some additional logic may be necessary (age tags) Map table checkpoints Some additional hardware here (but not associative search) See paper ptagr r r r wakeup tags == == dispatch tags/ready bits age > >

CIDD Re-Dispatch So far: tag rewriting for insns in issue queue ROB-size issue queue? Segmented/pipelined? [Hrishikesh+, ISCA’02] – No, slows down common-case wakeup/select Now: conventional issue queue, issued insns leave as usual CIDD insns re-dispatch from someplace That place itself must supports tag rewriting map table ready bits ROB regfile exec issue queue issue queue? ?

CIDD Re-Dispatch Ginger: a ROB-sized re-dispatch queue Internal wakeup/select re-dispatch loop Separate from issue wakeup/select + Supports tag rewriting to identify initial re-dispatch wave + Transitively identifies minimal dependent slice for re-dispatch Segmented/pipelined and “half-bandwidth”  slow Only 2% of insns re-dispatch  slow is fine map table ready bits issue queue ROB regfile exec re-dispatch queue

CIDD Loads CIDD loads: depend (via memory) on CD stores How are these identified when CD stores inserted/removed? SQIP (store queue index prediction) [Sha+ MICRO’05] Solution for large LSQ + Makes store-load forwarding act like register communication Supports “store tag rewriting” A: bez r1, D D: st(r1)=2 E: r3=r1+1 F: r4=r2+1 G: r5=ld(r1) B: r2=1 C: jmp E

SQIP and Store Tag Rewriting 15 second introduction to SQIP Store map table: store-PC  SQ index Forwarding predictor: load-PC  store-PC Load G  store D  SQ index 6 Load G’s second register tag is 6 Load G indexes SQ at position 6 Store tag rewriting Checkpoint & walk store map table Search-and-replace old-SQ-index  new-SQ-index Re-dispatch load if SQ-index tag has changed FG D– CDE ––– A: bez p1, D D: st(p1)=1, @6 E: p3=p1+1 F: p4=p2+1 G: p5=ld(p1) 6 G: p5=ld(p1), 6

Inserting CD Instructions Ginger uses proactive resource management (a la Skipper) Not the same as proactive ooo renaming Predict convergence distance Reserve ROB, LSQ, and physical registers for them + Simplifies CD insn insertion + Simplifies commit and recovery, avoids resource deadlocks + Keeps CI stores in SQ positions: minimizes store tag rewriting – Reduces window utilization, but still better than non-CI } Convergence distance: here 2 insns A: bez p1, D E: r3=r1+1 F: r4=r2+1 G: r5=ld(r4) E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) D: p2=2

Outline Control Independence (CI) and out-of-order renaming Prior CI microarchitectures (ooo renaming schemes) Ginger Acronym pop quiz Comparative performance evaluation Conclusion

Experimental Methodology Goal: compare ooo renaming schemes Re-implemented “Walker”, Skipper All things equal other than ooo renaming Paper also has selective branch recovery (SBR) [Gandhi+ HPCA’04] Simulated configuration 4-way fetch/issue/commit, 21-stage pipe, 512 ROB, 64 issue queue 32KB hybrid gShare, 8KB confidence predictor 2-way, 8-stage re-dispatch, 16 checkpoints Statically computed convergence PCs & distances CI for branches confidence <95%, convergence distance <256 Benchmarks: SPECint2000, MediaBench, CommBench Gmeans over entire suite

Before We Start: Ideal CI Ideal CI: instantaneous, zero bandwidth ooo renaming Not a CI limit study in any other sense 95% confidence, 256 convergence distance limits apply Mis-predictions CI’ed: 55% Speedups: 8% SPECint, 14% Comm, 16% Media Perfect branch prediction provides higher speedups

Comparative Performance: Ginger Mis-predictions CI’ed: 53% Speedups: 5% SPECint, 11% Comm, 12% Media + Ooo renaming overhead of tag rewriting is low: ~3%

Comparative Performance: Walker Mis-predictions CI’ed: 56% + Exploits more CI opportunities: 1 checkpoint per CI, not 2 Speedups: 1% SPECint, 7% Comm, 5% Media – High rename/dispatch bandwidth overhead

Comparative Performance: Skipper Mis-predictions CI’ed: 29% – Penalty on correct prediction  possible slowdowns – Limits benefit to very low confidence branches (<80%) – In turn, limits CI opportunities Speedups: -1% SPECint, 8% Comm, 9% Media

More Insight: Dispatch Bandwidth Dispatch bandwidth: limits commit bandwidth Overhead: slot spent on anything other than committing insn Non-CI processor overheads – Squashed insns/fetch refill stalls: big components – Full window stalls: smaller, partially due to mis-predictions vpr (SPECint)

More Insight: Dispatch Bandwidth Effect of ideal CI + Reduces squashed insns: CI insns + Reduces fetch refill stalls: don’t squash front-end insns, dispatch – Increases full window stalls: space reservation, higher utilization – Some low overhead for CIDD re-dispatch: ~2% vpr (SPECint)

More Insight: Dispatch Bandwidth Effect of realistic CI Some additional ooo renaming overhead: tag rewrites, pmoves Additional inefficiencies and limitations vpr (SPECint)

More Insight: Dispatch Bandwidth Ginger + Low ooo renaming overhead: few other inefficiencies tag rewriting vpr (SPECint)

More Insight: Dispatch Bandwidth – Walker: high ooo renaming bandwidth overhead – Skipper: very high ooo renaming bandwidth overhead Restricted to very low confidence branches vpr (SPECint)

Conclusions Control independence (CI) Complements improvements in predictor accuracy Ooo renaming: most important feature, should be: Low-overhead on mis-prediction No overhead on correct prediction (“reactive”) Ginger: new reactive CI microarchitecture Out-performs previous schemes: “Walker”, Skipper Tag rewriting: new ooo renaming scheme + Uses (largely) existing hardware + Supports ooo memory renaming too New re-dispatch mechanism: potentially useful beyond CI

Selective Branch Recovery [Gandhi+, HPCA’04] Ooo renaming: annul wrong-path CD instructions 1 Transform wrong-path CD insns to pmoves (in place) 2 Re-dispatch them and CIDD insns (from recovery buffer) – Limited applicability: can remove CD instructions, but not insert Exact convergence : works for “if-then”, not “if-then-else” A: beqz p1, D D: p2 = 2 E: p3 = p1+1 G: p5 = ld(p4) F: p4 = p2+1 re-dispatch D: p2 = p9 transform to “pmove”, re-dispatch

Comparative Performance: SBR Mis-predictions CI’ed: 26% – Inability to insert CD insns limits CI opportunities Speedups: 0% SPECint, 5% Comm, 3% Media – CD to pmove transform adds latency  possible slowdowns

Ginger: Control Independence Using Tag Rewriting Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, ISCA-34 :: June, 2007.

Similar presentations

Presentation on theme: "Ginger: Control Independence Using Tag Rewriting Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, ISCA-34 :: June, 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ginger: Control Independence Using Tag Rewriting Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, ISCA-34 :: June, 2007.

Similar presentations

Presentation on theme: "Ginger: Control Independence Using Tag Rewriting Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, ISCA-34 :: June, 2007."— Presentation transcript:

Similar presentations

About project

Feedback