School of Computing Exploiting Eager Register Release in a Redundantly Multi-threaded Processor Niti Madan Rajeev Balasubramonian University of Utah.

School of Computing Exploiting Eager Register Release in a Redundantly Multi-threaded Processor Niti Madan Rajeev Balasubramonian University of Utah

School of Computing Introduction Rising soft error rates due to shrinking transistor sizes and lower supply voltages Existing Solutions: Process level – SOI Circuit level – Rad-hard cells, ECC, BISER Architecture level – –Redundant Multithreading –Reducing the time useful state spends in unprotected structures –Software assisted fault tolerance

School of Computing Introduction CMPs/SMTs enable redundant multithreading (RMT) Detailed Design and Evaluation of Redundant Multithreading Alternatives, ISCA 2002 –2 processors/threads execute the same program

School of Computing Chip-level Redundant Multi-threading (CRTR) Processor 1Processor 2 Branch Outcomes Loads Register Values Stores Leading thread 1 Trailing thread 2 Trailing thread 1 Leading thread 2 OoO Lags behind leading thread by some slack

School of Computing Motivation Register file is already a critical resource: impacts ILP impacts cycle time impacts peak temperature Multiple threads increase pressure on register file

School of Computing Motivation Out-of-order processors are "conservative" since they must preserve correctness –Example: registers are de-allocated conservatively Having a trailing thread allows the leading thread to be aggressive –improves the performance of the leading thread –trailer state can be used for ensuring correctness –some errors may go undetected

School of Computing Processor 1 Processor 2 RVQ Leading 1 Trailing 1 lr5 = … lr5 mapped to R2 R1 lr5 = …. lr5 mapped to R1 Branch Mispredict R2R1

School of Computing Processor 1 Processor 2 RVQ Leading 1 Trailing 1 R1 R1’ Soft error Mispredict Recovery Fault Propagates Very few errors slip through: Slack is most of the times less than RVQ size

School of Computing Our Approach RMT processor has duplicate register value state in RVQ/trailer’s state Improve Register file efficiency using Eager Register Release Smaller Register file size can deliver same performance using above technique –Reduced power –Increased reliability – ECC less expensive –Potentially faster clock speed

School of Computing Outline Background on RMT design space Proposed technique Evaluation Conclusions & Future Work

School of Computing Redundant Multi-threading Fault model –Trailer’s state used for recovery Does not provide complete recovery –Caches and Load Value Queue (LVQ) ECC protected – Can detect all single event upset faults Baseline RMT models include SRTR, CRTR, ST-P-CRTR, MT-P-CRTR

School of Computing Baseline RMT Model Leading Thread 1 Trailing Thread 1 Out-of-Order Processor SRTR – SMT level RMT CRTR –Chip level RMT Proposed by Mukherjee et al ISCA 2002, Gomaa et al ISCA 2002, ISCA 2003 Processor 1Processor 2 LVQ, BOQ, RVQ Leading 1 Trailing 2 Trailing 1 Leading 2 Out-of-order

School of Computing Power-efficient RMT model Our Earlier Work explores Power-efficient RMT model P-CRTR (Selse-2, Tech Report 2005) Observations –Trailing thread doesn’t suffer from D-cache misses and branch mispredictions –Trailing thread bound to have higher IPC High Trailer IPC enables power reduction Techniques proposed for power-efficiency: –Dynamic Frequency Scaling –In-order execution of trailer

School of Computing Dynamic Frequency Scaling High Trailer IPC enables frequency reduction Reduce Trailer’s frequency to match the leader’s throughput Reduction in Trailer’s dynamic power Does not impact Trailer’s leakage power

School of Computing In-order Execution of Checker Our approach –Send all register values computed by leading core to the trailer (Register value prediction 100% accuracy if no fault) –Trailer reads source operands from RVQ –Trailer verifies source operands at commit RVP enables perfect IPC – no stalls Cost : Extra communication overhead Benefit : Overall reduced dynamic and leakage power

School of Computing ST-P-CRTR Single thread workloads Processor 1 Processor 2 LVQ, BOQ, RVQ Leading 1 Trailing 1 Out-of-order In-order

School of Computing MT-P-CRTR Multi-threaded Workloads Processor 1 Processor 2 LVQ, BOQ, RVQ Leading 1 Leading 2 Trailing 1 Out-of-order In-order Processor 3 Trailing 2 In-order LVQ, BOQ, RVQ

School of Computing Eager Register Release –Involves releasing older physical register after the value is rewritten and used by all consumers –Requires a mechanism to store the released state elsewhere Original Code lr3= lr1,lr2 lr5= lr3, lr4 Branch to x lr3=… Renamed Code pr21= pr8,pr11 pr15= pr21, pr12 Branch to x pr29=… lr3 has 2 mappings – new pr29 and old pr21 pr21 cannot be released until branch resolves

School of Computing Implementation Details Need to keep track of various states for each physical register in Usage Table –Bit that tracks if logical register value is overwritten –RVQ address/register id in trailing thread Counters for each physical register –To track pending consumers Modification in ROB to initiate recovery upon mispredict Non-trivial complexity and overheads

School of Computing Evaluation Methodology Simplescalar-3.0 (Modified for CMP/SMT) for performance analysis and wattch for processor power eCacti-3.0 to model register file power and area overheads Spec2k Int, FP benchmark suite –16 benchmarks for single thread experiments – 10 pairs of High/Low IPC/ Int/FP combinations for multi- thread experiments Evaluated all RMT models for comprehensive analysis of all combinations of leading/trailing threads RVQ size = 600 entries

School of Computing Performance Evaluation

School of Computing Effect of Register File Size - SRTR ROB size 160

School of Computing Effect of Register File Size ST-P-CRTR

School of Computing Effect of Register File Size CRTR

School of Computing Effect of Register File Size MT-P-CRTR

School of Computing Effect of Register File Size For SRTR, CRTR, MT-P-CRTR: –Performance of 100 size RF with ER same as baseline with 160 size (37.5% size reduction) –Performance improvement of 34% in 100 size RF with ER compared to baseline with 100 size For ST-P-CRTR –Performance of 50 size register file with ER same as baseline with 80 size (37.5% size reduction) –Performance improvement of 12% in 100 size RF with ER compared to baseline with 100 size

School of Computing Observations More favorable to models where leading thread co-executes with another leading/trailing thread Most FP benchmarks perform better with ER (greater than 20% improvement) Int benchmarks that have poor bpred rates do not benefit much (gcc, equake, eon etc upto 3%)

School of Computing Performance Overheads For 100 million single thread execution – 70 million registers are released eagerly – 6% copied back upon mispredict recovery –Cost of copying back dependent upon program mispredict rate –Each mispredict requires 6.6 copy back values –Cost of copying can be possibly hidden with branch recovery time

School of Computing Performance Overheads Max IPC loss for 5-cycle overhead is 4%

School of Computing Power/Area Analysis 8 Rd/4 Wr ports assumed for ST RF 16 Rd/8 Wr ports assumed for MT RF

School of Computing Power/Area Analysis Single thread RF size 50 with ER compared to baseline RF size 80 can –Improve Clock speed by 19% –Consumes 11% less energy and 25% less area If SEC-DED ECC is implemented on baseline register file –6% Energy increase and 16% area increase Smaller RF can help afford ECC for even multiple bit soft error resilience

School of Computing Fault-Injection Analysis Modified Simplescalar for fault analysis Conservative analysis as masking effects cannot be modeled Every 1000 cycles, register bit is flipped in trailing register file –Only 0.0004% of faults go undetected On average 99% of time logical register is rewritten in less than 100 instruction interval –Ensures that slack is less than RVQ size

School of Computing Conclusions and Future Work RMT model very suitable for Eager Register Release A 100 entry RF can match the throughput of 160 entry file and shows 34% improvement over baseline Fault-coverage reduction marginal ~0.0004% Enables smaller RF for lower power, higher clock speed, lower area overheads Enables reliability by making ECC affordable Nontrivial implementation overheads Need to explore complexity-effective solution

School of Computing Exploiting Eager Register Release in a Redundantly Multi-threaded Processor Niti Madan Rajeev Balasubramonian University of Utah.

Similar presentations

Presentation on theme: "School of Computing Exploiting Eager Register Release in a Redundantly Multi-threaded Processor Niti Madan Rajeev Balasubramonian University of Utah."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

School of Computing Exploiting Eager Register Release in a Redundantly Multi-threaded Processor Niti Madan Rajeev Balasubramonian University of Utah.

Similar presentations

Presentation on theme: "School of Computing Exploiting Eager Register Release in a Redundantly Multi-threaded Processor Niti Madan Rajeev Balasubramonian University of Utah."— Presentation transcript:

Similar presentations

About project

Feedback