School of Computing Exploiting Eager Register Release in a Redundantly Multi-threaded Processor Niti Madan Rajeev Balasubramonian University of Utah.

Slides:

Advertisements

Similar presentations

NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.

Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Transient Fault Detection and Recovery via Simultaneous Multithreading Nevroz ŞEN 26/04/2007.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.

Microarchitectural Approaches to Exceeding the Complexity Barrier © Eric Rotenberg 1 Microarchitectural Approaches to Exceeding the Complexity Barrier.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.

Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.

Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.

1 Transient Fault Recovery For Chip Multiprocessors Mohamed Gomaa, Chad Scarbrough, T. N. Vijaykumar and Irith Pomeranz School of Electrical and Computer.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

Self-* Systems CSE 598B Paper title: Dynamic ECC tuning for caches Presented by: Niranjan Soundararajan.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

1 A Cost-effective Substantial- impact-filter Based Method to Tolerate Voltage Emergencies Songjun Pan 1,2, Yu Hu 1, Xing Hu 1,2, and Xiaowei Li 1 1 Key.

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

Redundant Multithreading Techniques for Transient Fault Detection Shubu Mukherjee Michael Kontz Steve Reinhardt Intel HP (current) Intel Consultant, U.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

PipeliningPipelining Computer Architecture (Fall 2006)

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Lecture 18: Core Design, Parallel Algos

Computer Structure Multi-Threading

/ Computer Architecture and Design

Hyperthreading Technology

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Hwisoo So. , Moslem Didehban#, Yohan Ko

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Ka-Ming Keung Swamy D Ponpandi

Adaptive Single-Chip Multiprocessing

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Patrick Akl and Andreas Moshovos AENAO Research Group

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

School of Computing Exploiting Eager Register Release in a Redundantly Multi-threaded Processor Niti Madan Rajeev Balasubramonian University of Utah

School of Computing Introduction Rising soft error rates due to shrinking transistor sizes and lower supply voltages Existing Solutions: Process level – SOI Circuit level – Rad-hard cells, ECC, BISER Architecture level – –Redundant Multithreading –Reducing the time useful state spends in unprotected structures –Software assisted fault tolerance

School of Computing Introduction CMPs/SMTs enable redundant multithreading (RMT) Detailed Design and Evaluation of Redundant Multithreading Alternatives, ISCA 2002 –2 processors/threads execute the same program

School of Computing Chip-level Redundant Multi-threading (CRTR) Processor 1Processor 2 Branch Outcomes Loads Register Values Stores Leading thread 1 Trailing thread 2 Trailing thread 1 Leading thread 2 OoO Lags behind leading thread by some slack

School of Computing Motivation Register file is already a critical resource: impacts ILP impacts cycle time impacts peak temperature Multiple threads increase pressure on register file

School of Computing Motivation Out-of-order processors are "conservative" since they must preserve correctness –Example: registers are de-allocated conservatively Having a trailing thread allows the leading thread to be aggressive –improves the performance of the leading thread –trailer state can be used for ensuring correctness –some errors may go undetected

School of Computing Processor 1 Processor 2 RVQ Leading 1 Trailing 1 lr5 = … lr5 mapped to R2 R1 lr5 = …. lr5 mapped to R1 Branch Mispredict R2R1

School of Computing Processor 1 Processor 2 RVQ Leading 1 Trailing 1 R1 R1’ Soft error Mispredict Recovery Fault Propagates Very few errors slip through: Slack is most of the times less than RVQ size

School of Computing Our Approach RMT processor has duplicate register value state in RVQ/trailer’s state Improve Register file efficiency using Eager Register Release Smaller Register file size can deliver same performance using above technique –Reduced power –Increased reliability – ECC less expensive –Potentially faster clock speed

School of Computing Outline Background on RMT design space Proposed technique Evaluation Conclusions & Future Work

School of Computing Redundant Multi-threading Fault model –Trailer’s state used for recovery Does not provide complete recovery –Caches and Load Value Queue (LVQ) ECC protected – Can detect all single event upset faults Baseline RMT models include SRTR, CRTR, ST-P-CRTR, MT-P-CRTR

School of Computing Baseline RMT Model Leading Thread 1 Trailing Thread 1 Out-of-Order Processor SRTR – SMT level RMT CRTR –Chip level RMT Proposed by Mukherjee et al ISCA 2002, Gomaa et al ISCA 2002, ISCA 2003 Processor 1Processor 2 LVQ, BOQ, RVQ Leading 1 Trailing 2 Trailing 1 Leading 2 Out-of-order

School of Computing Power-efficient RMT model Our Earlier Work explores Power-efficient RMT model P-CRTR (Selse-2, Tech Report 2005) Observations –Trailing thread doesn’t suffer from D-cache misses and branch mispredictions –Trailing thread bound to have higher IPC High Trailer IPC enables power reduction Techniques proposed for power-efficiency: –Dynamic Frequency Scaling –In-order execution of trailer

School of Computing Dynamic Frequency Scaling High Trailer IPC enables frequency reduction Reduce Trailer’s frequency to match the leader’s throughput Reduction in Trailer’s dynamic power Does not impact Trailer’s leakage power

School of Computing In-order Execution of Checker Our approach –Send all register values computed by leading core to the trailer (Register value prediction 100% accuracy if no fault) –Trailer reads source operands from RVQ –Trailer verifies source operands at commit RVP enables perfect IPC – no stalls Cost : Extra communication overhead Benefit : Overall reduced dynamic and leakage power

School of Computing ST-P-CRTR Single thread workloads Processor 1 Processor 2 LVQ, BOQ, RVQ Leading 1 Trailing 1 Out-of-order In-order

School of Computing MT-P-CRTR Multi-threaded Workloads Processor 1 Processor 2 LVQ, BOQ, RVQ Leading 1 Leading 2 Trailing 1 Out-of-order In-order Processor 3 Trailing 2 In-order LVQ, BOQ, RVQ

School of Computing Eager Register Release –Involves releasing older physical register after the value is rewritten and used by all consumers –Requires a mechanism to store the released state elsewhere Original Code lr3= lr1,lr2 lr5= lr3, lr4 Branch to x lr3=… Renamed Code pr21= pr8,pr11 pr15= pr21, pr12 Branch to x pr29=… lr3 has 2 mappings – new pr29 and old pr21 pr21 cannot be released until branch resolves

School of Computing Implementation Details Need to keep track of various states for each physical register in Usage Table –Bit that tracks if logical register value is overwritten –RVQ address/register id in trailing thread Counters for each physical register –To track pending consumers Modification in ROB to initiate recovery upon mispredict Non-trivial complexity and overheads

School of Computing Evaluation Methodology Simplescalar-3.0 (Modified for CMP/SMT) for performance analysis and wattch for processor power eCacti-3.0 to model register file power and area overheads Spec2k Int, FP benchmark suite –16 benchmarks for single thread experiments – 10 pairs of High/Low IPC/ Int/FP combinations for multi- thread experiments Evaluated all RMT models for comprehensive analysis of all combinations of leading/trailing threads RVQ size = 600 entries

School of Computing Performance Evaluation

School of Computing Effect of Register File Size - SRTR ROB size 160

School of Computing Effect of Register File Size ST-P-CRTR

School of Computing Effect of Register File Size CRTR

School of Computing Effect of Register File Size MT-P-CRTR

School of Computing Effect of Register File Size For SRTR, CRTR, MT-P-CRTR: –Performance of 100 size RF with ER same as baseline with 160 size (37.5% size reduction) –Performance improvement of 34% in 100 size RF with ER compared to baseline with 100 size For ST-P-CRTR –Performance of 50 size register file with ER same as baseline with 80 size (37.5% size reduction) –Performance improvement of 12% in 100 size RF with ER compared to baseline with 100 size

School of Computing Observations More favorable to models where leading thread co-executes with another leading/trailing thread Most FP benchmarks perform better with ER (greater than 20% improvement) Int benchmarks that have poor bpred rates do not benefit much (gcc, equake, eon etc upto 3%)

School of Computing Performance Overheads For 100 million single thread execution – 70 million registers are released eagerly – 6% copied back upon mispredict recovery –Cost of copying back dependent upon program mispredict rate –Each mispredict requires 6.6 copy back values –Cost of copying can be possibly hidden with branch recovery time

School of Computing Performance Overheads Max IPC loss for 5-cycle overhead is 4%

School of Computing Power/Area Analysis 8 Rd/4 Wr ports assumed for ST RF 16 Rd/8 Wr ports assumed for MT RF

School of Computing Power/Area Analysis Single thread RF size 50 with ER compared to baseline RF size 80 can –Improve Clock speed by 19% –Consumes 11% less energy and 25% less area If SEC-DED ECC is implemented on baseline register file –6% Energy increase and 16% area increase Smaller RF can help afford ECC for even multiple bit soft error resilience

School of Computing Fault-Injection Analysis Modified Simplescalar for fault analysis Conservative analysis as masking effects cannot be modeled Every 1000 cycles, register bit is flipped in trailing register file –Only % of faults go undetected On average 99% of time logical register is rewritten in less than 100 instruction interval –Ensures that slack is less than RVQ size

School of Computing Conclusions and Future Work RMT model very suitable for Eager Register Release A 100 entry RF can match the throughput of 160 entry file and shows 34% improvement over baseline Fault-coverage reduction marginal ~0.0004% Enables smaller RF for lower power, higher clock speed, lower area overheads Enables reliability by making ECC affordable Nontrivial implementation overheads Need to explore complexity-effective solution