Transient Fault Detection and Recovery via Simultaneous Multithreading Nevroz ŞEN 26/04/2007.

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

School of Computing Exploiting Eager Register Release in a Redundantly Multi-threaded Processor Niti Madan Rajeev Balasubramonian University of Utah.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.

Microarchitectural Approaches to Exceeding the Complexity Barrier © Eric Rotenberg 1 Microarchitectural Approaches to Exceeding the Complexity Barrier.

Lecture 4, Slide 1 ECE 442 / CS 436 Spring 2004 Tuesday, May 19, 2015 Design of High Availability Systems and Networks Lecture 4 Processor-Level Detection.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Chapter 12 Pipelining Strategies Performance Hazards.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Computer Organization and Architecture

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Slipstream Processors by Pujan Joshi1 Pujan Joshi May 6 th, 2008 Slipstream Processors Improving both Performance and Fault Tolerance.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.

Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.

1 Transient Fault Recovery For Chip Multiprocessors Mohamed Gomaa, Chad Scarbrough, T. N. Vijaykumar and Irith Pomeranz School of Electrical and Computer.

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

Redundant Multithreading Techniques for Transient Fault Detection Shubu Mukherjee Michael Kontz Steve Reinhardt Intel HP (current) Intel Consultant, U.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Fall 2012 Parallel Computer Architecture Lecture 13: Multithreading III Prof. Onur Mutlu Carnegie Mellon University 10/5/2012.

Multiscalar Processors

Computer Architecture: Multithreading (III)

CSC 4250 Computer Architectures

Simultaneous Multithreading

Computer Structure Multi-Threading

CS203 – Advanced Computer Architecture

Cache Memory Presentation I

/ Computer Architecture and Design

Improving Program Efficiency by Packing Instructions Into Registers

UnSync: A Soft Error Resilient Redundant Multicore Architecture

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Address-Value Delta (AVD) Prediction

Comparison of Two Processors

15-740/ Computer Architecture Lecture 5: Precise Exceptions

How to improve (decrease) CPI

Control unit extension for data hazards

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Control unit extension for data hazards

Control unit extension for data hazards

Conceptual execution on a processor which exploits ILP

Handling Stores and Loads

Presentation transcript:

Transient Fault Detection and Recovery via Simultaneous Multithreading Nevroz ŞEN 26/04/2007

AGENDA Introduction & Motivation SMT, SRT & SRTR Fault Detection via SMT (SRT) Fault Recovery via SMT (SRTR) Conclusion

INTRODUCTION Transient Faults: Faults that persist for a “short” duration Caused by cosmic rays (e.g., neutrons) Charges and/or discharges internal nodes of logic or SRAM Cells – High Frequency crosstalk Solution No practical solution to absorb cosmic rays 1 fault per 1000 computers per year (estimated fault rate) Future is worse Smaller feature size, reduce voltage, higher transistor count, reduced noise margin

INTRODUCTION Fault tolerant systems use redundancy to improve reliability: Time redundancy: seperate executions Space redundancy: seperate physical copies of resources DMR/TMR Data redundancy ECC Parity

MOTIVATION Simultaneous Multithreading improves the performance of a processor by allowing multiple independent threads to execute simultaneously (same cycle) in different functional units Use the replication provided by the different threads to run two copies of the same program so we are able to detect errors

MOTIVATION R1  (R2) Input Replication Output Comparison Memory covered by ECC RAID array covered by parity Servernet covered by CRC R1  (R2) microprocessor Replicated Microprocessors + Cycle-by-Cycle Lockstepping

MOTIVATION R1  (R2) Input Replication Output Comparison Memory covered by ECC RAID array covered by parity Servernet covered by CRC R1  (R2) Thread Replicated Threads + Cycle-by-Cycle Lockstepping ???

MOTIVATION Less hardware compared to replicated microprocessors SMT needs ~5% more hardware over uniprocessor SRT adds very little hardware overhead to existing SMT Better performance than complete replication Better use of resources Lower cost Avoids complete replication Market volume of SMT & SRT

MOTIVATION - CHALLENGES Cycle-by-cycle output comparison and input replication (Cycle-by-Cycle Lockstepping); Equivalent instructions from different threads might execute in different cycles Equivalent instructions from different threads might execute in different order with respect to other instructions in the same thread Precise scheduling of the threads is crucial Branch misprediction Cache miss

SMT – SRT - SRTR Functional Units Instruction Scheduler Thread1 Thread2 Simultaneous Multithreading (SMT)

SMT – SRT - SRTR SRT: Simultaneous & Redundantly Threaded Processor SRT = SMT + Fault Detection SRTR: Simultaneous & Redundantly Threaded Processor with Recovery SRTR = SRT + Fault Recovery

Fault Detection via SMT - SRT Sphere of Replication (SoR) Output comparison Input replication Performance Optimizations for SRT Simulation Results

SRT - Sphere of Replication (SoR) Logical boundary of redundant execution within a system Components inside sphere are protected against faults using replication External components must use other means of fault tolerance (parity, ECC, etc.) Its size matters: Error detection latency Stored-state size

SRT - Sphere of Replication (SoR) for SRT Excludes instruction and data caches Alternates SoRs possible (e.g., exclude register file)

OUTPUT COMPARISION Compare & validate output before sending it outside the SoR - Catch faults before propagating to rest of system No need to compare every instruction; Incorrect value caused by a fault propagates through computations and is eventually consumed by a store, checking only stores suffices. Check; 1. Address and data for stores from redundant threads. Both comparison and validation at commit time 2. Address for uncached load from redundant threads 3. Address for cached load from redundant threads: not required Other output comparison based on the boundary of an SoR

OUTPUT COMPARISION – Store Queue Store:... Store: R1  (R2) Store:... Store: R1  (R2) Store:... Store Queue Output Comparison To Data Cache Bottleneck if store queue is shared Separate per-thread store queues boost performance

INPUT REPLICATION Replicate & deliver same input (coming from outside SoR) to redundant copies. To do this; Instructions: Assume no self-modification. No check Cached load data: Active Load Address Buffer Load Value Queue Uncached load data: Synchronize when comparing addresses that leave the SoR When data returns, replicate the value for the two threads External Interrupts: Stall lead thread and deliver interrupt synchronously Record interrupt delivery point and deliver later

INPUT REPLICATION – Active Load Address Buffer (ALAB) Delays a cache block’s replacement or invalidation after the retirement of the trailing load Counter tracks trailing thread’s outstanding loads When a cache block is about to be replaced: The ALAB is searched for an entry matching the block’s address If counter != 0 then: Do not replace nor invalidate until trailing thread is done Set the pending-invalidate bit Else replace - invalidate

INPUT REPLICATION – Load Value Queue (LVQ) An alternative to ALAB – Simpler Pre-designated leading & trailing threads Protected by ECC add load R1  (R2) sub add load R1  (R2) sub probe cache LVQ Leading Thread Trailing Thread

INPUT REPLICATION – Load Value Queue (LVQ) Advantages over ALAB; Reduces the pressure on data cache ports Accelerate fault detection of faulty addresses Simple design

Performance Optimizations for SRT Idea: Using one thread to improve cache and branch prediction behavior for the other thread. Two techniques; Slack Fetch Maintains a constant slack of instructions between the threads Prevents the trailing thread from seeing mispredictions and cache misses Branch Outcome Queue (BOQ)

Performance Optimizations for SRT - Branch Outcome Queue (BOQ) Fetch Decode DispatchCommit Execute Data Cache BOQ Sends the outcomes of the committed branch outcomes (branch PCs and outcomes) to the trailing thread In the fetch stage trailing thread uses the head of queue like a branch target buffer

Simulation Results Simulation Environment: Modified Simplescalar “sim- outorder” Long front-end pipeline because of out-of-order nature and SMT Simple approximation of trace cache Used 11 SPEC95 benchmarks

Simulation Results ORH: On-Chip Replicated Hardware ORH-Dual -> two pipelines, each with half the resources SMT- Dual -> Replicated threads with no detection hardware

Simulation Results - Slack Fetch & Branch Outcome Queue Max 27% performance improvements for SF, BOQ, and SF + BOQ Performance better with slack of 256 instructions over 32 or 128 Prevents trailing thread from wasting resources by speculating

Simulation Results - Input Replication Very low performance degradation for 64- entry ALAB or LVQ On average a 16-entry ALAB and a 16-entry LVQ degrade performance by 8% and 5% respectively.

Simulation Results - Overall Comparison with ORH- Dual SRT processor: 256 slack fetch, BOQ with 128 entries, 64-entry store buffer, and 64- entry LVQ Average: 16% Maksimum: %29 over a lockstepping processor with the “same” hardware

Fault Recovery via SMT (SRTR) What is wrong with SRT: A leading non- store instruction may commit before the check for the fault occurs Relies on the trailing thread to trigger the detection However, an SRTR processor works well in a fail-fast architecture A faulty instruction cannot be undone once the instruction commits.

Fault Recovery via SMT (SRTR) - Motivation In SRT, a leading instruction may commit before the check for faults occurs, relying on the trailing thread to trigger detection. In contrast, SRTR must not allow any leading instruction to commit before checking occurs, SRTR uses the time between the completion and commit time of leading instruction and checks the results as soon as the trailing completes In SPEC95, complete to commit takes about 29 cycles This short slack has some implications: Leading thread provides branch predictions The StB, LVQ and BOQ need to handle mispredictions

Fault Recovery via SMT (SRTR) - Motivation Leading thread provides the trailing thread with branch predictions instead of outcomes (SRT). Register value queue (RVQ), to store register values and other information necessary for checking of instructions, avoiding bandwidth pressure on the register file. Dependence-based checking elision (DBCE) to reduce the number of checks is developed Recovery via traditional rollback ability of modern pipelines

SRTR Additions to SMT SRTR Addition to MST Predq : Prediction Queue LVQ : Load Value Queue CVs : Commit Vectors AL: Active List RVQ: Register Value Queue

SRTR – AL & LVQ Leading and trailing instructions occupy the same positions in their ALs (private for each thread) May enter their AL and become ready to commit them at different times The LVQ has to be modified to allow speculative loads The Shadow Active List holds pointers to LVQ entries A trailing load might issue before the leading load Branches place the LVQ tail pointer in the SAL The LVQ’s tail pointer points to the LVQ has to be rolled back in a misprediction

SRTR – PREDQ Leading thread places predicted PC Similar to BOQ but only holds predictions instead of outcomes Using the predQ, the two threads fetch essentially the same instructions On a misprediction detection leading clears the predQ ECC protected

SRTR – RVQ & CV SRTR checks when the trailing instruction completes The Register Value Queue is used to store register values for checking, avoiding pressure on the register file RVQ entries are allocated when instruction enter the AL Pointers to the RVQ entries are placed in the SAL to facilitate their search If check succeeds, the entries in the CV vector are set to checked-ok and comitted If check fails, the entries in the CV vectors are set to failed Rollback done when entries in head of AL

SRTR - Pipeline After the leading instruction writes its result back, it enters the fault-check stage The leading instruction puts its value in the RVQ using the pointer from the SAL. The trailing instructions also use the SAL to obtain their RVQ pointers and find their leading counterparts’

SRTR – DBCE SRTR uses a separate structure, the register value queue (RVQ), to store register values and other information necessary for checking of instructions, avoiding bandwidth pressure on the register file. Check each inst brings BW pressure on RVQ DBCE (Dependence Based Checking Elision) scheme reduce the number of checks, and thereby, the RVQ bandwidth demand.

SRTR – DBCE Idea: Faults propagate through dependent instructions Exploits register dependence chains so that only the last instruction in a chain uses the RVQ, and has the leading and trailing values checked.

SRTR – DBCE If the last instruction check succeeds, commit previous ones If the check fails, all the instructions in the chain are marked as having failed and the earliest instruction in the chain triggers a rollback.

SRTR - Performance Detection performance between SRT & SRTR Better results in the interaction between branch mispredictions and slack. Better than SRT between %1-%7

SRTR - Performance SRTR’s average performance peaks at a slack of 32

CONCLUSION A more efficient way to detect Transient Faults is presented The trailing thread repeats the computation performed by the leading thread, and the values produced by the two threads are compared. Defined some concepts: LVQ, ALAB, Slack Fetch and BOQ An SRT processor can provide higher performance then an equivalently sized on-chip HW replicated solution. SRT can be extended for fault recovery-SRTR

REFERANCES T. N. Vijaykumar, Irith Pomeranz, and Karl Cheng, “Transient Fault Recovery using Simultaneous Multithreading,” Proc. 29th Annual Int’l Symp. on Computer Architecture, May S. K. Reinhardt and S. S. Mukherjee. Transient-fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 25–36, June Eric Rotenberg, “AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessor,” Proceedings of Fault- Tolerant Computing Systems (FTCS), S.S.Mukherjee, M.Kontz, & S.K.Reinhardt, “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” International Symposium on Computer Architecture (ISCA), 2002