Presentation is loading. Please wait.

Presentation is loading. Please wait.

Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.

Similar presentations


Presentation on theme: "Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado."— Presentation transcript:

1 Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado at Boulder Department of Electrical and Computer Engineering DRACO Architecture Research Group Workshop on Binary Instrumentation and Applications San Jose, CA 10.22.2006

2 Outline Introduction Background/Terminology Software-centric Fault Detection Process-Level Redundancy Experimental Results Conclusion

3 Introduction Process technology trends –Single transistor error rate expected to stay close to constant –Number of transistors is increasing exponentially with each generation Transient faults will be a problem for microprocessors! Hardware Approaches –Specialized redundant hardware, redundant multi-threading Software Approaches –Compiler solutions: instruction duplication, control flow checking –Low-cost, flexible alternative but higher overhead Goal: Leverage available hardware parallelism in SMT and CMP machines to improve the performance of software transient fault tolerance

4 Background/Terminology Types of transient faults (based upon outcome) –Benign Faults –Silent Data Corruption (SDC) –Detected Unrecoverable Error (DUE) True DUE False DUE Sphere of Replication (SoR) –Indicates the scope of fault detection and containment Input Replication Output Comparison

5 Software-centric Fault Detection Most previous approaches are hardware-centric –Even compiler approaches (e.g. EDDI, SWIFT) Software-centric able to leverage strengths of a software approach –Correctness is defined by software output –Ability to see larger scope effect of a fault –Ignore benign faults Processor Cache MemoryDevices ApplicationLibraries Operating System Hardware-centric Fault Detection Software-centric Fault Detection Software SoR Hardware SoR

6 Process-Level Redundancy (PLR) System Call Emulation Unit Creates redundant processes Barrier synchronize at all system calls Enforces SoR with input replication and output comparison Emulates system calls to guarantee determinism among all processes Detects and recovers from transient faults App Libs App Libs App Libs SysCall Emulation Unit Operating System Watchdog Alarm Master Process only process allowed to perform system I/O Redundant Processes identical address space, file descriptors, etc. not allowed to perform system I/O Watchdog Alarm occasionally a process will hang set at beginning of barrier synchronization to ensure that all processes are alive

7 Enforcing SoR and Determinism Input Replication –All read events: read(), gettimeofday(), getrusage(), etc. –Return value from all system calls Output Comparison –All write events: write(), msync(, etc. –System call parameters Maintaining Determinism at System Calls –Master process executes system call –Redundant processes emulate it Ignore some: rename(), unlink() Execute similar/altered system call –Identical address space: mmap() –Process-specific data: open(), lseek() Compare syscall type and cmd line parameters Write cmd line parameters and syscall type to shmem read() Write resulting file offset and read buffer to shmem Copy the read buffer from shmem lseek() to correct file offset Master Process Redundant Processes Barrier Example of handling a read() system call

8 Fault Detection and Recovery PLR supports detection/recovery from multiple faults by increasing number of redundant processes and scaling the majority vote logic Output Mismatch Detected as a mismatch of compare buffers on an output comparison Use majority vote ensure correct data exists, kill incorrect process, and fork() to create a new one Program FailureSystem call emulation unit registers signal handlers for SIGSEGV, SIGIOT, etc. Re-create the dead process by forking one of existing processes TimeoutWatchdog alarm times outDetermine the missing process and fork() to create a new one Type of ErrorDetection Mechanism Recovery Mechanism

9 Experimental Methodology Use a set of the SPEC2000 benchmarks PLR prototype developed with Pin –Intercept system calls to implement PLR Fault Injection –Gather an instruction count profile –Use profile to generate a test case Test case: an instruction and a particular execution of the instruction to fault –Run with Pin in JIT mode and use IARG_RETURN_REGS to alter a random bit of the instructions source or destination registers Fault Coverage –Use fault injector on test inputs generating 1000 test cases per benchmark –specdiff in SPEC2000 harness determines output correctness PLR Performance –Run PLR (in Probe mode using Pin Probes) on reference inputs with two redundant processes –4-way SMP machine, each processor is hyper-threaded –Use sched_set_affinity() to simulate various hardware platforms

10 Fault Coverage Watchdog timeout very rare so not shown PLR detects all Incorrect and Failed cases Effectively detects relevant faults and ignores benign faults Floating point correctness question (ex. 168.wupwise, 172.mgrid) –Actually different results but tolerable difference for specdiff

11 Performance Performance for single processor (PLR 1x1), 2 SMT processors (PLR 2x1) and 4 way SMP (PLR 4x1) Slowdown for 4-way SMP only 1.26x –Should be better on a CMP with faster processor interconnect

12 Conclusion Present a different way to use existing general purpose SMT and CMP machines for transient fault tolerance Differentiate between hardware-centric and software-centric fault detection models –Show how software-centric can be effective in ignoring benign faults PLR on a 4-way SMP executes with only a 26% slowdown, a 36% improvement over the fastest compiler technique Future Work –Implementation in a run-time system allows for dynamically altering amount of fault tolerance –Simple PLR model is presented; work on handling interrupts, shared memory, and threads (the tough one) Questions?


Download ppt "Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado."

Similar presentations


Ads by Google