Presentation on theme: "Fault Tolerance in Embedded Systems Daniel Shapiro"— Presentation transcript:
Fault Tolerance in Embedded Systems Daniel Shapiro firstname.lastname@example.org http://site.uottawa.ca/~dshap092
Fault Tolerance This presentation is based upon  Focus is on the basics as applied to embedded systems with processors This presentation does not rely on Wikipedia. See Byzantine fault tolerance on wiki
Trends Problems Fault Tolerance Goal = safety + liveness Safe: Hide faults from hurting the user, even in failure Live: performs the desired task Better to fail than to do harm Cosmic rays and alpha particles
Trends Problems More devices/processor means more units can fail – Think CISC v.s. RISC More complex designs mean more failure cases exist – Think AVX v.s. MMX Cache faults and more generally memory faults – Recharging DRAM is “easier” than reloading a destroyed cache line
Fault Tolerance Definitions Fault – Physical faults – Software faults May manifest as error Masked fault does not show up as an error Errors may also be masked Otherwise the error results in a failure Logical mask - 0 AND error bit Architectural mask – NOP reg destination error Application mask – silent fault like writing garbage to an unused address … produces no failure
Fault Hiding Some faults are automatically recovered already: branch prediction can recover from faulty branches Dangerous cases are the faults that are NOT masked Goal: mask all faults – E.g. HDD faults are common but hidden Transient fault – signal glitch Permanent fault – wire burns Intermittent fault – cold soldered wire Fault tolerance scheme – design a system for masking the expected fault type (transient/permanent/int ermittent)
Fault Avoidance Fault avoidance is just as good as fault tolerance Error detection and correction is the alternative Permanent faults – Physical wear-out – Fabrication defects – Design bugs
Error Models We only care about errors, since masked faults are innocuous Error models – For improving fault tolerance – E.g. stuck at 0/1 model tells us that there is a potential error – Many many stuck at 0 errors can mean that there is NO PROBLEM – Reduces the need to evaluate all sources of error. Design space size↓↓ 3 main error model parameters Type of error – bridging/coupling error (e.g. short, cross-talk), stuck-at error, fail-stop error, delay error Error duration – transient, intermittent, permanent # simultaneous errors – errors are rare, how many wars can you fight at once?
# Simultaneous Errors Maybe 1 error hides another error E.g. 2-bit flip parity checker Reasons for resolving: – Mission critical – High error rate – Latent errors (undetected and lingering) may overlap with other errors. Think about an incorrectly stored word: the error occurs upon NEXT read of the word Better to detect the first error AND to have double error correction since the error rate trends are against us.
Fault Tolerance Metrics Availability – 99.999% = five nines of availability Reliability – P(time t and still no failure) – Most errors are not failures Mean != probability Variance (2 and 20 v.s. 11 and 12) MTTF – Mean Time to Failure MTTR – Mean Time To Repair MTBF = MTTF+MTTR
Fault Tolerance Metrics Failures in Time (FIT) – Rate – # failures / 1 billion hours – Additive – α 1/MTTF – Arbitrary – Raw rate includes masked failures – Effective rate excludes masked failures Effective FIT = FIT*AVF – Helps locate transient error vulnerability – Shown to be a good lower bound on reliability Architectural Vulnerability Factor (AVF) – Architecturally Correct Execution =ACE state – Otherwise = un-ACE state – E.g. PC state = ACE; branch pred=un-ACE – Fraction of time in ACE state Component AVF = – avg # ACE bits per cycle / # state bits. If many ACE bits reside in a structure for a long time, that structure is highly vulnerable. Large AVF
Error Detection Helps to provide safety Without redundancy we cannot detect errors What kind of redundancy do we need? Redundancy – Physical (majority gate = TMR, dual modular redundancy =DMR, NMR where N is odd>3) – Temporal (run twice & compare results) – Information (extra bits like parity) Boeing 777 uses “triple- triple” modular redundancy, 2 levels of triple voting, where each vote is from a different architecture DMR
Error Detection Physical Redundancy Heterogeneous hardware units can provide physical redundancy – E.g. Watchdog timer – E.g. Boeing 777 different architectures running same program and then voting on results. – Design Diversity Unit replication – Gate level – Register level – Core level Wastes lots of area & power NMR impractical for PCs False error reporting becomes more likely Using different hardware for the voters avoids the possibility of design bugs
Error Detection Temporal Redundancy Twice the active power but not twice the area Can find transient but not permanent errors Smart pipelining can have the votes arrive 1 cycle apart, but wastes pipeline slots Information Redundancy Error-Detecting Code (EDC) Words mapped to code words like checksums and CRC Hamming Distance (HD) Single-Error Correcting (SEC) Double-Error Detecting (DED) with HD of 4
For ALU we can compare bitcount of inputs out outputs, but this is not common Many other techniques exist like BIST or calculating a known quantity and comparing to a ROM with the answer in it. ReExecution with Shifted Operands (RESO) finds permanent errors. Redundant multithreading: use empty slots to run redundancy threads Checking invariant conditions Anomaly detection like behavioural antivirus (look at data and/or traces) Error Detection by Duplicated Instructions (EDDI) – let software look into the hardware using randomly inserted dummy code Way way more stuff about caches, CAMs, consistency, and more.
Error Recovery Safety from detection but what about liveness? Forward Error Recovery – FER – Once detected, the error is seamlessly corrected FER implemented using physical, information, or temporal redundancy More HW needed to correct than detect – E.g. DMR can detect but TMR or triple-triple can correct (spatial) HD=k (information redundancy) – k-1 bit errors detection – (k-1)/2 error correction – (HD,Detect,correct) (5,4,2) TMR by repetition (temporal)
Error Recovery Backwards Error Recovery – BER – Rollback / Safe point – Restore point – Recovery line for multicore (cool!) – How do we model communication in MP /w caches?? – Just log everything? Nope, save it distributed and in the caches. Possibly use software. – Way more crazy algorithm selection magic…. The Output Commit Problem – Sphere of recoverability – Don’t let bad data out – Wait for error detection hardware to complete – Latency is usually hidden – Processor state is difficult to store/restore
Error Recovery FER when DRAM module fails – RAID-M/chipkill
Fault Diagnosis Diagnosis hardware – FER and BER do not solve livelock – E.g. mult fails, recover, mult again.. livelock Idea: be smart, figure out what components are toast BIST – Compare boundary scan data or stored tests to a ROM with the right answers Run BIST at fixed intervals or at end of context switch Commit changes if error free, otherwise restore Try to test all components in system, ideally all gates in the system MPs/NoC typically have dedicated diagnosis hardware
Self-Repair BIST can tell you what broke, but not how to fix it. i7 can respond to errors on the on-chip busses at runtime. Partial bus shorts do not kill the system. Data is transferred like a packet (NoC) – Because of all the prediction, lanes, and issue logic, superscalar has much more redundancy than RISC – For RISC just steal a core from the grid and mark the old core dead – CISC has some very crazy metrics for triggering self-repair Remember the infinite loop mult we diagnosed? Alternative: notice that mult is dead, use shift-add booth Another cool idea: if shift breaks use the mult with base 2 inputs (hot spare) A cold spare would be a fully dedicated redundant unit – CellBE only uses 7 cores and has an 8 th cold spare SPE! So cool!
Conclusions Things are getting a bit crazy in error detection and correction Multicore and caches complicated everything Although up until now this fault stuff was known, it is only now entering the PC market because the error rate is increasing with process technology Like the byzantine generals problem, we start to worry about who to trust in the running but broken chip Voting works best for transient errors. For permanent errors too, but land the plane or you will end up crashing. You can prove that it is easier to detect a problem than fix it.
References  Daniel J. Sorin, “Fault Tolerant Computer Architecture (Synthesis Lectures on Computer Architecture),” 2010.