Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.

Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University

Lecture 4: Soft Errors Software Techniques

Outline □Soft Errors Recap □Process Technology and Packaging Solutions □Gate-level and Circuit-level Solutions □Microarchitectural Solutions □Single-core □Multi-threaded □Software Solutions □Multi Bit Upsets (MBUs) □Single Event Latchup

Razor □Originally proposed to tolerate process variations and achieve power reduction □Shadow latch clocked with a delayed clock □If difference in values latched, raise error □How to use it to detect soft errors?

Multi-issue Processors □Superscalar □Execute instructions from the same thread □Multi-threading □Execute instructions from the same threads in one cycle, but can switch between applications □Simultaneous Multithreading □Issue instructions from different threads in the same cycle SuperscalarMultithreadingSimultaneous Multithreading

SMT Solutions □SRT: Simultaneous Redundant Threading □Duplicate a thread, and run them on the same core as leading thread and trailing thread □Threads maintain their contexts, including the register file □Threads should not diverge when there are no faults □Memory interface □Only leading thread can read from the memory □Puts a copy in a LVQ – trailing thread reads from here □Leading thread writes to STB to write store values □Only trailing thread can write to the memory - after checking the value in the STB □Branch Interface □Leading thread writes branch outcomes in BOQ □Trailing thread has perfect branch prediction

SMT Solutions: PER □Trailing thread competes for resources – High ILP phases □STB fills up causing leading thread stalls □PER: Partial Explicit Redundancy □Leading thread uses all resources during high-ILP phases □SEM: Single Execution Mode □Trailing thread executes during low-ILP phases □REM: Redundant Execution Mode □In REM state, check all instructions □Need resume point for trailing thread □Maintain state (LVQ, STB, RF, etc…) □Proportional to slack size

SMT Solutions: IRTR □IR: Instruction Reuse □Do not execute an instruction, if it has already executed with the same inputs □Keep a reuse buffer □IRTR: Implicit Redundancy Through Reuse □Check with previous value for soft errors □If matches, continue and overwrite the value in buffer □If mis-match, raise flag □During high ILP regions

Watchdog Processor & Control Flow Checking □Watchdog processor □Simple processor, receives signals from the main processor □Checks to see if the signals are coming in correct order □S3 should not come after S1 □Watchdog program can be automatically generated □Formal techniques for correctness □Asynchronous communication of Main processor with watchdog processor Processor Memory Watchdog Processor BB1 BB2 BB3 Send S1 Send S2 Send S3

EDDI (Error Detection by Duplicated Instructions) □Duplicate instructions □Validation instructions □Store and branch are sync points □Check store and branch operands □Memory penalty □Load/store from duplicated locations

EDDI+CFCSS (Control Flow Checking by Software Signatures) □At the beginning of the node, perform G = G xor d □d2 = s1 xor s2, Then G = s1 xor (s1 xor s2) = s2 □If two source nodes jump to the same destination node, then the two source nodes should have the same signature

CFCSS + SWIFT (Software Implemented Fault Tolerance) □If two source nodes jump to the same destination node, then the two source nodes should have the same signature □Need another path-dependent D □B1 -> B5, D=0, Then G = s1 xor d5 xor 0 = s5 □B3 -> B5, D = s1 xor s3, Then G = s3 xor (s1 xor s5) xor (s1 xor s3) = s5

ED 4 I: Error Detection by Diverse Data and Duplicated Instructions The simplest way to detect Byzantine Faults is to run the same program on multiple processors and compare results. ED 4 I is Byzantine Fault detection for uniprocessors. Must take into account both temporary and and permanent faults. Re-executing with same inputs does not guard against permanent faults Overhead = 100%

Key Idea Lets feed into the program two different sets of data and then compare the results. Key Insight: If the program only uses arithmetic operations, we can alter the input by multiplying all input numbers by a constant. Then the modified output will be the (real output) * (the constant). Thus, you can verify that the two computations succeeded AND the two computations will be affected by errors differently.

New Program If we alter the input to the program, we must alter the program to work with this modified input. The transformation is given the constant k (called the “diversity factor”) and it creates the “k-factor diverse program”. The new program will have the same control flow graph as the old program but all the variables will be k-multiples of the of original ones.

Transformations If k ↔ <, ≥ ↔ ≤) All constants in code get multiplied by k. Addition and Subtraction of variables unchanged. Multiplication: v 1 *v 2 *....*v n → (v 1 *v 2 *....*v n )/k n-1 Division: v 1 /v 2 → (v 1 /v 2 )*k

Fault Detection & Data Integrity For functional unit h i (such as the adder), fault f and diversity factor k: X i = is the set of inputs to h i E i = subset of X containing the inputs that will result in erroneous output due to the fault. E' i = subset of E i that will escape detection C i (k) = Probability of catching an error in h i. D i (k) = Probability of missing no errors in h i.

Choosing the value of k For some functional units we can derive C i (k) and D i (k) analytically for each k. This is too hard in general so try out a range of k's empirically to determine C i (k) and D i (k). Bus Signal (12-bit) 12-bit carry look-ahead adder 12-bit Multipliers and Dividers

Analytical Computation of AVF □Iteration Space □L-dimensional integer vector space □L: levels of loop □Each point in IS represents an iteration □Data dependences exist □Fully ordered in time □Array Space □M-dimensional integer vector space □M: array dimension □Every point represents an element of the array for (i=0; i<N 1 ; i++) for (j=0; j<N 2 ; j++) a[i][j] = a[i][j-1]+ a[i-1][j] + a[i][j+1]

Analytical Computation of AVF □Access Function (AF) of a reference □Mapping from IS to AS □When are the elements of array accessed by a reference □References will access different parts of Array Space □Divide the Array Space into regions, in which every element is accessed by a subset of references □Array Interval (AI): Subset of AS that the reference accesses □Every element is accessed by the same set of references

Analytical Computation of AVF Iteration Intervals for an Array Interval □Each reference will access the elements of array interval at iterations given by AF (Access Function) □Iteration Interval (II) is AF in Array Interval □Formula of access time of each element in II □Vulnerability can be computed as a formula on II □Time from r/w  r □A reference either reads or writes (not both) □Need to time-order points in II □Break into Iteration Segments, which can be ordered □Strict order, or point-wise ordered

Multiple-bit Upsets (MBUs) □Error rate ~ 1/100 th of SEU □Hamming Code □1-bit error correction, 2-bit error detection □Reed Solomon Codes □RS(n,k) with s-bit symbols □s - Each symbol is s-bits □n – total number of bits per code, n = 2 s -1 □k – data bits □Number of parity bits = 2t = n-k □Can correct errors in ‘t’ symbols, where t = (n-k)/2 □RS(255, 223) with 8-bit symbols □Can correct 16 symbol errors in each codeword (255 bits) □Other multi-bit error detection and correction schemes □LDPC

Copyright 2005, M. Tahoori 25 Bit Read Bit has error protection Error is only detected (e.g., parity + no recovery) Error can be corrected (e.g, ECC) yes no Does bit matter? Silent Data Corruption (SDC) yes no Detected, but unrecoverable error (DUE) no error yes no benign fault no error benign fault no error Strike on state bit (e.g., in register file)

Interleaving bits □Interleaving converts □spatial multi-bit error  multiple single bit errors bits X X X X = covered with single ECC code + + + + = covered with different ECC code / / / 0 0 0

Two Separate Strikes on Different Bits Temporal Double Bit Errors □SECDED ECC (single error correction, double error detection) □could detect error, but cannot correct the error □if errors accumulate □single bit correctable error becomes a double bit detectable error Cycle 100 Cycle 1,000,000

Solutions for Temporal Double Bit Errors □Natural Effects □whenever a processor reads a cache block, we can correct the single bit error □check for errors when cache blocks are replaced from the cache □More Powerful ECC □SECDED ECC requires 8 bits per 64 bits □7 bits for single bit correction □8 th bit for double bit detection □Overhead = 13% □ECC with two bit correction requires 12 bits per 64 bits □Overhead = 19%

Scrubbing □Periodically read memory and correct all single bit errors □Disallows accumulation of temporal double bit errors □Standard technique in main memories (DRAMs)

Single Event Latchup □SEL: Single Event Latchup □Parasitic circuit elements forming a silicon controlled rectifier (SCR) □Potentially destructive □the device current may destroy the device if not current limited and removed "in time. □Removal of power to the device is required in all non- catastrophic SEL conditions in order to recover device operations. □SEL probability increases with temperature!

Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.

Similar presentations

Presentation on theme: "Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.

Similar presentations

Presentation on theme: "Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University."— Presentation transcript:

Similar presentations

About project

Feedback