Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)1 ELEC 7770 Advanced VLSI Design Spring 2014 Soft Errors and Fault-Tolerant Design Vishwani.

Similar presentations


Presentation on theme: "Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)1 ELEC 7770 Advanced VLSI Design Spring 2014 Soft Errors and Fault-Tolerant Design Vishwani."— Presentation transcript:

1 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)1 ELEC 7770 Advanced VLSI Design Spring 2014 Soft Errors and Fault-Tolerant Design Vishwani D. Agrawal James J. Danaher Professor ECE Department, Auburn University Auburn, AL 36849 vagrawal@eng.auburn.edu http://www.eng.auburn.edu/~vagrawal/COURSE/E7770_Spr14

2 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)2 Soft Errors  Soft errors are the errors caused by the operating environment.  They are not due to a permanent hardware fault.  Soft errors are intermittent or random, which makes their testing unreliable.  One way to deal with soft errors is to make hardware robust:  Capable of detecting soft errors  Capable of correcting soft errors  Both measures are probabilistic

3 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)3 Some Early References  J. von Neumann, “Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components,” pp. 329-378, 1959, in A. H. Taub, editor, John von Neumann: Collected Works, Volume V: Design of Computers, Theory of Automata and Numerical Analysis, Oxford University Press, 1963.  M. A. Breuer, “Testing for Intermittent Faults in Digital Circuits,” IEEE Trans. Computers, vol. C-22, no. 3, pp. 241-246, March 1973.  T. C. May and M. H. Woods, “Alpha-Particle-Induces Soft Errors in Dynamic Memories,” IEEE Trans. Electron Devices, vol. ED-26, no. 1, pp. 2-9, 1979.

4 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)4 Causes of Soft Errors  Interconnect coupling (crosstalk).  Power supply noise: IR-drop, power droop, ground bounce.  Ignition noise.  Electromagnetic pulse (EMP).  Effects generally attributed to alpha-particles:  Charged particles: electrons, protons, ions.  Radiation (photons): X-rays, gamma-rays, ultra-violet light.

5 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)5 Sources of Alpha-Particles  Radioactive contamination in VLSI packaging material.  Ionosphere, magnetosphere and solar radiation.  Other electromagnetic radiation.

6 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)6 Alpha-Particle  Helium nucleus: two protons and two neutrons, mass = 6.65 ×10 -27 kg, charge = +2e (e = 1.6 ×10 -19 C).  Energy = 3.73 GeV

7 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)7 Soft Error Rate (SER)  Failures in time (FIT): One FIT is 1 error per billion hours of operation.  Alternative unit is mean time between failures (MTBF) or mean time to failure (MTTF). 1 year MTBF =10 9 /(365×24)= 114,155 FIT

8 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)8 Particle Strike p - substrate n - + + + + - - Ion or Charged particle

9 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)9 Induced Current time current I(t) = I 0 (e – t/a – e – t/b ),a >> b

10 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)10 Voltage Induced at a Node V = Q/C Where Q = ∫ I(t) dt C = node capacitance Smaller node capacitance will result in larger voltage swing.

11 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)11 Effect on Digital Circuit IN OUT CK Combinational Logic Charged Particles Charged Particles

12 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)12 An SRAM Cell bit VDD WL BL 0 1

13 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)13 SRAM Cell Struck by Alpha-Particle Single-Event Upset (SEU) bit VDD WL BL 0→1 1→0 Charged Particles

14 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)14 A Resistor Hardened SRAM Cell bit VDD WL BL 0 1

15 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)15 D-Latch D CK = 0 Q 1 0 Q

16 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)16 SEU in D-Latch D CK = 0 Q 1→0 0→1 Charged Particles Q

17 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)17 Single Event Transients in Combinational Logic CK 1111 0 1 0 1 Charged Particles

18 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)18 Effects of Transients  Error correcting effects  Transient pulse is filtered by gate inertia  Transient is blocked by an unsensitized path  Transient is blocked by an inactive clock  Error enhancing effects  Large number of gates can produce multiple pulses  Fanouts can multiply error pulses

19 Typical Soft Error Distribution Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)19 S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” Computer, vol. 38, no. 2, pp. 43-52, February 2005.

20 Soft Error Simulation  F. Wang and V. D. Agrawal, “Soft Error Rate with Inertial and Logical Masking,” Proc. 22 nd International Conference on Quality VLSI Design, January 2009, pp. 459-464.  F. Wang and V. D. Agrawal, “Soft Error Rate Determination for Nanoscale Sequential Logic,” Proc. 11 th International Symposium on Quality Electronic Design (ISQED), March 2010, pp. 225-230. Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)20

21 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)21 SEUs in FPGA  Parts that can be affected  Look-up table (LUT)  Configuration memory cell  Flip-flop  Block RAM  F. L. Kastensmidt, L. Carro and R. Reis, Fault-Tolerant Techniques for SRAM-Based FPGAs, Springer, 2006.

22 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)22 LUT out F1 F2F3F4 1 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 Memory cells

23 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)23 SEU in LUT out F1 F2F3F4 1 0 1 0 0 1 1 0 0 0 0 0 1 1 1 0 Memory cells Charged Particle 1 changed to 0

24 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)24 Four Types of SEU in FPGA F1 F2 F3 F4 LUT FF M M M M MMM Configuration memory cell Type 1 Type 2 Type 3 Block RAM Type 4

25 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)25 SEU Detection Methods  Hardware redundancy  Time redundancy  Error detection codes (EDC)  Self-checker techniques

26 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)26 SEU Mitigation Techniques  Triple modular redundancy (TMR)  Multiple redundancy with voting  Error detection and correction codes (EDAC)  Hardened memory cells  FPGA-specific methods  Reconfiguration  Partial configuration  Rerouting design

27 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)27 Hardware Redundancy for Detection Combinational Logic Combinational Logic (duplicated) outputinputs Logic 1 indicates error Hardware overhead is high ~ 100% Performance penalty is negligible.

28 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)28 Time Redundancy for Detection Combinational Logic outputinputs Logic 1 indicates error Hardware overhead is low. Performance penalty ( ~ d) = maximum detectable pulse width. D Q CK+ d CK

29 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)29 Repeat on Error Detection Combinational Logic output inputs Logic 1 indicates error D Q CK+ d CK C Operation:If error is detected, then output retains its previous value. Repeating the computation can produce correct result.

30 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)30 Muller C-Element output C A B AB 000 01 Old output 10 111 S Q R A B output

31 Dynamic CMOS C-Element Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)31 output C A B AB 001 01 Old output 10 110 A B output

32 Pseudostatic CMOS C-Element Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)32 output C A B AB 001 01 Old output 10 110 A B output Weak keeper

33 Built-In Soft Error Resilience (BISER) Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)33 ABoutput 001 01Old output 10 110 A B output Weak keeper Flip-flop Duplicate Flip-flop Clock Data from combinational logic

34 BISER  Assumptions:  Most soft errors in combinational logic are eliminated by inertial or logic masking.  Soft error pulse generated in flip-flop is much shorter than clock period.  Probability of either a master or slave latch being struck by soft error exactly at clock edge is small.  Flip-flop is duplicated and outputs fed to C-element.  Twenty times reduction of soft error observed.  Ref.: S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” Computer, vol. 38, no. 2, pp. 43-52, February 2005. Spring 2014, Apr 11... ELEC 7770: Advanced VLSI Design (Agrawal) 34

35 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)35 Triple Modular Redundancy (TMR) Combinational Logic copy 1 outputinputs Majority Voter Combinational Logic copy 3 Combinational Logic copy 2

36 TMR Error Reduction  Voter input error probability = E, assumed independent for each input.  Output error probability, e= Prob(two errors or three errors) =( ) E 2 (1 – E) + ( ) E 3 =3 E 2 – 3 E 3 + E 3 =3 E 2 – 2 E 3  For very small E, E 3 << E 2 → e = 3E 2 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)36 3 2 3 3

37 TMR Error Probability Input error probability, EOutput error probability, e 0.0 0.0010.000002998 0.010.000298 0.10.027 0.20.104 0.30.216 0.40.352 0.5 0.60.648 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)37

38 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)38 Majority Voter Circuit A B ABCoutput 0000 0010 0100 0111 1000 1011 1101 1111 A B output Majority Voter C C

39 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)39 Alternative Implementations of Voter LUT 0001011100010111 output A B C A B C VDD

40 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)40 Triple Modular Redundancy (TMR) Combinational Logic output inputs D Q CK CK + d Majority Voter D Q CK + 2d CK + 3d

41 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)41 TMR for Memory Cells Combinational Logic output inputs D Q CK Majority Voter D Q CK Problems: 1.Accumulation of errors in flip-flops. 1.Voter is not protected.

42 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)42 FF Refresh and TMR for Memory Cells output D Q CK D Q CK Majority Voter Majority Voter Majority Voter Majority Voter r1r2r3

43 Reliability Analysis  Determine how long a system will work without failure.  Find:  Mean time to failure (MTTF) or mean time between failures (MTBF)  Mean time to repair (MTTR)  FIT rate Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)43

44 Reliability Function  Reliability function of a system, R(t) = Probability of survival at time t  Determined from failure rates of components, λ(t) = Number of failures per unit time Generally varies with time. Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)44

45 Failure Rate, λ(t) Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)45 Time, t Failures per second, λ(t) 10 -12 10 -9 10 -6 10 -3 10 0 Infant mortality Constant failure Rate (useful life) λ(t) = λ Wearout or aging

46 Deriving R(t)  R(t) is the probability of no error in interval [0, t].  Divide interval in a large number (n) of subintervals of duration t/n. Let x be the probability of error in one subinterval.  Assume that duration t/n is so small that either no error occurs or at most one error can occur. Then, average errors in a subinterval = 0.(1 – x) + 1.x = x = λt/n.  Probability of no error in interval [0, t] is, R(t)= (1 – x) n = (1 – λt/n) n = exp(– λt), from Sterling’s formula as n → ∞ Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)46

47 R(t) and MTBF R(t)=e –λt Failure rate, λ = failures per unit time Number of failures in time T = λT ∞ MTBF = T/λT = 1/λ = ∫ R(t) dt 0 R(t) = exp( – t/MTBF) For t = MTBF, R(MTBF) = e –1 = 0.368 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)47

48 Reliability and MTBF Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)48 Time, t Reliability, R(t) 1.0 0.8 0.6 0.4 0.2 0.0 1 MTBF 2 MTBF3 MTBF R(t) = 1/e = 0.368

49 Example: First Generation Computer  10,000 electron tubes.  Average burn out rate: 5 tubes per 100,000 hours.  MTBF = 100,000/5 = 20,000 hours = 2.3 years, i.e., 37% chance of survival beyond 2.3 years.  Time for 95% chance of survival:  R(t) = exp(– t/MTBF) = 0.95, or t = 1.4 months Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)49

50 Reliability of TMR  R(TMR)= Prob(all three modules correct) + Prob(any two modules correct) = R 3 + 3R 2 (1 – R) = 3 R 2 – 2 R 3 = 3e -2λt – 2e -3λt Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)50

51 MTBF of TMR R(TMR)= 3e -2λt – 2e -3λt MTBF = ∫ R(TMR) dt=5/(6λ) 0 This is less than the MTBF = 1/λ for a single system! Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)51 8

52 MTBF of TMR Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)52 Time, t Reliability, R(t) 1.0 0.8 0.6 0.4 0.2 0.0 Single module TMR Mission duration

53 Error Detection Code  Errors: Bits can flip due too noise in circuits and in communication.  Extra bits used for error detection.  Example: a parity bit in ASCII code Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal) Even parity code for A01000001 (even number of 1s) Odd parity code for A11000001 (odd number of 1s) 7-bit ASCII code Parity bits Single-bit error in 7-bit code of “A”, e.g., 1000101, will change symbol to “E” or 1000000 to “@”. But error will be detected in the 8-bit code because the error changes the specified parity. 53

54 Richard W. Hamming  Error-correcting codes (ECC).  Also known for  Hamming distance HD = Number of bits two binary vectors differ in  Example: HD(1101, 1010) = 3  Hamming Medal, 1988 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal) 1915-1998 54

55 The Idea of Hamming Code Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)55 Code space contains 2 N possible N-bit code words 1010 ”A” 1110 ”E” 1011 ”B” 1000 ”8” 0010 ”2” 1-bit error in “A” HD = 1 Error not correctable. Reason: No redundancy. Hamming’s idea: Increase HD between valid code words.

56 Hamming’s Distance ≥ 3 Code Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)56 1010010 ”A” 1-bit error in “A” shortest distance decoding eliminates error HD = 2 HD = 1 0010101 ”2” 1000111 ”8” 1011001 ”B” 1110100 ”E” HD = 3 HD = 4 0010010 ”?” HD = 3 HD = 4 0011110 ”3”

57 Minimum Distance-3 Hamming Code Symbol Original code Odd-parity code ECC, HD ≥ 3 00000100000000000 10001000010001011 20010000100010101 30011100110011110 40100001000100110 50101101010101101 60110101100110011 70111001110111000 81000010001000111 91001110011001100 A1010110101010010 B1011010111011001 C1100111001100001 D1101011011101010 E1110011101110100 F1111111111111111 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)57 Original code: Symbol “0” with a single-bit error will be Interpreted as “1”, “2”, “4” or “8”. Reason: Hamming distance between codes is 1. A code with any bit error will map onto another valid code. Remedy: Design codes with HD ≥ 2. Example: Parity code. Single bit error detected but not correctable. Remedy: Design codes with HD ≥ 3. For single bit error correction, decode as the valid code at HD = 1. For more error bit detection or correction, design code with HD ≥ 4.

58 A Book on Coding Theory R. W. Hamming, Coding and Information Theory, Englewood Cliffs, New Jersey: Prentice-Hall, 1980. Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)58

59 Byzantine Empire, 527-565 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)59 Emperor Justinian and General Belisarius

60 Byzantine General’s Problem  In a war a general needs to communicate an attack (a) or retreat (r) order to subordinates in the field.  For success a perfect agreement is necessary.  Byzantine Fault:  Subordinates can be unreliable or malicious.  Communication (messengers) can be unreliable or malicious. Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)60

61 Example 1: Single Fault  General: D; Subordinates: A, B and C Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)61 D A B C r→a r r

62 Example 1: Majority Agreement  General: D; Subordinates: A, B and C Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)62 D ABC r→a r r a a r r r r Retreat

63 Example 2: Two Faults  General: D; Subordinates: A, B and C Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)63 D A B C a a a

64 Example 2: Byzantine Failure  General: D; Subordinates: A, B and C Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)64 D ABC a a a r r r r a a Retreat Attack

65 Byzantine Resilient System  A system that can correctly function in presence of Byzantine faults.  Byzantine protocol for n node system:  Any node can initiate a message broadcast.  All nodes rebroadcast the received message to all nodes it has not heard from.  After communications end, nodes take majority decision.  Ref.: L. Lamport, R. Shostak and M. Pease, “The Byzantine General’s Problem,” ACM Trans. Prog. Lang. Syst., vol. 4, no. 3, pp. 382-401, July 1982. Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)65

66 Byzantine Resilience Conditions  In order to tolerate t failures:  The system must have at least 3t + 1 nodes.  There must be at least 2t +1 disjoint communication paths between nodes.  A node must exchange messages at least t +1 times. Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)66

67 Four-Core Processor System Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)67 A B C D

68 Example 1: C Initiates Message m, Sends n to A and m to B and D ProcessorFirst round Second round Decoded message Anm m Bmm nm Dm m Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)68

69 Example 2: C Initiates Message m, B Sends p to A and D ProcessorFirst round Second round Decoded message Amm pm Bmm m Dmm pm Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)69

70 Example 2: C Initiates Message m, A and B generate faulty message q ProcessorFirst round Second round Decoded message Amm qm Bm m Dmq q Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)70

71 References  L. Lamport, R. Shostak and M. Pease, “The Byzantine General’s Problem,” ACM Trans. Prog. Lang. Syst., vol. 4, no. 3, pp. 382-401, July 1982.  D. K. Pradhan, Fault-Tolerant Computer System Design, Upper Saddle River, New Jersey: Prentice Hall PTR, 1996.  P. K. Lala, Self-Checking and Fault-Tolerant Digital Design, San Francisco: Morgan- Kaufmann, 2001. Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)71


Download ppt "Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)1 ELEC 7770 Advanced VLSI Design Spring 2014 Soft Errors and Fault-Tolerant Design Vishwani."

Similar presentations


Ads by Google