Download presentation
Presentation is loading. Please wait.
Published byClifton Gilmore Modified over 8 years ago
1
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)1 ELEC 7770 Advanced VLSI Design Spring 2014 Soft Errors and Fault-Tolerant Design Vishwani D. Agrawal James J. Danaher Professor ECE Department, Auburn University Auburn, AL 36849 vagrawal@eng.auburn.edu http://www.eng.auburn.edu/~vagrawal/COURSE/E7770_Spr14
2
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)2 Soft Errors Soft errors are the errors caused by the operating environment. They are not due to a permanent hardware fault. Soft errors are intermittent or random, which makes their testing unreliable. One way to deal with soft errors is to make hardware robust: Capable of detecting soft errors Capable of correcting soft errors Both measures are probabilistic
3
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)3 Some Early References J. von Neumann, “Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components,” pp. 329-378, 1959, in A. H. Taub, editor, John von Neumann: Collected Works, Volume V: Design of Computers, Theory of Automata and Numerical Analysis, Oxford University Press, 1963. M. A. Breuer, “Testing for Intermittent Faults in Digital Circuits,” IEEE Trans. Computers, vol. C-22, no. 3, pp. 241-246, March 1973. T. C. May and M. H. Woods, “Alpha-Particle-Induces Soft Errors in Dynamic Memories,” IEEE Trans. Electron Devices, vol. ED-26, no. 1, pp. 2-9, 1979.
4
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)4 Causes of Soft Errors Interconnect coupling (crosstalk). Power supply noise: IR-drop, power droop, ground bounce. Ignition noise. Electromagnetic pulse (EMP). Effects generally attributed to alpha-particles: Charged particles: electrons, protons, ions. Radiation (photons): X-rays, gamma-rays, ultra-violet light.
5
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)5 Sources of Alpha-Particles Radioactive contamination in VLSI packaging material. Ionosphere, magnetosphere and solar radiation. Other electromagnetic radiation.
6
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)6 Alpha-Particle Helium nucleus: two protons and two neutrons, mass = 6.65 ×10 -27 kg, charge = +2e (e = 1.6 ×10 -19 C). Energy = 3.73 GeV
7
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)7 Soft Error Rate (SER) Failures in time (FIT): One FIT is 1 error per billion hours of operation. Alternative unit is mean time between failures (MTBF) or mean time to failure (MTTF). 1 year MTBF =10 9 /(365×24)= 114,155 FIT
8
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)8 Particle Strike p - substrate n - + + + + - - Ion or Charged particle
9
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)9 Induced Current time current I(t) = I 0 (e – t/a – e – t/b ),a >> b
10
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)10 Voltage Induced at a Node V = Q/C Where Q = ∫ I(t) dt C = node capacitance Smaller node capacitance will result in larger voltage swing.
11
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)11 Effect on Digital Circuit IN OUT CK Combinational Logic Charged Particles Charged Particles
12
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)12 An SRAM Cell bit VDD WL BL 0 1
13
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)13 SRAM Cell Struck by Alpha-Particle Single-Event Upset (SEU) bit VDD WL BL 0→1 1→0 Charged Particles
14
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)14 A Resistor Hardened SRAM Cell bit VDD WL BL 0 1
15
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)15 D-Latch D CK = 0 Q 1 0 Q
16
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)16 SEU in D-Latch D CK = 0 Q 1→0 0→1 Charged Particles Q
17
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)17 Single Event Transients in Combinational Logic CK 1111 0 1 0 1 Charged Particles
18
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)18 Effects of Transients Error correcting effects Transient pulse is filtered by gate inertia Transient is blocked by an unsensitized path Transient is blocked by an inactive clock Error enhancing effects Large number of gates can produce multiple pulses Fanouts can multiply error pulses
19
Typical Soft Error Distribution Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)19 S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” Computer, vol. 38, no. 2, pp. 43-52, February 2005.
20
Soft Error Simulation F. Wang and V. D. Agrawal, “Soft Error Rate with Inertial and Logical Masking,” Proc. 22 nd International Conference on Quality VLSI Design, January 2009, pp. 459-464. F. Wang and V. D. Agrawal, “Soft Error Rate Determination for Nanoscale Sequential Logic,” Proc. 11 th International Symposium on Quality Electronic Design (ISQED), March 2010, pp. 225-230. Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)20
21
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)21 SEUs in FPGA Parts that can be affected Look-up table (LUT) Configuration memory cell Flip-flop Block RAM F. L. Kastensmidt, L. Carro and R. Reis, Fault-Tolerant Techniques for SRAM-Based FPGAs, Springer, 2006.
22
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)22 LUT out F1 F2F3F4 1 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 Memory cells
23
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)23 SEU in LUT out F1 F2F3F4 1 0 1 0 0 1 1 0 0 0 0 0 1 1 1 0 Memory cells Charged Particle 1 changed to 0
24
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)24 Four Types of SEU in FPGA F1 F2 F3 F4 LUT FF M M M M MMM Configuration memory cell Type 1 Type 2 Type 3 Block RAM Type 4
25
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)25 SEU Detection Methods Hardware redundancy Time redundancy Error detection codes (EDC) Self-checker techniques
26
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)26 SEU Mitigation Techniques Triple modular redundancy (TMR) Multiple redundancy with voting Error detection and correction codes (EDAC) Hardened memory cells FPGA-specific methods Reconfiguration Partial configuration Rerouting design
27
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)27 Hardware Redundancy for Detection Combinational Logic Combinational Logic (duplicated) outputinputs Logic 1 indicates error Hardware overhead is high ~ 100% Performance penalty is negligible.
28
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)28 Time Redundancy for Detection Combinational Logic outputinputs Logic 1 indicates error Hardware overhead is low. Performance penalty ( ~ d) = maximum detectable pulse width. D Q CK+ d CK
29
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)29 Repeat on Error Detection Combinational Logic output inputs Logic 1 indicates error D Q CK+ d CK C Operation:If error is detected, then output retains its previous value. Repeating the computation can produce correct result.
30
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)30 Muller C-Element output C A B AB 000 01 Old output 10 111 S Q R A B output
31
Dynamic CMOS C-Element Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)31 output C A B AB 001 01 Old output 10 110 A B output
32
Pseudostatic CMOS C-Element Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)32 output C A B AB 001 01 Old output 10 110 A B output Weak keeper
33
Built-In Soft Error Resilience (BISER) Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)33 ABoutput 001 01Old output 10 110 A B output Weak keeper Flip-flop Duplicate Flip-flop Clock Data from combinational logic
34
BISER Assumptions: Most soft errors in combinational logic are eliminated by inertial or logic masking. Soft error pulse generated in flip-flop is much shorter than clock period. Probability of either a master or slave latch being struck by soft error exactly at clock edge is small. Flip-flop is duplicated and outputs fed to C-element. Twenty times reduction of soft error observed. Ref.: S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” Computer, vol. 38, no. 2, pp. 43-52, February 2005. Spring 2014, Apr 11... ELEC 7770: Advanced VLSI Design (Agrawal) 34
35
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)35 Triple Modular Redundancy (TMR) Combinational Logic copy 1 outputinputs Majority Voter Combinational Logic copy 3 Combinational Logic copy 2
36
TMR Error Reduction Voter input error probability = E, assumed independent for each input. Output error probability, e= Prob(two errors or three errors) =( ) E 2 (1 – E) + ( ) E 3 =3 E 2 – 3 E 3 + E 3 =3 E 2 – 2 E 3 For very small E, E 3 << E 2 → e = 3E 2 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)36 3 2 3 3
37
TMR Error Probability Input error probability, EOutput error probability, e 0.0 0.0010.000002998 0.010.000298 0.10.027 0.20.104 0.30.216 0.40.352 0.5 0.60.648 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)37
38
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)38 Majority Voter Circuit A B ABCoutput 0000 0010 0100 0111 1000 1011 1101 1111 A B output Majority Voter C C
39
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)39 Alternative Implementations of Voter LUT 0001011100010111 output A B C A B C VDD
40
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)40 Triple Modular Redundancy (TMR) Combinational Logic output inputs D Q CK CK + d Majority Voter D Q CK + 2d CK + 3d
41
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)41 TMR for Memory Cells Combinational Logic output inputs D Q CK Majority Voter D Q CK Problems: 1.Accumulation of errors in flip-flops. 1.Voter is not protected.
42
Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)42 FF Refresh and TMR for Memory Cells output D Q CK D Q CK Majority Voter Majority Voter Majority Voter Majority Voter r1r2r3
43
Reliability Analysis Determine how long a system will work without failure. Find: Mean time to failure (MTTF) or mean time between failures (MTBF) Mean time to repair (MTTR) FIT rate Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)43
44
Reliability Function Reliability function of a system, R(t) = Probability of survival at time t Determined from failure rates of components, λ(t) = Number of failures per unit time Generally varies with time. Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)44
45
Failure Rate, λ(t) Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)45 Time, t Failures per second, λ(t) 10 -12 10 -9 10 -6 10 -3 10 0 Infant mortality Constant failure Rate (useful life) λ(t) = λ Wearout or aging
46
Deriving R(t) R(t) is the probability of no error in interval [0, t]. Divide interval in a large number (n) of subintervals of duration t/n. Let x be the probability of error in one subinterval. Assume that duration t/n is so small that either no error occurs or at most one error can occur. Then, average errors in a subinterval = 0.(1 – x) + 1.x = x = λt/n. Probability of no error in interval [0, t] is, R(t)= (1 – x) n = (1 – λt/n) n = exp(– λt), from Sterling’s formula as n → ∞ Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)46
47
R(t) and MTBF R(t)=e –λt Failure rate, λ = failures per unit time Number of failures in time T = λT ∞ MTBF = T/λT = 1/λ = ∫ R(t) dt 0 R(t) = exp( – t/MTBF) For t = MTBF, R(MTBF) = e –1 = 0.368 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)47
48
Reliability and MTBF Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)48 Time, t Reliability, R(t) 1.0 0.8 0.6 0.4 0.2 0.0 1 MTBF 2 MTBF3 MTBF R(t) = 1/e = 0.368
49
Example: First Generation Computer 10,000 electron tubes. Average burn out rate: 5 tubes per 100,000 hours. MTBF = 100,000/5 = 20,000 hours = 2.3 years, i.e., 37% chance of survival beyond 2.3 years. Time for 95% chance of survival: R(t) = exp(– t/MTBF) = 0.95, or t = 1.4 months Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)49
50
Reliability of TMR R(TMR)= Prob(all three modules correct) + Prob(any two modules correct) = R 3 + 3R 2 (1 – R) = 3 R 2 – 2 R 3 = 3e -2λt – 2e -3λt Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)50
51
MTBF of TMR R(TMR)= 3e -2λt – 2e -3λt MTBF = ∫ R(TMR) dt=5/(6λ) 0 This is less than the MTBF = 1/λ for a single system! Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)51 8
52
MTBF of TMR Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)52 Time, t Reliability, R(t) 1.0 0.8 0.6 0.4 0.2 0.0 Single module TMR Mission duration
53
Error Detection Code Errors: Bits can flip due too noise in circuits and in communication. Extra bits used for error detection. Example: a parity bit in ASCII code Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal) Even parity code for A01000001 (even number of 1s) Odd parity code for A11000001 (odd number of 1s) 7-bit ASCII code Parity bits Single-bit error in 7-bit code of “A”, e.g., 1000101, will change symbol to “E” or 1000000 to “@”. But error will be detected in the 8-bit code because the error changes the specified parity. 53
54
Richard W. Hamming Error-correcting codes (ECC). Also known for Hamming distance HD = Number of bits two binary vectors differ in Example: HD(1101, 1010) = 3 Hamming Medal, 1988 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal) 1915-1998 54
55
The Idea of Hamming Code Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)55 Code space contains 2 N possible N-bit code words 1010 ”A” 1110 ”E” 1011 ”B” 1000 ”8” 0010 ”2” 1-bit error in “A” HD = 1 Error not correctable. Reason: No redundancy. Hamming’s idea: Increase HD between valid code words.
56
Hamming’s Distance ≥ 3 Code Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)56 1010010 ”A” 1-bit error in “A” shortest distance decoding eliminates error HD = 2 HD = 1 0010101 ”2” 1000111 ”8” 1011001 ”B” 1110100 ”E” HD = 3 HD = 4 0010010 ”?” HD = 3 HD = 4 0011110 ”3”
57
Minimum Distance-3 Hamming Code Symbol Original code Odd-parity code ECC, HD ≥ 3 00000100000000000 10001000010001011 20010000100010101 30011100110011110 40100001000100110 50101101010101101 60110101100110011 70111001110111000 81000010001000111 91001110011001100 A1010110101010010 B1011010111011001 C1100111001100001 D1101011011101010 E1110011101110100 F1111111111111111 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)57 Original code: Symbol “0” with a single-bit error will be Interpreted as “1”, “2”, “4” or “8”. Reason: Hamming distance between codes is 1. A code with any bit error will map onto another valid code. Remedy: Design codes with HD ≥ 2. Example: Parity code. Single bit error detected but not correctable. Remedy: Design codes with HD ≥ 3. For single bit error correction, decode as the valid code at HD = 1. For more error bit detection or correction, design code with HD ≥ 4.
58
A Book on Coding Theory R. W. Hamming, Coding and Information Theory, Englewood Cliffs, New Jersey: Prentice-Hall, 1980. Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)58
59
Byzantine Empire, 527-565 Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)59 Emperor Justinian and General Belisarius
60
Byzantine General’s Problem In a war a general needs to communicate an attack (a) or retreat (r) order to subordinates in the field. For success a perfect agreement is necessary. Byzantine Fault: Subordinates can be unreliable or malicious. Communication (messengers) can be unreliable or malicious. Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)60
61
Example 1: Single Fault General: D; Subordinates: A, B and C Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)61 D A B C r→a r r
62
Example 1: Majority Agreement General: D; Subordinates: A, B and C Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)62 D ABC r→a r r a a r r r r Retreat
63
Example 2: Two Faults General: D; Subordinates: A, B and C Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)63 D A B C a a a
64
Example 2: Byzantine Failure General: D; Subordinates: A, B and C Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)64 D ABC a a a r r r r a a Retreat Attack
65
Byzantine Resilient System A system that can correctly function in presence of Byzantine faults. Byzantine protocol for n node system: Any node can initiate a message broadcast. All nodes rebroadcast the received message to all nodes it has not heard from. After communications end, nodes take majority decision. Ref.: L. Lamport, R. Shostak and M. Pease, “The Byzantine General’s Problem,” ACM Trans. Prog. Lang. Syst., vol. 4, no. 3, pp. 382-401, July 1982. Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)65
66
Byzantine Resilience Conditions In order to tolerate t failures: The system must have at least 3t + 1 nodes. There must be at least 2t +1 disjoint communication paths between nodes. A node must exchange messages at least t +1 times. Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)66
67
Four-Core Processor System Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)67 A B C D
68
Example 1: C Initiates Message m, Sends n to A and m to B and D ProcessorFirst round Second round Decoded message Anm m Bmm nm Dm m Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)68
69
Example 2: C Initiates Message m, B Sends p to A and D ProcessorFirst round Second round Decoded message Amm pm Bmm m Dmm pm Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)69
70
Example 2: C Initiates Message m, A and B generate faulty message q ProcessorFirst round Second round Decoded message Amm qm Bm m Dmq q Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)70
71
References L. Lamport, R. Shostak and M. Pease, “The Byzantine General’s Problem,” ACM Trans. Prog. Lang. Syst., vol. 4, no. 3, pp. 382-401, July 1982. D. K. Pradhan, Fault-Tolerant Computer System Design, Upper Saddle River, New Jersey: Prentice Hall PTR, 1996. P. K. Lala, Self-Checking and Fault-Tolerant Digital Design, San Francisco: Morgan- Kaufmann, 2001. Spring 2014, Apr 11...ELEC 7770: Advanced VLSI Design (Agrawal)71
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.