Programa de Pós-Graduação em Computação Instituto de Informática Universidade Federal do Rio Grande do Sul Porto Alegre – RS – Brazil Semana Acadêmica PPGC/UFRGS 17/10/2006 PPGC Programa de Pós-Graduação em Computação Dealing with Multiple Simultaneous Faults in Future Technologies Doutorando: Carlos Arthur Lang Lisbôa Orientador: Luigi Carro

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Why Multiple Simultaneous Faults ? Future technologies (2010 and beyond) very small transistors and fewer electrons to form the channel ( SETs) transient pulses due to radiation attack will last longer than the propagation delays of gates and cycle times devices will be more sensitive to the effects of electromagnetic noise, neutrons and alpha particles

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Single Event Upset Origin

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Why Should One Study Multiple Faults ? Changes in paradigm: Gates will behave statistically, producing correct outputs only a fraction of the time Faster devices cycle times shorter than duration of transient pulses

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ New paradigm: multiple simultaneous faults new fault tolerance techniques will be required (TMR will no longer provide enough protection) How to Deal with Multiple Faults ?

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ New paradigm: multiple simultaneous faults new fault tolerance techniques will be required (TMR will no longer provide enough protection) How to deal with this problem ? new materials and manufacturing technologies must be developed OR new design approaches must be taken How to Deal with Multiple Faults ?

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ New paradigm: multiple simultaneous faults new fault tolerance techniques will be required (TMR will no longer provide enough protection) How to deal with this problem ? How to Deal with Multiple Faults ? new design approaches must be taken (our bet !)

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Online Hardening Majority Logic Low cost redundancy Research Evolution - Overview Stochastic Operators TMR and Analog Voter Bit Stream Operators MemProc Statistical Computation IOLTS 04 DFT 04 WDES 04 LATW 06 ETS 06 DFT 06 VTS 07 (submitted) ETS 05 SBCCI 05 Research Report SRC 2005 TechCon Research Report DATE 06 PhD Forum

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Published Papers Lisbôa, C. and Carro, L., Arithmetic Operators Robust to Multiple Simultaneous Upsets, 10th IEEE International Online Test Symposium - IOLTS 2004, IEEE Computer Society, Funchal, Madeira Island, Portugal, July Lisbôa, C. and Carro, L., Highly Reliable Arithmetic Multipliers for Future Technologies, in Proceedings of the International Workshop on Dependable Embedded Systems - WDES in conjunction with the 23rd International Symposium on Reliable Distributed Systems - SRDS 2004, pp Edited by Becker, L. B. and Kaiser, J., Florianópolis, October 17, Lisbôa, C. and Carro, L., Arithmetic Operators Robust to Multiple Simultaneous Upsets, in Proceedings of the 19th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems - DFT 2004, pp , ISBN IEEE Computer Society, New York, October 2004.

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Published Papers Lisbôa, C. A. L., Carro, L. and Cota, E., RobOps - Arithmetic Operators for Future Technologies, 10th European Test Symposium - ETS 2005, Tallin, Estonia, May Lisbôa, C. A. L., Schüler, E. and Carro, L., Going Beyond TMR for Protection Against Multiple Faults, in Proceedings of the 18th Symposium on Integrated Circuits and Systems Design - SBCCI 2005, September Rhod, E.; Lisbôa, C. A. L. and Carro, L., Using Memory to Cope with Simultaneous Transient Faults, in Proceedings of the 7th Latin- American Test Workshop - LATW 2006, pp , IEEE Computer Society, New York, March 2006.

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Published Papers Rhod, E.; Lisbôa, C. A. L.; Michels, Á. and Carro, L., Fault Tolerance Against Multiple SEUs using Memory-Based Circuits to Improve the Architectural Vulnerability Factor, in Informal Digest of Papers of the 11th IEEE European Test Symposium - ETS 2006, pp , IEEE Computer Society, New York, May Michels, Á., Petroli, L., Lisbôa, C. A. L., Kastensmidt, F. and Carro, L. SET Fault Tolerant Combinational Circuits Based on Majority Logic, in Proceedings of the 21st IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems - DFT 2006, pp , IEEE Computer Society, Los Alamitos, CA, October Lisbôa, C. A. L., Carro, L., Sonza Reorda, M., and Violante, M. Online Hardening of Programs against SEUs and SETs, in Proceedings of the 21st IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems - DFT 2006, pp , IEEE Computer Society, Los Alamitos, CA, October 2006.

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Research Approaches / 2005 Use of stochastic operators Use of bit stream operators Ensuring voter reliability to use n-MR while dealing with multiple simultaneous faults

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Research Evolution / 2005 Stochastic Operators IOLTS 2004

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Research Evolution / 2005 IOLTS 2004 OK for some DSP Applications Stochastic Operators

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Research Evolution / 2005 Looking for more speed Stochastic Operators Bit Stream Operators DFT 2004 WDES 2004

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Research Evolution / 2005 Looking for more speed Stochastic Operators Small footprint and fast Bit Stream Operators DFT 2004 WDES 2004

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Research Evolution / 2005 Looking for more speed Stochastic Operators Analog Voter Bit Stream Operators Looking for tolerant converter ETS 2005 SBCCI 2005

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Research Evolution / 2005 Looking for more speed Stochastic Operators Tolerant to multiple faults in n-MR solutions Bit Stream Operators Looking for tolerant converter TMR and Analog Voter ETS 2005 SBCCI 2005

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Research Evolution / 2005 Looking for more speed Stochastic Operators Bit Stream Operators Looking for tolerant converter TMR and Analog Voter Research Report SRC 2005 TechCon

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Research approach / 2007 cooperation with peers use of memory for computation analog voter + majority logic use of an I-IP to harden instructions

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Research approach / 2007 cooperation with peers use of memory for computation analog voter + majority logic use of an I-IP to harden instructions low cost redundancy using statistical parallel computation

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Research Evolution / 2007 Research Report DATE 06 PhD Forum

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Research Evolution / 2007 MemProc LATW 06 ETS 06 Research Report DATE 06 PhD Forum

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Majority Logic Research Evolution / 2007 MemProc LATW 06 ETS 06 Research Report DATE 06 PhD Forum DFT 06

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Low cost redundancy Majority Logic Research Evolution / 2007 MemProc LATW 06 ETS 06 Research Report DATE 06 PhD Forum DFT 06

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Low cost redundancy Online Hardening Majority Logic Research Evolution / 2007 MemProc LATW 06 ETS 06 DFT 06 Research Report DATE 06 PhD Forum DFT 06

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Online Hardening Majority Logic Low cost redundancy Research Evolution / 2007 MemProc Statistical Computation LATW 06 ETS 06 DFT 06 VTS 07 (submitted) Research Report DATE 06 PhD Forum DFT 06

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Current research - motivation faster devices transient pulse duration scaling not proportional to speed scaling transient pulses will last longer than one cycle future technologies

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Current research - motivation future technologies faster devices transient pulse duration scaling not proportional to speed scaling transient pulses will last longer than one cycle techniques relying on time redundancy will fail

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Current research - motivation alternative approach: space redundancy current solutions: area overhead 100% small granularity does not provide low overhead (what can one do with 50% of a MOSFET ?)

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ proposed solution: fingerprinting parallel processing on subset of possible inputs small transient fault probability (desired: 0%) Current research - motivation alternative approach: space redundancy current solutions: area overhead 100% small granularity does not provide low overhead (what can one do with 50% of a MOSFET ?)

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Current research - focus use of low cost redundancy and statistical computation to cope with transient faults main circuit random checker inputs output error

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Sample application Freivalds: matrix multiplication correctness given matrices A and B, n x n given one algorithm that calculates C = A x B goal: check if the algorithm performs correctly by executing thousands of multiplications and comparing the results naive solution: calculate again and compare O(n 3 )

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Sample application Freivalds technique 1. generate a random vector r, with values from {0,1} 2. compute vector Cr = C r O(n 2 ) 3. compute vector ABr = A (B x r) O(n 2 ) 4. if C A B, then Pr[Abr = Cr] 1/2 After k independent repetitions of steps 1, 2 and 3: Pr[Abr = Cr] 1/2 k

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Sample application Our extension of Freivalds technique 1. generate a random vector r, with values from {0,1} 2. generate a vector r c with r ci = not(r i ) for i = 1:n 3. compute Cr = C r and Crc = C r c 4. compute ABr = A (B x r) and ABrc = A (B x r c ) 5. if ABr Cr OR ABrc Crc, then Pr[Abr Cr] = 1

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Sample Implementation C A * B Cr C * r ABr A*(B*r) inputs (A, B) output (C) error matrix multiplier with checker application of Freivalds technique

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Sample Implementation Area overhead (# of gates)

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Sample implementation Time overhead (# of instructions)

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Sample implementation Fault injection results

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ PhD program requiremnets 36 credits qualifying examination 2 foreign languages proficiency exam academic week seminar Thesis proposal February 2007 Thesis presentation December 2007

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Questions ?

Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Using Stochastic Operators SEU induced transient errors are of random nature Stochastic operators rely on randomness to produce approximate results The injection of random faults in the input signals processed by stochastic operators did not impact the precision of the results 0 faults 2 faults 4 faults8 faults Stochastic Adder Conventional % Errors in 1,000 additions

43
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Using Stochastic Operators SEU induced transient errors are of random nature Stochastic operators rely on randomness to produce approximate results The injection of random faults in the input signals processed by stochastic operators did not impact the precision of the results Several application areas (DSP) can deal with approximate values and still produce acceptable results (outputs)

44
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Using Stochastic Operators Benefit: reduced area of the operators Stochastic multiplier circuit Stochastic Adder Circuit S1S1 S3S3 Sum S2S2

45
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Using Bit Stream Operators Computation principles similar to those of the stochastic adder and multiplier Operators can produce bit streams which represent the exact results of the operation Proposed Multiplication Algorithm - bit stream product (the count of 1s in the stream is equal to the product value) F x F F1 2 F2 0. F1 1 F2 0. F1 0 F2 1. F1 2 F2 1. F1 1 F2 1. F1 0 F2 2. F1 2 F2 2. F1 1 F2 2. F1 0 b48.. b33b32.. b17b16.. b5b4.. b1b0

46
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ b48.. b48 b47.. b47... b0.. b times 8 times 8 times +4 total count of 1s = 8 * product + 4 Using Bit Stream Operators Computation principles similar to those of the stochastic adder and multiplier Operators can produce bit streams which represent the exact results of the operation Redundancy is added to the bit streams in order to stand to multiple bit flips Adding robustness to the bit stream through redundancy

47
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Using Bit Stream Operators Computation principles similar to those of the stochastic adder and multiplier Operators can produce bit streams which represent the exact results of the operation Redundancy is added to the bit streams in order to stand to multiple bit flips Conversion of bit streams to binary coded values is delayed as much as possible, and conversion circuits must use TMR or n-MR for protection against faults

48
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Using Bit Stream Operators Computation principles similar to those of the stochastic adder and multiplier Operators can produce bit streams which represent the exact results of the operation Redundancy is added to the bit streams in order to stand to multiple bit flips Conversion of bit streams to binary coded values is delayed as much as possible, and conversion circuits must use TMR or n- MR for protection against faults Issues to be further investigated: size of bit streams and area of the conversion circuits

49
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ VOTERVOTER correct output What is Wrong with TMR ? TMR protects only against single faults in one of the modules Module 1 Module 2 Module 3 correct output

50
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Module 2 wrong output What is Wrong with TMR ? Module 1 Module 3 correct output VOTERVOTER TMR protects only against single faults in one of the modules

51
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Module 2 correct output What is Wrong with TMR ? TMR does not protect against double faults in different modules Module 1 Module 3 wrong output VOTERVOTER

52
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ VOTERVOTER correct output What is Wrong with TMR ? When a single fault occurs in the voter circuit, the voter output may be wrong Module 1 Module 2 Module 3 correct output

53
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ VOTERVOTER correct output ? What is Wrong with TMR ? Module 1 Module 2 Module 3 correct output When a single fault occurs in the voter circuit, the voter output may be wrong

54
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Making TMR (n-MR) more reliable Known solutions imply in area, performance and / or power penalties deadlock: how to protect the output generator ?

55
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Making TMR (n-MR) more reliable Known solutions imply in area, performance and / or power penalties deadlock: how to protect the output generator ? Proposed solution: use TMR to cope with single faults in the modules

56
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Making TMR (n-MR) more reliable Known solutions imply in area, performance and / or power penalties deadlock: how to protect the output generator ? Proposed solution: use TMR to cope with single faults in the modules replace the digital voter by an analog voter that uses a comparator to generate the output

57
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Known solutions imply in area, performance and / or power penalties deadlock: how to protect the output generator ? Proposed solution: use TMR to cope with single faults in the modules replace the digital voter by an analog voter that uses a comparator to generate the output can support some noise, nevertheless producing the correct result Making TMR (n-MR) more reliable

58
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ The Analog Voter

59
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Injection of faults in the comparator (*) Minimum Area Comparator (*) using CMOS 0.35µm

60
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Electrical Simulation: Multiple Faults (SPICE and CMOS 0.35 m)

61
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Dealing with Multiple Simultaneous Faults: n-MR The Analog Voter with 5 Inputs (for 5-MR)

62
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ Dealing with Multiple Simultaneous Faults: n-MR The Analog Voter with 5 Inputs (for 5-MR) Simulations with injection of 2 simultaneous faults also succeeded

63
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/ The Analog Voter... Oops ! Does this work ???

