Presentation is loading. Please wait.

Presentation is loading. Please wait.

New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology.

Similar presentations


Presentation on theme: "New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology."— Presentation transcript:

1 New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

2 page 2A. Steininger My contact data Andreas Steininger Vienna University of Technology Faculty of Informatics Institute of Computer Engineering Embedded Computing Systems Group Treitlstrasse 3 A- 1040 Vienna Austria steininger@ecs.tuwien.ac.at http://ti.tuwien.ac.at/ecs

3 page 3A. Steininger Main Contributors to this Material  Dr. Thomas Kottke R. Bosch AG / EADS  Dr. Peter Tummeltshammer R. Bosch AG / Thales  Dr. Christoph Scherrer Alcatel / Thales  Dr. Eric Armengaud DecomSys / VirtualVehicle  Dr. Karl Thaller DecomSys / Elektrobit Austria  Dr. Martin Horauer UAT Technikum Wien  Paul Milbredt AUDI AG

4 page 4A. Steininger Outline Fault tolerance – some (very) basics Automotive electronics: the specific situation Design of a cost efficient fault tolerant node –Basic architecture –Temporal diversity –Treatment of common cause faults –Switching performance mode / safety mode –Fault-tolerance validation by fault injection

5 page 5A. Steininger Faults, Errors and Failures computer 10 fault error failure

6 page 6A. Steininger Error Detection computer 10 fault error failure Fault detection: usually too difficult (too many possibilities)

7 page 7A. Steininger Error Detection computer 10 error failure Failure detection: too late: want to prevent failure!

8 0 page 8A. Steininger Error Detection computer 10 error To decide that „1“ is wrong we need a reference. Where to get this reference from? Option 1: Perform same compu- tation a second time (hopefully the fault is gone by then…) Time redundancy

9 page 9A. Steininger Error Detection computer 10 error To decide that „1“ is wrong we need a reference. Where to get this reference from?

10 0 page 10A. Steininger Error Detection computer 10 To decide that „1“ is wrong we need a reference. Where to get this reference from? Option 2: Use a second computer in parallel (hopefully this one works well…) Space redundancy

11 page 11A. Steininger Error Detection computer 10 error To decide that „1“ is wrong we need a reference. Where to get this reference from? Option 3: Add additional information (hopefully not affected as well…) Information redundancy 0

12 computerED computer page 12A. Steininger Achieving Fault Tolerance Fail safe: system can be safely stopped when error is detected  example: train computer ED Fail operational: system must keep on working when error is detected  example: autopilot in airplane computer ED

13 page 13A. Steininger Outline Fault tolerance – some (very) basics Automotive electronics: the specific situation Design of a cost efficient fault tolerant node –Basic architecture –Temporal diversity –Treatment of common cause faults –Switching performance mode / safety mode –Fault-tolerance validation by fault injection

14 page 14A. Steininger Electronics in Cars – some Facts high proportion of value: up to 30% high development potential: more than 80% of the innovations high number of Electronic Control Units (ECUs) up to 70 complex distributed system different networks & topologies

15 page 15A. Steininger Electronics in Cars - Benefits cheap alternative to existing mechanical solutions –lighter, smaller, cheaper, more flexible,… enabler for further optimizations –electronic ignition, motor management, … key to new functionality –safety: ESP, active suspension, crash sensing… –comfort: air conditioning, infotainment,… –security: immobilizer, alarm, electronic key, GPS tracking,… –autonomy: anticipatory braking, lane keeping,…

16 page 16A. Steininger Key Demands  Safety  Real-Time  Low Cost  Robustness  Testability

17 page 17A. Steininger Key Demands  Safety  Real-Time  Low Cost  Robustness  Testability –high risk potential (energy!) –high public awareness –no safe state (in general) –certification required (EN 61508, ISO 26262) –high complexity of system & application –legal issues (liability)

18 page 18A. Steininger Key Demands  Safety  Real-Time  Low Cost  Robustness  Testability –engine: 6000 rpm = 1/10ms –VDM: 100km/h = 28cm/10ms –need to synchronize distributed activities –real-time communication –image processing tasks

19 page 19A. Steininger Key Demands  Safety  Real-Time  Low Cost  Robustness  Testability –extreme competition –high cost inhibits introduction –tailored safety concepts minimum degree of replication use structural redundancies –generic solutions scalable, configurable, flexible –marginal costs beat NRE

20 page 20A. Steininger Current Status fail safe functions realized: –shut off upon error –mechanical fall-back system assumes control no true “by wire” functions –single-channel solutions sufficient tolerance against random faults –avoid design faults by field experience => no diversity –avoid common cause faults by design (?) single fault assumption –keep faults rare (shielding, etc.)

21 page 21A. Steininger Outline Fault tolerance – some (very) basics Automotive electronics: the specific situation Design of a cost efficient fault tolerant node –Basic architecture –Temporal diversity –Treatment of common cause faults –Switching performance mode / safety mode –Fault-tolerance validation by fault injection

22 page 22A. Steininger A Fault Tolerant Node  mission: make a node (processor) fault tolerant  need to consider CPU and memory  aim is “fail safe” (but keep option for fail op in mind) –simplex unit with error detection capabilities –duplication and comparison –hybrid approach

23 page 23A. Steininger Options for the CPU Core  Single core + ED  Dual core + cmp  Superscalar proc. + cmp + ED modify custom CPU core –parity for buses –two-rail coding for signals –self-checking implemen- tation of simple units –duplicate & compare for complex units –careful layout

24 page 24A. Steininger Options for the CPU Core  Single core + ED  Dual core + cmp  Superscalar proc. + cmp + ED duplicate custom CPU core –master/checker operation –shared (safe) memory –validity check for inputs –self-checking comparator checks equality of outputs –option: clock delay –option: mode switch

25 page 25A. Steininger Solution Example “Dual Core Frame”  benefits can use custom core without modifications safety analysis valid for other cores as well promises high ED coverage with moderate efforts CPU is hard to protect otherwise  crucial points enable easy recovery ( => keep outage short) eliminate single points of failure detect common cause faults

26 page 26A. Steininger Instr. MemData Mem =? Instr. Addr. Instr.Data Addr.Data out Data in Instr. Addr. Instr.Data Addr.Data out Data in Core 1 (Master) Core 2 (Checker) Error_Sig „Safe memories“ Parity for buses Dual-Rail Coding Self-Checking Comparators Protection in the Dual Core Frame

27 page 27A. Steininger Potential for Common Cause Faults  identical input data  identical clock (lock step)  shared clock generator  shared power supply  both processors on same die (physical proximity; thermal & mechanical coupling)

28 page 28A. Steininger Temporal Diversity operate checker with a delay against master –same fault hits at different point of computation –therefore different effect => detect by comparison –different critical paths emerge store master output for comparison choose delay of 1 / 1.5 / 2 clock cycles –larger delay causes high effort for little gain (=>experiments) –error detection latency is equal to the delay –need to delay memory write and outputs by this amount

29 page 29A. Steininger Instr. MemData Mem =? Instr. Addr. Instr.Data Addr.Data out Data in Instr. Addr. Instr.Data Addr.Data out Data in Core #1 (Master) Core #2 (Checker) Error TT Temporal Diversity: Implementation

30 page 30A. Steininger Fail Safe Dual Core Frame – Summary safe memories for instructions and data comparison of all core outputs parity protection for buses (data, address) dual rail coding for single signals (int, rst, err) totally self-checking comparators temporal diversity How safe is the proposed solution?

31 page 31A. Steininger Assessment of the Solution’s Quality  How measure quality? ( Aim is fail safe) error detection coverage => detect all errors error detection latency => detect them quickly  Which method to choose? theoretical analysis / modelling experimental fault injection field observation

32 page 32A. Steininger Fault Injection Experiment 2 SPEAR cores in fail safe frame (= DUT) synthesized to EDIF netlist injected one by one into netlist exhaustive list of stuck-at-1 and stuck-at-0 faults download to FPGA, application run “golden device” as reference (= REF) upon mismatch (DUT  REF) => check comparator

33 page 33A. Steininger masterslaveframeoverall detectedno effect20451170351754891 before effect190479873419879 during effect RD0000 WR55909211480 after effect RD3145508731542 WR0000 not detected no effect4269427610739618 with effect0000 overall55534555446332117410 No change of memory contents in case of error Erroneous read access is uncritical Results of FI Experiment

34 page 34A. Steininger Enabling fast Recovery  error signal (dual rail) notifies external component / memory turns any further WR into RD (error confinement) triggers processor interrupt  status register (memory mapped) updated by HW indicates source of error (data parity, address mismatch,…)  recovery can build on uncorrupted status can benefit from detailed status information

35 page 35A. Steininger Why is fast Recovery important?  application specific fault-tolerance time application can “survive” without computer even in fail-operational case typ. some 10ms for car (recall: 100km/h = 28cm/10ms)  meaning of fast recovery if failed computer recovers within FT time, no need for hot standby => COST! re-booting after failure is -pragmatic -safe -expensive!

36 page 36A. Steininger Fail Safe Dual Core – Summary 1  duplicate & compare generic approach, applicable to any core type covers all (local) errors need to carefully eliminate single points of failure need to complement with protection for signals & buses  temporal diversity mitigates (many) common cause failures requires output delay to ensure error confinement

37 page 37A. Steininger Possible Sources of CCFs  Design & process design fault or (latent) process deficiency  Thermal coupling hot spot affects both replica in the same way  Mechancial defect affects both replica symmetrically  Electrical coupling wire bound (shared lines: VDD, reset, clock) wireless (EMI)

38 page 38A. Steininger Komp. error Why use Single Die then?  cheaper and faster use two instances of same design fast & comprehensive comparison  CCFs on single die intuitively higher thread quantification of thread? mitigation techniques?

39 page 39A. Steininger The Actual Problem with CCFs  One fault event affects both replica AND  is not detected by comparator i.e. leads to “symmetric” fault effect AND  produces an erroneous output i.e. does not crash the cores

40 page 40A. Steininger Possible Countermeasures for CCFs  Design & process design fault or (latent) process deficiency  Thermal coupling hot spot affects both replica in the same way  Mechancial defect affects both replica symmetrically  Electrical coupling wire bound (shared lines: VDD, reset, clock) wireless (EMI) diversity, burn-in, fault avoidance asymmetric propagation paths asymmetric critical paths asymmetric antennas (?)

41 page 41A. Steininger Possible Countermeasures for CCFs  Design & process design fault or (latent) process deficiency  Thermal coupling hot spot affects both replica in the same way  Mechancial defect affects both replica symmetrically  Electrical coupling wire bound (shared lines: VDD, reset, clock) wireless (EMI) asymmetric propagation paths

42 page 42A. Steininger Propagation Speed Comparison  Thermal & mechanical propagation are relatively slow  10000s of clock cycles within 1ms

43 page 43A. Steininger Experimental Assessment  Evaluation Experiments 1)single corresponding points with offset t 2)multiple corresp. points with offset t 3)single non-corresp. points no offset Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Master Compare unit Checker Golden Node Data Addr Iaddr Data Addr Iaddr We Erroneous write access?

44 page 44A. Steininger Symmetry Requirements for CCF  even a small offset…  fault multiplicity …  asymmetry of impact … …improve detection coverage

45 page 45A. Steininger Symmetry Requirements for CCF  even a small offset…  fault multiplicity …  asymmetry of impact … …improve detection coverage RF (7028) ExVecTab (8202) ALU (2472) PSW (308) DEC (152) P2 (158) PC+P1 (182)

46 page 46A. Steininger Squeezing our more Efficiency  dual core is expensive   normally yields performance improvement would be welcome here as well: increasing performance demand @ limited clock rates but: exclusively dedicated to safety here  observation: not all tasks are safety critical enable flexible switching between “safety mode” and “performance mode”

47 page 47A. Steininger Operation in Performance Mode cores execute different instruction streams in parallel both cores have direct access to memory / peripherals instruction caches introduced to minimize penalties from conflicting access temporal diversity disabled comparator disabled

48 page 48A. Steininger Requirements on the Mode Switching  coherent operation in safety mode internal states of cores must be aligned before switching to safety mode (register file, cache)  safe operation in safety mode switching must not introduce safety leakage no corruption of safety-relevant data in perform. mode  low performance penalty for mode switching slow or complicated switching would spoil the anticipated performance gain

49 page 49A. Steininger Implementation of the Split Core Frame

50 page 50A. Steininger Mode Switch: Safety => Performance LDL r1, 248 LDH r1, 255 mode switching LDW r2, r1 BTEST r2, 1 JMPI_CT load ID reg address mode switch instr => core1 wait => core2 wait => clk align => switch mode load & check ID bit => cond branch core2

51 page 51A. Steininger Mode Switch: Performance => Safety core1 encounters mode switch instr => trigger MSU (core1 signal) => halt core1 (wait1) => interrupt core2 (message2) core2 encounters interrupt => save context => jump to mode switch instr core2 executes mode switch => halt core2 & switch clock => resume core1 => resume core2 after delay

52 page 52A. Steininger masterslaveframeoverall detectedno effect102956962533463325 before effect5026013246350 within 1,5cy50956056951525 later0000 not detected no effect70557102427518432 with effect0000 overall640666406411502139632 Delayed WR still ensures error confinement Fault Injection in Safety Mode

53 page 53A. Steininger Fault Injection in Performance Mode detection inperf modesafety mode never effect inearlylatestuck≤1.5cy>1.5cy perf only11494232561734583458 both modes-- 000 safety only-- 965400 none14734771518560 fault injected in performance mode, then switch to safety mode No undetected effects / late detections in safety mode Watchdog important to prevent hang-up in perf mode

54 page 54A. Steininger We still need a “Safe Memory”  detect bit flips in storage cells parity (or EDC/ECC)  detect erroneous address decoding special decoder logic design  protect interfaces parity for data, address and control buses  prevent illegal WR access provide mask input for write enable Why not duplicate & compare?

55 page 55A. Steininger We still need a “Safe Memory”  detect bit flips in storage cells parity (or EDC/ECC)  detect erroneous address decoding special decoder logic design  protect interfaces parity for data, address and control buses  prevent illegal WR access provide mask input for write enable

56 page 56A. Steininger Possible Address Decoder Errors  correct behavior: any given address activates exactly one assigned memory cell  erroneous behaviors:  an address activates no memory cell at all  an address activates more than one memory cell  an address activates a wrong memory cell

57 page 57A. Steininger Checking the Address Decoder large decoders built from cascade of smaller ones re-check parity behind cell array: OR over even cells  parity ? check for missing or multiple cell activations: XOR(upper half)  XOR(lower half) ?

58 page 58A. Steininger Summary  the automotive domain has its own laws and rules need “extremely cost-effective robust solutions for safety- critical real-time applications, versatile and custom tailored”  on node level different redundancy concepts applicable example: dual core CPU and memory with protection mech’s on-line testing for memory may be required  on system level crucial role of communication infrastructure advantages of time triggered approach insufficient suitability of structural testing

59 page 59A. Steininger Hungry for more? http://ti.tuwien.ac.at/ecs steininger@ecs.tuwien.ac.at

60 page 60A. Steininger Related publications of my group (1) [1]T. Kottke and A. Steininger, “A Fail-Silent Memory for Automotive Applications”, 9th IEEE European Test Symposium, Corsica 2004. [2]T. Kottke and A. Steininger, “A Generic Dual Core Architecture with Error Containment”, Journal of Computing and Informatics, vol. 23, no.5, 2004. [3]T. Kottke and A. Steininger, “A Reconfigurable Generic Dual-Core Architecture”, Int’l Conference on Dependable Systems and Networks (DSN2006), Philadelphia, 2006. [4]T. Kottke and A. Steininger, “A Fail-Silent Reconfigurable Superscalar Processor”, 13 th IEEE Pacific Rim Int’l Symposium on Dependable Computing, Melbourne, 2007. [5]C. El Salloum, A. Steininger, P. Tummeltshammer and W. Harter, “Recovery Mechanisms for Dual Core Architectures”, 21st IEEE Int’l Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’06), Washington, 2006. [6]A. Steininger and C. Temple, “Economic Self-Test in the Time-Triggered Architecture”, IEEE Design & Test of Computers, vol 3/1999 [7]A. Steininger, “Testing and Built-in Self-Test – A Survey”, Journal of Systems Architecture 46(2000)

61 page 61A. Steininger Related publications of my group (2) [8]A. Steininger and C. Scherrer, “On the Necessity of BIST in Safety-Critical Applications – A Case Study”, 29th Annual Int’l Symposium on Fault-Tolerant Computing (FTCS’29), Madison, 1999. [9]C. Scherrer and A. Steininger, “How does Resource Utilization Affect Fault Tolerance?”, 2000 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’00), Yamanashi, 2001. [10] C. Scherrer and A. Steininger, “How to Tune the MTTF of a Fail-Silent System”, 2001 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’01), San Francisco, 2001 [11] C. Scherrer and A. Steininger, “Dealing with Dormant Faults in an Embedded Fault- Tolerant Computer System”, IEEE Transactions on Reliability, vol. 52, no. 4, 2003. [12] K. Thaller and A. Steininger, “A Transparent Online Memory Test for Simultaneous Detection of Functional Faults and Soft Errors in Memories”, IEEE Transactions on Reliability, vol. 52, no. 4, 2003.

62 page 62A. Steininger Related publications of my group (3) [13] E. Armengaud, F. Rothensteiner, A. Steininger, R. Pallierer, M. Horauer, M. Zauner, “A Structured Approach for the Systematic Test of Embedded Automotive Communication Systems”, Int’l Test Conference 2005, Austin 2005. [14] E. Armengaud, A. Steininger and M. Horauer, “Automatic Parameter Identification in FlexRay based Automotive Communication Networks”, 11th IEEE Int’l Conference on Emerging Technologies and Factory Automation, Prague 2006. [15] E. Armengaud, A. Steininger, M. Horauer, „Towards a Systematic Test of Embedded Automotive Communication Systems“, IEEE Transactions on Industrial Informatics vol 4, no 3 [16] P. Milbredt, A. Steininger and M. Horauer, “Automated Testing of FlexRay Clusters for System Inconsistencies in Automotive Networks”, 4th Int’l Symposium on Electronic Design, Test and Applications, Hong Kong, 2008. [17] P. Milbredt, A. Steininger, M. Horauer, „An investigation of the Clique Problem in FlexRay“, Proc. 3rd IEEE Symposium on Industrial Embedded Systems, Las Vegas, 2008.

63 page 63A. Steininger Related publications of my group (4) [18] P. Tummeltshammer and A. Steininger, „Power Supply Induced Common Cause Faults — Experimental Assessment of Potential Countermeasures“, 9th IEEE International Conference on Dependable Systems and Networks, Estoril, 2009. [19] E. Armengaud, A. Steininger, M. Horauer, R. Pallierer, “A Layer Model for the Systematic Test of Time-Triggered Automotive Communication Systems”, 5th IEEE Int’l Workshop on Factory Communication Systems, Vienna, 2004. [20] E. Armengaud, A. Steininger and M. Horauer, “Automatic Parameter Identification in FlexRay based Automotive Communication Networks”, 11th IEEE Int’l Conference on Emerging Technologies and Factory Automation, Prague 2006. [21] E. Armengaud and A. Steininger, “Pushing the Limits of Remote Online Diagnosis in Embedded Real-Time Networks”, 6th IEEE Int’l Workshop on Factory Communication Systems, Torino, 2006. [22] P. Milbredt, A. Steininger and M. Horauer, “Automated Testing of FlexRay Clusters for System Inconsistencies in Automotive Networks”, 4th Int’l Symposium on Electronic Design, Test and Applications (DELTA 2008), Hong Kong, 2008.

64 page 64A. Steininger Related PhD theses of my group T. Kottke, “Untersuchung von fehlertoleranten Prozessorarchitekturen für sicherheitsrelevante Automobilanwendungen”, PhD thesis, Vienna University of Technology, 2005. (German) C. Scherrer, “Zuverlässigkeit zweifach redundanter Architekturen unter besonderer Berücksichtigung latenter Fehler”, PhD thesis, Vienna University of Technology, 2002. (German) K. Thaller, “A Transparent Online Memory Test”, PhD thesis, Vienna University of Technology, 2001. E. Armengaud, “A Transparent Online Test Approach for Time-Triggered Communication Protocols”, PhD thesis, Vienna University of Technology, 2008. P. Tummeltshammer, “An Analysis of Common Cause Failures in Dual Core Architectures”, PhD thesis, Vienna University of Technology, 2009. G. Fuchs, “Fault-Tolerant Distributed Algorithm for Robust Tick Synchronization: Concepts, Implementations and Evaluations”, PhD thesis, Vienna University of Technology, 2009

65 page 65A. Steininger Related Projects STEACS (Systematic Test of Embedded Automotive Communication Systems) http://embsys.technikum-wien.at/projects/steacs/index.html EXTRACT (Exploiting Synchrony for Transparent Communication Services Testing) http://ti.tuwien.ac.at/ecs/research/projects/extract DARTS (Distributed Algorithms for Robust Tick Synchronization) http://ti.tuwien.ac.at/ecs/research/projects/DARTS


Download ppt "New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology."

Similar presentations


Ads by Google