Introduction to Fault- Tolerance Amos Wang Credit from: Dr. Axel Krings, Dr. Behrooz Parhami, Prof. Jalal Y. Kawash, Kewal K.Saluja, and Paul Krzyzanowski.

Slides:

Advertisements

Similar presentations

Principles of Engineering System Design Dr T Asokan

Advertisements

Chapter 8 Fault Tolerance

Fault-Tolerant Systems Design Part 1.

Bus Architectures for Satety- Critical Embedded Systems --by Harit Desai.

COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.

Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.

REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.

3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.

Making Services Fault Tolerant

EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Introduction Designing cost-sensitive real-time control systems for safety-critical applications requires a careful analysis of the cost/fault-coverage.

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

7. Fault Tolerance Through Dynamic (or Standby) Redundancy The lowest-cost fault-tolerance technique in multiprocessors. Steps performed: When a fault.

EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Bogdan Tanasa, Unmesh D. Bordoloi, Petru Eles, Zebo Peng Department of Computer and Information Science, Linkoping University, Sweden December 3, 2010.

DS -V - FDT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)

2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.

2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.

Reliability-Aware Frame Packing for the Static Segment of FlexRay Bogdan Tanasa, Unmesh Bordoloi, Petru Eles, Zebo Peng Linkoping University, Sweden 1.

8. Fault Tolerance in Software

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.

Session 3 Windows Platform Dina Alkhoudari. Learning Objectives Understanding Server Storage Technologies Direct Attached Storage DAS Network-Attached.

EECS 373 Controller Area Networks Samuel Haberl Russell Kuczwara Senyuan Zhong.

By : Nabeel Ahmed Superior University Grw Campus.

SERIAL BUS COMMUNICATION PROTOCOLS

Lecture 13 Fault Tolerance Networked vs. Distributed Operating Systems.

1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University

Reliability and Fault Tolerance Setha Pan-ngum. Introduction From the survey by American Society for Quality Control [1]. Ten most important product attributes.

2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.

N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.

I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

CH2 System models.

RTS Meeting January 2008 In-Vehicle Domains Powertrain: –Control of engine and transmission –Several complex control, high computing complexity –Multitasking.

IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.

Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin

An efficient active replication scheme that tolerate failures in distributed embedded real-time systems Alain Girault, Hamoudi Kalla and Yves Sorel Pop.

DEVICES AND COMMUNICATION BUSES FOR DEVICES NETWORK

© Oxford University Press 2011 DISTRIBUTED COMPUTING Sunita Mahajan Sunita Mahajan, Principal, Institute of Computer Science, MET League of Colleges, Mumbai.

Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.

Architectures of distributed systems Fundamental Models

1 of 14 1/15 Synthesis-driven Derivation of Process Graphs from Functional Blocks for Time-Triggered Embedded Systems Master thesis Student: Ghennadii.

Fault-Tolerant Systems Design Part 1.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.

Time Triggered Networks: use in space 2015 CCSDS spring SOIS Plenary 23 March 2015 Glenn Rakow/NASA-GSFC.

Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.

CprE 458/558: Real-Time Systems

FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.

Fault-Tolerant Systems Design Part 1.

Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.

Advantages of Time-Triggered Ethernet

Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.

Mixed Criticality Systems: Beyond Transient Faults Abhilash Thekkilakattil, Alan Burns, Radu Dobrin and Sasikumar Punnekkat.

Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003.

Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.

Introduction to Fault Tolerance By Sahithi Podila.

A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.

DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:

1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.

1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.

Seminar On Rain Technology

Sine-Wave Application v2.0 Pavel Čírtek. Sine-Wave Application v2.0 2 The Aim of the Work Create representative prototype of highly dependable synthetic.

Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.

ECE 753: FAULT-TOLERANT COMPUTING

Fault Tolerance In Operating System

Architectures of distributed systems Fundamental Models

Architectures of distributed systems

Seminar on Enterprise Software

Presentation transcript:

Introduction to Fault- Tolerance Amos Wang Credit from: Dr. Axel Krings, Dr. Behrooz Parhami, Prof. Jalal Y. Kawash, Kewal K.Saluja, and Paul Krzyzanowski

Introduction Fault tolerance is related to dependability o Availability o Reliability o Safety o Maintainability

Faults Due to a variety of factors o Hardware failure o Software bugs o Operator errors o Network errors/outages Duration o transient faults o intermittent faults o permanent faults

Failure Models

Fault Tolerance Fault Avoidance o Design a system with minimal faults Fault Removal o Validate/test a system to remove the presence of faults Fault Tolerance o Deal with faults!

Redundancy Redundancy types: o time redundancy  Timeout & retransmit o software redundancy  N-versions o information redundancy  Hamming codes, parity memory ECC memory o hardware redundancy  RAID disks, backup servers

Time redundancy Key Concept - do a job more than once over time o examples re-execution re-transmission of information o different faults and capabilities of different schemes transient faults  re-execution and re-transmission can detect such faults provided we wait for transient to subside permanent faults  send or process shifted version of data  send or process complemented data during second transmission

Software Redundancy Multiple teams of programmers Write different versions of software for the same function The hope is that such diversity will ensure that not all the copies will fail on the same set of input data

Distributed System Passive Replication o Only one server processes client’s request

Distributed System Active Replication o Client’s request processed by all servers o Atomic broadcast o Tolerate byzantine faults

Information Redundancy Key concept - add redundancy to information/data o all schemes use Error detecting or Error correcting coding o helps to catch system induced errors o parity checks o Ex: Error-Correcting Parity Codes, Hamming code, Cyclic code

Error-Correcting Parity Codes Simplest scheme: data is organized in a 2- dimensional array A single-bit error anywhere will cause a row and a column to be erroneous

Hamming Code

Compute Check

Overlapped Parity Example o data = o compute check bits:

Overlapped Parity Example o data sent is ; transmitted check bits are 1110 o assume received data is: » note that most sig. bit has been corrupted/flipped o received check bits are: 1110 o recomputed check bits:

Overlapped Parity Syndrome: 1110 XOR 0010 = 1100 (D8 as faulty)

Hardware Redundancy Passive (static) – uses fault masking to hide occurrence of fault – e.g. voting Active (dynamic) – uses comparison for detection and/or diagnoses – remove faulty hardware from system Hybrid

Passive Hardware Redundancy N-Modular Redundancy (NMR) – N independent modules replicate the same function – requirements: N >= 3 ! TMR (Triple Modular Redundancy)

Voting

Active Hardware Redundancy Duplicate and Compare o can only detect, but NOT diagnose o comparator is single point of failure

Active Hardware Redundancy Stand-by-sparing o only one module is driving outputs o error detection => switch to a new module

Active Hardware Redundancy Pair and Spare o duplication combined with compare & spare o 2 modules are always on-line

Hybrid Hardware Redundancy NMR with spares o N active + S spare modules (off-line) o replace erroneous module from spare pool o maintains N constant o uses N-of-(N+S) switch

Summary

Reference

Fault tolerance in automotive systems Namhoon Kim

Fault Behavior Fail-operational (FO): One failure is tolerated. This is required if no safe state exists immediately after the component fails. Fail-safe (FS): After one (or several) failure(s), the component directly reaches a safe state (passive fail-safe) or is brought to a safe state by a special action (active fail-safe). Fail-silent (FSIL): After one (or several) failure(s), the component exhibits quiet behavior externally and therefore does not wrongly influence other components.

Fail Behavior Credit from Fault-Tolerant Drive-by-Wire Systems

Automotive Electronic Systems Communications network Sensors and actuators Electronic Control Unit (ECU)

Communication Network Figure from: Expanding automotive Electronic Systems

Reliable Communication The network should remain active and working even in case of an error Active redundancy and error detection Two directions of operation Event-triggered (ET) systems transmissions are driven by the occurrence of events Time-triggered (TT) systems transmissions are driven by the progress of time

Time-triggered vs. Event- triggered Dependability is much easier to ensure using a TT bus 1.Access to the medium is deterministic 2.Adding new nodes without affecting existing ones is simple 3.The behavior of a TT system is predictable 4.Message transmission can be used as “heartbeats”

Fault Tolerance In Communication EMIs (Electro-Magnetic Interferences) EMIs can be radiated by in-vehicle devices (switches, relays, and etc.) Use a resilient physical layer (e.g., optical) Or replicate the transmission channels Cyclic Redundancy Check (CRC) can detect the corrupted frame.

Fault Tolerance In Communication Bus guardian component Avoids “babbling idiots” situation Restricts the node’s ability to transmit Allows transmission only when the node exhibits a specified behavior Ideally, the bus guardian should have its own copy of the communication schedule and its own power supply and should be able to construct the global time itself Due to cost, these assumptions are not fulfilled in general

In-Vehicle Networks Two or three separate controller area networks (CANs) A low-speed CAN (< 125kbps) manages “comfort electronics” A high-speed CAN runs more real-time-critical functions A very cost and performance effective solution during the last 20 years Local interconnect network (LIN) A cheap serial network A master-slave, time-triggered protocol On-off devices (door locks, sunroofs, rain sensors, door mirrors)

In-Vehicle Networks Media-oriented systems transport (MOST) A fiber-optic network protocol with capacity for high- volume streaming For multimedia networking in automobiles Redundant double ring configurations for safety-critical applications Developed by more than 50 firms (including Audi, BMW, Daimler-Chrysler, Toyota, Volkswagen, Volvo)

In-Vehicle Networks FlexRay BMW, Bosch, GM, Daimler-Chrysler, Philips, and Motorola are collaborating on FlexRay A fault-tolerant protocol designed for high data rate applications time-triggered communication with bus guardian and clock synchronization on dual wires Allow event-triggered behavior Real-time data transmission with bounded latency Full use of FlexRay was introduced in 2008 in the new BMW 7 Series

Sensors and Actuators Sensors are the first in the information flow Static or dynamic redundancy with cold or hot standby can be used The fail-silence property of actuators is essential Fail-silent: After a failure the component remains silent, so that it can not wrongly influence other components

Fault-Tolerant Sensors Credit from Fault-Tolerant Drive-by-Wire Systems

Fault-Tolerant Actuator Credit from Fault-Tolerant Drive-by-Wire Systems

An Example Brake-by-Wire System Electromechanical brake, developed by Continental Teves, Germany The system consist of 4 electromechanical wheel brake modules An electromechanical brake pedal module A communication and power system A central brake management computer Credit from Fault-Tolerant Drive-by-Wire Systems

An Example Brake-by-Wire System Figure from Safety in automotive by-wire systems The communication system and power system have dynamic redundancy with hot standby.

An Example Brake-by-Wire System Figure from Safety in automotive by-wire systems

An Example Brake-by-Wire System Figure from Safety in automotive by-wire systems

ECU Lock-step dual processor architecture Figure from Fault Tolerant Platforms for Automotive Safety Critical Applications

Lock-Step Architecture Two processors referred to as the master and the checker Execute the same code being strictly synchronized The master has access to the system memory and drives all system outputs While, the checker continuously executes the instructions fetched by the master The compare logic checks the consistency of their data-, address- and control-lines.

ECU Loosely-synchronized dual processor architecture Figure from Fault Tolerant Platforms for Automotive Safety Critical Applications

Loosely-Synchronized Arch. Two CPUs run independently having access to distinct memory subsystems A real-time operating system handles interprocessor communication and synchronization The OS is responsible for error detection (cross- checks), correction and containment Critical tasks are executes in parallel as software replicas

ECU Triple modular redundant (TMR) architecture Figure from Fault Tolerant Platforms for Automotive Safety Critical Applications

TMR Architecture Three identical CPUs execute the same code in lock-step A majority vote of the outputs masks any possible single CPU fault The memory and communication faults can be masked employing ECC techniques

ECU Dual lock-step architecture Figure from Fault Tolerant Platforms for Automotive Safety Critical Applications

Dual Lock-Step Architecture Consists of the combination of two fail-silent channels Each one consists of a lock-step architecture Can be used in different configurations Two core execute the same code in lock-step  provides fault-tolerance capability Two channels can operate independently  behaves like a traditional dual processor solution

References M. Davies, Safety in automotive by-wire systems, Vienna University of Technology, Jun G. Leen and D. Heffernan, Expanding Automotive Electronic Systems, IEEE Computer, vol. 35, no. 1, pp , Jan R. Isermann, R. Schwarz, and S. Stoelzl, Fault-Tolerant Drive-by-Wire Systems, IEEE Control Systems, vol. 22, no. 5, pp , Oct N. Navet and F. Simonot-Lion, Fault Tolerant Services For Safe In-Car Embedded Systems, in The Embedded Systems Handbook, CRC Press, Aug M. Baleani, A. Ferrari, L. Mangeruca, A. Sangiovanni-Vincentelli, M. Peri, and S. Pezzini, Fault-Tolerant Platforms for Automotive Safety-Critical Applications, In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, pp , D. Wanner, A. Trigell, L. Drugge, and J. Jerrelind, Survey on Fault-Tolerant Vehicle Design, In Proceedings of 26 th Electric Vehicle Symposium, May 2012.