1 © H. Kopetz 8/13/2015 Twelve Principles for the Design of Safety-Critical Real-Time Systems H. Kopetz TU Vienna April 2004.

Slides:



Advertisements
Similar presentations
ARCHITECTURES FOR ARTIFICIAL INTELLIGENCE SYSTEMS
Advertisements

Chapter 8 Fault Tolerance
SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.
Anthony Greene1 Simple Hypothesis Testing Detecting Statistical Differences In The Simplest Case:  and  are both known I The Logic of Hypothesis Testing:
EECE499 Computers and Nuclear Energy Electrical and Computer Eng Howard University Dr. Charles Kim Fall 2013 Webpage:
Making Services Fault Tolerant
Software Engineering for Real- Time: A Roadmap H. Kopetz. Technische Universitat Wien, Austria Presented by Wing Kit Hor.
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.
CS599 Software Engineering for Embedded Systems1 Software Engineering for Real-Time: A Roadmap Presentation by: Mandar Samant Raghbir Singh Banwait.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
REAL-TIME SOFTWARE SYSTEMS DEVELOPMENT Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical Engineering, WVU.
Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.
SIM5102 Software Evaluation
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
CSC 402, Fall Requirements Analysis for Special Properties Systems Engineering (def?) –why? increasing complexity –ICBM’s (then TMI, Therac, Challenger...)
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
1 Software Testing Techniques CIS 375 Bruce R. Maxim UM-Dearborn.
Airbus flight control system  The organisation of the Airbus A330/340 flight control system 1Airbus FCS Overview.
Airbus flight control system
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Software Testing Verification and validation planning Software inspections Software Inspection vs. Testing Automated static analysis Cleanroom software.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
CMSC 345 Fall 2000 Unit Testing. The testing process.
Fault Diagnosis System for Wireless Sensor Networks Praharshana Perera Supervisors: Luciana Moreira Sá de Souza Christian Decker.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
1 Software testing. 2 Testing Objectives Testing is a process of executing a program with the intent of finding an error. A good test case is in that.
Jon Perez, Mikel Azkarate-askasua, Antonio Perez
Thomas Losert HRTC Meeting 12 September 2002, Vienna Introduction to the TTA.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Software Verification, Validation and Testing.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
CprE 458/558: Real-Time Systems
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Safety-Critical Systems 7 Summary T V - Lifecycle model System Acceptance System Integration & Test Module Integration & Test Requirements Analysis.
Failure Mode Assumptions and Assumption Coverage David Powell.
Copyright 1999 G.v. Bochmann ELG 7186B ch.1 1 Course Notes ELG 7186C Formal Methods for the Development of Real-Time System Applications Gregor v. Bochmann.
Software Safety Case Why, what and how… Jon Arvid Børretzen.
Advantages of Time-Triggered Ethernet
Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.
Mixed Criticality Systems: Beyond Transient Faults Abhilash Thekkilakattil, Alan Burns, Radu Dobrin and Sasikumar Punnekkat.
Smart Home Technologies
Slide 1 Security Engineering. Slide 2 Objectives l To introduce issues that must be considered in the specification and design of secure software l To.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
Toward a New ATM Software Safety Assessment Methodology dott. Francesca Matarese.
ARTEMIS SRA 2016 Trust, Security, Robustness, and Dependability Dr. Daniel Watzenig ARTEMIS Spring Event, Vienna April 13, 2016.
Failure Modes, Effects and Criticality Analysis
ON “SOFTWARE ENGINEERING” SUBJECT TOPIC “RISK ANALYSIS AND MANAGEMENT” MASTER OF COMPUTER APPLICATION (5th Semester) Presented by: ANOOP GANGWAR SRMSCET,
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Week#3 Software Quality Engineering.
From a Federated to an Integrated Architecture
Software Testing.
Fault Tolerance & Reliability CDA 5140 Spring 2006
Security Engineering.
Software testing strategies 2
Outline Announcements Fault Tolerance.
Fault Tolerance Distributed Web-based Systems
Software testing.
Baisc Of Software Testing
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
Introduction to Operating Systems
Time-Triggered Architecture
Design.
Functional Safety Solutions for Automotive
Presentation transcript:

1 © H. Kopetz 8/13/2015 Twelve Principles for the Design of Safety-Critical Real-Time Systems H. Kopetz TU Vienna April 2004

2 © H. Kopetz 8/13/2015 Outline  Introduction  Design Challenges  The Twelve Design Principles  Conclusion

3 © H. Kopetz 8/13/2015 Examples of Safety Critical Systems--No Backup Fly-by-wire Airplane: There is no mechanical or hydraulic connection between the pilot controls and the control surfaces. Drive-by-wire Car: There is no mechanical or hydraulic connection between the steering wheel and the wheels.

4 © H. Kopetz 8/13/2015 What are the Alternatives in Case of Failure? Can humans manage the functional difference between the computer control system and the manual backup system? Design an architecture that will tolerate the failure of any one of its components. Fall back to human control in case of a component failure

5 © H. Kopetz 8/13/2015 Design Challenges in Safety-Critical Applications In Safety-Critical Applications, where the safety of the system- at-large (e.g., an airplane or a car) depends on the correct operation of the computer system (e.g., the primary flight control system or the by-wire-system in a car) the following challenges must be addressed:  The Challenge  The Process of Abstracting  Physical Hardware Faults  Design Faults  Human Failures

6 © H. Kopetz 8/13/2015 The Challenge  The system as a whole must be more reliable than any one of its components: e.g., System Dependability 1 FIT--Component dependability 1000 FIT (1FIT: 1 failure in 10 9 hours)  Architecture must support fault-tolerance to mask component failures  System as a whole is not testable to the required level of dependability.  The safety argument is based on a combination of experimental evidence and formal reasoning using an analytical dependability model

7 © H. Kopetz 8/13/2015 The Process of Abstracting  The behavior of a safety-critical computer system must be explainable by a hierarchically structured set of behavioral models, each one of them of a cognitive complexity that can be handled by the human mind.  Establish a clear relationship between the behavioral model and the dependability model at such a high level of abstraction that the analysis of the dependability model becomes tractable. Example: Any migration of a function from one ECU to another ECU changes the dependability model and requires a new dependability analysis  From the hardware point of view a complete chip forms a single fault containment region (FCR) that can fail in an arbitrary failure mode.

8 © H. Kopetz 8/13/2015 Physical Hardware Faults of SoCs: Assumed Behavioral Hardware Failure Rates (Orders of Magnitude): Design Assumption in Aerospace: A chip can fail with a probability of hours in an arbitrary failure mode. Type of FailureFailure Rate in FitSource Transient Node Failures (fail silent) Fit (MTTF = 1000 hours) Neutron bombardment Aerospace Transient Node Failure (non-fail silent) Fit (MTTF= ) Tendency: increase Fault Injection Experiments Permanent Hardware Failures 100 Fit (MTTF= ) Automotive Field Data

9 © H. Kopetz 8/13/2015 Design Faults No silver bullet has been found yet--and this is no silver bullet either: Interface Centric Design!  Partition the system along well-specified linking interfaces (LIF) into nearly independent software units.  Provide a hierarchically structured set of ways-and-means models of the LIFs, each one of a cognitive complexity that is commensurate with the human cognitive capabilities.  Design and validate the components in isolation w.r.t. the LIF specification und make sure that the composition is free of side effects (composability of the architecture).  Beware of Heisenbugs!

10 © H. Kopetz 8/13/2015 The Twelve Design Principles 1.Regard the Safety Case as a Design Driver 2.Start with a Precise Specification of the Design Hypotheses 3.Ensure Error Containment 4.Establish a Consistent Notion of Time and State 5.Partition the System along well-specified LIFs 6.Make Certain that Components Fail Independently 7.Follow the Self-Confidence Principle 8.Hide the Fault-Tolerance Mechanisms 9.Design for Diagnosis 10.Create an Intuitive and Forgiving Man-Machine Interface 11.Record Every Single Anomaly 12.Provide a Never Give-Up Strategy

11 © H. Kopetz 8/13/2015 Regard the Safety Case as a Design Driver (I)  A safety case is a set of documented arguments in order to convince experts in the field (e.g., a certification authority) that the provided system as a whole is safe to deploy in a given environment.  The safety case, which considers the system as whole, determines the criticality of the computer system and analyses the impact of the computer-system failure modes on the safety of the application: Example: Driver assistance versus automatic control of a car.  The safety case should be regarded as a design driver since it establishes the critical failure modes of the computer system.

12 © H. Kopetz 8/13/2015 Regard the Safety Case as a Design Driver II)  In the safety case the multiple defenses between a subsystem failure and a potential catastrophic system failures must be meticulously analyzed.  The distributed computer system should be structured such that the required experimental evidence can be collected with reasonable effort and that the dependability models that are needed to arrive at the system-level safety are tractable.

13 © H. Kopetz 8/13/2015 Start with a Precise Specification of the Design Hypotheses The design hypotheses is a statement about the assumptions that are made in the design of the system. Of particular importance for safety critical real-time systems is the fault-hypotheses: a statement about the number and types of faults that the system is expected to tolerate:  Determine the Fault-Containment Regions (FCR): A fault- containment region (FCR) is the set of subsystems that share one or more common resources and that can be affected by a single fault.  Specification of the Failure Modes of the FCRs and their Probabilities  Be aware of Scenarios that are not covered by the Fault-Hypothesis Example: Total loss of communication for a certain duration.

14 © H. Kopetz 8/13/2015 Contents of the Fault Hypothesis i.Unit of Failure: What is the Fault-Containment Region (FCR)?--A complete chip? ii.Failure Modes: What are the failure modes of the FCR? iii.Frequency of Failures: What is the assumed MTTF between failures for the different failure modes eg. transient failures vs permanent failures? iv.Detection: How are failures detected? How long is the detection latency? v.State Recovery: How long does it take to repair corrupted state (in case of a transient fault)?

15 © H. Kopetz 8/13/2015 Failure Modes of an FCR--Are there Restrictions? assumption fail-silent k+1 assumption synchronized 2k + 1 no assumption (arbitrary) 3k + 1 A B C What is the assumption coverage in cases A and B?

16 © H. Kopetz 8/13/2015 Example: Slightly-out-of-Specification (SOS) Failure Receive Window The following is an example for the type of asymmetric non-fail-silent failures that have been observed during the experiments:

17 © H. Kopetz 8/13/2015 Example Brake by Wire Application Consider the scenario where the right two brakes do not accept an SOS-faulty brake-command message, while the left two brakes do accept this message and brake. RFRB LF LB If the two left wheels brake, while the two right wheels do not brake, the car will turn.

18 © H. Kopetz 8/13/2015 Ensure Error Containment In a distributed computer system the consequences of a fault, the ensuing error, can propagate outside the originating FCR (Fault Containment Region) either by an erroneous message or by an erroneous output action of the faulty node to the environment that is under the node’s control.  A propagated error invalidates the independence assumption.  The error detector must be in a different FCR than the faulty unit.  Distinguish between architecture-based and application-based error detection  Distinguish between error detection in the time-domain and error detection in the value domain.

19 © H. Kopetz 8/13/2015 Fault Containment vs. Error Containment No Error Detection Error Detection We do not need an error detector if we assume fail-silence. Error detecting FCR must be independent of the FCR that has failed--at least two FCRs are required if a restricted failure mode is assumed.

20 © H. Kopetz 8/13/2015 Establish a Consistent Notion of Time and State A system-wide consistent notion of a discrete time is a prerequisite for a consistent notion of state, since the notion of state is introduced in order to separate the past from the future: “The state enables the determination of a future output solely on the basis of the future input and the state the system is in. In other word, the state enables a “decoupling” of the past from the present and future. The state embodies all past history of a system. Knowing the state “supplants” knowledge of the past. Apparently, for this role to be meaningful, the notion of past and future must be relevant for the system considered.” (Taken from Mesarovic, Abstract System Theory, p.45) Fault-masking by voting requires a consistent notion of state in distributed Fault Containment Regions (FCRs).

21 © H. Kopetz 8/13/2015 Fault-Tolerant Sparse Time Base If the occurrence of events is restricted to some active intervals with duration  with an interval of silence of duration  between any two active intervals, then we call the time base  /  -sparse, or sparse for short.

22 © H. Kopetz 8/13/2015 Need for Determinism in TMR Systems FCU Voter Actuator Fault Tolerant Smart Sensor TMR Replicas

23 © H. Kopetz 8/13/2015 Partition the System along well-specified LIFs “Divide and Conquer” is a well-proven method to master complexity. A linking interface (LIF) is an interface of a component that is used in order to integrate the component into a system-of-components.  We have identified two different types LIFs:  time sensitive LIFs and  not time sensitive LIFs  Within an architecture, all LIFs of a given type should have the same generic structure  Avoid concurrency at the LIF level The architecture must support the precise specification of LIFs in the domains of time and value and provide a comprehensible interface model.

24 © H. Kopetz 8/13/2015 The LIF Specification hides the Implementation Component Operating System Middleware Programming Language WCET Scheduling Memory Management Etc. Linking Interface Specification (In Messages, Out Messages, Temporal, Meaning-- Interface Model)

25 © H. Kopetz 8/13/2015 The LIF Specification hides the Implementation Component Operating System Middleware Programming Language WCET Scheduling Memory Management Etc. Linking Interface Specification (In Messages, Out Messages, Temporal, Meaning-- Interface Model)

26 © H. Kopetz 8/13/2015 Composability in Distributed Systems Communication System Delay, Dependability Interface Specification A Interface Specification B

27 © H. Kopetz 8/13/2015 A Component may support many LIFs Service X Service Y Service Z X Z Y Fault Isolation in Mixed Criticality Components

28 © H. Kopetz 8/13/2015 Make Certain that Components Fail Independently Any dependence of FCR failures must be reflected in the dependability model--a challenging task! Independence is a system property. Independence of FCRs can be compromised by  Shared physical resources (hardware, power supply, time- base, etc.)  External faults (EMI, heat, shock, spatial proximity)  Design  Flow of erroneous messages

29 © H. Kopetz 8/13/2015 Follow the Self-Confidence Principle The self-confidence principles states that an FCR should consider itself correct, unless two or more independent FCRs classify it as incorrect. If the self-confidence principle is observed then  a correct FCR will always make the correct decision under the assumption of a single faulty FCR  Only a faulty FCR will make false decisions.

30 © H. Kopetz 8/13/2015 Hide the Fault-Tolerance Mechanisms  The complexity of the FT algorithms can increase the probability of design faults and beat its purpose.  Fault tolerance mechanisms (such as voting, recovery) are generic mechanisms that should be separated from the application in order not to increase the complexity of the application.  Any fault-tolerant system requires a capability to detect faults that are masked by the fault-tolerance mechanisms--this is a generic diagnostic requirement that should be part of the architecture.

31 © H. Kopetz 8/13/2015 Design for Diagnosis The architecture and the application of a safety-critical system must support the identification of a field-replaceable unit that violates the specification:  Diagnosis must be possible on the basis of the LIF specification and the information that is accessible at the LIF  Transient errors pose the biggest problems--Condition based maintenance  Determinism of the Architecture helps!  Avoid Diagnostic Deficiencies  Scrubbing--Ensure that the FT mechanisms work

32 © H. Kopetz 8/13/2015 Diagnostic Deficiency in CAN Driver Interface CC Engine Control CC I/O Assistant System CC Steering Manager CC I/O Gateway Body CC I/O Suspen- sion CC I/O CC: Communication Controller Brake Manager CC I/O Even an expert cannot decide who sent the erroneous message Erroneous CAN message with wrong identifier

33 © H. Kopetz 8/13/2015 Create an Intuitive and Forgiving Man-Machine Interface  The system designer must assume that human errors will occur and must provide mechanisms that mitigate the consequences of human errors.  Three levels of human errors  Mistakes (misconception at the cognitive level)  Lapses (wrong rule from memory)  Slips (error in the execution of a rule)

34 © H. Kopetz 8/13/2015 Record Every Single Anomaly  Every single anomaly that is observed during the operation of a safety critical computer system must be investigated until an explanation can be given.  This requires a well-structured design with precise external interface (LIF) specifications in the domains of time and value.  Since in a fault-tolerant system many anomalies are masked by the fault-tolerance mechanisms from the application, the observation mechanisms must access the non-fault-tolerant layer. It cannot be performed at the application level.

35 © H. Kopetz 8/13/2015 Provide a Never Give-Up Strategy  There will be situations when the fault-hypothesis is violated and the fault tolerant system will fail.  Chances are good that the faults are transient and a restart of the whole system will succeed.  Provide algorithms that detect the violation of the fault hypothesis and that initiate the restart.  Ensure that the environment is safe (e.g., freezing of actuators) while the system restart is in progress.  Provide an upper bound on the restart duration as a parameter of the architecture.

36 © H. Kopetz 8/13/2015 Approach to Safety: The Swiss-Cheese Model Subsystem Failure Catastrophic System Event Multiple Layers of Defenses Independence of Layers of Error Detection are important From Reason, J Managing the Risk of Organizational Accidents 1997 Normal Function Fault Tolerance Never Give Up Strategy

37 © H. Kopetz 8/13/2015 Conclusion Every one of these twelve design principles can be the topic of a separate talk! Thank you