Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.

Slides:



Advertisements
Similar presentations
Tolerating Timing faults TSW November 2009 Anders P. Ravn Aalborg University.
Advertisements

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 2.
Fault-Tolerant Systems Design Part 1.
Software Quality Assurance (SQA). Recap SQA goal, attributes and metrics SQA plan Formal Technical Review (FTR) Statistical SQA – Six Sigma – Identifying.
Dependability ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg University August.
Fault Tolerance -Example TSW November 2009 Anders P. Ravn Aalborg University.
Dependability TSW 10 Anders P. Ravn Aalborg University November 2009.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
© Burns and Welling, 2001 Characteristics of a RTS n Large and complex n Concurrent control of separate system components n Facilities to interact with.
Fault Tolerance: Basic Mechanisms mMIC-SFT September 2003 Anders P. Ravn Aalborg University.
CSE 322: Software Reliability Engineering Topics covered: Dependability concepts Dependability models.
8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.
ABCSG - Dependable Systems - 01/06/ ABCSG Dependable Systems.
Dependability ITV Real-Time Systems Anders P. Ravn Aalborg University February 2006.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
Safety Analysis – A quick introduction RTS February 2006 Anders P. Ravn Aalborg University.
CSC 402, Fall Requirements Analysis for Special Properties Systems Engineering (def?) –why? increasing complexity –ICBM’s (then TMI, Therac, Challenger...)
Software Fault Tolerance – The big Picture mMIC-SFT September 2003 Anders P. Ravn Aalborg University.
Safety Assessment (Fault Trees) ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg.
Chapter 2: Reliability and Fault Tolerance
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Design of SCS Architecture, Control and Fault Handling.
Summary and Safety Assessment mMIC-SFT November 2003 Anders P. Ravn Aalborg University.
CIS 376 Bruce R. Maxim UM-Dearborn
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
Software Testing Verification and validation planning Software inspections Software Inspection vs. Testing Automated static analysis Cleanroom software.
Reliability and Fault Tolerance Setha Pan-ngum. Introduction From the survey by American Society for Quality Control [1]. Ten most important product attributes.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
Characteristics of a RTS
1 Software testing. 2 Testing Objectives Testing is a process of executing a program with the intent of finding an error. A good test case is in that.
FAULT TREE ANALYSIS (FTA). QUANTITATIVE RISK ANALYSIS Some of the commonly used quantitative risk assessment methods are; 1.Fault tree analysis (FTA)
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Software Verification, Validation and Testing.
Fault-Tolerant Systems Design Part 1.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg.
Quality Assurance.
CprE 458/558: Real-Time Systems
Safety-Critical Systems 7 Summary T V - Lifecycle model System Acceptance System Integration & Test Module Integration & Test Requirements Analysis.
RELIABILITY ENGINEERING 28 March 2013 William W. McMillan.
Fault-Tolerant Systems Design Part 1.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Software Testing and Quality Assurance 1. What is the objectives of Software Testing?
1 INTRUSION TOLERANT SYSTEMS WORKSHOP Phoenix, AZ 4 August 1999 Jaynarayan H. Lala ITS Program Manager.
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
CSE 8377 Software Fault Tolerance. CSE 8377 Motivation Software is becoming central to many life- critical systems Software is created by error-prone.
Winter 2007SEG2101 Chapter 121 Chapter 12 Verification and Validation.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Safety Assessment: Safety Integrity Levels
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
TECHNICAL SEMINAR On. introduction  Cloud support for real time system is really important because, today we found a lot of real time systems around.
I&C Lab Seminar Procedure for the Software Requirements Specification for Safety Critical Systems Seo Ryong Koo Korea Advanced Institute Science.
Week#3 Software Quality Engineering.
Software Dependability
Chapter 9, Testing.
Fault-Tolerant Computing Systems #3 Fault-Tolerant Software
Chapter 2: Reliability and Fault Tolerance
Fault Tolerance & Reliability CDA 5140 Spring 2006
Fault Tolerance In Operating System
Fault Tolerance Distributed Web-based Systems
Software Verification and Validation
Software Verification and Validation
Fault Tolerance Distributed
Software Verification and Validation
Presentation transcript:

Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University

Fault Tolerance Means to isolate component faults Prevents system failures May increase system dependability

Dependability - attributes Availability Reliability Safety Confidentiality Integrity Maintainability BW p. 129

Dependability - impairments Faults Errors Failures BW p. 103,...,130 FaultErrorFailure... Fault

System and Component

Dependability - means Fault prevention Fault tolerance Error Removal Failure Forecasting BW p. 106,..., 130

Fault classification Origin Kind Property physical (internal/external) logical (design/interaction) omission value timing byzantine duration (permanent, transient) consistency (determinate, nondeterminate) autonomy (spontaneous, event-dependent)

Error Classification (Fault  Error) Effect Extent latent effective local distributed

Failure Classification (Fault  Error  Failure) Consequence benign malign (a mishap) BW (Failure modes) p. 105

Fault Avoidance Careful Design Conservative Design process (activities) notations tools robust functionality testability tracability

Error Removal Verification (analysis of design) Test (analysis of implementation)

Failure Forecasting Calculation – analysis of design Simulation – measurement on design Test -- measurement on implementation

Fault Tolerance Means to isolate component faults Prevents system failures May increase system dependability... And mask them

Dependability - means Fault prevention Fault tolerance Error Removal Failure Forecasting BW p. 106,...

Fault Tolerance

FT - levels Full tolerance Graceful Degradation Fail safe BW p. 107

FT basis: Redundancy Time Space TryRetry... Try... BW p. 109

N-version programming V1 V2 V3 Driver (comparator) Comparison vectors (votes) Comparison status indicators BW p. 109 Comparison points

Fault classification (scope of N-VP) Origin Kind Property physical (internal/external) logical (design/interaction) omission value timing byzantine duration (permanent, transient) consistency (determinate, nondeterminate) autonomy (spontaneous, event-dependent) + (+) ++ (+) + / (+) + / +

Dynamic Redundancy 1.Error detection 2.Damage confinement and assessment 3.Error recovery 4.Fault treatment and continued service BW p. 114

Error Detection f: State x Input  State x Output Environment (exception) Application BW p. 115 Assertion: precondition (input) postcondition (input, output) invariant(state, state’) Timing: WCET(f, input) Deadline (f,input) D

Damage Confinement Static structure Dynamic structure BW p. 117 object I I

Error Recovery Forward Backward BW p. 118 Repair the state – if you can ! define recovery points checkpoint state at r. p. roll back retry Domino effect

Recovery blocks ENSURE acceptance_test BY { module_1 } ELSE BY { module_2 }... ELSE BY { module_m } ELSE ERROR BW p. 120

The ideal FT-component Exception HandlerNormal mode Request/response Interface exception Interface exception Failure exception Failure exception BW p. 126

Safety Assessment Find faults that may lead to mishaps, analyze their relations, and estimate their consequences. May involve probabilistic reasoning (Reliability Engineering)

Fault Tree - Events Primary Events: Basic event – fault in atomic component Undeveloped Event – fault in composite component (may be analyzed later) External event – expected event from environment Intermediate event: Nodes inside a fault-tree

Fault Tree - Gates... condition Inhibit gate

Example – ”Wake too late” Wake too late Alarm clock fails Phone fails ”Inner clock” fails

Example ”Alarm clock fails” Beeper fails Button fails Alarm clock fails electronics fail SW fails Power fails Button read failsBeeper not set

Cut Set A cut set is a set of events that causes a top level event A singleton cut set is a single point of failure

Example – ”Wake too late” Wake too late Alarm clock fails Phone fails ”Inner clock” fails

Example ”Alarm clock fails” Beeper fails Button fails Alarm clock fails electronics fail SW fails Power fails Button read failsBeeper not set

Extensions etc. Probabilities on edges Event tree (forward analysis from initiating event) Combinations (cause-consequence diagrams) Many tools Kirsten M. Hansen, Anders P. Ravn and Victoria Stavridou, From Safety Analysis to Formal Specification, IEEE Trans. Softw. Eng.24,pp , July 1998

Example

Fault Hypotheses

Fault-Tolerant System

Impulse Generator

CU

Voter and Arbiter

Parameters

Properties

Procedure 1.Model the correct component and check that it has the desired properties. 2.Model relevant faults and introduce them as internal transitions to error states. Check that this fault-affected. 3. Introduce into the model the mechanisms for fault detection, error recovery and masking and check that the desired properties are valid for this design.