SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.

Slides:



Advertisements
Similar presentations
Configuration Management
Advertisements

SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 2.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
CS 795 – Spring  “Software Systems are increasingly Situated in dynamic, mission critical settings ◦ Operational profile is dynamic, and depends.
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.
8. Fault Tolerance in Software 8.5 Construction of Acceptance Tests Goal Goal: describe the types and selection criteria for acceptance tests Two levels.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 SYSTEM FAILURES Lecture based on [GUW ,
8. Fault Tolerance in Software
Modified from Sommerville’s originals Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
SIM5102 Software Evaluation
Modified from Sommerville’s originals Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
Testing an individual module
SENG521 (Fall SENG 521 Software Reliability & Testing Defining Necessary Reliability (Part 3b) Department of Electrical & Computer.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 2 Slide 1 Systems engineering 1.
Department of Computer Science 1 CSS 496 Business Process Re-engineering for BS(CS)
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Software Testing Sudipto Ghosh CS 406 Fall 99 November 9, 1999.
Critical systems development. Objectives l To explain how fault tolerance and fault avoidance contribute to the development of dependable systems l To.
CMSC 202 Exceptions. Aug 7, Error Handling In the ideal world, all errors would occur when your code is compiled. That won’t happen. Errors which.
Secure Systems Research Group - FAU 1 A survey of dependability patterns Ingrid Buckley and Eduardo B. Fernandez Dept. of Computer Science and Engineering.
Chapter 13: Regression Testing Omar Meqdadi SE 3860 Lecture 13 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.
Fault-Tolerant Systems Design Part 1.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg.
CprE 458/558: Real-Time Systems
Fault-Tolerant Systems Design Part 1.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
Problem Reduction So far we have considered search strategies for OR graph. In OR graph, several arcs indicate a variety of ways in which the original.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
Transactions.
Software Quality Assurance and Testing Fazal Rehman Shamil.
©Ian Sommerville 2000Dependability Slide 1 Chapter 16 Dependability.
1 Phase Testing. Janice Regan, For each group of units Overview of Implementation phase Create Class Skeletons Define Implementation Plan (+ determine.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
1 Exceptions When the Contract is Broken. 2 Definitions A routine call succeeds if it terminates its execution in a state satisfying its contract A routine.
Exception Handling and Tolerance of Software Faults Prepared by: Saeid Pashazadeh Written By: Flaviu Cristian University of California,San Diego (session.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
SENG521 (Fall SENG 521 Software Reliability & Testing Preparing for Test (Part 6a) Department of Electrical & Computer Engineering,
SOFTWARE TESTING LECTURE 9. OBSERVATIONS ABOUT TESTING “ Testing is the process of executing a program with the intention of finding errors. ” – Myers.
Week#3 Software Quality Engineering.
Eighth Lecture Exception Handling in Java
Exception Handling in C++
Software Testing.
Chapter 2: Reliability and Fault Tolerance
EEC 688/788 Secure and Dependable Computing
Critical systems development
Design for Quality Design for Quality and Safety Design Improvement
Fault Injection: A Method for Validating Fault-tolerant System
Multi-Way Search Trees
Outline Announcements Fault Tolerance.
Fault Tolerance Distributed Web-based Systems
Exception Handling In Text: Chapter 14.
EEC 688/788 Secure and Dependable Computing
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
System Testing.
EEC 688/788 Secure and Dependable Computing
Exception handling Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.1.
Exception handling Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.1.
CMSC 202 Exceptions.
Exception Handling.
Presentation transcript:

SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical & Computer Engineering, University of Calgary B.H. Far ( )

SENG521 (Fall What Is Fault Tolerance A fault-tolerant computing system must be capable of providing specified services in the presence of a bounded number of failures. These failures could occur because of faults present in either the components of the system or in the system’s design. Building large computing systems is a complex task; fault-tolerance requirements could make the task even more difficult unless appropriate system structuring concepts are utilized.

SENG521 (Fall Problems … The traditional approaches to fault tolerance in hardware systems have been based on coping with the effects of well-understood failure modes of physical components. Conventional hardware fault tolerance methods are rarely powerful enough to cope with deficiencies of design. Consequently, most hardware fault tolerance techniques cannot be applied in software, where almost all faults are design faults.

SENG521 (Fall History … Defensive programming: Implementing relatively ad hoc methods are used to minimize the damage which could arise from the damage of presence of residual bugs. Dual software technique: Implementing two distinct versions of the same software and executing them. Any discrepancy in the outputs of the two versions may trigger an alarm.

SENG521 (Fall Fault Tolerance Phases /1 Phase 1: Error detection Phase 1: Error detection For a fault to be tolerated, it must first be detected. Thus the starting point for fault-tolerance techniques is observing failures. Phase 2: Damage assessment Phase 2: Damage assessment It is necessary to assess the extent to which the system state has been damaged or corrupted. If the delay involved between the manifestation of a fault (failure) and the detection of its cause (error) is large then it is likely that the damage to the system state will be more severe than if the latency interval were shorter.

SENG521 (Fall Fault Tolerance Phases /2 Phase 3: Error recovery Phase 3: Error recovery Error recovery techniques must be utilized in order to obtain a normal, error-free system state. There are two different kinds of recovery technique. Backward recovery technique consists of discarding the current (corrupted) state in favor of an earlier state Therefore, mechanisms are needed to record and store system states. Forward recovery technique involves making use of the current (corrupted) state to construct an error-free state.

SENG521 (Fall Fault Tolerance Phases /3 Phase 4: Fault treatment & continued service Phase 4: Fault treatment & continued service Once recovery has been undertaken, it is essential to ensure that the normal operation of the system will continue without the fault immediately manifesting itself once more. The first aspect of fault treatment is to attempt to locate the fault. Following this, steps can be taken either to repair the fault or to reconfigure the rest of the system to avoid the fault.

SENG521 (Fall Recovery Block Mechanism Syntax of a recovery block construct: ensure by P 0 else-by P 1 else fail It depicts a software system with 3 components, the two procedures P 0 (the primary) and P 1 (the alternative), and the acceptance test. The design of the system is the control structure implied by the syntax. Assume that the acceptance test is perfect (i.e., detects all violations of the specification) then the recovery block P 1 will tolerate all the faults of procedure P 0 that could lead to its failure.

SENG521 (Fall Example Fault tolerance phases: Error detection: acceptance test (a Boolean expression) is used. Damage assessment: only the program in execution is assumed to be affected. Error recovery: (backward in this case) consists of recovering the state of the executing program to that at the beginning of the recovery block. Fault treatment: the program in execution (primary or alternative) is assumed to be faulty, so its faults are avoided by executing the next alternative (if any).

SENG521 (Fall Design Technique /1 Robust Software Systems Robust Software Systems (Anderson and Lee 1981, etc.): Construction of a robust module requires: Exception handlers for coping with exceptions propagated from lower levels; and Boolean expressions for detecting exceptions arising in the module itself, and their exception handlers. It is often possible (and desirable for the sake of simplicity) to map several exceptions onto a single handler.

SENG521 (Fall Design Technique /2 Modules are represented by nodes and arrow from a node A to node B means that there are one or more operations in A that a successful completion of that operation depends on the successful completion of some operation provided by B; in other words, B provides certain services to A. Assuming the use of data abstractions (abstract data types) in program development. The software system is structured into a hierarchy of modules represented by an acyclic graph.

SENG521 (Fall Design Technique /3  A normal chain of events consist of some procedure of ‘A’ making a call on ‘B’, and ‘B’ calls a lower level module (say ‘F’), this call returns normally, and subsequent A’s call returns normally.

SENG521 (Fall Design Technique /4 Exception cases: 1.A call from ‘B’ to a lower level module returns an exception and this is passed to ‘A’ 2.A call from ‘B’ to a lower level module returns an exception but ‘B’ has exception handlers that can handle this and provides a normal service to ‘A’ 3.A Boolean expression in B - inserted specifically for detecting an error (exception) - evaluates to false. This is handled by either of: Exception is masked, in which case ‘B’ will return normally to ‘A’ An exceptional return is obtained by ‘A’

SENG521 (Fall Notation A procedure P, besides the normal return, also provides an exceptional return E: procedure P(--) signals E The invoker of P can define the exceptional continuation to be some operation H which is called the handler of E: P(--) [E ⇒ H] In P the following constructs can be inserted: [T ⇒.. ;{signal E}] (1) O[L ⇒.. ;{signal E}] (2) (1) represents an exception is detected by a run time test T. (2) represents the case when invocation of an operation 'O' results in an exceptional return L which in turn could lead to the signaling of exception E. When an exception is signaled using construct (1) or (2), the control passes to the handler of that exception (H).

SENG521 (Fall Example: Expected Events Design of a procedure P which adds three positive integers. The procedure uses operation ' + ' and an overflow signal exception 'OV'. procedure P (var i,j,k:integer) signals OW; begin i:=i+j [OV ⇒ signal OW]; i:=i+k [OV ⇒ i:=i-j; signal OW]; end; An important aspect of exception handling: clean-up operation If all the procedures of a module follow this strategy, we get a module with the following highly desirable property: Either the module produces results that reflect the desired normal service to the caller, or no results are produced and an exceptional return is obtained by the caller.

SENG521 (Fall Unexpected Events /1 1.The execution of P does not terminate. 2.A lower level exception is detected for which there is no exception handler in P. 3.The execution of P terminates normally (the invoker obtains a normal return) but the results produced by P are not in accordance with the specification. Situations (1) and (2) will eventually cause a failure of the module; situation (3) represents the case where the module has failed but this event has not yet been detected by the system.

SENG521 (Fall Unexpected Events /2 To cope with such cases, we can employ a default exception handler: procedure P (--) signals E; begin … … … end[ ⇒ "default handler"]; The control goes to this handler during the execution of P whenever an exception is detected for which there is no handler.

SENG521 (Fall Unexpected Events /3 Case (1): It is possible to start a ‘timer’ concurrently with the invocation of P; the ‘time out’ exception will then be handled by the default handler. Case (2): All the lower level exceptions with no programmed handlers will similarly be handled by the default handler. Case (3): Make use of run time checks to detect possible violations of specifications to minimize the danger of undetected failures.

SENG521 (Fall Unexpected Events /4 What strategy should be adopted by the default handler? The simplest thing to do is to undo any side- effects produced by the procedure and to signal a fail exception. When the invoker receives a fail exception, it means that the called module has failed to provide the specified service.

SENG521 (Fall Design Guidelines For a given module, carefully analyze the cases that could prevent the module from providing the desired normal services. Make use of exception handlers either to mask the effects of such undesired, but expected, exceptions or to signal an appropriate exception to the caller of the module. Make use of default exception handlers or recovery blocks to obtain a measure of tolerance against design faults.

SENG521 (Fall Discussion The capability of tolerating design faults rests largely on the ‘coverage’ of run-time checks (i.e. acceptance tests) for detecting errors. Often, it is not possible to check completely within a procedure that the results produced have been according to the specification (e.g. for a routine that sorts its input, the check that the output has been sorted would be as complex as the routine itself). Hence run-time checks are often limited to checking certain critical aspects of the specification. This means that the possibility of undetected failures cannot be ruled out entirely.