SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.

SENG521 (Fall 2002)far@enel.ucalgary.ca1 SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical & Computer Engineering, University of Calgary B.H. Far （ far@enel.ucalgary.ca ） http://www.enel.ucalgary.ca/~far/Lectures/SENG521/04a/

SENG521 (Fall 2002)far@enel.ucalgary.ca2 What Is Fault Tolerance A fault-tolerant computing system must be capable of providing specified services in the presence of a bounded number of failures. These failures could occur because of faults present in either the components of the system or in the system’s design. Building large computing systems is a complex task; fault-tolerance requirements could make the task even more difficult unless appropriate system structuring concepts are utilized.

SENG521 (Fall 2002)far@enel.ucalgary.ca3 Problems … The traditional approaches to fault tolerance in hardware systems have been based on coping with the effects of well-understood failure modes of physical components. Conventional hardware fault tolerance methods are rarely powerful enough to cope with deficiencies of design. Consequently, most hardware fault tolerance techniques cannot be applied in software, where almost all faults are design faults.

SENG521 (Fall 2002)far@enel.ucalgary.ca4 History … Defensive programming: Implementing relatively ad hoc methods are used to minimize the damage which could arise from the damage of presence of residual bugs. Dual software technique: Implementing two distinct versions of the same software and executing them. Any discrepancy in the outputs of the two versions may trigger an alarm.

SENG521 (Fall 2002)far@enel.ucalgary.ca5 Fault Tolerance Phases /1 Phase 1: Error detection Phase 1: Error detection For a fault to be tolerated, it must first be detected. Thus the starting point for fault-tolerance techniques is observing failures. Phase 2: Damage assessment Phase 2: Damage assessment It is necessary to assess the extent to which the system state has been damaged or corrupted. If the delay involved between the manifestation of a fault (failure) and the detection of its cause (error) is large then it is likely that the damage to the system state will be more severe than if the latency interval were shorter.

SENG521 (Fall 2002)far@enel.ucalgary.ca6 Fault Tolerance Phases /2 Phase 3: Error recovery Phase 3: Error recovery Error recovery techniques must be utilized in order to obtain a normal, error-free system state. There are two different kinds of recovery technique. Backward recovery technique consists of discarding the current (corrupted) state in favor of an earlier state Therefore, mechanisms are needed to record and store system states. Forward recovery technique involves making use of the current (corrupted) state to construct an error-free state.

SENG521 (Fall 2002)far@enel.ucalgary.ca7 Fault Tolerance Phases /3 Phase 4: Fault treatment & continued service Phase 4: Fault treatment & continued service Once recovery has been undertaken, it is essential to ensure that the normal operation of the system will continue without the fault immediately manifesting itself once more. The first aspect of fault treatment is to attempt to locate the fault. Following this, steps can be taken either to repair the fault or to reconfigure the rest of the system to avoid the fault.

SENG521 (Fall 2002)far@enel.ucalgary.ca8 Recovery Block Mechanism Syntax of a recovery block construct: ensure by P 0 else-by P 1 else fail It depicts a software system with 3 components, the two procedures P 0 (the primary) and P 1 (the alternative), and the acceptance test. The design of the system is the control structure implied by the syntax. Assume that the acceptance test is perfect (i.e., detects all violations of the specification) then the recovery block P 1 will tolerate all the faults of procedure P 0 that could lead to its failure.

SENG521 (Fall 2002)far@enel.ucalgary.ca9 Example Fault tolerance phases: Error detection: acceptance test (a Boolean expression) is used. Damage assessment: only the program in execution is assumed to be affected. Error recovery: (backward in this case) consists of recovering the state of the executing program to that at the beginning of the recovery block. Fault treatment: the program in execution (primary or alternative) is assumed to be faulty, so its faults are avoided by executing the next alternative (if any).

SENG521 (Fall 2002)far@enel.ucalgary.ca10 Design Technique /1 Robust Software Systems Robust Software Systems (Anderson and Lee 1981, etc.): Construction of a robust module requires: Exception handlers for coping with exceptions propagated from lower levels; and Boolean expressions for detecting exceptions arising in the module itself, and their exception handlers. It is often possible (and desirable for the sake of simplicity) to map several exceptions onto a single handler.

SENG521 (Fall 2002)far@enel.ucalgary.ca11 Design Technique /2 Modules are represented by nodes and arrow from a node A to node B means that there are one or more operations in A that a successful completion of that operation depends on the successful completion of some operation provided by B; in other words, B provides certain services to A. Assuming the use of data abstractions (abstract data types) in program development. The software system is structured into a hierarchy of modules represented by an acyclic graph.

SENG521 (Fall 2002)far@enel.ucalgary.ca12 Design Technique /3  A normal chain of events consist of some procedure of ‘A’ making a call on ‘B’, and ‘B’ calls a lower level module (say ‘F’), this call returns normally, and subsequent A’s call returns normally.

SENG521 (Fall 2002)far@enel.ucalgary.ca13 Design Technique /4 Exception cases: 1.A call from ‘B’ to a lower level module returns an exception and this is passed to ‘A’ 2.A call from ‘B’ to a lower level module returns an exception but ‘B’ has exception handlers that can handle this and provides a normal service to ‘A’ 3.A Boolean expression in B - inserted specifically for detecting an error (exception) - evaluates to false. This is handled by either of: Exception is masked, in which case ‘B’ will return normally to ‘A’ An exceptional return is obtained by ‘A’

SENG521 (Fall 2002)far@enel.ucalgary.ca14 Notation A procedure P, besides the normal return, also provides an exceptional return E: procedure P(--) signals E The invoker of P can define the exceptional continuation to be some operation H which is called the handler of E: P(--) [E ⇒ H] In P the following constructs can be inserted: [T ⇒.. ;{signal E}] (1) O[L ⇒.. ;{signal E}] (2) (1) represents an exception is detected by a run time test T. (2) represents the case when invocation of an operation 'O' results in an exceptional return L which in turn could lead to the signaling of exception E. When an exception is signaled using construct (1) or (2), the control passes to the handler of that exception (H).

SENG521 (Fall 2002)far@enel.ucalgary.ca15 Example: Expected Events Design of a procedure P which adds three positive integers. The procedure uses operation ' + ' and an overflow signal exception 'OV'. procedure P (var i,j,k:integer) signals OW; begin i:=i+j [OV ⇒ signal OW]; i:=i+k [OV ⇒ i:=i-j; signal OW]; end; An important aspect of exception handling: clean-up operation If all the procedures of a module follow this strategy, we get a module with the following highly desirable property: Either the module produces results that reflect the desired normal service to the caller, or no results are produced and an exceptional return is obtained by the caller.

SENG521 (Fall 2002)far@enel.ucalgary.ca16 Unexpected Events /1 1.The execution of P does not terminate. 2.A lower level exception is detected for which there is no exception handler in P. 3.The execution of P terminates normally (the invoker obtains a normal return) but the results produced by P are not in accordance with the specification. Situations (1) and (2) will eventually cause a failure of the module; situation (3) represents the case where the module has failed but this event has not yet been detected by the system.

SENG521 (Fall 2002)far@enel.ucalgary.ca17 Unexpected Events /2 To cope with such cases, we can employ a default exception handler: procedure P (--) signals E; begin … … … end[ ⇒ "default handler"]; The control goes to this handler during the execution of P whenever an exception is detected for which there is no handler.

SENG521 (Fall 2002)far@enel.ucalgary.ca18 Unexpected Events /3 Case (1): It is possible to start a ‘timer’ concurrently with the invocation of P; the ‘time out’ exception will then be handled by the default handler. Case (2): All the lower level exceptions with no programmed handlers will similarly be handled by the default handler. Case (3): Make use of run time checks to detect possible violations of specifications to minimize the danger of undetected failures.

SENG521 (Fall 2002)far@enel.ucalgary.ca19 Unexpected Events /4 What strategy should be adopted by the default handler? The simplest thing to do is to undo any side- effects produced by the procedure and to signal a fail exception. When the invoker receives a fail exception, it means that the called module has failed to provide the specified service.

SENG521 (Fall 2002)far@enel.ucalgary.ca20 Design Guidelines For a given module, carefully analyze the cases that could prevent the module from providing the desired normal services. Make use of exception handlers either to mask the effects of such undesired, but expected, exceptions or to signal an appropriate exception to the caller of the module. Make use of default exception handlers or recovery blocks to obtain a measure of tolerance against design faults.

SENG521 (Fall 2002)far@enel.ucalgary.ca21 Discussion The capability of tolerating design faults rests largely on the ‘coverage’ of run-time checks (i.e. acceptance tests) for detecting errors. Often, it is not possible to check completely within a procedure that the results produced have been according to the specification (e.g. for a routine that sorts its input, the check that the output has been sorted would be as complex as the routine itself). Hence run-time checks are often limited to checking certain critical aspects of the specification. This means that the possibility of undetected failures cannot be ruled out entirely.

SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.

Similar presentations

Presentation on theme: "SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.

Similar presentations

Presentation on theme: "SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical."— Presentation transcript:

Similar presentations

About project

Feedback