SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical & Computer Engineering, University of Calgary B.H. Far ( )
SENG521 (Fall Fault Tolerance (Review) A fault-tolerant computing system must be capable of providing specified services in the presence of a bounded number of failures. These failures could occur because of faults present in either the components of the system or in the system’s design. Most of the software faults are due to deficiencies of design and almost all of the hardware fault tolerance techniques cannot be applied in software.
SENG521 (Fall Acceptance Testing A program-specific error detection mechanism to check on the results of program execution. Usually evaluates to either “true” or “false”. ensure by P 0 else-by P 1 else fail Examples: Checksums for program parts Internal check points: ABS[(SQRT(x)*SQRT(x)) – x] < E
SENG521 (Fall External Consistency A kind of external error detection mechanism to judge correctness of execution of a program. Examples: Exception signal when dividing by zero Integer overflow signal Interrupt signal for program loop Float point numerical failure check
SENG521 (Fall Example The correct answer is But ordinary implementation of this will return zero due to rounding and large differences in the order of magnitude of the summands.
SENG521 (Fall Redundancy Dual software technique: Implementing two (or more) distinct versions of the same software and executing them for the same set of inputs. Any discrepancy in the outputs of the two versions may trigger an alarm. Redundancy techniques’ efficiency depends on coincident, correlated and dependent faults.
SENG521 (Fall Coincident Faults Coincident Faults: Coincident Faults: when two or more functionally equivalent software components fail on the same input. When two or more software versions give the same incorrect response, an identical-and- wrong (IAW) answer is obtained.
SENG521 (Fall Correlated & Dependent Faults Correlated Faults: Correlated Faults: Two faults are correlated when the measured probability of the coincidence failures is significantly higher than what would be expected from coincidence. If There will be no failure independence.
SENG521 (Fall Possible Failure Scenario What if the software components produce doublet or triplet IAW responses? P1 P2 P3 Doublet & triplet IAW faults Adjudication Algorithm
SENG521 (Fall Adjudication by Voting A “voter” compares results from two or more functionally equivalent software components and decides which of the answers provided by those components is correct. Various versions of voting algorithm: Majority voting 2-of-N voting Consensus voting
SENG521 (Fall Techniques Recovery blocks N-version programming Consensus recovery block Acceptance voting N self-checking programming
SENG521 (Fall Recovery Blocks (RB) Using multiple versions of software module and acceptance test. The output of the 1 st module is tested for acceptability and if fails, the 2 nd module is executed after backward state recovery.
SENG521 (Fall N-Version Programming Parallel execution of N independently developed functionally equivalent modules. Adjudication is via voting. The voter accepts all N outputs and selects the correct one among them. Advantage of NVP: no service interrupt.
SENG521 (Fall Consensus Recovery Block Composed of NVP and RB. IF NVP fails, the system reverts to RB using the same blocks. NVP RB input failure System failure Correct output Correct output success
SENG521 (Fall Acceptance Voting Like NVP all versions are executed in parallel. The output of ach module goes to an acceptance test. If acceptance test is successful, the output goes to a voter.
SENG521 (Fall N Self-Check Programming In N Self-Check Programming (NSCP), N modules are executed in pairs. The pairs’ outputs can be compared or accessed for correctness.