20 minutes lecture + 10 min QnA Francis Palma Lakehead University

Software Fault Tolerance and Recovery Introduction to Software Fault Tolerance
20 minutes lecture + 10 min QnA Francis Palma Lakehead University Thunder Bay, ON 19th June 2017

Reference Books

Common requirement for the softwares used in the above systems?
Types of Systems Safe and reliable software operation is a significant requirement. Why? Because the cost and consequences of these systems failing can range from annoying to catastrophic, with serious injury occurring or lives lost, security breached, business failed, and so on… Safe and Reliable Software Common requirement for the softwares used in the above systems? ©Francis Palma, Lakehead University, 2017 1 of 19

Examples of Events software-related incidents Problems in the backup tracking software delayed the launch of Atlantis for 3 days Suffered a 9-hour US-wide blockade when one switch experienced abnormal behavior and attempted recovery, because of a flaw in recovery-recognition software In Gulf War, clock drift in the Patriot system caused it to miss a scud missile hitting an American barracks, killing 29 and injuring 97. The clock drift was caused by the use of two different representations (24 & 48 bit) of the value 0.1 in software In 2016, Qatar Airways walked away from 4-6 aircraft orders after problems affected the A320s hydraulics and software ©Francis Palma, Lakehead University, 2017 2 of 19

Topics Covered Fault, Failure, and Error
Dependable Software and Means to Achieve Dependability Types of Error Recovery in Software Fault Tolerance Types of Redundancy for Software Fault Tolerance Checking The Understanding software-related incidents

An Introduction to Fault, Failure, and Error
Fault is a noun: responsibility for an accident or misfortune

Fault, Failure, and Error
A fault is the identified cause of an error, also known as ‘bug’ The actual 'mistake' in the code An error is part of the system state that is liable to lead to a failure The bad state in the system that results from the fault A failure is the variation from expected behaviour observed by the user as a result of an error Software Fault Tolerance To prevent failures by tolerating faults whose occurrences are known when errors are detected Fault is a noun: responsibility for an accident or misfortune ©Francis Palma, Lakehead University, 2017 3 of 19

Fault, Failure, and Error: The Relationship
Bugs or mistake in the code A state that may lead to failure Actual result deviates from expected result Results in Causes Fault Error Failure Represented by Exception ©Francis Palma, Lakehead University, 2017 4 of 19

Fault, Failure, and Error: An Example
int doubleValue (int i) { int result; result = i * i; print result; } Fault: Line 3 should be result = 2 * i Error: If i = 1 then result = 1 * 1, which is 1; If i = 3 then result = 3 * 3, which is 9, ... Failure: If i = 1 then print result as 1; If i = 3 then print result as 9, … ©Francis Palma, Lakehead University, 2017 5 of 19

An Introduction to Dependable Software

Dependable Software For a dependable software:
Faults Impairments Errors Failures Fault avoidance1 Construction For a dependable software: the users will have full trust on it the users will have confidence that it will operate as expected and it will not ‘fail’ in normal use Fault tolerance4 Dependability Means Fault removal2 - The impairments, or those things that stand in the way of dependability, are faults, errors, and failures. - The attributes of dependability enable the properties of dependability and provide a way to assess achievement of those properties. The means to achieve dependability falls into two major groups: (1) those that are employed during the software construction process (fault avoidance and fault tolerance), and (2) those that contribute to validation of the software after it is developed (fault removal and fault forecasting). Briefly, the techniques are: • Fault avoidance or prevention: to avoid or prevent fault introduction and occurrence; • Fault removal: to detect the existence of faults and eliminate them; • Fault/failure forecasting: to estimate the presence of faults and the occurrence and consequences of failures; • Fault tolerance: to provide service complying with the specification in spite of faults. reliability: a measure of the continuous delivery of correct service — or, equivalently, of the time to failure, • availability: a measure of the delivery of correct service with respect to the alternation of correct and incorrect service, • maintainability: a measure of the time to service restoration since the last failure occurrence, or equivalently, measure of the continuous delivery of incorrect service, • safety is an extension of reliability: when the state of correct service and the states of incorrect service due to non-catastrophic failure are grouped into a safe state Validation Fault forecasting3 Availability Reliability Attributes Safety Confidentiality Integrity Maintainability ©Francis Palma, Lakehead University, 2017 6 of 19

Fault Avoidance Fault avoidance techniques that contribute to system dependability include: Rigorous System Requirements Specification: System failure may occur due to logic errors incorporated into the requirements specification Structured Design and Programming Methods: The principles of decoupling, modularization, and encapsulation (e.g. information hiding) reduces overall complexity of the software, making it easier to understand and implement Software Reuse: Reduces the number of components that must be originally developed (object-oriented principles) Despite fault prevention efforts, faults are created, so Fault Removal is required - Engineering is the discipline that deals with the application of science, mathematics and other types of knowledge to design and develop products and services that improve the quality of life. - Software engineering deals with designing and developing software of the highest quality. A software engineer does analyzing, designing, developing and testing software. Software engineers carry out software engineering projects, which usually have a standard software life cycle. - System Engineering is the sub discipline of engineering which deals with the overall management of engineering projects during their life cycle (focusing more on physical aspects). It deals with logistics, team coordination, automatic machinery control, work processes and similar tools. The difference between System Engineering and Software Engineering is not very clear. However, it can be said that the System Engineers focus more on users and domains, while Software Engineering focus more on n implementing quality software. ©Francis Palma, Lakehead University, 2017 7 of 19

Fault Removal Fault removal techniques contribute to system dependability during software verification and validation: Software Testing Formal Inspection: A rigorous process to examine source code to find and correct the faults, and then verify the corrections (widely applied in industry) Formal Design Proofs: Using executable specifications, test cases can be automatically generated to improve the software verification process Fault removal is not perfect, so Fault Forecasting and Fault Tolerance are needed What is the most common fault removal technique? ©Francis Palma, Lakehead University, 2017 8 of 19

Fault Forecasting Fault forecasting is done during the validation of software to estimate the presence of faults and usually focuses on the reliability measure of dependability: Reliability Estimation Determines current software reliability by applying statistical inference techniques to failure data obtained during system testing (or system operation) Reliability Prediction Determines future software reliability based upon available software metrics and measures Fault forecasting can indicate the need for Fault Tolerance Fault forecasting is conducted by performing an evaluation of the system behavior with respect to fault occurrence or activation. Evaluation has two aspects: • qualitative , or ordinal , evaluation , which aims to identify, classify, rank the failure modes, or the event combinations (component failures or environmental conditions) that would lead to system failures, • quantitative, or probabilistic , evaluation , which aims to evaluate in terms of probabilities the extent to which some of the attributes of dependability are satisfied; those attributes are then viewed as measures of dependability. The methods for qualitative and quantitative evaluation are either specific (e.g., failure mode and effect analysis for qualitative evaluation, or Markov chains and stochastic Petri nets for quantitative evaluation), or they can be used to perform both forms of evaluation (e.g., reliability block diagrams, fault-trees). ©Francis Palma, Lakehead University, 2017 9 of 19

Fault Tolerance Fault tolerance techniques contribute to system dependability during software development include: Single Version Software Environment: Partially tolerates software design faults through monitoring techniques or exception handling Multiple Version Software Environment (design diverse): Functionally equivalent independently developed software versions can provide tolerance to faults Examples: Recovery Blocks (RcB), N-version programming (NVP), and N self-checking programming (NSCP) Multiple Data Representation Environment (data diverse): Different representations of input data are utilized to provide tolerance to software design faults Examples: Retry Blocks (RtB) and N-Copy Programming (NCP) Notes: In the simplest form of N-version programming, “N” implementations i.e., N-versions of an application are developed separately. In operation, these N versions are executed in parallel and their outputs compared. Each version in an N-version system is a complete implementation of the specification developed separately from the N−1 versions. By developing versions separately, it is assumed that they will be based on designs that are different, a property called design diversity. Although the basic technique is quite simple, there are a number of issues that must be kept in mind when deciding whether to use N-version programming and how to implement it there issued are discussed. ©Francis Palma, Lakehead University, 2017 10 of 19

The Fault Tolerance Process and Types of Error Recovery

The Fault Tolerance Process
A set of activities with the goal to remove errors and their effects from the computational state, before a failure occurs Error Detection An erroneous state is identified Error Diagnosis The damage is assessed and the cause of the error is determined Error Containment/ Isolation Further damages are prevented, i.e., the error is prevented from propagating Error Recovery The erroneous state is replaced with an error-free state Give an example of a daily life: Car accident in highway. ©Francis Palma, Lakehead University, 2017 11 of 19

Types of Error Recovery
Backward Recovery Attempts to return the system to a previously saved error-free state by restoring or rolling back the system System states are saved at predetermined recovery points called checkpoints Advantages: Can handle unpredictable errors caused by unresolved design faults Requires no knowledge of the errors in the system state Disadvantages: Requires significant resources (e.g., time, computation, and stable storage) The system might need to be halted temporarily Domino effect may occur, i.e., a series of interdependent roll-backs ©Francis Palma, Lakehead University, 2017 12 of 19

Backward Recovery ©Francis Palma, Lakehead University, 2017 13 of 19
*Source: Software Fault Tolerance Techniques and Implementation Book by Laura L. Pullum, 2001. ©Francis Palma, Lakehead University, 2017 13 of 19

Types of Recovery Forward Recovery
With a full backup image, roll forward through the archive logs to recover to a specific System Change Number (SCN), Date/Time, or until an administrator cancels the recovery process Alternatively, error compensation based on redundancy model can be used where redundant software processes are executed in parallel from which a Fault Detection and Handling Unit (a.k.a. Adjudicator) selects the one with correct result, e.g., the NVP fault tolerance technique Advantages: Fairly efficient in terms of the overhead (time and memory) Anticipated faults or potential loss of data can be well handled using redundancy and forward recovery Disadvantages Requires thorough knowledge of the error Application-specific and must be tailored to each situation or program ©Francis Palma, Lakehead University, 2017 14 of 19

Forward Recovery ©Francis Palma, Lakehead University, 2017 15 of 19
*Source: Software Fault Tolerance Techniques and Implementation Book by Laura L. Pullum, 2001. ©Francis Palma, Lakehead University, 2017 15 of 19

Types of Redundancy and Software Fault Tolerance

Types of Redundancy Hardware redundancy includes replicated and supplementary hardware added to the system to support fault tolerance The most common use of redundancy Software redundancy includes the additional programs, modules, or objects used in the system to support fault tolerance Information or data redundancy uses additional information with data to assist in hardware or software fault tolerance Temporal redundancy involves the use of additional time to perform the tasks required to support fault tolerance A key concept for fault tolerance is redundancy Redundancy can take several forms: hardware, software, information, and time ©Francis Palma, Lakehead University, 2017 16 of 19

Software Redundancy Software faults cannot be detected by simple replication of identical software units -- the same fault will exist in each copy Solution: Introduce diversity into the software replicas Basic Approach: Start with the same specification and have different programming teams develop the variants independently, which will result in functionally equivalent, design-diverse software components However, we need to decide on the acceptability of the results obtained by the variants. The component that performs this task is called the Adjudicator Adjudicator is a decision mechanism ©Francis Palma, Lakehead University, 2017 17 of 19

Review of the Lecture (1) In practice, software development is not error-free even if the best people, practices, and tools are used. (2) The goal of software fault tolerance is to prevent failures by tolerating faults whose occurrences are known when errors are detected. (3) Four means to achieve a dependable software are: fault avoidance, fault removal, fault/failure forecasting, and fault tolerance. (4) The fault tolerance process consists of four activities: error detection, error diagnosis, error containment/isolation, and error recovery. (5) There are two types of recovery: (1) backward recovery and (2) forward recovery. (6) Four types of redundancy for fault tolerance: (1) hardware redundancy, (2) software redundancy, (3) Information or data redundancy, and (4) temporal redundancy. ©Francis Palma, Lakehead University, 2017 18 of 19

Checking the Understanding

Right or Wrong? An incorrect statement in a requirements document is often caused by a human mistake. A defect is also known as an error. Bug and fault are synonyms. Design errors can lead to the wrong data stored in a database. A coding mistake is one example of a software failure. An incorrect total in a printed report is an example of an error. Incorrect logic statements in a program are examples of defects. When a system stops unexpectedly, it is called a failure. ©Francis Palma, Lakehead University, 2017 19 of 19

Questions?

Information or Data Redundancy
Diverse data (not simple redundant copies) can be used for tolerating software faults A data re-expression algorithm (DRA) produces different representations of a module's input data This transformed data is input to copies of the module in data diverse software fault tolerance techniques ©Francis Palma, Lakehead University, 2017 18 of 19

Temporal Redundancy Temporal redundancy commonly comprises repeating an execution using the same software & hardware resources involved in the initial, failed execution Backward recovery schemes typically use a combination of temporal and software redundancy Temporal redundancy is mainly used in human-interactive programs Applications with hard real-time constraints are not suitable for using temporal redundancy ©Francis Palma, Lakehead University, 2017 19 of 19

20 minutes lecture + 10 min QnA Francis Palma Lakehead University

Similar presentations

Presentation on theme: "20 minutes lecture + 10 min QnA Francis Palma Lakehead University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

20 minutes lecture + 10 min QnA Francis Palma Lakehead University

Similar presentations

Presentation on theme: "20 minutes lecture + 10 min QnA Francis Palma Lakehead University"— Presentation transcript:

Similar presentations

About project

Feedback