Software Reliability Engineering: A Roadmap

Software Reliability Engineering: A Roadmap
Future of Software Engineering ICSE’2007 Minneapolis, Minnesota May 24, 2007 Software Reliability Engineering: A Roadmap Michael R. Lyu Dept. of Computer Science & Engineering The Chinese University of Hong Kong

Introduction Software reliability is the probability of failure-free operation with respect to execution time and environment. Software reliability engineering (SRE) is the quantitative study of the operational behavior of software-based systems with respect to user requirements concerning reliability. SRE has been adopted by more than 50 companies as standards or best current practices. Creditable software reliability techniques are still in urgent need.

Historical SRE Techniques: Fault Lifecycle
Fault prevention: to avoid, by construction, fault occurrences. Fault removal: to detect, by verification and validation, the existence of faults and eliminate them. Fault tolerance: to provide, by redundancy and diversity, service complying with the specification in spite of manifested faults. Fault/failure forecasting: to estimate, by statistical modeling, the presence of faults and occurrence of failures.

Fault Lifecycle Technique
Fault Manifestation and Modeling Process Reliability Fault Prevention Fault Removal Fault Tolerance Fault/Failure Forecasting

Fault Lifecycle Technique
Fault Manifestation and Modeling Process Reliability Availability Safety Security Fault Prevention Fault Removal Fault Tolerance Fault/Failure Forecasting

Software Reliability Modeling
 R = e -t Testing Time

Current SRE Process Overview

Current Trends and Problems
The theoretical foundation of software reliability comes from hardware reliability techniques. Software failures do not happen independently. Software failures seldom repeat in exactly the same or predictable pattern. Failure mode and effect analysis (FMEA) for software is still controversial and incomplete. There is currently a need for a creditable end-to-end software reliability paradigm that can be directly linked to reliability prediction from the very beginning.

Future Direction 1: Reliability-Centric Software Architectures
The product view – achieve failure-resilient software architecture Fault prevention Fault tolerance The process view – explore the component-based software engineering Component identification, construction, protection, integration and interaction Reliability modeling based on software structure

Future Direction 2: Design for Reliability Achievement
Fault confinement Fault detection Diagnosis Reconfiguration Recovery Restart Repair Reintegration

Fault Confinement Fault Detection Failover Diagnosis Online Offline
Reconfiguration Recovery Restart Repair Reintegration 1. Fault confinement. This stage limits the spread of fault effects to one area of the Web service, thus preventing contamination of other areas. Fault-confinement can be achieved through use of: fault-detection within the Web services, consistency checks and multiple requests/confirmations. 2. Fault detection. This stage recognizes that something unexpected has occurred in the Web services. Fault latency is the period of time between the occurrence of a fault and its detection. Techniques fall in 2 classes: off-line and on-line. With off-line techniques, such as diagnostic programs, the service is not able to perform useful work while under test. On-line techniques, such as duplication, provide a real-time detection capability that is performed concurrently with useful work. 3. Diagnosis. This stage is necessary if the fault detection technique does not provide information about the failure location and/or properties. 4. Reconfiguration. This stage occurs when a fault is detected and a permanent failure is located. The Web services can be composed of different components. When providing the service, there may be failure in individual components. The system may reconfigure its components either to replace the failed component or to isolate it from the rest of the system. 5. Recovery. This stage utilizes techniques to eliminate the effects of faults. Two basic recovery approaches are based on: fault masking, retry and rollback. Fault-masking techniques hide the effects of failures by allowing redundant information to outweigh the incorrect information. Web services can be replicated or implemented with different versions (NVP). Retry attempts a second attempt at an operation and is based on the premise that many faults are transient in nature. Web services provide services through network; retry would be a practical as requests/reply may be affected by the situation of the network. Rollback makes use of the fact that the Web service operation is backed up (checkpointed) to some point in its processing prior to fault detection and operation recommences from this point. Fault latency is important here because the rollback must go back far enough to avoid the effects of undetected errors that occurred before the detected error. 6. Restart. This stage occurs after the recovery of undamaged information. l Hot restart: resumption of all operations from the point of fault detection and is possible only if no damage has occurred. l Warm restart: only some of the processes can be resumed without loss. l Cold restart: complete reload of the system with no processes surviving. The Web services can be restarted by rebooting the server. 7. Repair. At this stage, a failed component is replaced. Repair can be off-line or on-line. Web services can be component-based and consist of other Web services In off-line repair either the Web service will continue if the failed component/sub-Web service is not necessary for operation or the Web services must be brought down to perform the repair. In on-line repair the component/sub-Web service may be replaced immediately with a backup spare or operation may continue without the component. With on-line repair Web service operation is not interrupted. 8. Reintegration. In this stage the repaired module must be reintegrated into the Web service. For on-line repair, reintegration must be performed without interrupting Web service operation.

Future Direction 3: Testing for Reliability Assessment
Establish the link between software testing and reliability Study the effect of code coverage to fault coverage Evaluate impact of reliability by various testing metrics Assess competing testing schemes quantitatively

Positive vs. negative evidences for coverage-based software testing
Resources Findings Positive Frankl(1988) Horgan(1994) Weyuker(1988) High code coverage brings high software reliability and low failure rate Chen(1992) A correlation between code coverage and software reliability is observed Wong(1994) The correlation between test effectiveness and block coverage is higher than that between test effectiveness and the size of test set Frate(1995) An increase in reliability comes with an increase in at least one code coverage measures Cai (2005) Code coverage contributes to a noticeable amount of fault coverage Negative Briand(2000) The testing result on published data did not support a causal dependency between code coverage and defect coverage

RSDIMU test cases description
II III IV V VI This is the descriptions of test set, containing the detailed testing purpose of each test case. Can be classified as functional testing (1-800) and random testing ( ). Can be classified into six regions according to their different patterns.

The correlation: various test regions
Linear regression relationship between block coverage and fault coverage in the whole test set Linear modeling fitness in various test case regions Fault Coverage For overall, moderate; highest, region IV, lowest: region VI

The correlation: normal operational testing vs. exceptional testing
Testing profile (size) R-square Whole test case (1200) 0.781 Normal testing (827) 0.045 Exceptional testing (373) 0.944 Normal operational testing very weak correlation Exceptional testing strong correlation

The correlation: normal operational testing vs. exceptional testing
Normal testing: small coverage range (48%-52%) Exceptional testing: two main clusters Fault Coverage Fault Coverage Normal testing; code coverage (48%-52%) main control flow/data flow Exceptional testing: two clusters . The reason is in some cases, part of large-scale computational functions are executed but others will be skipped. But in other cases, all these computational code are skipped.

The Spectrum in Software Testing and Reliability
Time Based Models Coverage Testing - user oriented tester oriented - more physical meaning less physical meaning - abundant models lack of models - easy data collection hard data collection - less relevance to testing more relevance to testing New Model Software Reliability Growth Models Coverage-Based Analysis A new model is needed to combine execution time and testing coverage

A New Coverage-Based Reliability Model
λ(t,c): joint failure intensity function λ1(t): failure intensity function with respect to time λ2(c): failure intensity function with respect to coverage α1,γ1, α2, γ2: parameters with the constraint of α 1 + α 2 = 1 joint failure intensity function failure intensity function with time failure intensity function with coverage Dependency factors In`tegral De`rivative Since lambda(t) is the failure intensity function with respect to time, any existing distributions in well-known reliability models can be used, e.g., NHPP,Weibull model,S-shaped model or logarithmic Poisson models.

Estimation Accuracy NHPP model

Future Direction 4: Metrics for Reliability Prediction
New models (e.g., BBN) to explore rich software metrics Data mining approaches Machine learning techniques Bridging the gap of the one-way function: feedback to building reliable software Continuous industrial data collection efforts – demonstration of cost-effectiveness

Future Direction 5: Reliability for Emerging Software Applications
“The Internet changes everything” On-demand customizable software Service oriented architecture, composition, integration Customization by middleware – from metadata to metacode A common infrastructure delivers reliability to all customers

A Paradigm for Reliable Web Service
Replication Manager Web service selection algorithm WatchDog UDDI Registry WSDL Web Service IIS Application Database Client Port Create Web services Select primary Web service (PWS) 3. Register 4. Look up 5. Get WSDL 6. Invoke Web service Keep check the availability of the PWS If PWS failed, reselect the PWS. 9. Update the WSDL

Conclusions Software reliability is receiving higher attention as it becomes an important economic consideration for businesses. New SRE paradigms need to consider software architectures, testing techniques, data analyses, and creditable reliability modeling procedures. Domain specific approaches on emerging software applications are worthy of investigation. Still a long way to go, but the directions are clear.

Software Reliability Engineering: A Roadmap

Similar presentations

Presentation on theme: "Software Reliability Engineering: A Roadmap"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Software Reliability Engineering: A Roadmap

Similar presentations

Presentation on theme: "Software Reliability Engineering: A Roadmap"— Presentation transcript:

Similar presentations

About project

Feedback