Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Slides:



Advertisements
Similar presentations
1 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Critical Systems Development IS301.
Advertisements

Verification and Validation
Critical systems development
Chapter 13 – Dependability engineering
Chapter 13 – Dependability engineering Lecture 1 1Chapter 13 Dependability Engineering.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 2.
Fault-Tolerant Systems Design Part 1.
©Ian Sommerville 1995/2000 (Modified by Spiros Mancoridis 1999) Software Engineering, 6th edition. Chapter 18 Slide 1 Dependable software development l.
Software Engineering-II Sir zubair sajid. What’s the difference? Verification – Are you building the product right? – Software must conform to its specification.
Critical Systems Validation CIS 376 Bruce R. Maxim UM-Dearborn.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
Software Configuration Management
Critical systems development
Modified from Sommerville’s originals Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
Modified from Sommerville’s originals Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
(c) 2007 Mauro Pezzè & Michal Young Ch 1, slide 1 Software Test and Analysis in a Nutshell.
CprE 458/558: Real-Time Systems
Verification and Validation CIS 376 Bruce R. Maxim UM-Dearborn.
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
Software Testing Verification and validation planning Software inspections Software Inspection vs. Testing Automated static analysis Cleanroom software.
Verification and Validation Yonsei University 2 nd Semester, 2014 Sanghyun Park.
CS 501: Software Engineering Fall 1999 Lecture 16 Verification and Validation.
Critical systems development. Objectives l To explain how fault tolerance and fault avoidance contribute to the development of dependable systems l To.
CMSC 345 Fall 2000 Unit Testing. The testing process.
Dr. Tom WayCSC Code Reviews & Inspections CSC 4700 Software Engineering.
1 Debugging and Testing Overview Defensive Programming The goal is to prevent failures Debugging The goal is to find cause of failures and fix it Testing.
Topic (1)Software Engineering (601321)1 Introduction Complex and large SW. SW crises Expensive HW. Custom SW. Batch execution.
This chapter is extracted from Sommerville’s slides. Text book chapter
 Chapter 13 – Dependability Engineering 1 Chapter 12 Dependability and Security Specification 1.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Software Verification, Validation and Testing.
Fault-Tolerant Systems Design Part 1.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Critical Systems Development IS301 – software Engineering Lecture #19 – M. E. Kabay, PhD, CISSP Dept of Computer Information Systems Norwich.
©Ian Sommerville 2000Software Engineering, 6th edition. Chapter 19Slide 1 Chapter 19 Verification and Validation.
Manag ing Software Change CIS 376 Bruce R. Maxim UM-Dearborn.
Quality Assurance.
CprE 458/558: Real-Time Systems
Chapter 12: Software Inspection Omar Meqdadi SE 3860 Lecture 12 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.
Safety-Critical Systems 7 Summary T V - Lifecycle model System Acceptance System Integration & Test Module Integration & Test Requirements Analysis.
RELIABILITY ENGINEERING 28 March 2013 William W. McMillan.
Fault-Tolerant Systems Design Part 1.
Chapter 8 Lecture 1 Software Testing. Program testing Testing is intended to show that a program does what it is intended to do and to discover program.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
SOFTWARE ENGINEERING. Objectives Have a basic understanding of the origins of Software development, in particular the problems faced in the Software Crisis.
©Ian Sommerville 2000Dependability Slide 1 Chapter 16 Dependability.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
Week#3 Software Quality Engineering.
Laurea Triennale in Informatica – Corso di Ingegneria del Software I – A.A. 2006/2007 Andrea Polini XVII. Verification and Validation.
Chapter 11 – Reliability Engineering
Software Configuration Management
Critical Systems Development
CMSC 345 Defensive Programming Practices from Software Engineering 6th Edition by Ian Sommerville.
Chapter 18 Maintaining Information Systems
Chapter 8 – Software Testing
Software Configuration Management
Fault Tolerance In Operating System
Critical systems development
Critical Systems Validation
Dependable software development
Critical Systems Development
Chapter 10 – Software Testing
Seminar on Enterprise Software
Presentation transcript:

Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn

Software Dependability Customers expect all software to be dependable. They may accept some system failures in non-critical applications Applications having high dependability requirements require special programming techniques

Achieving Dependability Fault avoidance –software developed to minimize impact of human error –development process is organized so that faults in the software are detected and repaired before customer delivery Fault tolerance –software designed so that faults in delivered software do not cause system failure

Fault Minimization Current SE methods can produce fault-free software Fault-free software merely conforms to its specification (it may or may not always perform correctly since the specification may be flawed) The cost of producing fault-free software is very expensive and may only be justified in exceptional situations It may be cheaper to accept some software faults

Developing Fault-Free Software Needs a precise (preferably formal) specification Requires an organizational commitment to quality Information hiding and encapsulation in software design are essential A programming language with strict type checking and run-time checking should be used Needs a dependable and repeatable development process

Error Prone Constructs - part 1 Floating-point numbers –inherently imprecise, frequent comparison errors Pointers –Dangling references and aliases possible Dynamic Memory Allocation –memory overflow and garbage problems Parallelism –race conditions and deadlocks are possible

Error Prone Constructs - part 2 Recursion –memory overflow when errors occur Interrupts –errors are difficult to trace Inheritance –code is no longer localized, unexpected results can arise when changes are made Note: You can use these constructs as needed, but you must be careful to use them correctly.

Information Hiding Information should only be available to program components on a need to know basis –reduces the probability of accidental corruption of information –information is encapsulated to prevent error propagation to rest of program –since information is localized, programmer is less likely make errors and reviewers are more likely to find errors

Reliable Software Processes Having a well-defined, repeatable software process will reduce the number of software faults A well-defined repeatable process is one that does not depend entirely on individual skills, but can be carried out by a team Significant verification and validation process activities must included to minimize the number of software faults.

Process Validation Activities Requirements inspections Requirements management Model checking Design inspections Code inspections Static code analysis Test planning and management Configuration management

Fault Tolerance Required in critical applications (high reliability needed and high failure costs) System can continue operation, despite software failure A system which seems to be fault-free must also be fault tolerant (in case specification errors exist or the validation is incorrect)

Fault Tolerant Actions Fault detection –system determines an incorrect system state has occurred Damage assessment –determine system parts affected by fault Fault recovery –system must restore its state to a known safe state Fault repair –for a non-transitory fault, system is modified to prevent repetition

Approaches Defensive Programming –programmers assume faults exist in system code –redundant code is written to check system state for consistency after modification are made Fault Tolerant Architectures –HW and SW architectures that support redundancy are used –a fault tolerance controller that detects problems and supports recovery Both approaches are important

Exception Management Could be program error or an event like power failure Exception handling facilities in programming languages allow exceptions to be handled without constant checking to detect them Using normal control constructs to detect exceptions in a sequence of procedural calls adds considerable timing overhead to a program

Fault Detection Languages with strict type checking allow many errors to be trapped during program compilation Some types of errors can only be caught at run-time (e.g. cin >> I; cin >> A[I];)

Fault Detection Approaches Preventative Fault Detection –fault detection mechanism is activated before a state change is committed –if an erroneous state is detected change is cancelled Retrospective Fault Detection –fault detection mechanism is initiated after system state change has been made –used when correct sequence of actions can lead to erroneous system state or preventative fault detection has too much overhead

Type System Extension Preventative fault detection really involves extending the current type system by including additional constraints as part of the type definition These constraints are typically implemented by defining basic operations within a class definition

Damage Assessment System is analyzed to judge the extent of corruption caused by a system failure Must determine what parts of the state space have been affected by the failure Generally based on “validity functions” which can be applied to the state elements to assess if their value is within an allowed range

Damage Assessment Techniques Checksums are used to check for data transmission errors Redundant pointers can be used to check integrity of data structures Watch dog timers can help check for non- terminating processes (e.g. long time with no response assume the worst)

Fault Recovery Forward Recovery –apply repairs to corrupted system state –usually application specific, requires domain knowledge –e.g. error coding like check sum added to data Backward Recovery –restore system to known safe state –simpler, since archived safe state is used to replace erroneous state –e.g. use of checkpoints in WP editor

Fault Tolerant Architecture Defensive programming can not cope faults caused by HW and SW interactions If requirements are not understood then SW checks are not likely to be correct Systems with high availability requirements often require fault tolerant architectures Must tolerate both HW and SW failure

Hardware Fault Tolerance Triple-modular redundancy (TMR) Three replicated component are included in the system If one component produces different output than the other two, failure is assumed This idea is based on the notion that most failures result from component failures, not design faults Component failures should be a low probability event

Software Fault Tolerance TMR is based on two assumptions –HW components do not include common design flaws –simultaneous component failures are not likely Neither assumption is valid for software components –isn’t possible to replicate SW components without replicating their design flaws –simultaneous component failure is inevitable Software systems must be diverse

Design Diversity Different versions of the system are designed and implemented different ways (so they should have different failure rates) Different approaches to design –object-oriented and function oriented –different implementation languages – different algorithms in the implementation –different tools or environments

Software Analogies to TMR N-version Programming –same specification is implemented in a number of different version by several teams –all versions compute simultaneously, the majority output is presumed correct Recovery blocks –a number of explicitly distinct versions of a program are written for the same specification and executed in sequence –an acceptance test is used to select the output to keep

Problems with Design Diversity Teams tend to tackle the same problems in the same ways, so the resulting implementations may not be diverse Characteristic errors –different teams are likely make the same mistakes, since some parts of the implementation are more difficult than others –specification errors may cause the same errors to appear in all implementations (argument for developing multiple specifications)

Is software redundancy needed? Unlike HW, SW faults are not an inevitable consequence of the real world Some people believe that a higher level of reliability can be reducing software complexity instead The existence of fault-tolerance controllers increases program complexity considerably and adds sources of errors that affect reliability