1 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Critical Systems Development IS301.

1 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Critical Systems Development IS301 – Software Engineering Lecture #27 – 2004-11-03 M. E. Kabay, PhD, CISSP Assoc. Prof. Information Assurance Division of Business & Management, Norwich University mailto:mkabay@norwich.edumailto:mkabay@norwich.edu V: 802.479.7937

3 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Objectives To explain how fault tolerance and fault avoidance contribute to the development of dependable systems To describe characteristics of dependable software processes To introduce programming techniques for fault avoidance To describe fault tolerance mechanisms and their use of diversity and redundancy

4 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Topics covered Dependable processes Dependable programming Fault tolerance Fault tolerant architectures

5 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Dependable Software Development Programming techniques for building dependable software systems.

6 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Software Dependability In general, software customers expect all software to be dependable For non-critical applications, may be willing to accept some system failures Some applications have very high dependability requirements Special programming techniques reqd

7 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Dependability Achievement Fault avoidance Software developed so Human error avoided and System faults minimized Development process organized so Faults in software detected and Repaired before delivery to customer Fault tolerance Software designed so Faults in delivered software do not result in system failure

8 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Diversity and Redundancy Redundancy Keep more than 1 version of a critical component available so that if one fails then a backup is available. Diversity Provide the same functionality in different ways so that they will not fail in the same way. However, adding diversity and redundancy adds complexity and this can increase the chances of error. Some engineers advocate simplicity and extensive verification & validation (V&V) as a more effective route to software dependability.

9 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Diversity and Redundancy Examples Redundancy Where availability is critical (e.g. in e- commerce systems), companies normally keep backup servers and switch to these automatically if failure occurs. Diversity. To provide resilience against external attacks, different servers may be implemented using different operating systems (e.g. Windows and Linux)

10 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Fault Minimization Current methods of software engineering now allow for production of fault-free software Fault-free software means it conforms to its specification Does NOT mean software which will always perform correctly Why not? Because of specificatio n errors.

11 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Cost of Producing Fault-Free Software (1) Very high Cost-effective only in exceptional situations Which? May be cheaper to accept software faults But who will bear costs? Users? Manufacturers? Both? Will the risk-sharing be with full knowledge?

12 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Cost of Producing Fault-Free Software (2) The Pareto Principle Costs Total % of Errors Fixed 20% 80% 100% If curve really is asymptotic to 100%, cost may approach

13 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Cost of Producing Fault-Free Software (3) Many Few Very few Number of residual errors Cost per error detected Just a different way of looking at it.

14 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Dependable Processes To ensure a minimal number of software faults, it is important to have a well-defined, repeatable software process. A well-defined repeatable process is one that does not depend entirely on individual skills; rather can be enacted by different people. For fault detection, it is clear that the process activities should include significant effort devoted to verification and validation.

16 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Validation activities Requirements inspections. Requirements management. Model checking. Design and code inspection. Static analysis. Test planning and management. Configuration management, discussed in Chapter 29, is also essential.

17 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Dependable programming Use programming constructs and techniques that contribute to fault avoidance and fault tolerance Design for simplicity; Protect information from unauthorized access; Minimize the use of unsafe programming constructs.

18 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Information Hiding Information should only be exposed to those parts of program which need to access it. Create objects or abstract data types which maintain state and operations on state Reduces faults: Less accidental corruption of information Firewalls make problems less likely to spread to other parts of program All information localized: Programmer less likely to make errors Reviewers more likely to find errors

19 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Safe Programming Faults in programs are usually a consequence of programmers making mistakes. These mistakes occur because people lose track of the relationships among program variables. Some programming constructs are more error-prone than others so avoiding their use reduces programmer mistakes.

20 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Fault-Free Software Development Needs precise (preferably formal) specification Requires organizational commitment to quality Information hiding and encapsulation in software design essential Use programming language with strict typing and run-time checking Avoid error-prone constructs Use dependable and repeatable development process

21 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Structured Programming First discussed in 1970's Programming without goto While loops and if statements as only control statements Top-down design Important because it promoted thought and discussion about programming Programs easier to read and understand than old spaghetti code

22 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Error-Prone Constructs (1) Floating-point numbers Inherently imprecise – and machine- dependent Imprecision may lead to invalid comparisons Pointers Pointers referring to wrong memory as can corrupt data Aliasing can make programs difficult to understand and change Dynamic memory allocation Run-time allocation can cause memory overflow

23 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Error-Prone Constructs (2) Parallelism Can result in subtle timing errors (race conditions) because of unforeseen interaction between parallel processes Recursion Errors in recursion can cause memory overflow Interrupts Interrupts can cause critical operation to be terminated and make program difficult to understand Similar to goto statements

24 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Error-Prone Constructs (3) Inheritance Code not localized Can result in unexpected behavior when changes made Can be hard to understand Difficult to debug problems All of these constructs dont have to be absolutely eliminated But must be used with great care

25 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Reliable Software Processes Well-defined, repeatable software process: Reduces software faults Does not depend entirely on individual skills – can be enacted by different people Process activities should include significant verification and validation

26 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Process Validation Activities Requirements inspections Requirements management Model checking Design and code inspection Static analysis Test planning and management Configuration management also essential

27 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Fault Tolerance Critical software systems must be fault tolerant System can continue operating in spite of software failure Fault tolerance required in High availability requirements or System failure costs very high Even fault-free systems need fault tolerance May be specification errors or Validation may be incorrect

28 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Fault Tolerance Actions Fault detection Incorrect system state has occurred Damage assessment Identify parts of system state affected by fault Fault recovery Return to known safe state Fault repair Prevent recurrence of fault Identify underlying problem If not transient*, then fix errors of design, implementation, documentation or training that led to error E.g., hardwar e failure *

29 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Approaches to Fault Tolerance Defensive programming Programmers assume faults in code Check state after modifications to ensure consistency Fault-tolerant architectures HW & SW system architectures support redundancy and fault tolerance Controller detects problems and supports fault recovery Complementary rather than opposing techniques

30 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Fault Detection (1) Strictly-typed languages E.g., Java and Ada Many errors trapped at compile-time Some classes of error can only be discovered at run-time Fault detection: Detecting erroneous system state Throwing exception To manage detected fault

31 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Fault Detection (2) Preventative fault detection Check conditions before making changes If bad state detected, dont make change Retrospective fault detection Check validity after system state has been changed Used when Incorrect sequence of correct actions leads to erroneous state or When preventative fault detection involves too much overhead

32 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Damage Assessment Analyze system state Judge extent of corruption caused by system failure Assess what parts of state space have been affected by failure Generally based on validity functions Can be applied to state elements Assess if their value within allowed range

33 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Damage Assessment Techniques Checksums Used for damage assessment in data transmission Verify integrity after transmission Redundant pointers Check integrity of data structures E.g., databases Watch-dog timers Check for non-terminating processes If no response after certain time, theres a problem

34 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Fault Recovery Forward recovery Apply repairs to corrupted system state Domain knowledge required to compute possible state corrections Forward recovery usually application specific Backward recovery Restore system state to known safe state Simpler than forward recovery Details of safe state maintained and replaces corrupted system state

35 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Forward Recovery Data communications Add redundancy to coded data Use to repair data corrupted during transmission Redundant pointers E.g., doubly-linked lists Damaged list / file may be repaired if enough links are still valid Often used for database and file system repair

36 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Backward Recovery Transaction processing often uses conservative methods to avoid problems Complete computations, then apply changes Keep original data in buffers Periodic checkpoints allow system to 'roll- back' to correct state

37 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Fault Tolerant Architecture Defensive programming cannot cope with faults involve interactions between hardware and software Misunderstandings of requirements may mean checks and associated code incorrect Where systems have high availability requirements, specific architecture designed to support fault tolerance may be required. Must tolerate both hardware and software failure

38 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Hardware Fault Tolerance Depends on triple-modular redundancy (TMR) 3 replicated identical components Receive same input Outputs compared If 1 output different, is discarded Component failure assumed Assumes Most faults result from component failures Few design faults Low probability of simultaneous component failures

40 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Fault Tolerant Software Architectures Assumptions of TMR for HW not true for software SW design flaws more common than in HW Cannot replicate same component in software: Would have common design faults Simultaneous component failure therefore virtually inevitable Software systems must therefore be diverse

41 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Software Analogies to TMR (1) N-version programming Same specification implemented In number of different versions By different teams All versions compute simultaneously Majority output selected Using voting system Most commonly-used approach E.g. in Airbus 320 control systems

43 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. N-version Programming (2) Different system versions Designed and implemented by different teams Assume low probability of same mistakes Algorithms used should be different Problem: may not be different enough Some empirical evidence (research) Teams commonly misinterpret specifications in same way Choose same algorithms for their systems

44 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Software Analogies to TMR (2) Recovery blocks Explicitly different versions of same specification written and executed in sequence Acceptance test used to select output to be transmitted

46 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Recovery Blocks (2) Force different algorithm to be used for each version so they reduce probability of common errors However, design of acceptance test difficult as it must be independent of computation used Problems with approach for real-time systems because of sequential operation of redundant versions

47 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Problems with Design Diversity Teams not culturally diverse so they tend to tackle problems in same way Characteristic errors Different teams make same mistakes Some parts of implementation more difficult than others so all teams tend to make mistakes in same place. Specification errors If error in specification, then reflected in all implementations Can be addressed to some extent by using multiple specification representations

48 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Specification Dependency Both approaches to software redundancy susceptible to specification errors. If specification incorrect, system could fail Also problem with hardware but software specifications usually more complex than hardware specifications and harder to validate Has been addressed in some cases by developing separate software specifications from same user specification

49 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Software Redundancy Needed? Software faults not inevitable (unlike hardware faults = inevitable consequence of physical world) (WHY?) Reducing software complexity may improve reliability and availability (WHY?) Redundant software much more complex Scope for range of additional errors Affect system reliability Ironically, caused by existence of fault- tolerance controllers

50 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Key Points Dependability in system can be achieved through fault avoidance and fault tolerance Some programming language constructs such as gotos, recursion and pointers inherently error-prone Data typing allows many potential faults to be trapped at compile time.

51 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Key Points Fault tolerant architectures rely on replicated hard and software components include mechanisms to detect faulty component and to switch it out of system N-version programming and recovery blocks two different approaches to designing fault- tolerant software architectures Design diversity essential for software redundancy

52 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Homework Study Chapter 18 in detail using SQ3R Required By next Wed 10 Nov 2004 For 30 points 20.1, 20.2, 20.4 – 20.6, 20.9 (@5) and pay attention to demands for examples OPTIONAL By Wed 17 Nov 2004 For up to 14 extra points, any or all of 20.10 (@3), 20.11 (@3) – details please 20.12 (@8) – detailed answers to all parts of this question

1 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Critical Systems Development IS301.

Similar presentations

Presentation on theme: "1 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Critical Systems Development IS301."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Critical Systems Development IS301.

Similar presentations

Presentation on theme: "1 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved. Critical Systems Development IS301."— Presentation transcript:

Similar presentations

About project

Feedback