CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II.

Slides:



Advertisements
Similar presentations
Principles of Engineering System Design Dr T Asokan
Advertisements

Configuration management
Fault-Tolerant Systems Design Part 1.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,
Making Services Fault Tolerant
Dependability TSW 10 Anders P. Ravn Aalborg University November 2009.
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.
1 Testing Effectiveness and Reliability Modeling for Diverse Software Systems CAI Xia Ph.D Term 4 April 28, 2005.
Software Testing Using Model Program DESIGN BY HONG NGUYEN & SHAH RAZA Dec 05, 2005.
Critical systems development
Fault Tolerance: Basic Mechanisms mMIC-SFT September 2003 Anders P. Ravn Aalborg University.
DAIMIHenrik Bærbak Christensen1 Testing Terminology.
1 Chapter Fault Tolerant Design of Digital Systems.
DS -V - FDT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)
Software Testing and Quality Assurance
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
8. Fault Tolerance in Software
Modified from Sommerville’s originals Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
An Experimental Evaluation on Reliability Features of N-Version Programming Xia Cai, Michael R. Lyu and Mladen A. Vouk ISSRE’2005.
ABCSG - Dependable Systems - 01/06/ ABCSG Dependable Systems.
Modified from Sommerville’s originals Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
Dependability ITV Real-Time Systems Anders P. Ravn Aalborg University February 2006.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
Constructing Reliable Software Components Across the ORB M. Robert Rwebangira Howard University Future Aerospace Science and Technology.
Testing an individual module
Reliability Modeling for Design Diversity: A Review and Some Empirical Studies Teresa Cai Group Meeting April 11, 2006.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Design of SCS Architecture, Control and Fault Handling.
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Software Testing Verification and validation planning Software inspections Software Inspection vs. Testing Automated static analysis Cleanroom software.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Figures-Chapter 13. Figure 13.1 The increasing costs of residual fault removal.
Critical systems development. Objectives l To explain how fault tolerance and fault avoidance contribute to the development of dependable systems l To.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Slide 1 Critical Systems Specification 2.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Secure Systems Research Group - FAU 1 A survey of dependability patterns Ingrid Buckley and Eduardo B. Fernandez Dept. of Computer Science and Engineering.
Testing Basics of Testing Presented by: Vijay.C.G – Glister Tech.
 Chapter 13 – Dependability Engineering 1 Chapter 12 Dependability and Security Specification 1.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Software Verification, Validation and Testing.
Fault-Tolerant Systems Design Part 1.
Safety-Critical Systems T Ilkka Herttua. Safety Context Diagram HUMANPROCESS SYSTEM - Hardware - Software - Operating Rules.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg.
CprE 458/558: Real-Time Systems
Safety-Critical Systems 7 Summary T V - Lifecycle model System Acceptance System Integration & Test Module Integration & Test Requirements Analysis.
Software Testing and Quality Assurance Practical Considerations (4) 1.
FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.
Fault-Tolerant Systems Design Part 1.
Chapter 8 Lecture 1 Software Testing. Program testing Testing is intended to show that a program does what it is intended to do and to discover program.
Chapter 10 Recovery System. ACID Properties  Atomicity. Either all operations of the transaction are properly reflected in the database or none are.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
CSE 8377 Software Fault Tolerance. CSE 8377 Motivation Software is becoming central to many life- critical systems Software is created by error-prone.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
Structuring Redundancy for Fault Tolerance Chapter 2 Designed by: Hadi Salimi Instructor: Dr. Mohsen Sharifi.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
18/05/2006 Fault Tolerant Computing Based on Diversity by Seda Demirağ
Week#3 Software Quality Engineering.
Software Quality Assurance
Fault-Tolerant Computing Systems #3 Fault-Tolerant Software
Fault Tolerance In Operating System
Critical systems development
Presentation transcript:

CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AUHenrik Bærbak Christensen2 Application domains Fault tolerant systems are used in various domains, primarily for safety-critical systems –First documented example is for railway systems (1978) –Nuclear power plants –Airplanes (Airbus) –Space program

CS, AUHenrik Bærbak Christensen3 Experience Somewhat mixed, actually… Why? –Redundancy works for hardware because hardware often fails randomly Due to wearing out (component failure, not design failure) –Software fails due to specification and design errors Thus simple replication does not provide protection… Review the Ariane failure reported by Tan Redundant software units require diversity –However, there are evidence of failure correlations even over diverse implementations…

CS, AUHenrik Bærbak Christensen4 The origins: Hardware Triple Modular redundancy (TMR) –Three identical hardware units process input –Output is compared for equality Deviating output is ignored Fault manager may try repair, or reconfigure to take unit out of service

CS, AUHenrik Bærbak Christensen5 Terminology for Software FT Principal requirement –Redundancy of functional equivalent but different software units different teams, tools, processes, … –Oracle Method to dynamically determine if output is correct or within acceptable limits –Recovery Defect will lead to error state that leads to failure if not handled A detected error state, results in recovery being initiated

CS, AUHenrik Bærbak Christensen6 Terminology for Software FT Recovery –Backward Recovery Recovery points are stored during normal execution System rolled back/restored to a previous restore point and restarted from that –Forward Recovery Transition into degraded mode state which is functional but quality is lowered Or: error compensation in which algorithms derive the correct answer. Exercise: –Give examples of each type

CS, AUHenrik Bærbak Christensen7 Oracles Result verification/Dynamic self-checking Acceptance test –Internal accept test Test for correctness, or if answer is within limits or bounds Require that testing correctness is easier than calculating the result, like |sqr(x)*sqr(x) – x| < E Examples –Checksums, used to accept test datagram contents –Data structure validation methods –Hardware self tests

CS, AUHenrik Bærbak Christensen8 Oracles Result verification External consistency: –Uses additional knowledge outside of the unit producing results Examples –Watchdogs (heartbeat in Bass) use timings to detect and resolve problems –Exceptions: for instance floating point errors

CS, AUHenrik Bærbak Christensen9 Diversity The rationale behind diversity: Modules fail on disjoint subsets of the input space – one will always process input correct! Program 1 execution state I_e error states Input space Program 2 execution state I_e error states Input space

CS, AUHenrik Bærbak Christensen10 Redundancy Require a software unit to judge acceptability of redundant modules: adjudicator –As it is a software unit – it may contain defects –Techniques Voting Median value Acceptance testing And more…

CS, AUHenrik Bærbak Christensen11 Failure classes In unit testing, failures occur because of defects in the software unit. –A test case either fail or pass A redundant system (= N functionally identical but different units) introduces more types/classes of failures –k-fold coincident failures (sammenfaldende) k out of N units fails on the same test case –U1 says 7, U2 says 13, but answer is 42. Identical-and wrong (IAW) answer –U1 and U2 says 7, but answer is 42

CS, AUHenrik Bærbak Christensen12 Failure classes Correlated/Dependent failures –P( U1 fails | U2 fails) ≠ P( U1 fails ) Hvis sandsynligheden for at U1 fejler på en test case givet at vi ved U2 fejler på test cases er forskellig fra ss for at U1 fejler givet at vi ikke ved om U2 fejler på test casen. Tænk på at U1 er lig U2. Hvis vi ved U2 fejler og U1 er identisk med U2 så ved vi sten sikkert at U1 fejler: P(U1 fails | U2 fail) = 1. Men hvis vi ved at U1 er identisk med U2 men ikke om U2 vil fejle, så kender vi kun fordelingen som måske er at SS for at U1 fejler er 0,1%.

CS, AUHenrik Bærbak Christensen13 Failure classes

CS, AUHenrik Bærbak Christensen14 Adjudication Techniques

CS, AUHenrik Bærbak Christensen15 Voting Majority voting –m = number of matching outputs –m = ceil[(n+1)/2] –Usually N = 3 which means m = ? Two-out-of-N voting –Actually m = 2 is enough regardless of N (usually) –Note: agreement ≠ correctness –Best argument I have: Hitler was democratic elected Median voting –Sort the answers and select the middle element –Used in aerospace …

CS, AUHenrik Bærbak Christensen16 Voting Consensus voting

CS, AUHenrik Bærbak Christensen17 Redundancy Techniques

CS, AUHenrik Bærbak Christensen18 Recovery Blocks Failed Accept Test –Often roll-back / recovery of system state –Single processors suffers from sequential processing ‘core dumped’ in first unit is bad…

CS, AUHenrik Bærbak Christensen19 N-version Programming Executed in parallel –Voting used to select proper answer

CS, AUHenrik Bærbak Christensen20 Variants Lyu discuss various variants and combinations. One I find interesting is Acceptance voting –N versions execute in parallel, and the answers are subjected to acceptance testing. –Only accepted answers are then feed to the voter –Voter must be dynamic as the number of inputs, Ni <= N, to the voter varies according to the number of accepted outputs.