CprE 458/558: Real-Time Systems

Slides:



Advertisements
Similar presentations
Computer Systems & Architecture Lesson 2 4. Achieving Qualities.
Advertisements

Principles of Engineering System Design Dr T Asokan
CprE 458/558: Real-Time Systems (G. Manimaran)1 CprE 458/558: Real-Time Systems Fault-Tolerant Scheduling Techniques.
Fault-Tolerant Systems Design Part 1.
COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Making Services Fault Tolerant
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Dependability TSW 10 Anders P. Ravn Aalborg University November 2009.
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.
7. Fault Tolerance Through Dynamic (or Standby) Redundancy The lowest-cost fault-tolerance technique in multiprocessors. Steps performed: When a fault.
© Burns and Welling, 2001 Characteristics of a RTS n Large and complex n Concurrent control of separate system components n Facilities to interact with.
Fault Tolerance: Basic Mechanisms mMIC-SFT September 2003 Anders P. Ravn Aalborg University.
8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,
1 Chapter Fault Tolerant Design of Digital Systems.
DS -V - FDT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
Modified from Sommerville’s originals Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Dependability ITV Real-Time Systems Anders P. Ravn Aalborg University February 2006.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
Chapter 2: Reliability and Fault Tolerance
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Design of SCS Architecture, Control and Fault Handling.
What Exactly are the Techniques of Software Verification and Validation A Storehouse of Vast Knowledge on Software Testing.
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
University of Palestine software engineering department Testing of Software Systems Fundamentals of testing instructor: Tasneem Darwish.
Critical systems development. Objectives l To explain how fault tolerance and fault avoidance contribute to the development of dependable systems l To.
CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Fault-Tolerant Systems Design Part 1.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Adaptive control and process systems. Design and methods and control strategies 1.
Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg.
Quality Assurance.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
Safety-Critical Systems 7 Summary T V - Lifecycle model System Acceptance System Integration & Test Module Integration & Test Requirements Analysis.
FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.
Fault-Tolerant Systems Design Part 1.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
CSE 8377 Software Fault Tolerance. CSE 8377 Motivation Software is becoming central to many life- critical systems Software is created by error-prone.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Fault Tolerance
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
Week#2 Software Quality Assurance Software Quality Engineering.
18/05/2006 Fault Tolerant Computing Based on Diversity by Seda Demirağ
Week#3 Software Quality Engineering.
Software Quality Assurance
ECE 753: FAULT-TOLERANT COMPUTING
Chapter 2: Reliability and Fault Tolerance
Fault Tolerance & Reliability CDA 5140 Spring 2006
Verification and Testing
Fault Tolerance In Operating System
Critical systems development
Multi-version approach (with error detection and recovery)
Software Verification and Validation
Software Verification and Validation
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Software Verification and Validation
Seminar on Enterprise Software
Presentation transcript:

CprE 458/558: Real-Time Systems Lecture 17 Fault-tolerant design techniques

Fault Tolerant Strategies Fault tolerance in computer system is achieved through redundancy in hardware, software, information, and/or computations. Such redundancy can be implemented in static, dynamic, or hybrid configurations. Fault tolerance can be achieved by many techniques: Fault masking is any process that prevents faults in a system from introducing errors. Example: Error correcting memories and majority voting. Reconfiguration is the process of eliminating faulty component from a system and restoring the system to some operational state.

Reconfiguration Approach Fault detection is the process of recognizing that a fault has occurred. Fault detection is often required before any recovery procedure can be initiated. Fault location is the process of determining where a fault has occurred so that an appropriate recovery can be initiated. Fault containment is the process of isolating a fault and preventing the effects of that fault from propagating throughout the system. Fault recovery is the process of remaining operational or regaining operational status via reconfiguration even in the presence of faults.

The Concept of Redundancy Redundancy is simply the addition of information, resources, or time beyond what is needed for normal system operation. Hardware redundancy is the addition of extra hardware, usually for the purpose either detecting or tolerating faults. Software redundancy is the addition of extra software, beyond what is needed to perform a given function, to detect and possibly tolerate faults. Information redundancy is the addition of extra information beyond that required to implement a given function; for example, error detection codes.

The Concept of Redundancy (Cont’d) Time redundancy uses additional time to perform the functions of a system such that fault detection and often fault tolerance can be achieved. Transient faults are tolerated by this. The use of redundancy can provide additional capabilities within a system. But, redundancy can have very important impact on a system's performance, size, weight, power consumption, and reliability.

Hardware Redundancy Passive techniques use the concept of fault masking. These techniques are designed to achieve fault tolerance without requiring any action on the part of the system. Relies on voting mechanisms. Active techniques achieve fault tolerance by detecting the existence of faults and performing some action to remove the faulty hardware from the system. That is, active techniques use fault detection, fault location, and fault recovery in an attempt to achieve fault tolerance.

Hardware Redundancy (Cont’d) Hybrid techniques combine the attractive features of both the passive and active approaches. Fault masking is used in hybrid systems to prevent erroneous results from being generated. Fault detection, location, and recovery are also used to improve fault tolerance by removing faulty hardware and replacing it with spares.

Hardware Redundancy - A Taxonomy

Triple Modular Redundancy (TMR)

Software Redundancy - to Detect Software Faults There are two popular approaches: N-Version Programming (NVP) and Recovery Blocks (RB). NVP is a forward recovery scheme - it masks faults. NVP: multiple versions of the same task is executed concurrently. NVP relies on voting. RB is a backward error recovery scheme. RB: the versions of a task are executed serially. RB relies on acceptance test.

N-Version Programming (NVP) NVP is based on the principle of design diversity, that is coding a software module by different teams of programmers, to have multiple versions. Diversity can also be introduced by employing different algorithms for obtaining the same solution or by choosing different programming languages. NVP can tolerate both hardware and software faults. Correlated faults are not tolerated by the NVP. In NVP, deciding the number of versions required to ensure acceptable levels of software reliability is an important design consideration.

N-Version Programming (Cont’d)

Recovery Blocks (RB) RB uses multiple alternates (backups) to perform the same function; one module (task) is primary and the others are secondary. The primary task executes first. When the primary task completes execution, its outcome is checked by an acceptance test. If the output is not acceptable, a secondary task executes after undoing the effects of primary (i.e., rolling back to the state at which primary was invoked) until either an acceptable output is obtained or the alternates are exhausted.

Recovery Blocks (Cont’d)

Recovery Blocks (Cont’d) The acceptance tests are usually sanity checks; these consist of making sure that the output is within a certain acceptable range or that the output does not change at more than the allowed maximum rate. Selecting the range for acceptance test is crucial. If the allowed ranges are too small, the acceptance tests may label correct outputs as bad. If they are too large, the probability that incorrect outputs will be accepted is more. RB can tolerate software faults because the alternates are usually implemented with different approaches; RB is also known as Primary-Backup approach.