Principles of Engineering System Design Dr T Asokan

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

RAID A RRAYS Redundant Array of Inexpensive Discs.
Mehdi Naghavi Spring 1386 Operating Systems Mehdi Naghavi Spring 1386.
Lecture 8: Testing, Verification and Validation
Global Analysis and Distributed Systems Software Architecture Lecture # 5-6.
Principles of Engineering System Design Dr T Asokan
Fault-Tolerant Systems Design Part 1.
COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.
EECE499 Computers and Nuclear Energy Electrical and Computer Eng Howard University Dr. Charles Kim Fall 2013 Webpage:
Tamper-Tolerant Software: Modeling and Implementation International Workshop on Security (IWSEC 2009) October 28-30, 2009 – Toyama, Japan Mariusz H. Jakubowski.
Object-Oriented Software Construction Bertrand Meyer 2nd ed., Prentice Hall, 1997.
5th Conference on Intelligent Systems
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
Making Services Fault Tolerant
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
PRESENTED BY: VIJETA CHALLA PARNITHA KOTHAPALLY HIMA BINDU SALVAJI.
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
8. Fault Tolerance in Software
Chapter 13 Embedded Systems
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
1 Software Fault Protection Allen Goldberg Kestrel Technology.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Issues on Software Testing for Safety-Critical Real-Time Automation Systems Shahdat Hossain Troy Mockenhaupt.
Achieving Qualities 1 Võ Đình Hiếu. Contents Architecture tactics Availability tactics Security tactics Modifiability tactics 2.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Module 9 Review Questions 1. The ability for a system to continue when a hardware failure occurs is A. Failure tolerance B. Hardware tolerance C. Fault.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
Chapter 2 The process Process, Methods, and Tools
CSI-09 COMMUNICATION TECHNOLOGY FAULT TOLERANCE AUTHOR: V.V. SUBRAHMANYAM.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Chapter 2: Software Process Omar Meqdadi SE 2730 Lecture 2 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.
Secure Systems Research Group - FAU 1 A survey of dependability patterns Ingrid Buckley and Eduardo B. Fernandez Dept. of Computer Science and Engineering.
Distributed systems A collection of autonomous computers linked by a network, with software designed to produce an integrated computing facility –A well.
Fault-Tolerant Systems Design Part 1.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
14.1/21 Part 5: protection and security Protection mechanisms control access to a system by limiting the types of file access permitted to users. In addition,
Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.
CprE 458/558: Real-Time Systems
RELIABILITY ENGINEERING 28 March 2013 William W. McMillan.
FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.
Fault-Tolerant Systems Design Part 1.
CSE 303 – Software Design and Architecture
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
Reverse Engineering. Reverse engineering is the general process of analyzing a technology specifically to ascertain how it was designed or how it operates.
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
Week#3 Software Quality Engineering.
Self-Checking Circuits
PREPARED BY G.VIJAYA KUMAR ASST.PROFESSOR
ECE 753: FAULT-TOLERANT COMPUTING
Fault Tolerance Comparison
The Engineering Design of Systems: Models and Methods 3rd Edition
Component Based Software Engineering
Fault Tolerance In Operating System
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Critical systems development
Multi-version approach (with error detection and recovery)
Fault Tolerance Distributed Web-based Systems
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
ISO/IEC Systems and software Quality Requirements and Evaluation
Distributed Systems and Concurrency: Distributed Systems
Seminar on Enterprise Software
Presentation transcript:

Principles of Engineering System Design Dr T Asokan

Implementing Fault Tolerance in Physical Architecture Development

Case Study: Aircraft crash- Iowa

United 232: 3-engine aircraft crashed on 19/7/1989 while making an emergency landing after losing one of the three engines. 110 people died, 185 survived. Three redundant hydraulic systems, each powered by a unique engine, were available for aircraft stabilisation. The three hydraulic system converged at the location near the tail where the fan disk ripped out, the single point of failure for all the hydraulic systems.

Error detection Functions Failure: Deviation in behavior between the system and its requirements Error : A subset of the system state, which may lead to system failure. Fault: a defect in the system that can cause an error. Fault tolerance is the ability of a system to tolerate faults and continue performing. Fault tolerance can be achieved only for those errors that are observed. Functions associated with fault tolerance are: Error detection Damage confinement Error recovery Fault isolation and reporting

Error detection is defining possible errors, deviations in the subset of the systems state from the desired state, in the design phase before they occur, and establishing a set of functions for checking for the occurrence of each error. –Type checks, range checks, timing checks Damage confinement is protecting the system from the possible spread of failure to other parts of the system. Firewalls

Error recovery attempts to correct the error after the error has been detected and the errors extent defined. Backward recovery, forward recovery Fault isolation and reporting attempts to determine where in the system the fault occurred that generated the error.

Redundancy to Achieve Fault Tolerance A primary source of high availability and fault tolerance is redundancy: Hardware, software, information, and time. Hardware redundancy uses extra hardware to enable the detection of errors as well as to provide additional operational hardware components after errors have occurred. Hardware redundancy can be implemented in Passive, Active, and Hybrid forms

Passive hardware redundancy masks or hides the occurrence of errors rather than detecting them. Recovery is achieved by having extra hardware available when needed. The most common implementation is Triple Modular Redundancy (TMR). Relies on majority voting scheme to mask error in one of the three hardware units.

Triplicated TMR

Software implementation of voting for TTMR

Active hardware redundancy attempts to do all the four functions i.e. detect errors, confine damage, recover from errors, and isolate and report fault. Hardware duplication with comparison Hot standby sparing Cold standby sparing Pair-and-a-spare Active hardware redundancy

Hardware duplication with comparison Hardware duplication with comparison is the basic building block for active redundancy

Hot standby sparing and Cold standby sparing Most common approaches to hardware redundancy

1

Pair and a spare active redundancy

Hybrid Hardware Redundancy Combination of N-modular redundancy with spares or TMR with duplication with comparison. Critical computation systems normally use Active or Hybrid redundancy. Active redundancy reduces the life of the system Hybrid redundancy is the costliest

Software redundancy N-versions, capability checks: Periodic hardware tasks with known answers consistency checks: compares output of a component with known characteristics Information redundancy: achieved by extra bits of information to enable error detections Helps to catch system induced errors Parity checks Time redundancy Standby systems error detection

Design Flexibility The mark of a long-lived system is one that has been upgraded successfully many times System should have an adaptable platform for such upgrades ( eg: Windows NT operating system) Engineering systems to be designed to be changeable in the future Four aspects of changeability are: Flexibility Agility Robustness Adaptability

Flexibility represents the property of the system to be changed easily. Changes from external to be incorporated to cope with changing environments Computers with various interface ports can interface with many external systems Flexibility is important for future upgrades Agility characterizes a systems ability to be changed rapidly Race cars are designed to be agile to enable easy modifications to suit the tracks Robustness represents a systems ability to be insensitive towards changing environments. An all terrain vehicle such as a Jeep is robust enough to run on different terrains. Adaptability characterizes a systems ability to adapt itself towards changing environments. No changes form external have to be incorporated to cope with changing environments. Some of the intelligent software/OS are designed to learn and adapt to different users

Summary Development of Physical architecture from Functional architecture Generic and Instantiated architecture Morphological box Fault tolerance in physical architecture Redundancy for fault tolerance –Hardware, software, information, time Passive, Active redundancy Hot standby, cold standby, pair and spare Design Flexibility

QUIZ DATE: 07 th October 2010 Part I: Open book (Group ) Par II: Closed book (individual) Open book Question(s) can be collected on 22 nd September; answer sheet to be submitted on 27 th September. Presentations on 28 th and 29 th Sep. T Asokan ED309