Fault Tolerance & Reliability CDA 5140 Spring 2006

Fault Tolerance & Reliability CDA 5140 Spring 2006
Chapter 1 Overview & Definitions

Topics basic concepts of Fault Tolerance (FT)
reliability & availability of systems, both hardware & software tools to compare & contrast FT designs

What is FT? Computing in presence of errors
Some techniques from analog systems of 1940’s ’s Digital technology adds to these to be faster, better & cheaper Investigate architecture keeping in mind tradeoff of cost, weight & volume Becoming more important as digital systems become more & more prevalent

Why Have FT? Needed more in 21st century since: Harsher environments
Many novice users Increasing repair costs Larger systems Digital systems more prevalent More users dependent on digital systems from business to government to home to school

How is FT Obtained? Add redundancy in form of: Hardware, e.g. RAID
Software, e.g. 2 algorithms for same task Information, e.g. coding theory Time, e.g. on Internet if fault, then new route

Definitions & Terminology
Failure - departure from correct operation Fault - flaw in hardware or software resulting in failure, e.g. physical problems, design flaws, defects in hardware; design or implementation for software Error - incorrect response from module leading to system failure if no FT Type - hardware or software Cause - improper design, hardware failure, external disturbance

Definitions continued
Permanent Fault - always present, needs repair to remove Intermittent fault - not always present but still needs repair to remove Transient fault - will disappear without repair Fault latency - fault can go undetected & does not cause error Fault-avoidance - use of high quality components & careful design to avoid faults Fault-tolerance - use of redundancy (hardware, software, information or time) to correct system operation after fault occurs

Graceful degradation - system still performs but with degraded but correct performance after faults Fail-safe - system can fail but only to safe state to avoid catastrophes Reliability - probability of not failing within time t given operating correctly at time 0 Availability - probability system operating correctly at time t Maintainability - probability that system can be restored to operation by time t given not operational at time 0

Mean-time-to-failure (MTTF) - expected value of system failure time Mean-time-to-repair (MTTR) - expected value of system repair time Mean-time-between-failure (MTBF) - expected value between successive system failure, MTTF + MTTR Fault detection - method used to detect presence of fault Fault confinement - technique to confine damage of fault to as small an area as possible

Fault diagnosis - automatic identification of faulty modules Recovery - system put into operating state, possibly degraded Hardware redundancy - extra hardware to detect, mask or diagnose faults Passive hardware redundancy - fault masking to hide faults & prevent faults from resulting in errors; no action by system

Information redundancy - use of coding theory techniques (addition of bits) Software redundancy - use of diagnostic software or extra modules, each with distinct algorithm Temporal redundancy - repeating bus cycles or whole programs, new route on Internet

Microelectronic Growth
Density of chips dramatically increased & concomitantly, use of digital systems Obvious need for FT in space shuttle, nuclear power plants, but with increased use in homes, more faults likely so will need FT there too Interesting observations: 1999 typical home had microprocessors 2004 expected to be 280

Reliability & Availability
Goal: high reliability & availability based on sound analysis & not conjecture! Use both reliability & availability as measures

Air Traffic Control Example
ATC fails once/year, so MTTF = 8766 hours Airline Reservation System (ARS) down 5 times/year, so MTTF=1753 hours Availability (A) = uptime/(uptime + downtime) ATC down 1 hour, so A = 8765/( ) = ARS down for 1 minute, 5 times, or hours A = /(87666) =

Air Traffic Control Example cont’d
Unavailability U = 1-A So, comparing the two systems for U: ( )/( ) = 12 The ARS is 12 times better than the ATC in terms of availability. Homework 1: 1.13, 1.14, 1.17 (3 examples)

Fault Tolerance & Reliability CDA 5140 Spring 2006

Similar presentations

Presentation on theme: "Fault Tolerance & Reliability CDA 5140 Spring 2006"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fault Tolerance & Reliability CDA 5140 Spring 2006

Similar presentations

Presentation on theme: "Fault Tolerance & Reliability CDA 5140 Spring 2006"— Presentation transcript:

Similar presentations

About project

Feedback