Fault Tolerance & Reliability CDA 5140 Spring 2006 Chapter 1 Overview & Definitions
Topics basic concepts of Fault Tolerance (FT) reliability & availability of systems, both hardware & software tools to compare & contrast FT designs
What is FT? Computing in presence of errors Some techniques from analog systems of 1940’s - 1960’s Digital technology adds to these to be faster, better & cheaper Investigate architecture keeping in mind tradeoff of cost, weight & volume Becoming more important as digital systems become more & more prevalent
Why Have FT? Needed more in 21st century since: Harsher environments Many novice users Increasing repair costs Larger systems Digital systems more prevalent More users dependent on digital systems from business to government to home to school
How is FT Obtained? Add redundancy in form of: Hardware, e.g. RAID Software, e.g. 2 algorithms for same task Information, e.g. coding theory Time, e.g. on Internet if fault, then new route
Definitions & Terminology Failure - departure from correct operation Fault - flaw in hardware or software resulting in failure, e.g. physical problems, design flaws, defects in hardware; design or implementation for software Error - incorrect response from module leading to system failure if no FT Type - hardware or software Cause - improper design, hardware failure, external disturbance
Definitions continued Permanent Fault - always present, needs repair to remove Intermittent fault - not always present but still needs repair to remove Transient fault - will disappear without repair Fault latency - fault can go undetected & does not cause error Fault-avoidance - use of high quality components & careful design to avoid faults Fault-tolerance - use of redundancy (hardware, software, information or time) to correct system operation after fault occurs
Definitions continued Graceful degradation - system still performs but with degraded but correct performance after faults Fail-safe - system can fail but only to safe state to avoid catastrophes Reliability - probability of not failing within time t given operating correctly at time 0 Availability - probability system operating correctly at time t Maintainability - probability that system can be restored to operation by time t given not operational at time 0
Definitions continued Mean-time-to-failure (MTTF) - expected value of system failure time Mean-time-to-repair (MTTR) - expected value of system repair time Mean-time-between-failure (MTBF) - expected value between successive system failure, MTTF + MTTR Fault detection - method used to detect presence of fault Fault confinement - technique to confine damage of fault to as small an area as possible
Definitions continued Fault diagnosis - automatic identification of faulty modules Recovery - system put into operating state, possibly degraded Hardware redundancy - extra hardware to detect, mask or diagnose faults Passive hardware redundancy - fault masking to hide faults & prevent faults from resulting in errors; no action by system
Definitions continued Information redundancy - use of coding theory techniques (addition of bits) Software redundancy - use of diagnostic software or extra modules, each with distinct algorithm Temporal redundancy - repeating bus cycles or whole programs, new route on Internet
Microelectronic Growth Density of chips dramatically increased & concomitantly, use of digital systems Obvious need for FT in space shuttle, nuclear power plants, but with increased use in homes, more faults likely so will need FT there too Interesting observations: 1999 typical home had 40-60 microprocessors 2004 expected to be 280
Reliability & Availability Goal: high reliability & availability based on sound analysis & not conjecture! Use both reliability & availability as measures
Air Traffic Control Example ATC fails once/year, so MTTF = 8766 hours Airline Reservation System (ARS) down 5 times/year, so MTTF=1753 hours Availability (A) = uptime/(uptime + downtime) ATC down 1 hour, so A = 8765/(8765 + 1) = 0.999886 ARS down for 1 minute, 5 times, or 0.083333 hours A = 8765.91666/(87666) = 0.999905
Air Traffic Control Example cont’d Unavailability U = 1-A So, comparing the two systems for U: (1-0.999886)/(1-0.999905) = 12 The ARS is 12 times better than the ATC in terms of availability. Homework 1: 1.13, 1.14, 1.17 (3 examples)