Presentation on theme: "1 A case study: SIFT From: D.P. Siewiorek, R. S. Swarz Reliable Computer Systems (Design and Evaluation) Prentice Hall, 1998. Chapter 10 – “The SIFT Case:"— Presentation transcript:
1 A case study: SIFT From: D.P. Siewiorek, R. S. Swarz Reliable Computer Systems (Design and Evaluation) Prentice Hall, Chapter 10 – “The SIFT Case: Design and Analysis of a Fault Tolerant Computer for Aircraft Control”.
2 SIFT SIFT is a Fault-Tolerant Computer for Aircraft Control “a system capable of carrying out the calculations required for the control of an advanced commercial transport aircraft” developed for NASA as an experimental case study for fault tolerant system research The safety of the flight depends on the computer functions (controls derived from computer outputs). Reliability requirement: probability of failure less than per hour in a flight of ten hours' duration. Reliability requirement similar to that demanded for manned space- flight systems.
3 SRI International: responsable of the overall design, the software and the testing Bendix Corporation: reponsable for the design and the construction of the hardware. SIFT was delivered to NASA's Avionics Integration Research Laboratory in April 1982 ( ).
4 SIFT -Software-Implemented Fault Tolerance The SIFT effort began with broad, in-depth studies stating the reliability and processing requirements for digital computers which would control flight-critical functions. Detailed design studies were made of fault-tolerant architectures that could meet reliability and processing requirements.
5 SIFT A major objective of the SIFT design was to reduce the hardware failure rate by implementing as much of the system as possible in software (i.e., keeping the hardware component count to a minimum). This software-intensive design philosophy deliberately sacrificed performance to maximize reliability. It was implicitly assumed that failure due to software error would be eliminated by formal proof of correctness.
6 -Fully distributed configuration of Bendix BDX930 processors (up to 8 processors) -Transparent fault tolerance (replication and voting) -Assignment of tasks to processors predetermined by a task schedule table defined by the designer -As processors fail, the available hardware changes. A new task schedule must be defined for each configuration level -Processor synchronization fundamental to the correct functioning -Validity of data by task schedule and process synchronization -To accomplish reconfiguration, the SIFT OS selects appropriate schedule -The decision to reconfigure is based on error information obtained when replicated data are voted SIFT
7 Fault tolerance includes: - error detection and correction, - diagnosis, - reconfiguration, and - the prevention of a faulty unit from having an adverse effect on the system -Use of interactive consistency algorithm to eliminate certain “malicious process” failure mode. -Two level of authority: -The “local executive” contains procedures that support scheduling, voting and communications -The “global executive” consists of tasks that cooperate to provide synchronization, redundancy management, (fault isolation and reconfiguration) and interactive consistency. SIFT
8 1) The SIFT design approach makes no assumptions about the failure modes, distinguishing only between failed and nonfailed units. Prior work on permanent stuck-at-1 or stuck-at-0 faults on single lines considered not appropriate for the possible failure modes of LSI circuit components, which can be very complex and affect the performance of units in very subtle ways. 2) The primary method of detecting errors is the corruption of data : the particular manner in which the data are corrupted is of no importance. 3) R igorous formal specification of interfaces that enables us to deduce the effects on one unit of improper signals from a faulty unit. This has important consequences for failure-modes-and-effects analysis (FMEA), which is only required at the interface between units.
9 SIFT The unit of fault detection and of reconfiguration in SIFT is a processor/memory module or a bus. Early work on fault-tolerant computer systems used fault detection and reconfiguration at the level of simple devices, such as flip-flops and adders. Later work considered units such as registers or blocks of memory. No longer appropriate to be concerned with such small subunits. Several low-level techniques for fault tolerance, such as error- detection and –correction codes in memory, are not included in the design of SIFT. Such techniques could be incorporated in SIFT but would provide only a slight improvement in reliability.
10 System overview Main processors: Computing is carried out by the main processors. Each processor's results are stored in a main memory that is uniquely associated with the processor. A processor and its memory are connected by a conventional high bandwidth connection. I/O processors: The I/O processors and memories are structurally similar to the main processors and memories but are of much smaller computational and memory capacity. They connect to the input and output units of the system which are the sensors and actuators of the aircraft. Processing module: Each processor and its associated memory form a processing module, and each of the modules is connected to a multiple bus system. A faulty module or bus is prevented from causing faulty behavior in a nonfaulty module by the fault isolation methods.
11 Structure of SIFT hardware D.P. Siewiorek, R. S. Swarz Reliable Computer Systems (Design and Evaluation) Prentice Hall, Chapter 10 – “The SIFT Case: Design and Analysis of a Fault Tolerant Computer for Aircraft Control”.
12 System overview: execution of tasks The SIFT system executes a set of tasks, each of which consists of a sequence of iterations. The input data to each iteration of a task are the output data produced by the previous iteration of some collection of tasks (which may include the task itself). The input and output of the entire system is accomplished by tasks executed in the I/O processors. Reliability is achieved by having each iteration of a task independently executed by a number of modules. After executing the iteration, a processor places the iteration's output in the memory associated with the processor.
13 System overview: synchronization SIFT uses the iterative nature of the tasks to economize on the amount of voting by voting on the state data of the aircraft (or the computer system) only at the beginning of each iteration. This produces less data flow along the buses than with schemes that vote on the results of all calculations performed by the program. It also has important implications for the problem of synchronizing the different processors. We must ensure only that the different processors allocated to a task are executing the same iteration. This means that the processors need be only loosely synchronized, so we do not need tight synchronization to the instruction or clock interval. An important benefit of this loose synchronization is that an iteration of a task can be scheduled for execution at slightly different times by different processors. Simultaneous transient failures of several processors will be less likely to produce correlated failures in the replicated versions of a task.
14 System overview: task replication The number of processors executing a task 1) can vary with the task a non critical task may be simplex; critical tasks may be replicated (3 or 5 replicas) 2) can be different for the same task at different times—for example, if a task that is not critical at one time becomes critical at another time. The allocation of tasks to modules is, in general, different for each module. It is determined dynamically by a task called the global executive, which diagnoses errors to determine which modules and buses are faulty. When the global executive decides that a module has become faulty, it reconfigures the system by appropriately changing the allocation of tasks to modules.
15 Fault isolation preventing a faulty unit from causing incorrect behavior in a nonfaulty unit Fault isolation in SIFT requires 1)isolating damage 2)preventing a faulty unit from causing incorrect behavior either by 2a) corrupting the data of the nonfaulty unit or 2b) by providing invalid control signals The control signals include those that request service, grant service, effect timing synchronization between units, ….
16 Fault isolation 1) Damage isolation means preventing physical damage from spreading beyond carefully prescribed boundaries. Techniques for damage isolation include physical barriers to prevent propagation of mechanical and thermal effects and electrical barriers (for example, high-impedance electrical connections and optical couplers). In SIFT, such damage isolation is provided at the boundaries between processing modules and buses.
17 Fault isolation 2a) Protection against the corruption of data is provided by the way in which units can communicate. - A processing module can read data from any processing module's memory, but it can write only into its own memory. Thus a faulty processor can corrupt the data only in its own memory and not in that of any other processing modules. All faults within a module are treated as if they have the same effect: namely, that they produce bad data in that module's memory. The system does not attempt to distinguish the nature of a module fault. In particular, it does not distinguish between a faulty memory and a processor that puts bad data into an otherwise nonfaulty memory. - A processor can obtain bad data if those data are read from a faulty processing module or over a faulty bus. Preventing these bad data from causing the generation of incorrect results is solved by fault masking (voting).
18 Fault isolation 2b) Fault isolation also requires that invalid control signals not produce incorrect behavior in a nonfaulty unit. In general, a faulty set of control signals can cause two types of faulty behavior in another unit: (1) The unit carries out the wrong action (possibly by doing nothing), and (2) the unit does not provide service to other units. In SIFT these two types of fault propagation are prevented by making each unit autonomous, with its own control. Improper control signals are ignored, and time-outs are used to prevent the unit from hanging up, waiting for a signal that never arrives.
19 Reconfiguration After identifying the faulty unit, the system is reconfigured to prevent that unit from having any further effect. If the faulty unit is a processing module, then the tasks that were assigned to it are reassigned to other modules. If it is a bus, then processors request their data over other buses. After reconfiguration, the system is able to withstand a new failure, assuming that there are enough nonfaulty units remaining. Because the number of processors executing a task can vary with the task and can be changed dynamically, SIFT has a flexibility not present in most fault-tolerant systems. The particular application field, aircraft control, is one in which different computations are critical to different degrees, and the design takes advantage of this.
20 Scheduling The types of timing requirements on the SIFT system: Output to the actuators must be generated with specified frequency. Transport delay (the delay between the reading of sensors and the generation of output to the actuators based upon those readings) must be kept below specified limits. To fulfill these requirements, an iteration rate is specified for each task. The scheduling strategy must guarantee that the processing of each iteration of the task will be completed within the time frame of that iteration. It does not matter when the processing is performed, provided that it is completed by the end of the frame. Moreover, the time needed to execute an iteration of a task is highly predictable. The iteration rates required by different tasks differ, but they can be adjusted somewhat to simplify the scheduling. Various scheduling strategies were considered for SIFT: (1) Fixed preplanned (nonpreemptive) scheduling, (2) priority scheduling, (3) deadline scheduling, and (4) simply periodic scheduling.
21 Scheduling The scheduling strategy chosen for the SIFT system is a slight variant of the simply periodic method. Each task is assigned to one of several priority levels. Each priority level corresponds to an iteration rate. Each iteration rate is an integral multiple of the next lower iteration rate. The time frame corresponding to highest priority level is divided into a number of subframes. The highest priority tasks are run in specific subframes so that their results can be available to other tasks run in the next subframe and they are required to complete within one subframe.
22 Processor synchronization The SIFT intertask and interprocessor communication mechanism allows a degree of asynchronism between processors and avoids the lockstep traditional in ultrareliable systems. However the processors must periodically resynchronize their clocks to ensure that no clock drifts too far from any other. For reliability, the resynchronization procedure must be immune to the failure of any one clock or processor and to a succession of failures over a period of time.
23 Processor synchronization The synchronization procedure must be reliable in the face of the worst possible behavior of the failing component, even though that behavior may seem unrealistically malicious. The traditional clock synchronization algorithm for reliable systems is the median clock algorithm, requiring at least three clocks. In this algorithm, each clock observes every other clock and sets itself to the median of the values that it sees. The justification for this algorithm is that, in the presence of only a single fault, either the median value must be the value of one of the valid clocks or else it must lie between a pair of valid clock values. In either case, the median is an acceptable value for resynchronization. The weakness of this argument is that the worst possible failure modes of the clock may cause other clocks to observe different values for the failing clock (Byzantine fault)
24 Processor synchronization (Consensus problem) In the presence of a fault that results in other clocks seeing different values for the failing clock, the median resynchronization algorithm can lead to a system failure. Consider a system of three clocks A, B, and C, of which C is faulty. Clock A runs slightly faster than clock B. Assume the failure mode of clock C is such that clock A sees a value for clock C that is slightly earlier than its own value, while clock B sees a value for clock C that is slightly later than its own value (Byzantine faults). Clock A=10: A:10B: 20C: 8 Clock B=20: A:10B:20C: 22 Clock C: faulty Median clock algorithm: Clock A=10Clock B= 20 Clocks A and B will both see their own value as the median value, and therefore not change it.
25 Processor synchronization (Consensus problem) An algorithm exists that makes the system immune to the failure of M out of N clocks under the assumption that N >= 3M+1. The general algorithm, and the proof of its correctness, can be found in Pease, Shostak, and Lamport . Let us consider the single-failure algorithm. The algorithm is carried out in two parts. In the first part, each clock computes a vector of clock values, called the interactive consistency vector, having an entry for every clock (Consensus problem). In the second part, each clock uses the interactive consistency vector to compute its new value. Each SIFT processor reads the value of its own clock directly and reads the value of another processor's clock over a bus. It obtains the value that processor r read for processor q's clock by reading from processor r's memory over a bus.
26 SIFT algorithm to synchronise clocks A clock p computes its interactive consistency vector as follows. The entry of the vector corresponding to p itself is set equal to p's clock value. The value for the entry corresponding to another processor q is obtained by p as follows: 1. Read q's value from q. 2. Obtain from each other clock r the value of q that r read from q. 3. If a majority of these values agree, then the majority value is used. Otherwise, the default value NIL (indicating that q is faulty) is used. One can show that if at most one of the clocks is faulty, then (1) each nonfaulty clock computes exactly the same interactive consistency vector, and (2) the component of this vector corresponding to any nonfaulty clock q is q's actual value.
27 SIFT algorithm to synchronise clocks Having computed the interactive consistency vector, each clock computes its new value as follows. Let d be the maximum amount by which the values of nonfaulty processors may disagree. (The value of d is known in advance and depends upon the synchronization interval and the rate of clock drift.) Any component that is not within d of at least two other components is ignored, and any NIL component is ignored. The clock then takes the median value of the remaining components as its new value. Since each nonfaulty clock computes exactly the same interactive consistency vector, each will compute exactly the same median value. Moreover, this value must be within d of the original value of each nonfaulty clock.
28 SIFT hardware The SIFT system attempts to use standard units whenever possible. Special design is needed only in the bus system and in the interfaces between the buses and the processing modules. The fault-tolerant properties of SIFT are based on the interconnection system between units and on the software system. The particular design of the processors and memories is irrelevant to our discussion of fault tolerance. The I/O processors are based on the well-known 8080 microprocessor architecture.
29 An abstract view of data transfer Connections among processors, buses, and memories. The varying replications of these connections are shown for each type of unit. Within each unit are shown a number of abstract registers that contain data or control information. Arrows that terminate at a register indicate the flow of data to the register. Arrows that terminate at the boundary of a unit indicate control signals for that unit. D.P. Siewiorek, R. S. Swarz Reliable Computer Systems (Design and and Evaluation) Prentice Hall, Chapter 10 – “The SIFT Case: Design and Analysisof a Fault Tolerant Computer for Aircraft Control”.
30 How a process p reads a word of data from location w of memory m via bus b (we assume normal operation, in which no errors or time-outs occur) Processor p initiates the Read operation by putting m and w into the register Prequest (p, b). Every processor has a separate Prequest register for each bus to which it is connected. When this register is loaded, a Busrequest line is set to request attention from the appropriate bus. The processor must now wait until the requested bus and memory units have completed their part of the operation. Each bus unit contains a counter-driven scanner that continuously scans the Prequest and Busrequest lines from processors. When the scanner finds a processor that requires its attention (Busrequest high), it stops, and the bus is said to have been seized by that processor. The bus's counter then contains the identifying number of the processor that has seized it.
31 When seized, the bus transfers the value w from the processor to a register connected to memory m. When this transfer has been completed, the Memrequest line is raised, calling for attention from the memory. The bus then waits for the memory to complete its actions. Memory units contain counter-driven scanners that operate in the same manner as those in bus units, that is, they continuously scan all buses to determine which of them (if any) is requesting service. When a request is detected, the memory is said to be seized, and it reads the value w from the bus. The memory then reads the contents of its location w into Memdata register and raises the Memread line to inform the bus that the data are available. The memory leaves the state of Memdata and Memread unchanged until it detects that the Memrequest line from the bus has dropped, indicating that the bus has received the data from the Memdata register. The memory then drops the Memread line and resumes scanning the buses for further requests.
32 When the bus detects that the Memread line from the memory is up, it transfers the data in the Memdata register to the Busdata register, drops the Memrequest line, and raises the Dataready line, indicating to the processor that the data are available. The bus leaves the state of the Busdata and Dataready lines unchanged until it detects that the Busrequest line from the processor has dropped, indicating that the processor has received the data word. The bus then drops the Dataready line and resumes scanning the processors for further requests. Meanwhile, the processor that made the original request has been waiting for the Dataready line to be raised by the bus, at which time it reads the data from the Busdata register. After completing this read, it drops the Busrequest line and continues with other operations. These actions have left the units in their original states. They are therefore ready to take part in other data transfer operations.
33 Abstract Program: REQUEST(p,b) := m,w D:= REALTIME WAIT ON (DATAREADY (b) OR REALTIME >(D )) IF DATA READY (b) THEN BEGIN READIN(p,b) := BUSDATA(b) BUSREQUEST(p,b) := FALSE WAIT ON ((DATA READY = FALSE) OR (REALTIME >(D )) END ELSE BEGIN BUS REQUEST := FALSE BUSFAIL(p,b) := TRUE END time-outs are used to prevent any kind of bus or memory failure from hanging up the unit
34 Interconnection units The interconnection system units designed especially for the SIFT system are (1)the processor-to-bus interfaces, (2)the buses, and (3)the bus-to-memory interfaces. These units all operate autonomously and contain their own control, which is implemented as a simple microprogrammed controller.
35 Interfaces The design of the interfaces permits simultaneous operation of all units. For example, a processor can simultaneously read data from its memory and from another memory, while at the same time another processor is reading from the first processor's memory. Since several processors may attempt to seize the same bus, or several buses may attempt to seize the same memory, a processor may have to wait for the completion of one or more other operations before receiving service. Such waiting should be insignificant because of the small amount of data that is transmitted over the buses.
36 The software system The software of SIFT consists of the application software and the executive software. The application software performs the actual flight-control computations. The executive software is responsible for the reliable execution of the application tasks and implements the error-detection and reconfiguration mechanisms. Formal specifications of the executive software have been written in a rigorous form using the SPECIAL language [Robinson and Roubine, 1977] developed at SRI. These formal specifications are needed for the proof of the correctness of the system. Moreover, they are also intended to force the designer to produce a well-structured system. From the point of view of the software, a processing module, with its processor, memory, and associated registers, is a single logical unit.
37 Application software The application software is structured as a set of iterative tasks. Each task is run with a fixed iteration rate that depends on its priority. The iteration rate of a higher-priority task is an integral multiple of the iteration rate of any lower-priority task. Every task's iteration rate is a simple fraction of the main clock frequency. The fact that a task is executed by several processors is invisible to the application software. In each iteration, an application task obtains its inputs by executing calls to the executive software. After computing its outputs, it makes them available as inputs to the next iteration of tasks by executing calls to the executive software. The input and output of a task iteration will consist of at most a few words of data.
38 Executive software The SIFT executive software performs the following functions: 1. Run each task at the required iteration rate. 2. Provide correct input values for each iteration of a critical task (masking any errors). 3. Detect errors and diagnose their cause. 4. Reconfigure the system to avoid the use of failed components. To perform the last three functions, the executive software implements the techniques of redundant execution and majority voting. The executive software is structured into three parts: the global executive task, the local executive, and the local-global communicating tasks.
39 Executive software One global executive task is provided for the whole system. It is run just like a highly critical application task, being executed by several processors and using majority voting to obtain the output of each iteration. It diagnoses errors to decide which units have failed and determines the appropriate allocation of tasks to processors. Each processing module has its own local executive and local- global communicating tasks. The local-global communicating tasks are the error-reporting task and the local reconfiguration task. Each of these tasks is regarded as a separate task executed on a single processor, so there are as many separate error-reporting tasks and local reconfiguration tasks as there are processors.
40 Logical structure of the SIFT software system D.P. Siewiorek, R. S. Swarz Reliable Computer Systems (Design and and Evaluation) Prentice Hall, Chapter 10 – “The SIFT Case: Design and Analysisof a Fault Tolerant Computer for Aircraft Control”.
41 Arrangement of application tasks within SIFT configuration critical tasks are replicated on several processors D.P. Siewiorek, R. S. Swarz Reliable Computer Systems (Design and and Evaluation) Prentice Hall, Chapter 10 – “The SIFT Case: Design and Analysisof a Fault Tolerant Computer for Aircraft Control”.
42 Arrangement of executive tasks within SIFT configuration D.P. Siewiorek, R. S. Swarz Reliable Computer Systems (Design and and Evaluation) Prentice Hall, Chapter 10 – “The SIFT Case: Design and Analysisof a Fault Tolerant Computer for Aircraft Control”.
43 The Global Executive Task. The global executive task uses the results of every processor's error task to determine which processing modules and buses are faulty. When the global executive decides that a component has failed, it initiates a reconfiguration by sending the appropriate information to the local reconfiguration task of each processor. The global executive may also reconfigure the system as a result of directives from the application tasks. For example, an application task may report a change of flight phase that changes the criticality of various tasks. To permit rapid reconfiguration, we require that the program for executing a task must reside in a processor's memory before the task can be allocated to that processor. In the initial version of SIFT, there is a static assignment of programs to memories. The program for a critical task usually resides in all main processor memories, so the task can be executed by any main processor.
44 The Local Executive. The local executive is a collection of routines to perform the following functions: (1) run each task allocated to it at the task's specified iteration rate; (2) provide input values to and receive output values from each task iteration; and (3) report errors to the local executive task.
45 Fault Detection Fault detection is the analysis of errors to determine which components are faulty. Processor/bus error table, an m by n matrix, where m is the number of processors and n the number of buses in the system. Each processor p has its own processor/ bus error table Xp that is maintained by its local executive's error handler. Xp[i,j] represents the number of errors detected by processor p's local executive that involve processor i and bus j. Suppose that processor p is reading from processor i using bus j. There are five distinct kinds of errors: 1. The connection from bus j to processor i is faulty. 2. The connection from processor p to bus j is faulty. 3. Bus j is faulty. 4. Processor i is faulty. 5. Processor p is faulty.
46 Fault Detection Processor p's error-reporting task analyzes the processor/bus error table If the number of errors is greater than a given threshold, an appropriate action can be taken. In case 1, processor p will stop using bus j to talk to processor i. In cases 2 and 3, processor p will stop using bus j, and will report to the global executive that bus j is faulty. In case 4, processor p will report to the global executive that processor i is faulty. The global executive task makes the final decision about which unit is faulty. It reads the faulty processor reports provided by the error-reporting task. - If two or more processors report that another processor is faulty, then the global executive decides that this other processor has indeed failed. - If two or more processors report that a bus is faulty, then the global executive decides that the bus has failed.
47 Fault Detection The global executive may know that some unit produced errors but be unable to determine which is the faulty unit. In that case, it must await further information. It can obtain such information by allocating the appropriate diagnostic tasks. If there is a faulty unit (and the error reports were not due to transient faults), then it should obtain the necessary information in a short time. It can be shown that in the presence of a single fault, the above procedure cannot cause the global executive to declare a nonfaulty unit to be faulty. With the appropriately malicious behavior, a faulty unit may generate error reports without giving the global executive enough information to determine that it is faulty. For example, if processor p fails in such a way that it gives incorrect results only to processor q, then the global executive cannot decide whether it is p or q that is faulty. However, the majority-voting technique will mask these errors and prevent a system failure.
48 SIFT SIFT (Software-Implemented Fault Tolerance): a Fault-Tolerant Computer for Aircraft Control Reliability requirement is that the probability of failure should be less than per hour in a flight of 10 hours' duration. Structure of SIFT hardware D.P. Siewiorek, R. S. Swarz Reliable Computer Systems (Design and Evaluation) Prentice Hall, Chapter 10 – “The SIFT Case: Design and Analysis of a Fault Tolerant Computer for Aircraft Control”.
49 Reliability prediction High reliability of survival for a short period time (10-hour flight). For a flight of T duration survival will occurr unless certain combination of failure events occurr within the interval T or have already occurred prior to the interval T and were undetected by the initial checkout of the system. Show that the probability of a more catastrophic sequence of failures is sufficient small. Finite state Markov process. The combined probability of all event sequences that lead to a failed state is the system failure probability. Failure rate of for a 10-hour period T
50 Assumptions: - Hardware-fault events are independent and exponentialy distributed in time (constant failure rate) - All failures are permanent for the duration of the flight Accurate because: - the physical design of the system prevents fault propagation between functional units - a mutiple fault in a functional unit is no more serious than a single fault Effects of transient errors are masked by the executive system which requires a unit to make multiple errors before it consideres the unit to be faulty. The execution of critical tasks in loose synchronism also helps protect against correlation of fast transient errors. Failure rates for hardware have been estimated on the basis of active component counts, using typical figures for similar hardware : main processors per hour failure rate
51 State of the system in the reliability model (h, d, f) with h<=d<=f (h, d, f) represents a situation in which: f failures of individual components have occurred d of those failures have been detected h of these detected failures have been handled in reconfiguration Three types of possible transitions: The first two types of transitions are represented by straight arrorws (costant probabilities for unit of time) The third type of transition is represented by wave arrows, represents the completion of a reconfiguration procedure. (these transitions are assumed to occurr within some fixed length of time
52 D.P. Siewiorek, R. S. Swarz Reliable Computer Systems (Design and Evaluation) Prentice Hall, Chapter 10 – “The SIFT Case: Design and Analysis of a Fault Tolerant Computer for Aircraft Control”.
53 A state (h,d,f) with h
54 Designers calculated the probability of system failure through a double- fault transition and also through reaching a state with fewer than two nonfaulty processors, for which they said the system has failed because it has run out of spares. Failure probabilitie for a 5 processor system and T= 10 hours D.P. Siewiorek, R. S. Swarz Reliable Computer Systems (Design and Evaluation) Prentice Hall, Chapter 10 – “The SIFT Case: Design and Analysis of a Fault Tolerant Computer for Aircraft Control”.
55 Summary SIFT basic approach to fault tolerance involves the replication of standard components, relying upon the software to detect and analyze errors and to dyamically reconfigure the system to bypass faulty units. Special hardware is needed only to isolate the units from one another, so a faulty unit does not cause the failure of a nonfaulty one. Processor/memory modules and bus modules as the basic units of fault detection and reconfiguration have been used. These units make system reconfiguration easy and are small and inexpensive enough to allow sufficient replication to achieve the desired reliability. By using software to achieve fault tolerance, SIFT allows considerable flexibility in the choice of error handling policies and mechanisms. For example, algorithms for fault masking and reconfiguration can be easily modified on the basis of operational experience. Novel approaches to the tolerance of programming errors can be incorporated. Moreover, it is fairly easy to enhance the performance of the system by adding more hardware.
56 Fault tolerant systems
57 From “The standard IEC Nuclear Power Plant – I&C system important to safety – requirements for coping with CCF”, A. Lindner, H-W Bock, NPIC&HMIT 2009, American Nuclear Society. Scheme of a digital safety I&C system
58 Redundancy, diversity and independence are basic concepts of a defense-in-depth approach: 1. redundancy alternative systems and components are included, so that any one can perform the required function if the others fail 2. diversity for a particular function, two or more redundant systems or components with different attributes are included in the design. In practice, it may mean using different components based on different designs and principles, from different vendors 3. independence prevents the propagation of failures and common cause failures due to common internal plant hazards. Independence is achieved by electrical isolation, physical separation and independence of communications between systems Defense-in-depth approach Common Cause Failure a failure of two or more structures, systems or components due to a single specific event or cause (fault)
59 Scheme of a fourfold redundant safety I&C system comprising two independent I&C systems A and B The channels of independent systems and the systems are spatially separated The systems A and B can be designed by different Programmable Logical Controllers different set signals different I&C functions Replicas have different starting time to avoid failures triggered by the same runtime Functional diversity Existence of diverse measurement signals, actuators and functions in the plant system Diverse functions: functions that ensure independently that the plant safety targets are met Diverse I&C functions have to be assigned to independent I&C systems
60 Objective: decrease the lifecycle costs of fault tolerant systems Generic fault-tolerant architecture with two dimensions of redundancy a multi-level integrity dimension for accomodating software components of different levels of criticality Architecture based on COTS (commercial off-the-shelf) components Software implemented approach to minimize the need for special hw Architecture that can be configured into a wide variety of instances Considered applications: nuclear propulsion (*), railway and space systems Generic Upgradable Architecture for Real-time Dependable Systems (GUARDS, ESPRIT Project 20716) (*) I&C for submarine propulsion – subsystem related to process regulation, to control and monitoring of actuators for the availability of power generation and the man-machine interface in the control room (equipment classified B-IEC 61226)
61 GUARDS Lanes From: D. Powell Ed, A Generic Fault Tolerant Architecture for Real-Time Dependable Systems, Kluwer Academic Publishers, 2001
62 Three dimensions of fault containment: Channels (primary physical-fault containment regions) Lanes (secondary physical-fault containment regions) Integrity levels (design–fault containment regions) An instance of the architecture is defined by - the dimensional parameters (C, L, I) - a fault-tolerant strategy leading to a selction of components that implement mechanisms for: Inter-channel communication Output data consolidation Fault tolerance and integrity management GUARDS
63 Aims to provide containment regions with respect to software design faults -> to protect critical components from the propagation of errors due to residual design faults in less-critical components Each application object is assigned an integrity level Enforce an integrity policy to mediate the communication between objects of different levels -> prohibit flows of information from low to high integrity Validation Objects: special objects that apply fault tolerant mechanisms to upgrade the trustworthiness of data output reliable information by using possibly corrupted data as input (with a low integrity level) The Integrity dimension
64 Multiple processors or lanes are used: to define secondary physical fault containment regions. Improve the capability of fault diagnosis within a channel (e.g. by comparison of computation replicated on several processors) Improve Coverage with respect to design faults by using intra- channel diversification to improve the availability of a channel (by passivating a node that is diagnosed to be permanently faulty) The Lane dimension
65 Fault tolerance is based on active replication of application tasks over the set of channels: 2-channels: in case of requirements of improving safety or reliability 3-channels: TMR strategy that enables most faults in one channel to be masked 4-channels: allow arbitrary faults in one channel to be masked or allow a channel to be isolated for off-line testing while still guarantee TMR Replicas on fault-free channels behave deterministically and must produce the same outputs Error processing based on comparison or voting CONSTRAINT: replicas are supplied with the same inputs in the same order, despit the occurrence of faults The Channel dimension
66 Provides a global clock to all channels Each node has a physical clock and computes a global logical clock time through a fault tolerant synchronisation algorithm Allows channels to achieve interactive consistency (consensus) on non-replicated data Allows exchange of private data between channels and agreeing on a common value in the presence of arbitrary faults (Byzantine fault) Inter-Channel Communication Network- ICN Consensus between multiple processes is a basic building block in fault-tolerant distributed computing Byzantine fault an arbitrary or malicious fault, such as when a process sends corrupt messages or differing messages during a broadcast to its neighbors
67 Interactive consistency protocol used in GUARDS assume authenticated messages (signed messages) Uses checksum scheme for message authentication rather than criptographic features signatures for performance reasons An example of application: Agreement on sensor inputs: even when replicated sensors are provided, the values read from them must be considered as single-source values and thus potentially different Interactive Consistency Network
68 read sensors consolidate outputs execute application vote results write actuators Error reports Replicated execution of a task in the case of a three channel configuration - Replicated sensor data are read - Input values are consolidated across all channels over the ICN - Application tasks are executed asynchronously - Output values are exhanged over the ICN and voted - Voted results are written to actuators via output consolidation hw which allows physical values to be voted
69 References B. Littlewood, L. Strigini, “Validation of ultra-high dependability for software-based systems”, ACM, T-Book, Reliability Data of Components in Nordic Nuclear Plants ATV Office, Vallingby, Sweden, 1991 U.S. Commercial Nuclear Power Plant Digital I&C System Operating Experience, Electric Power Research Institute, Revision 0, Palo Alto, CA, D. Powell Ed, A Generic Fault Tolerant Architecture for Real-Time Dependable Systems, Kluwer Academic Publishers, 2001