Safety-Critical Systems Department of Computer Science

Safety-Critical Systems Department of Computer Science
CM30072 Safety-Critical Systems Department of Computer Science 2004

Safety-Critical Computer Systems
Recommended Book. Safety-Critical Computer Systems By Neil Storey ISBN:

Safety-Critical Systems Department of Computer Science
CM30072 Safety-Critical Systems Department of Computer Science 2004 Accidents and Risk Safety Terminology Computer and Software safety Hazard Identification and Analysis Risk Assessment and Reduction Verification of Safety Managing Safety

1. Accidents and Risk Many of the most serious industrial and commercial accidents this century have occurred in the past 25 odd years: 1979 Three Mile Island nuclear power plant accident, costing over $1.5 billion, plus the loss of the plant. 1984 Bhopal - toxic chemical escape killed 2500 people, and seriously injured 200,000. 1986 Nasa's Challenger space shuttle exploded after take-off. 1986 Chernobyl nuclear power plant explosion and fire caused a huge radioactive cloud, which affected much of Europe. 1988 Piper Alpha oil platform fire. 167 of the 225 people on the platform were killed. 1996 Channel Tunnel Fire 1996 ValuJet DC-9 crashed into the Florida Everglades, killing all 110 people on board. 1999 Ladbroke Grove Junction rail crash. 31 killed.

2001. Explosion in a fertilizer factory in France. 29
2001 Explosion in a fertilizer factory in France. 29 killed, over 2400 injured. 2002 The Senegalese ferry Joola sinks in a fierce gale off the Gambian coast; over 1,800 passengers are killed, 67 survive. 2002 The Zeyzoun Dam in Syria collapses, flooding at least three neighbouring villages. 20 deaths are reported. 2003 In Bangladesh an overloaded ferry capsizes at Chandpur. At least 400 are killed. 2003 The space shuttle Columbia breaks up as it descends over central Texas toward a planned landing at Kennedy Space Center in Florida. All 7 crew are killed.

1.1 Risks faced by technological societies
Increasing complexity High tech systems are often made up of networks of closely related subsystems. A problem in one subsystem may cause problems in other subsystems. Analyses of major industrial accidents invariably reveal highly complex sequences of events leading up to accidents, rather than single component failures. In the past, component failure was cited as the major factor in accidents. Today, more accidents result from dangerous design characteristics and interactions among components.

Increasing exposure Both technology and society are becoming more complex. More people today may be exposed to a given hazard than in the past. e.g. More passengers can be carried in 1 aircraft Faster trains means more people can be carried every year Dangerous chemical plants may have large populations near them

Increasing automation
Automation might appear to reduce the risk of human operator error. However, it just moves the humans to other functions - maintenance, supervisory, higher-level decision making etc. The effects of human decisions and actions can then be extremely serious. Example In 1985, a China Airways 747, flying on autopilot suffered a slow loss of power from its outer right engine. This would have caused the plane to yaw to the right, but the autopilot kept correcting for the yaw, until it reached the limits of what it could do, and could no longer keep the plane stable. The plane went into a vertical dive of 31,000 feet before it could be recovered. It was severely damaged and nearly lost.

Increasing centralisation and scale
Increasing automation has led to centralisation of industrial production in very large plants, giving the potential for great loss and damage in the even of an accident. e.g. Nuclear Power In 1968, manufacturers were taking orders for plants 6 times larger that the largest one in operation. In 1975, the Browns Ferry Nuclear Power Plant was the site of a very serious accident. It was 10 times larger than any plant already in operation. A technician checking for air leaks with a lighted candle caused $100 million in damage when insulation caught fire. The fire burned out electrical controls, lowering the cooling water to dangerous levels, before the plant could be shut down.

Increasing pace of technological change
The average time to translate a basic technical discovery into a commercial product was, in the early part of the twentieth century, 30 years. Nowadays, it takes less than 5 years. Economic pressures to change with technology: May lead to less extensive testing, May end up being done by the customer. Lessens the opportunity to learn from experience (trial and error). Difficult if products are brought to the market and then replaced by new products very quickly. In many cases we cannot learn by trial and error e.g. Nuclear power!!!

1.2 How accidents happen In order to devise effective ways to prevent accidents, we must first understand what causes them - this is a complex task. Example The vessel Baltic Star, registered in Panama, ran aground at full speed on the shore of an island in the Stockholm waters, on account of thick fog. One of the boilers had broken down, the steering system reacted only slowly, the compass was maladjusted, the captain had gone down into the ship to telephone, the lookout man on the prow took a coffee break, and the pilot had given an erroneous order in English to the sailor who was tending the rudder. The latter was hard of hearing and understood only Greek. The complexity of events and conditions contributing to this accident is not unusual.

End 11/10

Cause and effect A cause must precede a related effect, but problems finding causes arise because: A condition or event may precede another event without causing it. A condition may be considered to cause an event without the event occurring every time the condition holds. e.g. A plane may crash in bad weather, but the bad weather may not be the cause of the crash. The cause of an event (such as an accident) is composed of a set of conditions, each of which is necessary, and which together are sufficient for the event to occur. Thus if one condition is missing, the event won’t occur. If all conditions are there, then the event will occur.

Cause and effect The notion of how things happen is called causality.
The relationship between cause and effect.

Example: What are the factors required for combustion? Flammable material Source of ignition Oxygen

A Common Problem — Oversimplification of causes
Interpreting statements about the "cause" of an accident requires care — the cause may be oversimplified or biased. Out of a large number of necessary conditions one is often chosen as the cause, even though all factors were indispensable. Example: A car skidding and crashing in the rain Wet road No ABS Visibility Worn tires Lack of driver’s skill The curvature of the road Driving too fast The camber of the road

Common oversimplifications:—
Assuming human error - a favourite! Is applicable to virtually any accident Used far too freely Removing the human as a means of preventing future accidents is not particularly useful! There are many reasons why accident reports which attempt to place the blame on a human should be viewed with skepticism, here are just some:— The data may be biased or incomplete. Positive actions are not usually recorded. Blame may be based on the assumption that the human operator can overcome every emergency. Operators often have to intervene at the limits. Hindsight is always 20/20. Separating operator error from design error is very difficult.

Assuming technical failures
i.e. A component broke! e.g. Flixborough Chemical Works, UK, 1974. A makeshift bypass pipe exploded killing 28 workers. Ignoring organisation factors Accidents are often blamed on operator or technical failure without recognising there is an organisation and/or management which may have made an accident inevitable. e.g. 3 Mile Island nuclear power plant: the accident report had 19 pages of recommendations, 2 on technical issues and rest on management and training.

1.3 Finding root causes of accidents
Finding the root causes of an accident is key to preventing similar accidents. The root level causes of accidents can be divided into three categories: (i) Deficiencies in the safety culture of the industry or organisation. (ii) Flawed organisational structures. (iii) Superficial or ineffective technical activities.

(i) Deficiencies in the safety culture
Safety culture — the general attitude and approach to safety reflected by those who participate in that industry. e.g. Management, Workers, Industry regulators, Government regulators. In an ideal world all participants are equally concerned about safety, both in the processes they use and in the final product. But this is not always the case, that’s why we have to have industry and government regulators.

Major accidents often stem from flaws in this culture, especially:
Overconfidence and complacency. Discounting risk. Over-reliance on redundancy. Unrealistic risk assessment. Ignoring high-consequence, low probability events. Assuming risk decreases over time. Underestimating software related risks. Ignoring warning signs. Disregard or low priority for safety. Flawed resolution of conflicting goals.

Overconfidence and complacency.
Paradox: providing redundancy may lead to complacency which defeats the redundancy! Overconfidence and complacency. Discounting risk. Major accidents are often preceded by the belief that they cannot happen. e.g. Challenger 1986 Titanic 1912 Over-reliance on redundancy. Redundant systems use extra components to ensure that failure of one component doesn’t result in failure of the whole system. Many accidents can be traced back to common-cause failures in redundant systems. Common-cause failure happens when multiple redundant components fail at the same time for the same reason. Providing redundancy may help if a component fails, BUT must be aware that all redundant components may fail.

End 14/10 (Unrelistic risk assessment)

Unrealistic risk assessment. It is quite common for developers to state that the probability of a software fault occurring in a piece of software is 10-4, usually with little or no justification. Therac-25 software risk assessment was for “computer selects wrong energy level”. Ignoring high-consequence, low probability events. A common discovery after accidents is that the events involved were recognised as being very hazardous before the accident, but were dismissed as incredible. e.g. Apollo 13: The ground control engineers didn’t believe what the instruments were telling them.

Assuming risk decreases over time A common thread in accidents is the belief that a system must be safe because it has operated without any accidents for many years. Therac-25 operated thousands of times before the first accident For 18 years, Nitromethane was considered to be non-explosive and so, safe to transport in railway tanks, until 2 tanks exploded in separate incidents. Risk remains constant over time, though it may actually increase e.g. due to operators becoming over-familiar with safety procedures and hence become lax or even miss them out.

Underestimating software related risks: There is a pervading belief that software cannot fail, and that all errors will be removed by testing. Hardware backups, interlocks, etc, are being replaced by software. Many basic mechanical safety devices are well tested, cheap, reliable and failsafe (based on physical principles to fail in a safe state). It is therefore probably misguided to replace them by software that displays none of these properties!!

Ignoring warning signs: Accidents are frequently preceded by public warnings or a series of minor occurrences. e.g. The Kings cross underground fire in people killed. Apparently caused by a lighted match which dropped onto an escalator, which fell through and set fire to dust and grease on the track beneath. There had actually been an average of 20 fires a year from 1958 to 1987, which were called “smoulderings” presumably to make them sound less serious. They caused some damage but no one was killed. This lead to the view that no fire could become serious!`

Disregard or low priority for safety:
Problems will occur if management is not interested in safety, because the workers will not be encouraged to think about safety. The Government may disregard safety and ignore the need for government/industry watchdogs and standards committees. In fact these often only appear after major accidents.

Flawed resolution of conflicting goals
Occurs if there are deficiencies in the safety culture of an organisation/industry. The most common one is the “cost-safety” trade-off Safety costs money! Or appears to cost more money at the time of development/manufacture. Often cost become more important and safety may therefore be compromised in the rush for greater profits.

We are looking at finding root causes of accidents
We are looking at finding root causes of accidents. The root causes of major accidents usually reside in: Overconfidence and complacency Flawed organisational structures Ineffective technical activities Or a combination of the above.

(ii) Flawed organisational structures
Many accident investigations uncover a sincere concern for safety in the organisation, but find organisational structures in place that were ineffective in implementing this concern. Diffusion of responsibility and authority. Accidents are often associated with ill-defined responsibility and authority for safety matters. Should be at least 1 person with overall responsibility for safety. They must have real power within the company. Lack of independence and low status of safety personnel. This leads to their inability or unwillingness to bring up safety issues. e.g. Safety officers should not be under the supervision of the groups whose activities they must check.

Poor and limited communications channels.
In some industries, strict line management means that workers report only to their direct superiors. Problems with safety may not be reported to interested parties. Thus, safety decisions may not be reported back to the workers. All staff should have direct access to safety personnel and vice versa.

The start of the problem is that companies don’t factor in the possibility of catastrophes or their potential costs when making decisions. This results in insufficient resources being allocated to managing safety. Once this is understood, it clears the way for initiatives which can dramatically reduce the risk of catastrophes. e.g. Giving those closest to the problem the decision-making power to respond Improving internal communications Encouraging employees to come forward with ‘bad news’ Not overworking employees Providing special training to alert workers to potential dangers Making sure that sophisticated technology does not diminish a worker’s ability to assess a situation

End 18/10

Herald of Free Enterprise (1987)
Sank off the Belgian coast at Zeebrugge harbour, with the loss of 193 lives. Investigations revealed causal links in a chain that stretched from a negligent seaman, through every level of management up to the Board of Directors. The ferry sank because it left port without closing its bow doors. The seaman responsible for closing the doors was asleep. The officer responsible for monitoring the door-closing said he saw someone preparing to close them. The captain followed the standing order which said that if no problems were reported “the master will assume, at the due sailing time, that the vessel is ready for sea in all respects”. There was a great deal of pressure placed on the crew by management to cut turnaround time. One company memo read “Sailing late out of Zeebrugge is not on.” The ferry’s design, optimised for rapid loading meant that it only took 90 seconds to sink, leaving virtually no time at all to evacuate the passengers.

(iii) Ineffective Technical Activities
This is concerned with poor implementation of all the activities necessary to achieve an acceptable level of safety. Superficial safety efforts Hazard logs kept but no description of design decisions taken or trade-offs made to mitigate/control the recognised hazards. No follow-ups to ensure hazards have ever been controlled. No follow-ups to ensure safety devices are kept in working order. Ineffective risk control The majority of accidents are not the results from a lack of knowledge about how to prevent them! They are the results of failure to use that knowledge effectively when trying to fix the problem(s).

Failure to evaluate changes
Accidents often involve a failure to re-evaluate safety after changes are made. Any changes in hardware or software MUST be re-evaluated to determine whether safety has been compromised. Quick fixes often affect safety because they are not evaluated properly. Information deficiencies Feedback of operational experience is one of the most important sources of designing, maintaining and improving safety, but is often overlooked!! There are 2 types of data that are important: Information about accidents/incidents for the system itself Information about accidents/incidents for similar systems

1.4 Modelling Accidents Design and analysis methods used in safety-critical systems are based on underlying models of the accident process. Accident models attempt to reduce an accident description to a series of events and conditions that account for the outcome. Such models are used to: (i) understand past accidents, (ii) learn how to prevent future accidents. To design and implement safe systems we must understand these underlying models and the assumptions they make about accidents and human errors.

(Developed in 1931 by H. Heinrich.)
Domino models (Developed in 1931 by H. Heinrich.) Emphasise unsafe acts above unsafe conditions. The general accident sequence is mapped onto five "dominoes". Called a domino model because once one domino "falls", it causes a chain of falling dominoes until the accident occurs. Removing any of the dominoes will break the sequence.

Although removing any domino will prevent the accident, it is generally considered that the easiest and most effective domino to remove is #3 – unsafe act or condition. This model has been very influential in accident investigations. BUT has often been wrongly used to look for a single unsafe act or condition, when causes were actually more complex.

Example: A worker climbs a defective ladder
Example: A worker climbs a defective ladder. The ladder collapses and the worker falls and injures himself. leads to results in reason for Ancestry, social environment Fault of person Unsafe act or condition Accident Injury Worker breaks leg Worker falls when ladder breaks Worker climbs defective ladder Worker doesn’t notice that ladder is defective Worker unobservant Ignores safety

Normal solution: Remove domino #3: Unsafe act or condition. Replace the ladder! Stop people from climbing the defective ladder. (Notice how we can give a very particular slant to the accident: blame the worker!) However, replacing one defective ladder will not prevent accidents with other defective ladders. The analysis does not address why: Safety inspections didn’t spot the faulty ladder Management didn’t check that safety inspections were defective

End 21/10

Chain-of-events models
Organise causal factors into chains of events. Events are chained in chronological order, but there is often no obvious stopping point when tracing back from the cause of an accident. This model is very close to our view of accidents, where we often try to rationalise it into a series of events. As with the domino model, if the chain is broken, the accident won’t happen. Thus, accident prevention measures concentrate on either: Eliminating certain events, Intervening between events in the chain.

Simple chain of events for a pressurised tank rupture:
In this chain of events, we are modelling one of the many ways in which a pressurised tank might rupture at normal operating pressure. In this particular chain of events, we consider the possibility of the structure of the tank being weakened by corrosion. The resulting accident could cause equipment damage and/or injury to nearby personnel. NB: There are many other forms of accident model many of which deal with specific aspects of an accident eg. energy models, task models, cognitive models.

How could we prevent/mitigate against the effects of such an accident?
Prevent moisture getting to the tank in the first place. Prevent corrosion by using a special metal, or a special coating for the tank. You would then have to decide (NB: cost vs safety!) on which is the most appropriate. How about this way: Prevent projectiles (fragments) from the tank by enclosing it in a structure capable of withstanding the blast. However, this option doesn’t prevent the accident but it does reduce the effects (from the blast).

2. Safety Terminology 2.1 Some definitions Fault
A fault is a hardware or software defect which resides temporarily or permanently in the system. Error An error is a deviation from a desired or intended state. An error is a static condition, a state, which remains until it is removed (usually through human intervention). Failure A failure is the inability of a system or component to fulfil its operational requirement. Failure is an event or behaviour which occurs at a particular instant in time.

Example: Fault: A faulty piece of software may cause a valve to remain open even when it has been commanded to close by the system operator. Error: Once the valve remains open, even though commanded to close, the system is in an error, because it has an open valve that should be closed. Failure: The valve control software fails to control the valve correctly and the failure is due to a fault in the software.

Near-miss or incident An event which involves no loss (or only minor loss) but with the potential for loss under different circumstances. e.g. two aircraft coming within a pre-defined distance from each other is a near miss incident. Accident An undesired and unplanned (but not necessarily unexpected) event that results in (at least) a specified level of loss. e.g. two aircraft coming within a pre-defined distance from each other and colliding is an accident.

Hazard A hazard is a state of the system that may give rise to an accident. More generally, a hazard is a situation in which there is actual or potential danger to people or to the environment. NB: Hazard is specific to a particular system and is defined w.r.t the environment of the system/object. e.g. Water alone is not hazardous, but it is easy to think of combinations of conditions where it could lead to death.

e.g. Chlorine is not hazardous if properly contained.
See handout example. e.g. A railway sleeper In its normal location (holding up railway tracks), it is not dangerous. However, placed across the top of the rails it is a hazard and will lead to an accident if the train is unable to stop in time.

The need to define a hazard w. r. t
The need to define a hazard w.r.t. its environment is particular true for software. Software is an abstraction, not an object, so we can only talk about the safety of software and its hazards within the context of the system in which it is used. We must define boundaries for the system we are considering, otherwise everything around the system may be viewed as a hazard. A boundary(ies) are chosen by the designer and define what hazards are part of the system (and thus controllable by the designer) and what are outside (and this outside the control of the designer).

An appropriate hazard: lack of minimum separation between aircraft.
Example 1 An air traffic control system (ATC) where an accident is defined as a collision between two aircraft. An appropriate hazard: lack of minimum separation between aircraft. The designer chooses this as that hazard because the ATC system will theoretically have control over all aircraft separation since its job is to control the movements of these aircraft by: Directing pilots to specific, well-separated routes Warning pilots if they get too close to another plane However, remember that the designer of the ATC system may be unable to control other factors that could determine if two aircraft get close enough to collide.

End 25/10

An accident is defined as gases catching fire/exploding.
Example 2 An industrial plant handling flammable gases where an accident is defined as the gases catching fire or exploding. (A flammable mixture will catch fire or explode when there is air and a source of ignition). An accident is defined as gases catching fire/exploding. Research has shown that when flammable gases are mixed with air in a flammable concentration, the likelihood is that a source of ignition will turn up! Hence the only safe rule is to never permit such mixtures except under carefully defined circumstances where the risk is acceptable. Thus an appropriate hazard might be defined as “a mixture of gas in air”. NOT the ignition source, since once there is one, an explosion will occur immediately.

A specific risk always has these two elements:
The probable rate of occurrence of a hazard causing harm and the degree of severity of the hazard. A specific risk always has these two elements: The probability of the hazardous event The consequences of the hazardous event When we look at the risk associated with hazardous events, the various factors which constitute the overall risk can be very complex. This can make it very difficult, if not impossible, to come up with a meaningful value for risk. As a first step, it is common to informally state all the components that go to make up the risks in a particular system, since this can give insight into the overall risk.

Some examples of determining risk factors
(1) A computer controlling the movement of a robot. Probability that the computer causes incorrect/unexpected robot movement Probability that a human is within the field of movement Probability that the human has no time to move or fails to notice the robot’s movement Severity of the worst case consequence i.e. robot hits human with enough force to kill them.

Probability of hazardous plant condition arising
(2) A computer monitoring a piece of equipment, with a requirement to initiate some safety function upon detection of a potentially hazardous condition e.g. in a chemical plant: Probability of hazardous plant condition arising Probability of that the computer fails to detect it Probability of computer not initiating its safety functions even if the hazardous condition is detected Probability that the safety function, though initiated, doesn’t remove the hazard Severity of worst case consequences: a chemical spill leading to injury/death

Safety (a general definition)
Freedom from accidents or losses. It is often argued that there is no such thing as absolute safety, but instead a thing is safe if the attendant risks are judged acceptable. e.g. Cars cannot be made completely safe, but we can improve safety by having: Crumple zones Side impact bars Airbags ABS Indicators (e.g. reversing)

2.2 The Safety Lifecycle Main concern of system safety is the management of hazards: their identification, evaluation, elimination and control. The safety lifecycle helps us to do this. When developing a safety critical system we must: Identify the hazards and the accidents they may lead to. Assess the risk of these accidents. Reduce the risk, where possible and/or appropriate by controlling the hazards that lead to the accidents. Set safety requirements for the system, and define how they will be met in the system design and implementation. Show that the safety requirements have been met in the system. Provide a safety case for the system. The objective of the Safety Lifecycle is a systematic approach to safety.

IEE Code of Practice Engineers and managers working on safety-related systems should: at all times take all reasonable care to ensure that their work and the consequences of their work cause no unacceptable risk to safety not make claims for their work which are untrue, or misleading, or which are not supported by a line of reasoning which is recognised in the particular field of application accept personal responsibility for all work done by them or under their supervision ordirection take all reasonable steps to maintain and develop their competence by attention to new developments in science and engineering relevant to their field of activity; and encourage others working under their supervision to do the same declare their limitations if they do not believe themselves to be competent to undertake certain tasks, and declare such limitations should they become apparent after a task has begun take all reasonable steps to make their own managers, and those to whom they have a duty of care, aware of risks which they identify; and make anyone overruling or neglecting their professional advice formally aware of the consequent risks take all reasonable steps to ensure that those working under their supervision or direction are competent; that they are made aware of their own responsibilities; and that they accept personal responsibility for work delegated to them Anyone responsible for human resource allocation should: take all reasonable care to ensure that allocated staff will be competent for the tasks to which they will be assigned ensure that human resources are adequate to undertake the planned tasks ensure that sufficient resources are available to provide the level of diversity appropriate to the intended integrity. MOD Def Stan and 00-56 Requirements for safety-related software in defense equipment Safety management requirements for defence system

The basic concepts of the safety lifecycle are:
Build in safety, don’t add it on to a completed design. Safety considerations must be part of the initial stages of concept development and requirements. Deal with whole systems. Safety issues in systems tend to arise at component boundaries as components interact.

Try to analyse the current system rather than relying on past experience and standards.
Pace of change doesn’t allow for experience to accumulate or for proven designs to be used. The safety lifecycle attempts to anticipate or prevent accidents before they occur! Take a wider view of hazards than just component failures. Serious accidents have occurred while all components in a system were working exactly as they should. Many hazards arise in the requirements and design stages of development.

Nothing is absolutely safe!
Recognise the importance of trade-offs and conflicts in system design. Nothing is absolutely safe! Safety is not the only goal in system building. Safety acts as a constraint on possible system designs and interacts with other constraints such as cost, size, development time, etc.

End 28/10

3. Computer and Software Safety
3.1 Computer safety issues Computer technology has revolutionised the design and development of systems. Few systems today are built without computers to provide control functions or to support design or both. Computers now control most safety-critical devices, replacing traditional hardware control systems. Many important computer systems are extremely complex. There is little legal control over who can build safety critical systems. Accidents and financial disasters have already occurred.

Safety-critical software
• Performs or controls functions which, if executed erroneously or if they fail to execute, could directly inflict serious injury to people and/or the environment and cause loss of life. • Performs or controls functions which are activated to prevent or minimise the effect of a failure of a safety-critical system. (This is sometimes called safety-related software.) Software safety can be defined as: • Features which ensure that, — a product performs predictably under normal and abnormal conditions; — the likelihood of an unplanned event occurring is minimised and its consequences controlled and contained; thereby preventing accidental injury and death.

3.1.1 Computer-based control systems
Many complex systems use computers as part of the mechanism for controlling the system. (1) Providing information or advice to a human controller upon request, perhaps by reading sensors directly. Computer Displays Sensors Operator Controls Actuators Process being controlled

Monitor various important aspects of the process.
Gives the info from sensors to the user in a readable form. Computer Displays Sensors Operator Controls Actuators Process being controlled Operator uses these to control the process. Actual mechanisms which control the process.

e.g. Some control system for the electricity grid.
The operator has direct access to controls. The computer is there to recognise problems. What happens if the computer gives incorrect information? In this case, it may not matter if the computer gives the operator inappropriate advice. The operator has complete control of the system and has direct access to sensor information. However, you need an experienced, knowledgeable operator!

A 345 kV Three Phase Disconnect Switch Opening
(This is the world's 2nd largest Jacob's Ladder.)

A 500 kV Disconnect Switch (This is the NEW record holder for the world's biggest Jacob's Ladder!)

The operator has lost direct feedback on the state of the process.
(2) Interpreting data and displaying it to the controller, who makes the control decisions. Displays Sensors Process being controlled Actuators Controls Operator Computer The operator no longer has direct access to the sensor output. The operator only has access to what that computer displays. The computer reads the sensor outputs, interprets it and displays its interpretation to the operator. Now, any misinterpretation by the computer will affect how the operator controls the process. The operator has lost direct feedback on the state of the process.

The computer is interpreting both sides of the control loop.
(3) Issuing commands directly, but with a human monitor of the computer's actions providing varying levels of input. Displays Controls Operator Sensors Process being controlled Actuators Computer Here the operator controls the process and receives sensor information only via the computer. The computer is interpreting both sides of the control loop. Thus it is even more important that the computer works correctly, since the operator is totally isolated from the process.

(4). Eliminating the human from the control loop completely, and
(4) Eliminating the human from the control loop completely, and assuming complete control of the process. The operator may provide advice or high-level direction. Operator Computer Sensors Actuators Process being controlled The operator only knows what the computer reveals about the state of the process (which may be nothing at all!), and has no access to the controls or sensors. In such automated systems the operator usually has to set the initial parameters for the process and starts it running, but the computer then takes over control. The implications of computer error or failure are now very serious. Upon error/failure, to control the process the operator may have to step in with little or no information to deal with the situation.

End 1/11

Computer-based control systems are obvious ways in which computers are used. But there are many circumstances where computers and software are involved in a much less obvious way. Software-generated data used to make decisions. e.g. Air traffic control systems calculate plane headings for the controller. Software used for design. e.g. CAD systems to build bridges, electronic circuits, etc. Database software that stores and retrieves information. e.g. Medical records, credit ratings.

3.2 Computer and Software Myths
If there are problems why are computers so widely used? Because they are seen as providing numerous advantages over the mechanical and analogue systems they replace. Some supposed advantages of using computers are myths. Myth 1 The cost of computers is lower than that of analogue or electromechanical devices. There is a little truth in this – computer hardware is often cheaper than analogue/electromechanical devices. BUT the cost of designing, writing and certifying reliable and safe software + software maintenance costs can be enormous.

Myth 2 Software is easy to change Superficially true: making changes to software is easy. BUT Making changes without introducing errors is very hard. Any changes mean the software must be re-tested, re-verified (to check that it still meets its requirements) and re-certified for safety (as far as possible). Software tends to become more “brittle” as changes are made: i.e. the difficulty of making a change without introducing errors may actually increase over the lifetime of the software.

Myth 3 Computers provide greater reliability than the devices they replace. In theory this is true. Computer hardware is highly reliable and computer software doesn’t “fail” in the normal engineering sense. BUT Errors in software are (design) errors in the logic of what the software does, and they will affect system behaviour. Bath tub curve

Myth 4 Increasing software reliability will increase safety. There are two views of reliability: (1) Reliability in software is frequently defined as “compliance with the original requirements specification”. BUT in this case the myth is untrue because: Most safety-critical software errors can be traced back to errors in the requirements Many software-related accidents have occurred whilst the software was doing exactly what the requirements specification said it should. SO We might increase compliance with the specification, but that doesn’t do anything for safety if the specification is wrong!

(2) Reliability can also be viewed as “running without errors.”
This myth is untrue because we can remove errors that have no effect on safety, so the system is more reliable but not any safer. Safety and reliability, while partially overlapping, are not the same thing!

Myth 5 Testing and/or proving software (using formal verification techniques) can remove all the errors. This is completely untrue! The large number of possible paths through most realistically sized programs makes exhaustive testing impossible. We can try to verify that software meets its original specification, but, this won’t tell us if the specification itself is wrong. We have no formal techniques for checking the correctness of requirements specifications. We can do some formal verification of code – mostly mathematical proofs, but can only do this for small pieces of software.

Myth 6 Reusing software increases safety. This is basically untrue. Unfortunately reusing software may actually reduce safety because: It gives rise to complacency – “It worked in the previous system so it’ll work in this new system” Specific safety issues of the new system cannot have been considered in the original design of the reused software.

Myth 7 Computers reduce risk over mechanical systems. This is partially true. Computers have the potential to reduce risk, particularly where they are used to automate hazardous/tedious jobs. Other debatable arguments that computers reduce risk are: They allow finer, more accurate control of a process/machine. Yes – they can check process parameters more often, perform calculations quicker. But – this could lead to processes running with reduced safety margins. Automated systems allow operators to work further away from hazardous areas. Yes – in normal running of the process they do allow this. But – operators end up having to enter unfamiliar hazardous areas. Eliminating operators eliminates human errors. Yes – operator errors are reduced. But – we still have human design and maintenances errors.

End 4/11

Safety-Critical Systems Department of Computer Science

Similar presentations

Presentation on theme: "Safety-Critical Systems Department of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Safety-Critical Systems Department of Computer Science

Similar presentations

Presentation on theme: "Safety-Critical Systems Department of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback