Developing Safety Critical Software: Fact and Fiction John A McDermid.

Developing Safety Critical Software: Fact and Fiction John A McDermid

Overview n Fact – costs and distributions n Fiction – get the requirements right n Fiction – get the functionality right n Fiction – abstraction is the solution n Fiction – safety critical code must be “bug free” n Some key messages

Part 1 Fact – costs and distributions Fiction – get the requirements right

Costs and Distributions n Examples of industrial experience –Specific example –Some more general observations n Example covers –Cost by phase –Where errors are introduced –Where errors are detected n and their relationships

Process Phases From System Specification Via Software Engineering To System Integration Effort/Cost by Phase

Error Introduction FE = Functional Effect Min FE typically data change

Finding Requirements Errors Requirements testing tends to find requirements errors Phases on Pie ChartSystem Validation

Errors Introduced Here….. Result - High Development Cost

Errors Introduced Here….. ….are not found until here Result - High Development Cost

Errors Introduced Here….. ….are not found until here After following safety critical development process

Software and Money n Typical productivity –5 Lines of Code (LoC) per person day   1 kLoC per person year –Requirements to end of module test n Typical avionics “box” –100 kLoC –100 person years of effort –Circa £10M for software, so  £500M on a modern aircraft?

US Aircraft Software Dependence Year % functions performed by software DoD Defense Science Board Task Force on Defense Software, November 2000

Increasing Dependence n Software often determinant of function n Software operates autonomously –Without opportunity for human intervention, e.g. Mercedes Brake Assist n Software affected by other changes –e.g new weapons fit on EuroFighter n Software has high levels of authority

Inappropriate CofG control in fuel system can reduce fatigue life of wings

Growing Dependency n Problem is growing –Now about a third of aircraft development costs –Increasing proportion of car development n Around 25% of capital cost of new cars in electronics –Problem made more visible by rate of improvements in tools for “mainstream” software development

Growth of Airborne Software Approx £1.5B at current productivity and costs

The Problem - Size matters Probability of Software Project Being Cancelled Capers Jones, Becoming Best In Class, Software Productivity Research, 1995 briefing Size In Function Points 1 function point = 80 SLOC of Ada 1 function point =128 SLOC of C

Is Software Safety an Issue? n Software has a good track record –A few high profile accidents n Therac 25 n Ariane 501 n Cali (strictly data not software) –Analysis of 1,100 “computer related deaths” n Only 34 attributed to software

Chinook - Mull of Kintyre Was this caused by FADEC software?

But Don’t be Complacent n Many instances of “pilot error” are system assisted n Software failures typically leave no trace n Increasing software complexity and authority n Can’t measure software safety (no agreement) n Unreliability of commercial software n Cost of safety critical software

Summary n Safety critical software a growing issue –Software-based systems are dominant source of product differentiation –Starting to become a major cost driver –Starting to become the drive (drag) on product development n Can’t cancel, have to keep on spending!!! –Not major contributor to fatal accidents n Although many incidents

Requirements Fiction n Fiction stated –Get the requirements right, and the development will be easy n Facts –Getting requirements right is difficult –Requirements are biggest source of errors –Requirements change –Errors occur at organisational boundaries

Embedded Systems n Computer system embedded in larger engineering system n Requirements come from –“Flow down” from system –Design decisions (commitments) –Safety and reliability analyses n Derived safety requirements (DSRs) –Fault management/accommodation n As much as 80% for control applications

Almost Everything on One Picture NB Based on Parnas’ four variable model

Almost Everything on One Picture

Types of Layer n Some layers have design meaning –Abstraction from computing hardware n Time in mS from reference, or... –Not interrupts or bit patterns from clock hardware –The “System” HAL n “Raw” sensed values, e.g. pressure in psia –Not bit patterns from analogue to digital converters –FMAA to Application n Validated values of platform properties –May also have computational meaning n e.g. call to HAL forces scheduling action

Commitments n Development proceeds via a series of commitments –A design decision which can only be revoked at significant cost –Often associated with architectural decision or choice of component n Use of triplex redundancy, choice of pump, power supply, etc. –Commitments can be functional or physical n Most common to make physical commitments

Derived Requirements n Commitments introduce derived requirements (DRs) –Choice of pump gives DRs for control algorithm, iteration rate, also requirements for initialisation, etc. –Also get derived safety requirements (DSRs), e.g. detection and management of sensor failure for safety

System Level Requirements n Allocated requirements –System level requirements which come from platform –May be (slight) modification due to design commitments, e.g. n Platform – control engine thrust to within ± 0.5% of demanded n System – control EPR or N1 to within ± 0.5% of demanded

Stakeholder Requirements n Direct requirements from stakeholders, e.g. –The radar shall be able to detect targets travelling up to mach 2.5 at 200 nautical miles, with 98% probability –In principle allocated from platform n In practice often stated in system terms –Need to distinguish legitimate requirements from “soluntioneering” n Legitimacy depends on the stakeholder, e.g. CESG and cryptos

Requirements Types n Main requirements types –Invariants, e.g. n Forward and reverse thrust will not be commanded at the same time –Functional transform inputs to outputs, e.g. n Thrust demand from thrust-lever resolver angle –Event response – action on event, e.g. n Active ATP on passing signal at danger –Non-functional (NFR) – constraints, e.g. n Timing, resource usage, availability

Changes to Types n Note requirements types can change – NFR to functional –System – achieve < 10 -5 per hour unsafe failures –Software – detect failure modes x, y and z of the pressure sensor P30 with 99% coverage, and mitigate by … n Requirements notations/methods must be able to reflect requirements types

Requirements Challenges n Even if systems requirements are clear, software requirements –Must deal with quantisation (sensors) –Must deal with temporal constraints (iteration rates, jitter) –Must deal with failures n Systems requirements often tricky –Open-loop control under failure –Incomplete understanding of physics

Requirements Errors n Project data suggests –Typically more than 70% of errors found post unit test are requirements errors –F22 (and other data sets) put requirements errors at 85% –Finding errors drives change n The later they are found, the greater the cost n Some data, e.g. F22, write 3 LoC for every one delivered

The Certainty of Change n Change mainly due to requirements errors –high cost due to reverification in presence of dependencies Module %Change The majority of modules are stable Cumulative change 20% May verify all code 3 times!

Requirements and Organisations n Requirements errors are often based on misinterpretations (its obvious that …) –Thus errors (more likely to) happen at organisational/cultural boundaries n Systems to software, safety to software … –Study at NASA by Robyn Lutz n 85% of requirements errors arose at organisational boundaries

Summary n Getting requirements right is a major challenge –Software is deeply embedded n Discretisation, timing etc. an issue –Physics not always understood n Requirements (genuinely) change –Notion that can get requirements right is simplistic n Notion of “correct by construction” optimistic

Part 2 Fiction – get the functionality right Fiction – abstraction is the solution Fiction – safety critical code must be “bug free” Some key messages

Functionality Fiction n Fiction stated –Get the functionality right, and the rest is easy n Facts –Functionality doesn’t drive design n Non-Functional Requirements (NFRs) are critical n Functionality isn’t independent of NFRs –Fault management is a major aspect of complexity

Functionality and Design n Functionality –System functions allocated to software –Elements of REQ which end up in SOFTREQ n NB, most of them –At software level, requirements have to allow for properties of sensors, etc. n Consider an aero engine example

Engine Pressure Block

Engine Pressure Sensor n Aero engine measures P0 –Atmospheric pressure –A key input to fuel control, etc. n Example input P0 Sens –Byte from A/D converter –Resolution – 1 bit  0.055 psia –Base = 2, 0 = low (high value  16) –Update rate = 50mS

Pressure Sensing Example n Simple requirement –Provide validated P0 value to other functions and aircraft n Output data item –P0 Val n 16 bits n Resolution – 1 bit  0.00025 psia n Base = 0, 0 = low (high value  16.4)

Example Requirements n Simple functional requirement –RS1: P0 Val shall be provided within 0.03 bar of sensed value –R1: P0 Val = P0 Sens [± 0.03] (software level) –Note: simple algorithm P0 Val = (P0 Sens * 0.055 + 2)/0.00025 P0 Sens = 0 → P0 Val = 8000 = 00010 1111 0100 0000 binary P0 Sens = 1111 1111 = 16.025 → P0 Val = 64100 = 1111 1010 0100 0100 –Does R1 meet RS1? Does the algorithm meet R1?

A Non-Functional Requirement n Assume duplex sensors –P0 Sens1 and P0 Sens2 n System level –RS2: no single point of failure shall lead to loss of function (assume P0 Val is covered by this requirement) n This will be a safety or availability requirement n NB in practice may be different sensors wired to different channels, and cross channel comms

Software Level NFR n Software level –R2: If | P0 Sens1 - P0 Sens2 | < 0.06 then P0 Val = (P0 Sens1 + P0 Sens2 )/2 else P0 Val = 0 –Is R2 a valid requirement? n In other words, have we stated the right thing? –Does R2 satisfy RS2?

Temporal Requirements n Timing is often an important system property –It may be a safety property, e.g. sequencing in weapons release n System level –RS3: validated pressure value shall never lag sensed value by more than 100mS NB not uncommon to ensure quality of control NB not uncommon to ensure quality of control

Software Level Timing n Software level requirement, assuming scheduling on 50mS cycles –R3: P0 Val (t) = P0 Sens (t-2) [± 0.03] –If t is quantised in units of 50mS, representing cycles –Is R3 a valid requirement? –Does R3 satisfy RS3? NB need data on processor timing to validate

Timing and Safety n Software level –R4: If | P0 Sens1 (t) - P0 Sens2 (t) | < 0.06 then P0 Val (t+1) = (P0 Sens1 (t) + P0 Sens2 (t))/2 else if | P0 Sens1 (t) - P0 Sens1 (t-1) | < | P0 Sens2 (t) - P0 Sens2 (t-1) | then P0 Val (t+1) = (P0 Sens1 (t)) else P0 Val (t+1) = (P0 Sens2 (t)) –What does R4 respond to (can you think of an RS4)?

Requirements Validation n Is R4 a valid requirement? –Is R4 “safe” in the system context (assume that misleading values of P0 could lead to a hazard, e.g. a thrust roll-back on take off) n Does R4 satisfy RS3? n Does R4 satisfy RS2? n Does R4 satisfy RS1?

Real Requirements n Example still somewhat simplistic –Need to store sensor state, i.e. knowledge of what has failed n Typically timing, safety, etc. drive the detailed design –Aspects of requirements, e.g. error bands, depend on timing of code –Requirements involve trade-offs between, say, safety and availability

Requirements and Architecture n NFRs also drive the architecture –Failure rate 10 -6 per hour n Probably just duplex (especially if fail stop) n Functions for cross comms and channel change –Failure rate 10 -9 per hour n Probably triplex or quadruplex n Changes in redundancy management NB change in failure rate affects low level functions NB change in failure rate affects low level functions

Quantification n The “system level” functionality is in the minority –Typically over half is fault management –EuroFighter example n FCS  1/3 MLoC n Control laws  18 kLoC n Note, very hard to validate –777 flight incident in Australia due to error in fault management, and software change

Boeing 777 Incident near Perth n Problem caused by Air Data Inertial Reference Unit (ADIRU) –Software contained a latent fault which was revealed by a change June 2001 accelerometer #5 fails with erroneous high output values, ADIRU discards output values Power Cycle on ADIRU occurs each occasion aircraft electrical system is restarted Aug 2006 accelerometer #6 fails, latent software error allows use of previously failed accel #5

Summary n Functionality is important –But not the primary driver of design n Key drivers of design –Safety and availability n Turns into fault management at software level –Timing behaviour n Functionality not independent of NFRs –Requirements change to reflect NFRs

Abstraction Fiction n Fiction stated –Careful use of abstraction will address problems of requirements etc. n Fact –Most forms of abstraction don’t work in embedded control systems n State abstraction is of some use n The devil is in the detail

Data Abstraction n Most data is simple –Boolean, integer, floating point –Complex data structures are rare n May exist in a maintenance subsystem (e.g. records of fault events) –Systems engineers work in low-level terms, e.g. pressures, temperatures, etc. n Hence requirements are in these terms

Control Models are Low Level

Looseness n A key objective is to ensure that requirements are complete –Specify behaviour under all conditions –Normal behaviour (everything working) –Fault conditions n Single faults, and combinations –Impossible conditions n So design is robust against incompletely understood requirements/environment

Despatch Requirements n Can despatch (use) system “carrying” failures –Despatch analysis based on Markov model –Evaluate probability of being in non- despatchable state, e.g. only one failure from hazard –Link between safety/availability process and software design

Fault Management Logic n Fault-accommodation requirements may use four valued logic –Working, undetected, detected, and confirmed –Table illustrates “logical and” ([.]) –Used for analysis.wudc wwudc uuudc ddddc ccccc

Example Implementation.wdcwwdc dddc cccc

State Abstraction n Some state abstraction is possible –Mainly low-level state to operational modes n Aero engine control –Want to produce thrust proportional to demand (thrust lever angle in cockpit) –Can’t measure thrust directly –Can use various “surrogates” for thrust n Work with best value, but reversionary models

Thrust Control n Engine pressure ratio (EPR) – between atmosphere & the exhaust pressures –Best approximation to thrust –Depends on P0 n Low level state modelling “health” of P0 sensor –If P0 fails, revert to use N1 (fan speed) –Have control modes n EPR, N1, etc. which abstract away from details of sensor fault state

Summary n Opportunity for abstraction much more limited than in “IT” systems –Hinders many classical approaches n Abstraction is of some value –Mainly state abstraction, relating low-level state information, e.g. sensor “health” to system level control modes n NB formal refinement, a la B, is helped by this, as little data refinement

“Bug Free” Fiction n Fiction stated –Safety critical code must be “bug free” n Facts –It is hard to correlate fault density and failure rate –<1 fault per kLoC is pretty good! –Being “bug free” is unrealistic, and there is a need to “sentence” faults

Close to Fault Free? n DO 178A Level 1 software (engine controller) – now would be DAL A –Natural language specifications and macro- assembler –Over 20,000,000 hours without hazardous failure –But on version 192 (last time I knew) n Changes “trims” to reflect hardware properties

Pretty Buggy n DO 178B Level A software (aircraft system) –Natural language, control diagrams and high level language –118 “bugs” found in first 18 months, 20% critical –Flight incidents but no accidents –Informally “less safe” than the other example, but still flying, still no accidents

Fault Density n So far as one can get data –<1 flaw per kLoC for SC is pretty good –Commercial much worse, may be as high as 30 faults per kLoC –Some “extreme” cases n Space Shuttle – 0.1 per kLoC n Praxis system – 0.04 per kLoC –But will a hazardous situation arise?

Faults and Failures n Why doesn’t software “crash” more often? –Paths miss “bugs” as don’t get critical data –Testing “cleans up” common paths –Also “subtle faults” which don’t cause a crash n NB IBM OS –1/3 of failures were “3,000 year events”

Commercial Software n Examples of data dependent faults? –Loss of availability is acceptable –Most SCS have to operate through faults n Can’t “fail stop” – even reactor protection software needs to run circa 24 hours for heat removal Pictures © 3BP.com

Retrospective Analysis n Retrospective analysis of US civil product for UK military use –Analysis of over 500kLoC, in several languages –Found 23 faults per kLoC, 3% safety critical –Vast majority not safety critical n NB most of the 3% related to assumptions, i.e. were requirements issues

Find and Fix n If a fault is found it may not be fixed –First it will be “sentenced” n If not critical, it probably won’t be fixed –Potentially critical faults will be analysed n Can it give rise to a problem in practice? n If decide not to change, document reasons –Note: changes may bring (unknown) faults n e.g. Boeing 777 near Perth

Perils of Change Module Dependency

Summary n Probably no safety critical software is fault free –Less than 1 fault per kLoC is good –Hard to correlate fault density with failure rate (especially unsafe failures) n In practice –Sentence faults, and change if net benefit n Need to show presence of faults –To decide if need to remove them

Summary of the Summaries n Safety critical software –Has a good track record –Increased dependency, complexity, etc. mean that this may not continue n Much of the difficulty is in requirements –Partly a systems engineering issue –Many of the problems arise from errors in communication –Classical CS approaches limited utility

Research Directions (1) n Advances may come at architecture –Improve notations to work at architecture and implement via code generation –Develop approaches, e.g. good interfaces, product lines, to ease change –Focus on V&V, recognising that the aim is fault-finding n AADL an interesting development

Research Directions (2) n Advances may come at requirements –Work with systems engineering notations n Improve to address issues needed for software design and assessment, NB PFS n Produce better ways of mapping to architecture n Try to find ways of modularising, to bound impact of change, e.g. contracts –Focus on V&V, e.g. simulation n Developments of Parnas/Jackson ideas?

Research Directions (3) n Work on automation, especially for V&V –Design remains creative –V&V is 50% of life-cycle cost, and can be automated –Examples include n Auto-generation of test data and test oracles n Model-checking consistency/completeness n The best way to apply “classical” CS?

Coda n Safety critical software research –Always “playing catch up” –Aspirations for applications growing fast n To be successful –Focus on “right problems”, i.e. where the difficulties arise in practice –If possible work with industry – to try to provide solutions to their problems

Developing Safety Critical Software: Fact and Fiction John A McDermid.

Similar presentations

Presentation on theme: "Developing Safety Critical Software: Fact and Fiction John A McDermid."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Developing Safety Critical Software: Fact and Fiction John A McDermid.

Similar presentations

Presentation on theme: "Developing Safety Critical Software: Fact and Fiction John A McDermid."— Presentation transcript:

Similar presentations

About project

Feedback