Presentation on theme: "Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448"— Presentation transcript:
Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org www.CosgroveComputer.com Los Angeles ACM Loyola Marymount University – University Hall December 7, 2005 Responding to Risk in Software Systems Copyright 2001-2005 CCS Inc.
Responding to Risk in Software Systems 2 Contents The Problem........................3 Seeking a Solution....................9 Lessons- Learned....................15 Summary............................21 Bibliography
Responding to Risk in Software Systems 3 The Problem Most Software Systems Fail Planning Revisited Integrated Risk Management New World of Regulation Future of Software Engineering
Responding to Risk in Software Systems 4 Most Software Systems Fail Most SW projects fail – bigger fail more Failure is not inevitable – Notable exceptions exist Poor natural visibility typical w/ SW – Effective planning & status assessment critical Risk management integral to planning – Risk assessment must include economics of failure Source: Humphrey – Crosstalk 2005
Responding to Risk in Software Systems 5 Planning Revisited Plans must involve all the responsible stakeholders – Developers, customers, end users, etc. – Win-Win or Lose-Lose Development cycle policy must be explicit – Critical drivers must be stated – Independent variables – Schedule, cost, performance or quality Choose one or two – others are dependent variables Dependent variables vary! Planning is never complete – Rule - Never fail a plan because plan changes to reality 1 st Source: Boehm
Responding to Risk in Software Systems 6 Integrated Risk Management True Risk Management is element of planning Flows from unknowns identified in planning Two broad categories – Catastrophic or unacceptable risk Treat as requiring insurance in some form – Conventional risk exposure Classical risk mitigation steps Both demand $$$ quantification of failure – Cost of failure drives budgets
Responding to Risk in Software Systems 7 New World of Regulation Sarbanes Oxley (SOX) – Enforces accountability for reporting correctness – Software projects are investment assets – Correctness, control mechanisms, security are auditable – Non-compliance penalties include criminal & civil If we managed finances in companies the way we manage softwarethen somebody would go to prison. -- Armour
Responding to Risk in Software Systems 8 Future of Software Engineering Functional size & complexity increasing rapidly – Size increase ~ 10x every 5 years – Scale matters in all engineered systems Humphreys analogy with transportation systems speed Increasingly software [i.e., computer systems].. crucial part of the products and services in almost all industries. Most computer systems.. interconnected.... more internal and external threats … In.. past,.. assumed a friendly.. environment. Source: Humphrey, SEI/CMU 2002
Responding to Risk in Software Systems 9 Seeking a Solution Significant Differences – Software Why Software is Valuable Software Creation Failure Management Minimizing Failure Costs
Responding to Risk in Software Systems 10 Significant Differences - Software Requirements are seldom complete - IKIWISI – With software the challenge is to balance the unknowable nature of the requirements with the business need for a firm contractual relationship. -- Watts Humphrey Most engineered systems are defined by comprehensive plans and specifications prior to startup. Few software-intensive systems are. Most software projects are challenged or fail completely* – Over $6M – less than 10% succeed, $1M ~50% – Primary cause – no realistic planning by developers – No natural visibility of progress or completion status * Humphrey Why Big Software Projects Fail
Responding to Risk in Software Systems 11 Why Software is Valuable Value created by the abstraction of productive knowledge – Development is Social learning process Economic value comes from impact on useful activity – Efficient automotive ignitions Value is increased when the knowledge is readily adaptable – McDonalds hamburger franchises also work well in China Franchises show how preserved abstractions can be valuable Software engineers are ethically obligated to optimize value Source: Baetjer
Responding to Risk in Software Systems 12 Software Creation What is a Social Learning Process?? Ignorance -> useful, reproducible knowledge Orders-of-Ignorance (OI) – five levels – 0 th – Useful knowledge, have the answer – 1 st – Know the ?, but not the answer – 2 nd – Unknown # of unknowns, apply process – 3 rd – 2-OI but no process to begin – 4 th – 3-OI Ignorance of ignorance - meta-ignorance Source: Armour, Five Orders of Ignorance, C-ACM 10/00
Responding to Risk in Software Systems 13 Failure Management..as if the concepts of risk and failure are somehow disconnected... purpose of development.. do something not done before. 90% success means 1 in 10 failure – Is the failure tolerable? Must make it tolerable (e.g., insurance)? – Calculate $ likelihood of failure (e.g.,10% of cost) Source: Armour: Management of Risk, C-ACM 3/05
Responding to Risk in Software Systems 14 Minimizing Failure Costs Failure costs are never zero – Making costs explicit improves planning Steps to Minimize – Make all catastrophic risks tolerable Rationale behind insurance – life, property, etc. Project example – alternate, plan-B solution – Quantify risk exposure in terms of failure costs Rationale behind testing to avoid costly field retrofits Failure cost exposure drives budgets for mitigation
Responding to Risk in Software Systems 15 Lessons-Learned Air Traffic Control Failure New FBI Software Unusable Unsafe Automotive Ignition Framework for Dependable Designs Dependable Ignition System Example
Responding to Risk in Software Systems 16 Air Traffic Control Failure – LA regional system failed on 9/14/2004, 3.5 hours Backup system also failed – Many mid-air collision near misses with 800+ A/C – Improperly blamed on human error Fault lay with known glitch avoided by manual Ops Fault introduced with year-ago system re-host Only 1 of 21 centers have fault corrected – Questions – testing, fault tolerance policy, etc. Backup system failed immediately???
Responding to Risk in Software Systems 17 New FBI Software Unusable New Anti-terrorism software – Virtual Case File.. further delays in four-year effort.. $half-billion Upgrade … will not work.. –.. render worthless much of current $170M contract... may have outlived its usefulness.. before.. it was.. implemented..officials thought..get it right the 1 st time.. That never happens with anybody. Source: LA Times, 1/13/05
Responding to Risk in Software Systems 18 Unsafe Automotive Ignition Engine died when accelerating into traffic – Intermittent sensor wire – Ignition control software failed with open circuit Hazard analysis missed HW-SW interaction Incomplete SW system safety requirements – Interface failure protection - From Hazard analysis Deterministic values for common failures -- Open, short – Control algorithm must be protected – Detect failures and substitute safe values Recent examples LA Times 5/05 – Prius..
Responding to Risk in Software Systems 19 Framework for Dependable Designs Defend engineering process in court* Set bounds for system - three states – Operating -- Envelope for normal operations – Non-Operating -- Normal not possible – Exception -- Recover to normal after anomaly Normal may be degraded-normal Mishaps occur during state transitions – IDs SW system dependability requirements – Suggests mishap mitigation -- HW or SW * Source: Lawson
Responding to Risk in Software Systems 20 Dependable Ignition System Example Automotive Ignition -- Hazard identified – Sensor wiring may fail from constant movement – Ignition control failure may cause traffic emergency Requirement - Recover safely from faulty wiring Allocation of requirement – What if – HW - Terminate inputs for predictable open/short values – SW - Detect open/short values, use last or known safe value Requirements identification before design is best – More options, usually less costly
Responding to Risk in Software Systems 21 Summary Most large SW-intensive system developments fail Public safety and economic security forcing government & legal systems to recognize importance Planning and risk management practices are key to any solution Good systems engineering practices must be adapted to softwares special characteristics
Responding to Risk in Software Systems 22 Bibliography - I Armour, Phillip, The Five Orders of Ignorance, Communications of the ACM, October 2000 Armour, Phillip, Project Portfolios: Organizational Management of Risk, Communications of the ACM, March 2005 Armour, Phillip, Sarbanes-Oxley and Software Projects, Communications of the ACM, June 2005 Baetjer, H., Software as Capital - An Economic Perspective on Software Engineering, IEEE Computer Society Press, 1997 Boehm, Barry, Win-Win Negotiation Tool, Center for Software Engineering-USC, http://sunset.usc.edu Cosgrove, J., Software Engineering & Law, IEEE Software, May-June 2001 Humphrey, W. S., Managing the Software Process, Addison Wesley, 1990 Humphrey, Watts, The Future of Software Engineering: V, SEI Interactive, Software Engineering Institute, Carnegie Mellon University, Vol. 5, Num.1, 1Q 2002, http://interactive.sei.cmu.edu/news@sei/columns/watts_new/watts-new- compiled.pdfhttp://interactive.sei.cmu.edu/news@sei/columns/watts_new/watts-new- compiled.pdf Humphrey, Watts, Why Big Software Projects Fail – The 12 Key Questions, CrossTalk Magazine, March 2005 www.stsc.hill.af.milwww.stsc.hill.af.mil
Responding to Risk in Software Systems 23 Bibliography - II Lawson, Harold W., An Assessment Methodology for Safety Critical Systems, Lidingo, Sweden, Bud@damek.kth.seBud@damek.kth.se Los Angeles Times, System Failure Snarls Air Traffic in the Southland, 9/15/2004 Los Angeles Times, Human Errors Silenced Airports, 9/16/2004 Los Angeles Times, New FBI Software May Be Unusable, 1/13/2005 Los Angeles Times, Prius Glitches Highlight Problems of Car Computers, 5/18/2005 Lister, T. & DeMarco, T., Both Sides Always Lose: Litigation of Software-Intensive Contracts, CrossTalk, 2/2000, www.stsc.hill.af.mil/Crosstalk/2000/feb/demarco.aspwww.stsc.hill.af.mil/Crosstalk/2000/feb/demarco.asp Parnas, David L., Licensing Software Engineers in Canada, Communications of the ACM, 11/2002 Poore, Jesse H., A Tale of Three Disciplines … and a Revolution, IEEE Computer, 1/2004 Research Triangle Institute, The Economic Impacts of Inadequate Infrastructure for Software Testing, www.nist.gov, NIST Planning Report 02-3