CSC444 Lec01 1 University of Toronto Department of Computer Science Lecture 1: Why Does Software Fail? Some background What is Software Engineering? What.

Slides:



Advertisements
Similar presentations
CSCI 5230: Project Management Software Reuse Disasters: Therac-25 and Ariane 5 Flight 501 David Sumpter 12/4/2001.
Advertisements

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 2.
Introduction to Engineering Ethics – 2 Engineering Ethics Agenda Review Ethics I Introduce resources for ethical decisions in engineering References Challenger.
Computer Engineering 203 R Smith Project Tracking 12/ Project Tracking Why do we want to track a project? What is the projects MOV? – Why is tracking.
Solid Rocket Boosters Overview Two solid rocket boosters provide the main thrust to lift the Space Shuttle off the pad. They are the largest solid- propellant.
Space Shuttle Challenger Disaster
The Normalization of Deviance at NASA. Background January 28, 1986 Shuttle engineers were worried about launching at the predicted temperature of 31 degrees.
Three Ethical Case Studies
23/05/2015Dr Andy Brooks1 FOR0383 Software Quality Assurance Lecture 2 ESA Ariane 5 Rocket Flight 501.
1 COMS 161 Introduction to Computing Title: Numeric Processing Date: November 10, 2004 Lecture Number: 31.
Faiz Almansour Alemu Azanaw Rachel Downen Timothy Herbig Angie Schneider.
CS540 Software Design Lecture 1 1 Lecture 1: Introduction to Software Design Anita S. Malik Adapted from Budgen (2003) Chapters 1.
Challenger Accident or what you think you know can still hurt you CAPT RT Soule Supervisor of Shipbuilding Newport News “engineers and managers together.
An Accident Rooted in History NASA Culture History of the flawed joint Events leading up to the disaster.
Comprehend the Challenger accident Comprehend the Columbia accident The Space Shuttle Program: Challenger and Columbia Accidents.
INFORMATION TECHNOLOGIES SAFETY AND QUALITY THROUGH INFORMATION TECHNOLOGY WSRS Ulm – 20 Sept St. Ramberger / Th.Gruber 1 Experience Report: Error.
ICS Management Poor management is the downfall of many software projects Software project management is different from other engineering management.
©Ian Sommerville 2000CS 365 Ariane 5 launcher failureSlide 1 The Ariane 5 Launcher Failure June 4th 1996 Total failure of the Ariane 5 launcher on its.
The Challenger Disaster By Diana Clarke. The Orbiter Dimensions: 122’ L x 78’ W x 57’ H Dimensions: 122’ L x 78’ W x 57’ H Crew size: Up to 8 people Crew.
The Challenger Disaster
ARIANE 5 FAILURE ► BACKGROUND:- ► European space agency’s re-useable launch vehicle. ► Ariane-4 was a major success ► Ariane -5 was developed for the larger.
Roger Boisjoly and the Challenger Disaster
Notes on Challenger DisasterChallenger Disaster Stephen Scott March 12, 2003.
The Challenger Accident Magnus Jansson, Electrical Engineering Fredrik Mannesson, Engineering and Industrial Management Per Martinell, Civil Engineering.
Project Closure CHAPTER FOURTEEN Student Version Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
SPACECRAFT ACCIDENTS: EXAMINING THE PAST, IMPROVING THE FUTURE Overview and Challenger Case Study Bryan Palaszewski working with the Digital Learning Network.
©Ian Sommerville 2004Software Engineering Case Studies Slide 1 The Ariane 5 Launcher Failure June 4th 1996 Total failure of the Ariane 5 launcher on its.
LSU 01/18/2005Project Life Cycle1 The Project Life Cycle Project Management Unit, Lecture 2.
An Example Through When Things Go Wrong. Technical Communication Interactive and Adaptable Reader-Centered Produced In Teams Visual Influenced by Ethics,
CPSC 372 John D. McGregor Module 0 Session 1 Introduction.
University of Toronto Department of Computer Science CSC444 Lec04- 1 Lecture 4: Software Lifecycles The Software Process Waterfall model Rapid Prototyping.
Principles of Engineering System Design Dr T Asokan
University of Toronto Department of Computer Science © 2001, Steve Easterbrook CSC444 Lec22 1 Lecture 22: Software Measurement Basics of software measurement.
USS Yorktown (1998) A crew member of the guided-missile cruiser USS Yorktown mistakenly entered a zero for a data value, which resulted in a division by.
The Ariane 5 Launcher Failure
CompSci 230 Software Design and Construction
CRASH AND BURN ARIANE 5 Kristen Hieronymus SYSM6309 Advanced Requirements Engineering
CRASH AND BURN ARIANE 5 Kristen Hieronymus SYSM6309 Advanced Requirements Engineering
Challenger: Case Study in Engineering Ethics and Communications Tom Rebold Adapted from Tufte, Visual Explanations And
CPSC 871 John D. McGregor Module 0 Session 1 Introduction.
Software Quality Assurance Activities
Chapter 1 Software and Software Engineering. A Quick Quiz 1. What percentage of large projects have excess schedule pressure? 25% 50% 75% 100% 2. What.
Project Tracking. Questions... Why should we track a project that is underway? What aspects of a project need tracking?
The Ariane 5 Launcher Failure June 4th 1996 Total failure of the Ariane 5 launcher on its maiden flight.
1 University of Toronto Department of Computer Science © 2001, Steve Easterbrook CSC444F Software Engineering I Prof. Paulo Pacheco
1 Chapter 5 Project management. 2 Project management : Is Organizing, planning and scheduling software projects.
THE PROJECT LIFE CYCLE PROJECT MANAGEMENT LIFE CYCLE LSU 01/18/2005 PROJECT LIFE CYCLE 1.
Chapter 1 Quality terminology Error: human mistake Fault: result of mistake, evidenced in some development or maintenance product Failure: departure from.
RELIABILITY ENGINEERING 28 March 2013 William W. McMillan.
Software Defects.
SEN 460 Software Quality Assurance. Bahria University Karachi Campus Waseem Akhtar Mufti B.E(UIT), M.S(S.E) AAU Denmark Assistant Professor Department.
Learning Goals  I will be able to identify the names of the space shuttles in NASA’s program.  I will be able to identify two shuttle disasters.
Software Quality Assurance SOFTWARE DEFECT. Defect Repair Defect Repair is a process of repairing the defective part or replacing it, as needed. For example,
CSCI 3428: Software Engineering Tami Meredith Chapter 11 Maintaining the System.
Topic 10Summer Ariane 5 Some slides based on talk from Sommerville.
Dr. Elizabeth Hoppe Lewis University June Overview of Ethics Deontology (ethics based on duty or obligation) Immanuel Kant ( ) Focus on.
CHALLENGER DISASTER : CASE STUDY – TO BE
CHALLENGER DISASTER : CASE STUDY
DEGRADED MODES OF OPERATION: ANTECEDENTS FOR RAILWAY ACCIDENTS
Failure of Process / Failure of Mission
Chapter 17 - Component-based software engineering
Fault Tolerant Computing
Ariane 5 Software error Integer overflow.
SPACE SHUTTLES.
Operations Management
Space Travel Present & Future
Operations Management
Chapter 17 - Component-based software engineering
Operations Management
STS-114 Return to Flight Lessons Learned Bill Parsons
Presentation transcript:

CSC444 Lec01 1 University of Toronto Department of Computer Science Lecture 1: Why Does Software Fail? Some background What is Software Engineering? What causes system failures? The role of good engineering practice Are software failures like hardware failures? Shuttle flight STS51-L (Challenger) Ariane-5 flight 501 Some conclusions e.g. Reliable software has very little to do with writing good programs e.g. Humans make mistakes, but good engineering practice catches them! © 2001, Steve Easterbrook

CSC444 Lec01 2 University of Toronto Department of Computer Science Defining Software Engineering “Engineering… …creates cost-effective solutions to practical problems by applying scientific knowledge to building things in the service of humankind” Software Engineering: the “things” contain software (??) BUT: pure software is useless! …software exists only as part of a system software is invisible, intangible, abstract there are no physical laws underlying software behaviour there are no physical constraints on software complexity software never wears out …traditional reliability measures don’t apply software can be replicated perfectly …no manufacturing variability © 2001, Steve Easterbrook

CSC444 Lec01 3 University of Toronto Department of Computer Science Failures and Catastrophes System Components often fail Parts wear out Wires and joints come loose Cosmic rays scramble your circuits! Components get used for things they weren’t designed for Designs don’t work the way they should Point failures typically don’t lead to catastrophe backup systems fault tolerant designs redundancy certification using safety factors (eg 2x) Good Engineering Practice prevents accidents failure analysis reliability estimation checks and balances But how does this work in Software Engineering??? © 2001, Steve Easterbrook

CSC444 Lec01 4 University of Toronto Department of Computer Science Shuttle Flight 51-L (Challenger) Contracts for shuttle awarded 1972: Rockwell - Orbiter Martin Marietta - external tank Morton Thiokol - Solid Rocket Boosters (SRBs) Rocketdyne - Orbiter Main engines 3 NASA centers provide management: JSC - Manage the orbiter Marshall - Manage engines, tank and SRBs KSC - Assembly, checkout and launch 4 orbiters were built: flights began in ‘81; declared operational July ‘82 after STS-4 24 flights over 57 months up to Dec 1995 © 2001, Steve Easterbrook

CSC444 Lec01 5 University of Toronto Department of Computer Science Challenger Disaster Technical cause: failure of a pressure seal (“O-ring”) in the aft field joint of the right solid rocket motor Solid rocket motor assembled from four cylindrical sections, 25 feet long, 12 feet diameter, containing 100 tons of fuel 2 O-rings seal gaps in the joints caused by pressure at ignition Factors: temperature: cold reduces resiliency of the O-ring chance of O-ring failure increased by test procedures causing blow holes in the putty used to pack the joint But this was just the point failure… © 2001, Steve Easterbrook

CSC444 Lec01 6 University of Toronto Department of Computer Science What really happened? 1977: Tests show rotation of joints causes loss of secondary O-ring as a backup seal 1980: SRB joint classified as criticality 1R Anomalies in O-rings found in initial flights but not entered into Marshall's problem assessment system Dec 82: Tests show secondary O-ring no longer functional under 40% of max operating pressure. Criticality changed to 1 Paperwork after this time still shows SRB joints as 1R 1985 Jan 24: STS 51-C launched in lowest ever temperature: 53°F (≈11ºC) O-ring erosion worst yet. Feb 8: Analysis by Thiokol noted risk of O-ring failure concluded risk should be accepted because of secondary O-ring. © 2001, Steve Easterbrook

CSC444 Lec01 7 University of Toronto Department of Computer Science Leading up to the launch 1985 (cont.) April 29: STS 51-B: primary O-ring never sealed, secondary eroded beyond predicted limits as a result, Marshall placed a launch constraint on 51-F and all subsequent flights Thiokol were unaware of this constraint (which was waived for each flight thereafter) July: Thiokol engineers set up task force to solve the O-ring problem Oct: task force complains of lack of cooperation from management. Dec: Thiokol management recommends closure of O-ring problem Oct/Nov: 61-A & 61-B both experience O-ring problems L Launch originally scheduled for Jan 23rd Jan 23: Flight 51-L re-scheduled for 25th Jan 25: Unacceptable weather forecast Jan 27: countdown halted - jammed exit hatch Launch re-scheduled for Jan 28th, at 9:38am temperature of 27°F (≈-3ºC) predicted for launch time previous coldest launch: 53°F (≈11ºC) © 2001, Steve Easterbrook

CSC444 Lec01 8 University of Toronto Department of Computer Science The Launch decision Jan 27, :30pm Thiokol engineers express concern at predicted low temp. 5:45pm Thiokol presents its concerns to Marshal recommends launch should be delayed 8:45pm Thiokol re-presents its conclusions to larger meeting Marshall criticizes it for changing the launch criteria 10:30pm meeting recessed for Thiokol discussion engineers express strong objections to launch 11:00pm meeting reconvened Thiokol management withdrew objections to launch Jan 28, :39am: flight 51-L launched 73 seconds later, Challenger explodes © 2001, Steve Easterbrook

CSC444 Lec01 9 University of Toronto Department of Computer Science Report of the Presidential Commission on the Space Shuttle Challenger Accident William P. Rogers, Chairman (Former Secretary of State under President Nixon , and Attorney General under President Eisenhower ) Lack of trend analysis Management Structure: safety, reliability and QA placed under the organizations they were to check organizational responsibility for safety was not adequately integrated with decision-making No safety representative at the meetings on 27 Jan. Problem reporting and tracking Complacency: Escalating risk accepted Perception that less safety reliability and QA activity needed once Shuttle missions became routine Program Pressures were a factor Pressure on NASA to build up to 24 missions per year Shortened training schedules, lack of spare parts, and dilution of human resources. Customer commitments may have obscured engineering concerns Reduction of skilled personnel © 2001, Steve Easterbrook

CSC444 Lec01 10 University of Toronto Department of Computer Science Ariane-5 flight 501 Background European Space Agency’s reusable launch vehicle Ariane-4 a major success Ariane-5 developed for larger payloads Launched 4 June 1996 Mission $500 million payload to be delivered to orbit Fate: Veered off course during launch Self-destructed 40 seconds after launch Cause: Unhandled floating point exception in Ada code © 2001, Steve Easterbrook

CSC444 Lec01 11 University of Toronto Department of Computer Science Ariane-5 Events Locus of error: Platform alignment software (part of the Inertial Reference System, SRI) This software only produces meaningful results prior to launch Still operational for 40 seconds after launch Cause of error: Ada exception raised and not handled: Converting 64-bit floating point to 16-bit signed integer for Horizontal Bias (BH) Requirements state that computer should shut down if unhandled exception occurs Launch+30s: Inertial Reference Systems fail Backup SRI shuts down first Active SRI shuts down 50ms later for same reason Launch+31s: On-board Computer receives data from active SRI Diagnostic bit pattern interpreted as flight data OBC commands full nozzle deflections Rocket veers off course Launch+33s: Launcher starts to disintegrate Self-destruct triggered © 2001, Steve Easterbrook

CSC444 Lec01 12 University of Toronto Department of Computer Science Why did this failure occur? Why was Platform Alignment still active after launch? SRI Software reused from Ariane-4 40 sec delay introduced in case of a hold between -9s and -5s Saves having to reset everything Feature used once in 1989 Why was there no exception handler? An attempt to reduce processor workload to below 80% Analysis for Ariane-4 indicated overflow was not physically possible Ariane-5 had a different trajectory Why wasn’t the design modified for Ariane-5? Not considered wise to change software that worked well on Ariane-4 Why did the SRIs shut down? Assumed faults are random hardware failures, hence should switch to backup Why was the error not caught in unit testing? No trajectory data for Ariane-5 was provided in the requirements for SRIs Why was the error not caught in integration testing? Full integration testing considered too difficult/expensive SRIs were considered to be fully certified Integration testing used simulations of the SRIs Why was the error not caught by inspection? The implementation assumptions weren’t documented Why did the OBC use diagnostic data as flight data? They assumed this couldn’t happen??? © 2001, Steve Easterbrook

CSC444 Lec01 13 University of Toronto Department of Computer Science Summary Failures can usually be traced to a single root cause System of testing and validation designed to catch such problems Catastrophes occur when this system fails In most cases, it takes a failure of both engineering practice and of management Reliable software depends not on writing flawless programs but on how good we are at: Communication (sharing information between teams) Management (of Resources and Risk) Verification and Validation Risk Identification and tracking Questioning assumptions © 2001, Steve Easterbrook

CSC444 Lec01 14 University of Toronto Department of Computer Science Readings Van Vliet, chapter 1 Read all of it, especially the part about a code of ethics Challenger (& Space Shuttle in general) Current info about the shuttle: Info about Challenger: Rogers Commission Report (see especially appendix F, by Richard Feynman) A Succinct summary of the key factors and issues with Challenger: Ariane-5 Info about ESA’s launchers: Flight 501 inquiry report & Press release: © 2001, Steve Easterbrook