Presentation on theme: "Loss of the Columbia Lessons Not Learned Steve Fairfax MTechnology, Inc."— Presentation transcript:
Loss of the Columbia Lessons Not Learned Steve Fairfax MTechnology, Inc.
Outline Background 7x24 Perspective Review of the Challenger Loss PRA at NASA after Challenger Details of the Columbia Loss NASA Resistance to PRA Lessons for 7x24 The Future of Human Space Flight
Background Challenger Loss January 1986 Presentation to 7x24 Exchange Spring 1998 –12 years hindsight –100+ books published –Many obvious parallels to 7x24 industry Columbia Loss February 1, 2003 –Asked to present analysis to 7x24 February 4 –1200+ documents reviewed 2.5 GB digital files, primarily from internet searches 4 cubic feet hardcopy –Shuttle design and test information from early 1980s –Books and articles on the Challenger –About 300 hours effort
7x24 Perspective: Many similarities between NASA and 7x24 firms Failures can produce catastrophic damages –Billions of dollars –Loss of human life The role of the organization in technical failure –“Human error is responsible for 67% of all downtime” Why is this tolerated? What has been done to change this? Culture of secrecy –Failures and concerns about potential failure are rarely explored –NASA does a much better job than any 7x24 firm, still fails Normalization of deviance –Near-misses and heroic “saves” are signs of impending failure, not factors of safety, not validation of design or operations Production culture, time pressures –No time for initial commissioning –No scheduled maintenance windows –Dramatic failure required to get management attention
7x24 Perspective: PRA Reliance on redundancy, factors of safety, anecdote, and appeals to experience Lack of acceptance of PRA techniques, results –“Lack of data” used to excuse sloppy thinking –Results that gore sacred cows are unwelcome –Practices that clearly reduce reliability enshrined as “years of experience” –Inability to quantify risks makes optimal allocation of scarce resources impossible
7x24 Perspective: Important Differences NASA Government monopoly Resources determined by political process Formerly used to advance political goals Large bureaucracy, very difficult to change 7x24 Industries Varied, competitive Resources determined by market success or failure Mission Critical, not political Range of sizes, change management for survival Look for thefor 7x24 relevant points
Summary of the Challenger Loss Engineers at Solid Rocket Booster (SRB) manufacturer recommended delay in launch –Cold temperatures stiffen O-rings –Seal failures observed 12 times on previous missions –Worst failure, 1/3 O-ring diameter erosion, on coldest launch at 53 °F –Predicted launch temperature 29 °F 2 conference calls among multiple NASA centers and SRB manufacturer the evening before launch NASA “appalled” at request for delay SRB Manufacturer management certified safe for launch O-ring seals on solid rocket boosters (SRB) failed at ignition Shuttle destroyed by aerodynamic forces 88 seconds after launch –Hot gas from failed O-ring seal caused SRB – ET attachments to fail –Loss of ET internal pressure caused tank to fail
Investigation of the Challenger Loss Rogers Commission ( reports to President ) Testimony under oath Misleading Testimony by NASA managers –Disputed O-rings as cause of failure –Testified that Thiokol had certified system safe –Failed to mention initial recommendation to delay Commission attitude shifts from co-operative to confrontational
Causes of the Challenger Loss Immediate Cause: O-ring failure due to improper design, cold temperatures Contributing causes –Excessive schedule pressures –Normalization of deviance Accepting failures as evidence of safety –Gradual shift from requirement to prove that flight is safe to proving it is unsafe Reliance on experience to approve launch Demand for “hard” data to halt launch
Recommendations from the Challenger Loss Change design with independent oversight and review Change in management practices Criticality Review and Hazard Analysis Establish independent safety organization Improved communications Avoid reliance on shuttle alone
Risk Assessment at NASA Emphasis on Redundancy, Safety Factors, Criticality 4N redundant avionics and computer systems –1.4 Safety Factor in structure –Single failure of “Criticality 1” component leads to Loss of Crew Vehicle and Crew (LOV/C) Absent quantitative risk data, impossible to allocate limited resources most effectively NASA forbade use of formal probabilistic risk analysis (PRA) after Apollo consultant found small probability of successful moon missions NASA reluctantly started PRA in 1987, after Challenger loss
Risk Assessment at NASA: Tile PRA “Risk Assessment is a management tool.” –Risk Management for the Tiles of the Space Shuttle, Paté-Cornell & Fishbeck, 1994 15% of tiles account for 85% of the risk of LOV/C –Combination of 3 factors Probability of tile loss/damage Heat load during re-entry Criticality of underlying structures and systems Tile PRA allowed NASA to use same amount of maintenance resources (time, money, attention) to provide more safety “NASA seems to have grown from a can-do organization to a large bureaucracy.” Ibid.
Tile PRA Applied to NASA Management Practices New techniques allowed PRA to account for effects of: –Lower pay scale for tile technicians –Lack of training and experience –No sense of priorities, procedures meant to ensure that everything was done “perfectly.” –Fixed daily quota of tile inspections –Schedule pressure effects on technician error rates High Expectations set for flight frequency, safety “High visibility makes it difficult for the organization to learn.” Ibid.
Tile inspection and repair is difficult, painstaking work.
PRA of the Entire Shuttle System “Top-Down” PRA would allow NASA to –Identify systems contributing most to risk –Determine the uncertainty in the findings –Assess the benefits of potential improvements –Select systems with best risk/reward ratio for further work –Track effects and continuously improve NASA opted for piecemeal PRA of subsystems –Goal was to show that risks were acceptable
Brief History of the Tiles Early shuttle concepts included Titanium alloy structure, skin, tiles –Excellent high-temperature strength, corrosion resistance –Difficult to form, new alloy development required –Expensive, primary source is Russia (then USSR) Switch to Aluminum –Much lower cost –Large experience base from previous aerospace work –Loses 95% of room-temperature strength at 800 F –Places much greater demands on tiles –Politically beneficial, domestic suppliers –Lowest capital cost, highest operating cost
Tile History Continued Tiles developed, manufactured by NASA –Ceramic/glass composite –25,000 unique shapes, 14+ material mixes Orbiter Skin must be smooth to control heating –Gap fillers used to fill voids –Early Columbia wing roughness led to “hottest” re-entry Fragility not understood at first –Glass-like layer added to surface for protection –“Densified” layer added to base for strength
Tile Fragility Long history of bonding and debris damage –40% loss of tiles on 1 st ferry flight to KSC –Average of 179 tiles damaged on each of 1 st 33 flights High of 707 Low of 53 Design Specification for Tile Damage: 0 “The reusable surface insulation (RSI) material used in the shuttle thermal protection system is susceptible to damage. If any RSI tiles are damaged or lost during ascent, they must be repaired or replaced prior to entry.” - NASA-TM-81822(1980) NASA cancelled on-orbit tile repair development program
Brief History of the External Tank Initial RFP did not require insulation –Assumed ice would form, be shed as in Apollo –Extreme fragility of tiles added new requirement No ice on external tank “Orbiter tiles were so fragile that an ice cube dropped four inches would crack the tile glass coating.” –Lessons Learned from Space Shuttle External Tank Development by Myron Pessin, 2002 Extensive effort to develop Spray-On Foam Insulation (SOFI) –1 st material withdrawn from market after Univ. Utah professor fed burned residue to rats, showed possible toxicity
Brief History of the External Tank 2 nd SOFI material used Freon blowing agent –Banned by EPA for ozone damage during post-Challenger launch hiatus –NASA 1987 press releases tout “Environmentally Friendly” foam 1 st flight with 3 rd SOFI produces 308 hits, 132 larger than 1-inch. Some gouges 15 inches long, depths up to 1.5 inches in 2-inch thick tile. –EPA granted NASA waiver to use Freon in 2001 –NASA continued to use new SOFI 3 rd foam material uses HCFC 141b blowing agent –15% as harmful as Freon for ozone damage –Banned by EPA effective 2004 –No replacement found for SOFI
Focus on the Failure: Leading-Edge RCC Front edge of shuttle wings, shuttle nose get the most heating –2500 °F to 3000 °F in typical re-entry Reinforced Carbon-Carbon used in these areas –Dense matrix of carbon fiber cloth, carbonized epoxy resin –Approximately ¼” thick –Strong but brittle –SiC coating prevents oxidation (burning) of carbon Nose RCC shows no signs of corrosion –Protected by cap on launch pad Wing RCC panels (22 per wing) subject to corrosion –Salt spray creates pinholes in SiC protective coating –NASA inspects and repairs coating after each flight –NASA refuses to place canvas covers over RCC before flight
RCC Panels Wing box structure during assembly IR photo shows wing heating Wing bulkhead holds RCC panels 8 9
Focus on the Failure: T-seals RCC panels on wing expand and move with large temperature changes T-seals between each RCC panel seal gaps T-seals constructed of RCC Location, shape of T-seals makes inspection difficult Columbia has unique attachment hardware, subject to corrosion
Focus on the Failure: Bipod Ramp 2 struts attach shuttle nose to ET: the Bipod Hand-applied foam ramp used to smooth airflow near ET attach points Bipod ramp foam failed on previous flight, at least 4 other occasions –Previous flight bipod ramp SOFI debris hit SRB aft skirt –NASA did not order additional inspection of STS-107 ET bipod ramp for defects –Dissection of next ET bipod ramp showed multiple defects Several large voids that reduced strength, trapped water or liquid air Duct tape embedded in foam; increases chance of shear failure
Bipod Ramp Foam on next (after Columbia STS-107) external tank.
Voids and duct tape found in bipod foam on next ET
Previous Bipod Ramp Foam Failure: External Tank photographed after jettison STS-32 Jan. 9, 1990 Missing Foam Left Bipod Ramp, many other areas
Missing Foam Left Bipod Ramp Foam Intact Right Bipod Ramp External Tank Photograph from orbiter wheel well camera after ET jettison Flight STS-50, June 25, 1992
Details of the Failure: The Launch on January 16, 2003 Left bipod ramp foam broke off 81 seconds after launch –Estimated size 21” x 16” x 6” –Shuttle velocity Mach 2.4 (1800 MPH) Foam impacted lower portion left wing at ~750 FPS (500 MPH) Cloud of debris observed below wing after strike Strike appeared to occur on or near forward edge of wing
Details of the Failure: NASA-Boeing Analysis Launch on January 16, 2003 Reports presented January 21, 23, 24 All reports assumed no water in foam –4% water content would double weight of foam –10.6 inches of rain fell while Columbia sat on launch pad –NASA pokes thousands of holes in foam to reduce debris shedding CRATER program used to predict tile damage –Based on testing with 3 mm SOFI pellets RCC damage predicted by comparison with ice impact database –RCC not designed for ice impact –Size of ice debris not revealed
SOFI debris damage to the shuttle tiles Design requirement: No debris allowed to hit shuttle History: more than 100 impacts on many flights –More than 25 hits larger than 1 inch on multiple flights CRATER program developed to predict effects of SOFI impact on tiles –Based on study using 1/8” diameter SOFI pellets –CRATER program used to justify safety of Columbia after damage known –CRATER prediction exactly matches depth of gouge on STS-50 tiles Inconsistent with NASA characterization of results as “conservative” Possible that STS-50 single datum used to calibrate CRATER
Notes on NASA/Boeing Analysis No mention of 1994 Tile PRA No acknowledgement of different risk levels –Wing RCC loss = vehicle loss! “Criticality One” –Tile near front of wing: Criticality One –Main landing gear doors: Criticality One –Main landing gear door seals: Criticality One –T-seal loss: Criticality One –Uses close call on STS-50 to predict safety on STS-107 Same flawed logic as Challenger pre-flight debate –Ignores huge extrapolation in energy, damage potential
NASA/Boeing Analysis of RCC Strike, Thermal Predictions Impact angles greater than 15 degrees show penetration of RCC –Anticipated impact angle: 21 degrees –Penetration of 120% of RCC thickness at best estimate of impact angle –“softness of SOFI” not quantified, ignores impact energy
NASA Discussion of Actual Damage Scenario: Shades of Challenger Strike on lower half of RCC panels 8-9 No allowance for known corrosion of RCC No quantification of uncertainty –2 degree change in impact angle increases energy over 2X –4% humidity doubles weight of foam, hardens it No mention of tile PRA, criticality of various sections Extrapolation from ice damage to RCC –Size of ice particles not specified –Use of previous failures to infer safety Predicted erosion of 47% of RCC thickness “no issue”
Records of the Columbia Failure February 1, 2003 Telemetry drop-outs very early in re-entry –Probably caused by abnormal plasma flow Sensors show rising temperatures, then sensors fail –Multiple locations on left wing, left wheel well –Consistent with large plasma stream entering wing interior Not due to missing tile on wing lower surface
Photograph of Columbia early in re-entry –Photograph by amateur telescope at Air Force telescope installation –Shows damage to wing leading edge –Disturbed plasma flow at left wing trailing edge
1 inch gap from missing T-seal Plasma cuts aluminum bulkhead behind RCC, flows along path of least resistance
Records of the Columbia Failure Guidance System correcting for excessive drag –Unusual tendency for shuttle to roll –Unusual amount of left yaw Reconstruction of final 2 seconds of data –All 3 “redundant” hydraulic control systems at 0 pressure –Shuttle yawing rapidly (over 20 degrees per second) to left –Pilot attempts remedy by switch to manual control Shuttle Breakup at Mach 20, 205,000 feet
After the Failure: New Revelations Extensive e-mail exchanges between NASA engineers –Obvious concern about tile damage –Debate regarding survival if main landing gear, tires damaged “Why are we talking about this the day before landing, and not the day after launch?” “Any more activity on the tile damage or are people relegated to crossing their fingers and hoping for the best?”
After the Failure: New Data Air Force radar tracked object moving away from shuttle on second day in orbit 3,100 radar readings as object tumbles Tracked during re-entry, destruction in atmosphere Testing at Wright-Patterson AFB shows radar returns consistent with tumbling RCC T-seal Ballistic data during re-entry confirms RCC T-seal
After the Failure: More New Data OEX recorder on Columbia recorded temperatures, strains Left over from 1 st four “development flights” Recovered in Texas March 16 Tape intact, showed abnormal readings 270 seconds after crossing 400,000 ft - 206 seconds before NASA telemetry deviations 1 st indication was strain gauge behind RCC panel 9 Temperature sensor behind RCC panel 9 went to 450 F, then failed, 63 seconds prior to 1 st NASA indications OEX may have recorded abnormal aerodynamic forces during launch, analysis underway
Probable Sequence of Events: Launch Defect in SOFI application left voids, perhaps duct tape, buried in foam. Large, possibly waterlogged and frozen SOFI fragment broke free 82 seconds after launch SOFI Impacted Shuttle wing between RCC panels 8 and 9 RCC T-seal damaged by impact, possibly weakened by corrosion. Lower RCC panel 8 or 9 possibly damaged or destroyed. T-seal floats away from shuttle on second day in orbit, leaving 1- inch gap in wing leading edge
Probable Sequence of Events: Re-entry Plasma enters 1” gap in wing leading edge Plasma stream, at least 3,000 °F, melts exposed aluminum bulkhead at front of wing structure Plasma stream enters wing interior Sensors, hydraulics, attitude control systems fail Vehicle begins uncontrolled yaw to left at >20 degrees per second Vehicle torn apart by aerodynamic forces, breaks up over Texas
Calls for NASA to Use Probabilistic (Quantitative) Risk Assessment Techniques Rogers Commission Aviation Safety Advisory Panel (ASAP) –1986, 1987, 88, 89, 90 annual reports –"Reliability and Probabilistic Risk Assessment" section added in 1986, after Challenger –“All engineers involved in any aspect of design, test, or operations of any aerospace system should be given at least a minimal grounding in these valuable tools.” (Fault Tree Analysis and Failure Modes and Effects Analysis.) - ASAP –Safety impact of NASA downsizing impossible to predict due to lack of objective assessment
NASA Resistance to PRA/QRA “NASA has taken the position that a lack of maturity, insufficient data base, and lack of funds associated with quantitative risk assessment limits its usefulness.” ASAP Annual Report 1989 “Risk Management” chapters dropped from ASAP 1991 and subsequent reports “NASA has a culture of fixing the immediate symptom or problem rather than a learning orientation in which all factors (cultural, organizational, and technical) are included in the search for the ultimate cause. The results are continued safety risk and increased cost.” ASAP Annual Report 2002, underline added
Déjà Vu Challenger –Multiple near misses: 12 O-rings damaged, worst damage when cold –Normalization of deviance: “safety factor” –Cursory analysis just prior to accident –Initial denial by NASA management –Failure to use all available resources –Ignorance of relative risk Columbia –Multiple near misses: 112 flights with damage, 6 from same ET area –Normalization of deviance: CRATER –Cursory analysis just prior to accident –Initial denial by NASA management –Failure to use all available resources –Ignorance of relative risk
Lessons for 7x24 Not all risks are equal –Use PRA to set relative priority Interactions between systems cause failures Redundancy and safety factors don’t quantify, or effectively control, risk PRA can’t prove risks are acceptable –Acceptability is based on judgment, not math –PRA best used to understand and rank-order risks –MTTF or “nines” are not enough, need to know worst-case Treating near-misses as proof of design safety invites disaster Mistakes arise from the structure of the organization –Beating on the operators won’t fix the problem
The Future of Human Space Flight NASA did not learn from Challenger. –Overwhelming similarity in Columbia decision process –Pervasive evidence of an organization that can not or will not quantify risk NASA is a large, political bureaucracy. –No longer used to achieve national political goals –Competes with other political programs for resources –Political decisions are based on influence, not physics –Survival of the bureaucracy above all else We are getting the space program we pay for.
The Future of Human Space Flight NASA has a monopoly on US space flight. –Has driven new rocket designers out of business Economic laws for monopolies: –Price invariably increases –Quality, quantity invariably decrease The shuttle is an economic disaster. –100x projected cost per launch –100x less reliable than promised –10x less frequent flights More shuttle flights will lead to more losses.
One Man’s Vision: The Future of Human Space Flight Burt Rutan’s 3-person Spaceship One, the White Knight carrier aircraft, and all controls, simulators, and program facilities “I believe the government is the reason it’s unaffordable to fly into space. We didn’t want them to know, because their ‘help’ causes cost problems.” -Burt Rutan
3 people to the edge of space for $100,000 per flight.
“We use the lowest technology possible, not the highest.” - Burt Rutan
“For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.” Richard Feynman, Challenger Report, 1986
Could the crew of Columbia have been saved? Inspection of tiles on orbit –Robot arm removed – can see underside with TV –EMU (jetpack) not on board –SAFER (small emergency jetpack) not on board Policy questioned by Safety Advisory Panel for years Astronaut could have left shuttle, then shuttle flown around them while tumbling astronaut takes digital pictures "National Assets" (spy satellites) could have photographed orbiter –Halted by NASA management after requested by NASA engineers Irrational behavior viewed from technical viewpoint: zero risk to shuttle Perfectly rational behavior viewed from political viewpoint.
Unclassified Air Force image of Columbia in orbit. Taken from Maui. Payload bay doors block view of wing leading edge.
Could the crew of Columbia have been saved? Repair of tiles in orbit –In-orbit repair project abandoned by NASA –NASA technical memo published 1980 “The reusable surface insulation (RSI) material used in the shuttle thermal protection system is susceptible to damage. If any RSI tiles are damaged or lost during ascent, they must be repaired or replaced prior to entry.” –New proposals since loss Chemical patch similar to fireworks "snakes"
Could the crew of Columbia have been saved? Rendezvous with space station –No docking mechanism on Columbia –Orbital mechanics make it physically impossible Velocity change of ~5700 feet per second required Fuel on board sufficient for ~1000 FPS change Change in re-entry procedure –Jettison all cargo and expendables to reduce loading NASA estimates 5-7% reduction in RCC heating rate No help for unprotected aluminum –Cold-soak wings for 2 days NASA estimates 37 seconds additional lifetime
Could the crew of Columbia have been saved? Launch Rescue Shuttle –Columbia Spacehab mission very long: 16 days –Extra oxygen and hydrogen for fuel cells provide water –Power-down Spacehab, conserve food and water, and wait for rescue –CO 2 removal canisters are limiting item, probably 30 days –High risks from rushed shuttle launch, subject to same failure –We’ll never know Other options (à la Apollo 13) –Patch wing leading edge with ice bag, other material? –Not explored