Presentation on theme: "FAILURE ANALYSIS A Brief Discussion of Failure in Design 1 gmail.com January 2015."— Presentation transcript:
FAILURE ANALYSIS A Brief Discussion of Failure in Design 1 prgodin @ gmail.com January 2015
Laws of Processes If something can go wrong, it will go wrong. (Murphy’s law) Additional observations related to Murphy’s Law: – If nothing can go wrong, it will. – If something goes wrong it will be the flaw that causes the most damage. – Sooner or later the worst possible combination of events will happen. – Discoveries are made by mistake. – The one that predicts the greatest cost and the longest time for a project is usually the one with the most experience. – Perfect science is hind-sight 2
Processes Not all failures are directly caused by hardware or software flaws. – Groupthink: When a group of people naturally all want to agree and reach a consensus, thereby ignoring and discouraging any dissenting viewpoints. – Ineffective Communication – Ineffective leadership: Leaders must maintain a vision of the project, ensure all teams function together and all required resources are available. Poor leadership will lead to conflict, inability to get the work completed, inability to delegate, a lowering of moral and poor work ethic. – Cultural Constraints: Some cultures do not permit a member of a team to question leader’s decisions or make decisions, even in emergencies. Effective and experienced managers may be forced to resign due to an event that was out of their control. – Responsibility: Poorly-defined responsibilities can lead to lack of ownership that lead to overlooked problems and errors, and to divisions between groups that should be working together. – Incompetence: Leaders and team members may be personally incapable or unwilling to follow protocols & procedures, make important decisions, foresee problems or find appropriate solutions. Some are willing to gamble and take risks that should not be taken. 3
Space Missions Space missions are particularly vulnerable to faults because they are unique, expensive and complex systems in the harshest environment with no hope of rescue or repair. The advantage to analyzing these faults is that they are often thoroughly investigated. 4
NASA Stardust Mission: Genesis Probe 3 year mission 5 months collecting space dust 1.8 Billion Miles $260 Million Dollars Crashes to earth because the atmosphere sensor that was to release the parachute was installed backward. 5
NASA WIRE Satellite The Wide-field Infrared Explorer (WIRE) was a satellite meant to explore molecular space gasses at the infrared scale $73 million Lost all of its required coolant within minutes of orbiting, seriously affecting its mission. Problem attributed to timing errors and failsafe design flaws within the digital logic circuit. 6
NASA WIRE Satellite On power-up, the FPGA did not receive a reset before all the systems were operational. This resulted in a random signal which ejected the telescope cover at the wrong time. All the coolant was lost. The mission failed because of multiple errors in the digital circuit design, including a flawed power-up circuit, and the propagation delays were not taken into account. http://klabs.org 7
The factors in the electronics circuit design which proved problematic included: FPGAs do not start in a predictable manner so there must be allowances for incorrect outputs on startup other associated circuits produced glitch states on start-up due to start-up delays, compounding the problem oscillators takes time to start up and will not provide reliable edges for some time, and this compounded the problem simulation software does not take start-up into account and was a poor substitute for testing the functionality of the actual circuit. The design on the left demonstrates the 14-bit counter that provides a 100ms pulse to “arm” the pyrotechnic bolts to the cover. The counter relies on the reset inputs to turn the counter on or off. In the absence of a power-up reset, the counter will start in a random state and may immediately send an “arm” signal. This flaw prematurely fired the cover and damaged the main satellite instrument due to loss of coolant. 8
Other NASA Failures DART mission (2005): Software bug (calculation & data acquisition errors) caused the vehicle to burn excess fuel and crash into its rendezvous vehicle SBIRS Satellites (2007): Classified mission, but appears there was a timing problem with the communication bus on the circuit boards and the backup system (“safehold”) failed, locking out any possibility of re-establishing control. Mars Polar Lander (1999): Communication lost on final approach to Mars. The software likely misinterpreted vibrations on decent as contact with the surface and the decent engines were shut down prematurely. Mars Climate Orbiter (1998): Well-known failure for its engineering fault: forgetting to convert between Nm and Ft-lbs led to the spacecraft crashing into the planet. 9 These and other losses were partially attributed to using simulation software instead of physical tests. For instance, simulation software typically did not include vibration as a possible problem-causing event.
10 IN 2003 a technician removed the 24 bolts that held the adapter plate to the cart but did not record the action. The team that later attached the satellite to the adapter plate did not check for the bolts. As the satellite was moved to a horizontal position it toppled. A complete lack of procedural discipline was to blame as the processes were in place but ignored. Cost $135 Million to repair.
11 NASA were not the only ones with problems and challenges. China, Russia, Japan, Korea, EU, India, Iran and other countries (and companies) experienced failures with satellites. Many were mechanical and some are simply unknown or unreported.
In July 2011, two high-speed passenger trains collided in China, killing 40 and injuring more than 200 people. One train received a green “go” signal instead of a red “stop” signal, allowing it to occupy the same track as another train. A lightning strike blew fuses for the logic circuits that detected trains and controlled the signaling. The design flaw was the deactivated circuit provided a green light even though it could not detect traffic on the rail. The wrong “failsafe” was incorporated into the design. Wenzhou Train Collision 12
Fukushima One of the worst nuclear accident ever recorded, released massive amounts of radioactive materials following a major earthquake and tsunami on March 11, 2011 in Japan. – the earthquake shifted Japan a few meters to the east and dropped the landmass by.5 meters. – the 14 meter tsunami that followed killed 19,000 people and destroyed whole towns. – there were 11 nuclear reactors in the affected area. Several were in maintenance mode, all shut down automatically. Fukushima – Fukishima Daiichi units 1 to 3 had problems with the shutdown – Pool storage for spent and highly radioactive fuel damaged – #1 would eventually completely fail and breach containment 13
Known Risk The Japanese and Americans knew of the risks involved in building a nuclear plant at this location. In 1896 an estimated 8.5 quake occurred in the area with a 10.5 meter tsunami. Before the accident the Japanese had anticipated a possible disaster due to several studies but did nothing about it. Ancient stone in Japan warning people not to build below this point: a warning ignored 14
Nuclear Plant Based on the early Westinghouse design: – Water is used to cool the radioactive fuel rods contained within a pressure vessel – Water levels and temperature must be maintained All 3 vessels had problems – Lost electrical power needed to control recirculating and other systems – All backup generators were flooded. Batteries ran out after one day. Other battery banks flooded. – All “Failsafe” backup cooling systems non-functional – Reactors 1, 2 and 3 overheated and all 3 received damage to their fuel rods. – The storage pool for waste fuel developed a crack and was also losing coolant. This emergency was most pressing and the reactors were initially ignored. 15
Number 1 Without power, #1, 2, 3 temperature and pressure climbed. Coolant levels dropped below the fuel rods. Rods exposed to air disintegrated due to heat. Pressure release also released hydrogen which exploded. Likely coolant leak at base of #1. Water continuously pumped into the system to avert a more serious explosion and release. Radioactive materials at the bottom of the #1 vessel caused it to eventually rupture, releasing highly radioactive particles into the environment, and melted over 2ft into the concrete below. 16
Fukushima Lessons Tsunami was predictable and had happened in recent history. Several reports had determined a serious risk existed yet no action was taken. The initial report by the electrical company stated the cause of the accident was “natural causes” but subsequent government analysis demonstrated the accident was caused by a complete lack of concern, foresight and will. The initial report itself was also singled out as a further example of these problems. Cultural conventions contributed significantly to the prevention and handling of the disaster. Essential generators could have been placed a few dozens of meters higher and this would have prevented the catastrophe. Instead, they were placed below the pressure vessel and remained flooded. All other backup systems failed due to flooding. Emergency process were mismanaged due to confusion over authority, such as who could authorize airlifting additional backup generators. Too much nuclear waste was stored at the facility, and the containment had been breached by the earthquake. Cost of disaster estimated at greater than $127 Billion, some say $250B. Land and ocean areas will never be occupied again. 17
Three Mile Island IN 1979 a critical valve stuck open yet a panel light indicated it was closed. The indicator was wired to the signal going to the valve and did not detect if the valve was actually closed. The open valve reduced the amount of coolant in the reactor core. There was no sensor for water levels. Other protocols and regulations were not followed. Coolant backup systems were off-line due to maintenance. The operators did not apply proper thinking when analyzing multiple instrument readings that indicated low coolant. 18 A combination of errors led to a loss of coolant, a partial meltdown and release of radioactive materials outside of the plant. Only small amounts of radiation was leaked to the outside environment but this accident created serious concern among the population. This accident led to a heightened awareness of managing risk.
Consumer Products There is a large list of consumer items that have been recalled because of electrical or electronic design flaws. In addition to possible lawsuits due to secondary damage, these recalls cost companies a substantial amount of money in logistics, labour, and market image. In the US, it is estimated that product recalls cost $1 Trillion per year. 19
Consumer Product Flaws Countries like Canada require specific, 3 rd party accredited testing for electrical products, in accordance with the Standards Council of Canada (SCC) and provincial regulator guidelines. The standards are set by the Canadian Safety Association (CSA). Some companies, such as Nemko, specialize in testing consumer products to ensure they are safe and functional over time.. One analysis of the latest recalls stated that 80% were caused by design, 20% by production. Many of the flaws were due to incomplete product testing in a variety of operational conditions. http://www.ecnmag.com 20
Consumer Recalls: Fire and Overheating Most of these products have a defective electronic design or component that may cause overheating: – Ryobi battery chargers: 550,000 units – Genie garage door openers: 18,000 units – GE Humidifiers: 2,700,000 units – Gree Dehumidifiers: 2,200,000 units – Maytag dishwashers: 1,700,000 – HP Chromebook Chargers: 145,000 – Schneider surge protector: 15,000,000 – many more… GE ADEW30LN humidifier, before and after 21
Noteworthy Consumer Product Failures 1 (How were these failures detected?) Tyco Smoke Detector (2006) Faulty sensor rendered them unable to detect Smoke in high humidity. Recall of 150,000 units. Tyco Simplex Fire Alarm Control (2014): Defective chip, 750 units Tyco Simplex Grinnell Fire Alarm Control (2011): Defective software fails to alert monitoring centers, 540 units Kidde residential smoke/combo CO alarms (2014): Defective design. If a power outage occurs when the device performs its once-per-minute health check the device goes into a ‘latched’ mode, causing it not to alarm. 22
Noteworthy Consumer Product Failures 2 (How were these failures detected?) Visonic Personal Emergency Response kit recalled because after a reboot it may fail to reconnect with the personal pendant. Defective design. (1700 units) A previous recall for the same model was a failure of the low battery indicator. (24,000 units) Bosch Corporate Security Systems fail to activate an alarm in an alarm condition due to design defect (2000 units). Honda vehicles brake unexpectedly due to an electronic defect in the stability control system where the system fails to boot properly on startup (2013): (344,000+ vehicles) 23
Conclusion There will always be flaws in design, processes lacking and poor decisions made by people. Using design and simulation tools provides the ability to quickly develop and evaluate solutions. Ultimately the hardware must function over its anticipated lifetime so it is important to prototype and anticipate what physical factors may affect its operation. 24
Testing Products A good manufacturing practice is to sample- test a batch of product for durability and functionality. Many electronic component failures occur within the first few hours of use. Environmental chambers with hot-cold and humid-dry cycles, different atmospheric pressures, electrical noise, vibration, UV light and other environmental factors are used on circuit boards. 25 Environmental test chamber http://www.candctechinc.com
Counterfeit Parts Image of a counterfeit 500GB hard drive that was actually a 128MB flash that reports 500GB but overwrites the data. Purchased in a shop in China in 2011. https://www.jitbit.comhttps://www.jitbit.com + other sources 26
Counterfeit Parts A major issue in the electronics industry – Significant financial risk and costs for manufacturers and consumers Repairs to a 10₵ part can cost hundreds in labour costs to identify and rectify, or in downtime, replacement, recalls and lawsuits. According to the US Government, counterfeiting accounts for 8% of all merchandise trade ($1T) Time-consuming verification for genuine product implementation – Safety risk due to failed component Fake parts have been found in aircraft control boards, military equipment, NASA equipment, medical equipment, consumer items, electrical devices, etc. multiple sources 27
Counterfeit Parts Techniques: – Remarking (black-topped or sanded): Substituted Inferior part Used part – Used parts sold as new (leads cleaned) – Defective or substandard batches sold as good – Dies of older versions packaged as new versions – Some fakes completely lack the electronics inside Note faint previous label image: multiple sources 28
Identifying Counterfeit Parts Likely to be more expensive part such as ICs (81%), Transistors (8%) Check markings for: – poor printing, offset or uneven – date codes that are impossible – incorrect package codes – codes that do not match the box they came in – topping removable by standards-based solvent – manufacturer couldn’t have manufactured part Check surface of the parts for: – inconsistency between parts in the same batch – filled in cavities that manufacturers normally leave on the chip – scratches from sanding – inconsistent finish – check underside for inconsistencies http://www.aeri.com/counterfeit-electronic-component-detection/ http://spectrum.ieee.org/ 29
How do they enter the market? Parts purchased on the “open market” such as eBay. Unethical or inexperienced brokers, suppliers or vendors Often genuine parts are sent as “samples” for testing, subsequent orders are fakes Often the fakes function and may not be noticed Fake performance reports, certificates, specification & safety sheets Counterfeiters are getting better Source of the materials: – Parts stripped from used electronics – Remarked cheaper, lower quality or lower performance parts – Obsolete or defective parts and dies purchased in bulk – Most come from China (over 60%) where there is no legal enforcement preventing counterfeiting 30
Reclamation & Repackaging from waste to “new” images: multiple sources 31
Conclusion Only order parts from authorized distributors Train staff to look for counterfeit components Require certificates of authenticity when necessary Don’t order parts from eBay or questionable suppliers Video: Power supply from eBay (funny too): https://www.youtube.com/watch?v=DZDh8z9UDTo https://www.youtube.com/watch?v=DZDh8z9UDTo 32 Cover Covered Aluminum (CCA) Cat 5e wire: not acceptable https://www.cablewholesale.com