Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory.

Similar presentations


Presentation on theme: "Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory."— Presentation transcript:

1 Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

2 Outline Little history and perspective  What do we mean by “resilient”?  Traditional vs embedded approach  DARPA “built-in-test” program Cisco resilient router project  Brief overview of project  Our approach and partnership with OpenMPI Open Cluster Manager (OpenCM)

3 Motivation Head of new business unit for integrated diagnostics and control World’s largest customer  If system fails, will search out root cause  If your system, you pay cost of lost batch!  Rough cost/failure: $10M  Rough value of system: $200k

4 Resiliency Fault  Events that hinder the correct operation of a process. May not actually be a “failure” of a component, but can cause system-level failure or performance degradation below specified level  Effect may be immediate or some time in the future.  Usually are rare. May not have many data examples. Fault prediction  Estimate probability of incipient fault within some time period in the future Fault Tolerance ………………………………………reactive, static  Ability to recover from a fault Robustness…………………………………………..metric  How much can the system absorb without catastrophic consequences Resilience……………………………………………..proactive, dynamic  Dynamically configure system to minimize impact of potential faults

5 Traditional Approach to Faults: The “Bathtub” Infant Mortality MTBF “Floor” Region Defined Lifetime? B

6 What’s Wrong With That? Infant mortality  Resolved by extensive burn-in: costly Where to define “lifetime”?  A: Units decommissioned with considerable unused life  B: High probability of failures in advance  MTBF: ~50% of units fail before Bathtub floor does not sit at “zero”  Still significant probability of failure Can’t reliably estimate system lifetime due to multi-component degradation  Component-component interactions not reflected in individual component lifetime statistics Failures can be costly  Operational impact  Replacement costs B

7 DARPA BIT Program Multi-year program in 1990s  Focus on electronic, mechanical failures  Create a “resilient war fighting” capability  Enable better maintenance support of increasingly complex systems Objectives  Push-button “good box/bad box” readout Eliminate diagnostic “carts”, “toolboxes”,…  Pre-emptive switch from failing systems  “Okay for mission” test Reduce probability of failures during mission

8 Results Encouraging Vibration signatures  Impending bearing failures Fans, axles, transmissions Thermal patterns  Mechanical failures Existence of hot spots Patterns revealed root causes, better prediction  Electronic failures Patterns across boards, surface of chips Electrical frequency composition  Breakdowns in power transistors, other devices  IC internal wire connection degradation

9 General Conclusions Exploit access to internals  Investigate optimal location, number of sensors  Embed intelligence, communications capability Integrate data from all available sources  Engineering design tests  Reliability life tests  Production qualification tests Utilize learning algorithms to improve performance  Both embedded, post process  Seed with expert knowledge

10 Objective

11 Motivation Head of new business unit for integrated diagnostics and control World’s largest customer  If system fails, will search out root cause  If your system, you pay cost of lost batch!  Rough cost/failure: $10M  Rough value of system: $200k

12 Questions Can we develop technologies that would…  Warn of impending failure Provide time to reconfigure, respond Allow switch to backup systems for continuous operation Provide an opportunity to pace ourselves  “Stretch” life of system  With minimal overhead Cannot significantly impact performance How would we use them?

13 Direct Detection Spectral Filter ADC PZT Temp Voltage Current PZT Temp Voltage Current ADC Voltage Current ADC FDDP Analyzer Good Box Bad Box Problem Diagnosis Fault Prediction ~ -

14 Integrate All Factors

15 Results (generalized) Prediction  Better than 97% faults predicted within specified response time (hours)  Less than 5% “bad” prediction rate Diagnosis  Better than 80% correct localization Detection (good/bad box)  Better than 99% correct identification  Less than 5% false positive rate

16 Outline Little history and perspective  What do we mean by “resilient”?  Traditional vs embedded approach  DARPA “built-in-test” program Cisco resilient router project  Brief overview of project  Our approach and partnership with OpenMPI Open Cluster Manager (OpenCM)

17 17 © 2006 Cisco Systems, Inc. All rights reserved. 1)Internet Traffic Growth and interconnect requirements are growing faster than Silicon and Software available power are. 2)One approach is to build a larger more Distributed System. 3)Result are increased requirements on System Software in terms of: a)High Availability across a multi-component system b)Coherent view of intra-component messaging c)Fast Convergence amongst components during change d)Distributed Failover and effective sharing of load. e) SW/HW maintenance w/o service impact Problem Statements

18 18 © 2006 Cisco Systems, Inc. All rights reserved. Moore’s law x2/18m DRAM access rate x1.1/18m Silicon speed x1.5/18m Router Capacity x2.9/18m The demand for increased network system performance/scale is relentless... 1 10 100 1000 10000 1993199419951996199719981999 2000 2001 2002200320042005 Growth driven by increased user demand Problem Drivers

19 19 © 2006 Cisco Systems, Inc. All rights reserved. Shortfall! Shortfall is overcome by architectural innovation and trading off: Performance, functionality, programmability, physical size/density  Very hard to sustain long-term Technology is falling behind Demand Curve Problem Drivers

20 20 © 2006 Cisco Systems, Inc. All rights reserved. Product example Largest Routing System available today Each Linecard Chassis: 1.28Tbps, 13.6kW Switch Fabric Chassis: 8kW Hardware Details

21 21 © 2006 Cisco Systems, Inc. All rights reserved. Product example Maximum HW configuration: 92Tbps Switching capacity across millions of interfaces.  48 x LC chassis + 8 x Fabric chassis => System Messaging Across all control CPUs to manage switch fabric and interface control Hardware Details

22 22 © 2006 Cisco Systems, Inc. All rights reserved. System Software Requirements 1)Turn on once with remote access thereafter 2)Non-Stop == max 20 events/day lasting < 200ms each 3)Hitless SW Upgrades and Downgrades 4)Upgrade/downgrade SW components across delta versions 5)Field Patchable 6)Beta Test New Features in situ 7)Extensive Trace Facilities: on Routes, Tunnels, Subscribers,… 8)Configuration 9)Clear APIs; minimize application awareness 10)Extensive remote capabilities for fault management, software maintenance and software installations Software Details

23 Our Approach: Use OpenRTE Setup for new frameworks  Sensor - monitor hardware, software  FDDP - use sensor inputs to compute sliding window or probabilities Contribute back to OpenMPI  Proprietary modules as binary plug-ins Write new cluster manager  Exploit new capabilities  Create as non-centralized application

24 ORTE Extensions Software sensors  Memory footprint, cpu utilization (upper and lower), output file size Hardware sensors  Temperature, vibration FDDP  B-spline trend fit Resilient mapper  Fault groups Nodes with common failure mode Node can belong to multiple fault groups  Map replicas across fault groups

25 Cluster Manager Orted auto-starts upon node power-up  Auto-detect and connect to CM CM launches specified number of replicas of each application  Resilient mapper => minimize single point failures Applications auto-wireup  Plug-and-play inspired approach  Application decides which input to declare “leader”

26 Application Failure Orted detects (or predicts) failure and notifies CM CM utilizes resilient mapper to determine location of replacement  Future extension: probability of failure modes to help drive fault group selection  New replica is launched, does auto-wireup Connected applications  Loss of communication from “leader”  Independently select new “leader”

27 Outline Little history and perspective  What do we mean by “resilient”?  Traditional vs embedded approach  DARPA “built-in-test” program Cisco resilient router project  Brief overview of project  Our approach and partnership with OpenMPI Open Cluster Manager (OpenCM)

28 OpenCM Transition Cisco work to open source Broaden mission  Extend to HPC, other embedded operations  Manage any collection of nodes  Resilient operation with hooks MPI Other application layers Released under the OpenMPI license  BSD-like, open use

29 http://www.open-mpi.org/ Concluding Remarks


Download ppt "Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory."

Similar presentations


Ads by Google