Failure Prediction in Hardware Systems Douglas Turnbull Neil Alldrin CSE 221: Operating System Final Project Fall 2003 1.
Published byModified over 4 years ago
Presentation on theme: "Failure Prediction in Hardware Systems Douglas Turnbull Neil Alldrin CSE 221: Operating System Final Project Fall 2003 1."— Presentation transcript:
Failure Prediction in Hardware Systems Douglas Turnbull Neil Alldrin CSE 221: Operating System Final Project Fall 2003 1
Background Using sensors from a high-end server, can we predict system board failures. 2 If we can predict failure, we can take preventative action to avoid costly failures. System Specifications: 18 Hot Swappable System Boards 4 Processors per Board 18 Sensors per Board Measures various temperatures and voltages
3 Sensor Logs Each board has an associated Sensor Log: About every minute, the sensors are sampled and the measurements are stored in the sensor logs. System board failures are also record in the sensor log. We need to extract a data set from these logs to represent failure events (positive examples) and normal operating conditions (negative examples). We accomplish this using a Windowing Abstraction.
4 Windowing Abstraction Sensor Window – Adjacent entries in the sensor log that are used to predict failures Potential Failure Window – An example is labeled as positive or negative if a failure occurs in the potential failure window.
5 Feature Vectors Feature Vectors are created from the data in a sensor window. There are two types of feature vectors: Raw Feature Vectors – a vector all the sensor measurement in a sensor window. Summary Feature Vectors – the mean, standard deviation, range and slope for each of the sensors in a sensor window.
6 Classification A classifier assigns labels (positive or negative) to novel feature vectors after it has been trained using a set of feature vectors with known labels. Many classifiers can be used, such as SVMs, Bayesian mixture models, and neural networks. We use a Radial Basis Function (RBF) network, a special form or a neural network, because it is computationally efficient.
7 Evaluation Predictions True PositivesFalse Positives False NegativesTrue Negatives Failure Non-failure Ground Truth Prediction We must consider two rates when evaluating our prediction system. True Positive Rate (tpr) – A measure of our ability to correctly predict true failures. tpr = Correctly Predicted Failures / Total Number of True Failures False Positive Rate (fpr) – A measure of the number of mispredictions. fpr = incorrectly Predicted Failures / Total Number of Non-Failures Failure Non-failure
8 Preliminary Results Observations: 1.Summary feature vectors have lower false positive rates than Raw Feature Vectors. 2. Window size does not seem to matter. How can we improve these results?
9 Feature Subset Selection We can further improve prediction accuracy (and reduce computation) by reducing the number of features used by our classifier. Feature are selected automatically using Forward Stepwise Selection.
11 Best Results We find the best prediction results with Summary Feature Vectors using 2/3 of the summary features: 0.87 True Positive Rate (tpr) 0.10 False Positive Rate (fpr) Our data set assumes that we are equally likely to find a failure as a non-failure. When one considers that there are very few failures in most hardware system, even a low false positive rate will produce many false positives.
12 Future Work Implement other classifiers – SVMS, Bayesian Mixture Models Develop a larger data set with more examples of failures Apply framework to other hardware system such as personal computers Modify operating system to take advantage of failure prediction Migrate processes to other system boards Run diagnostic tests Turn off suspect system boards Backup data