A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team Presented by Geoffroy Vallee Oak Ridge National Laboratory Welcome to.

A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team Presented by Geoffroy Vallee Oak Ridge National Laboratory Welcome to HPCVirt 2009

Goal of the Presentation Can we anticipate failures and avoid their impact on application execution?

Introduction  Traditional Fault Tolerance Policies in HPC Systems  Reactive policies  Other approach: pro-active fault tolerance  Two critical capabilities to make pro-active FT successful  Failure prediction  Anomaly detection  Application migration  Pro-active policy  Testing / Experimentation  Is proactive fault tolerance the solution?

Failure Detection & Prediction  System monitoring  Live monitoring  Study non-intrusive monitoring techniques  Postmortem failure analysis  System log analysis  Live analysis for failure prediction  Postmortem analysis  Anomaly analysis  Collaboration with George Ostrouchov  Statistical tool for anomaly detection

Anomaly Detection  Anomaly Analyzer (George Ostrouchov)‏  Ability to view groups of components as statistical distributions  Identify anomalous components  Identify anomalous time periods  Based on numeric data with no expert knowledge for grouping  Scalable approach, only statistical properties of simple summaries  Power from examination of high-dimensional relationships  Visualization utility used to explore data  Implementation uses  R project for statistical computing  GGobi visualization tool for high-dimensional data exploration  With good failure data, could be used for failure prediction

Anomaly Detection Prototype  Monitoring / Data collection  Prototype developed using XTORC  Ganglia monitoring system  Standard metrics, e.g., memory/cpu utilization  LM_sensor data, e.g., cpu/mb temperature  Leveraged RRD reader from Ovis v1.1.1

Proactive Fault Tolerance Mechanisms  Goal: move the application away from the component that is about to fail  Migration  Pause/unpause  Major proactive FT mechanisms  Process-level migration  Virtual machine migration  In our context  Do not care about the underlying mechanism  We can easily switch between solutions

System and application resilience  What policy to use for proactive FT?  Modular framework  Virtual machine ckpt/rsrt and migration  Process-level ckpt/rsrt and migration  Implementation of new policies via our SDK  Feedback loop  Policy simulator  Ease initial phase of study of new policies  Results match experimental virtualization results

Type 1 Feedback-Loop Control Architecture  Alert-driven coverage  Basic failures  No evaluation of application health history or context  Prone to false positives  Prone to false negatives  Prone to miss real-time window  Prone to decrease application heath through migration  No correlation of health context or history

Type 2 Feedback-Loop Control Architecture  Trend-driven coverage  Basic failures  Less false positives/negatives  No evaluation of application reliability  Prone to miss real-time window  Prone to decrease application heath through migration  No correlation of health context or history

Type 3 Feedback-Loop Control Architecture  Reliability-driven coverage  Basic and correlated failures  Less false positives/negatives  Able to maintain real-time window  Does not decrease application heath through migration  Correlation of short-term health context and history  No correlation of long-term health context or history  Unable to match system and application reliability patterns

Type 4 Feedback-Loop Control Architecture  Reliability-driven coverage of failures and anomalies  Basic and correlated failures, anomaly detection  Less prone to false positives  Less prone to false negatives  Able to maintain real-time window  Does not decrease application heath through migration  Correlation of short and long- term health context & history

Testing and Experimentation  How to evaluate a failure prediction mechanism?  Failure injection  Anomaly detection  How to evaluate the impact of a given proactive policy?  Simulation  Experimentation

Fault Injection / Testing  First purpose: testing our research  Inject failure at different levels: system, OS, application  Framework for fault injection  Controller: Analyzer, Detector & Injector  Target system & user level targets  Testing of failure prediction/detection mechanisms  Mimic behavior of other systems  “Replay” failures sequence on another system  Based on system logs, we can evaluate the impact of different policies

Fault Injection  Example faults/errors  Bit-flips - CPU registers/memory  Memory errors - mem corruptions/leaks  Disk faults - read/write errors  Network faults - packet loss, etc.  Important characteristics  Representative failures (fidelity)‏  Transparency and low overhead  Detection/Injection are linked  Existing Work  Techniques: Hardware vs. Software  Software FI can leverage perf./debug hardware  Not many publicly available tools

Simulator  System logs based  Currently based on LLNL ASCI White  Evaluate impact of  Alternate policies  System/FT mechanisms parameters (e.g., checkpoint cost)‏  Enable studies & evaluation of different configurations before actual deployment

Anomaly Detection: Experimentation on “XTORC”  Hardware  Compute nodes: ~45-60 (P4 @ 2 Ghz)‏  Head node: 1 (P4 @ 1.7Ghz)‏  Service/log server: 1 (P4 @ 1.8Ghz)  Network: 100 Mb Ethernet  Software  Operating systems span RedHat 9, Fedora Core 4 & 5  RH9: node53  FC4: node4, 58, 59, 60  FC5: node1-3, 5-52, 61  RH9 is Linux 2.4  FC4/5 is Linux 2.6  NFS exports ‘/home’

XTORC Idle 48-hr Results  Data classified and grouped automatically  However, those results were manually interpreted (admin & statistician)‏  Observations  Node 0 is the most different from the rest, particularly hours 13, 37, 46, and 47. This is the head node where most services are running.  Node 53 runs the older Red Hat 9 (all others run Fedora Core 4/5).  It turned out that nodes 12, 31, 39, 43, and 63 were all down.  Node 13 … and particularly its hour 47!  Node 30 hour 7 … ?  Node 1 & Node 5 … ?  Three groups emerged in data clustering  1. temperature/memory related, 2. cpu related, 3. i/o related

Anomaly Detection - Next Steps  Data  Reduce overhead in data gathering  Monitor more fields  Investigate methods to aid data interpretation  Identify significant fields for given workloads  Heterogeneous nodes  Different workloads  Base (no/low work)  Loaded (benchmark/app work)‏  Loaded + Fault Injection  Working toward links between anomalies and failures

Prototypes - Overview  Proactive & reactive fault tolerance  Process level: BLCR + LAM-MPI  Virtual machine level: Xen + any kind of MPI implementation  Detection  Monitoring framework: based on Ganglia  Anomaly detection tool  Simulator  System log based  Enable customization of policies and system/application parameters

Is proactive the answer?  Most of the time: prediction accuracy is not good enough, we may loose all the benefit of proactive FT  No “one-fit-all” solution  Combination of different policies  “Holistic” fault tolerance  Example: decrease the checkpoint frequency combining proactive and reactive FT policies  Optimization of existing policies  Leverage existing techniques/policies  Tuning  Customization

Resource http://www.csm.ornl.gov/srt/ Contacts Geoffroy Vallee

Performance Prediction  Important variance between different runs of the same experiment  Only few studies to address the problem  “System noise”  Critical to scale up  Scientists want strict answer  What are the problems:  Lack of tools?  VMMs are too big/complex?  Not enough VMM-bypass/optimization?

Fault Tolerance Mechanisms  FT mechanisms are not yet mainstream (out-of-the-box)‏  But different solutions start to be available (BLCR, Xen, etc.)‏  Support of as many mechanisms as possible  Reactive FT mechanisms  Process-level checkpoint/restart  Virtual machine checkpoint/restart  Proactive FT mechanisms  Process-level migration  Virtual machine migration

Existing System Level Fault Injection  Virtual Machines  FAUmachine  Pro: focused on FI & experiments, code available  Con: older project, lots of dependencies, slow  FI-QEMU (patch)‏  Pro: works with ‘qemu’ emulator, code available  Con: patch for ARM arch, limited capabilities  Operating System  Linux (>= 2.6.20)‏  Pro: extensible, kernel & user level targets, maintained by Linux community  Con: immature, focused on testing Linux

Future Work  Implementation of the RAS framework  Ultimately have an “end-to-end” solution for system resilience  From initial studies based on the simulator  To deployment and testing on computing platforms  Using different low-level mechanisms (process level versus virtual machine level mechanisms)‏  Adapting the policies to both the platform and the applications

A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team Presented by Geoffroy Vallee Oak Ridge National Laboratory Welcome to.

Similar presentations

Presentation on theme: "A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team Presented by Geoffroy Vallee Oak Ridge National Laboratory Welcome to."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team Presented by Geoffroy Vallee Oak Ridge National Laboratory Welcome to.

Similar presentations

Presentation on theme: "A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team Presented by Geoffroy Vallee Oak Ridge National Laboratory Welcome to."— Presentation transcript:

Similar presentations

About project

Feedback