Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team Presented by Geoffroy Vallee Oak Ridge National Laboratory Welcome to.

Similar presentations


Presentation on theme: "A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team Presented by Geoffroy Vallee Oak Ridge National Laboratory Welcome to."— Presentation transcript:

1 A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team Presented by Geoffroy Vallee Oak Ridge National Laboratory Welcome to HPCVirt 2009

2 Goal of the Presentation Can we anticipate failures and avoid their impact on application execution?

3 Introduction  Traditional Fault Tolerance Policies in HPC Systems  Reactive policies  Other approach: pro-active fault tolerance  Two critical capabilities to make pro-active FT successful  Failure prediction  Anomaly detection  Application migration  Pro-active policy  Testing / Experimentation  Is proactive fault tolerance the solution?

4 Failure Detection & Prediction  System monitoring  Live monitoring  Study non-intrusive monitoring techniques  Postmortem failure analysis  System log analysis  Live analysis for failure prediction  Postmortem analysis  Anomaly analysis  Collaboration with George Ostrouchov  Statistical tool for anomaly detection

5 Anomaly Detection  Anomaly Analyzer (George Ostrouchov)‏  Ability to view groups of components as statistical distributions  Identify anomalous components  Identify anomalous time periods  Based on numeric data with no expert knowledge for grouping  Scalable approach, only statistical properties of simple summaries  Power from examination of high-dimensional relationships  Visualization utility used to explore data  Implementation uses  R project for statistical computing  GGobi visualization tool for high-dimensional data exploration  With good failure data, could be used for failure prediction

6 Anomaly Detection Prototype  Monitoring / Data collection  Prototype developed using XTORC  Ganglia monitoring system  Standard metrics, e.g., memory/cpu utilization  LM_sensor data, e.g., cpu/mb temperature  Leveraged RRD reader from Ovis v1.1.1

7 Proactive Fault Tolerance Mechanisms  Goal: move the application away from the component that is about to fail  Migration  Pause/unpause  Major proactive FT mechanisms  Process-level migration  Virtual machine migration  In our context  Do not care about the underlying mechanism  We can easily switch between solutions

8 System and application resilience  What policy to use for proactive FT?  Modular framework  Virtual machine ckpt/rsrt and migration  Process-level ckpt/rsrt and migration  Implementation of new policies via our SDK  Feedback loop  Policy simulator  Ease initial phase of study of new policies  Results match experimental virtualization results

9 Type 1 Feedback-Loop Control Architecture  Alert-driven coverage  Basic failures  No evaluation of application health history or context  Prone to false positives  Prone to false negatives  Prone to miss real-time window  Prone to decrease application heath through migration  No correlation of health context or history

10 Type 2 Feedback-Loop Control Architecture  Trend-driven coverage  Basic failures  Less false positives/negatives  No evaluation of application reliability  Prone to miss real-time window  Prone to decrease application heath through migration  No correlation of health context or history

11 Type 3 Feedback-Loop Control Architecture  Reliability-driven coverage  Basic and correlated failures  Less false positives/negatives  Able to maintain real-time window  Does not decrease application heath through migration  Correlation of short-term health context and history  No correlation of long-term health context or history  Unable to match system and application reliability patterns

12 Type 4 Feedback-Loop Control Architecture  Reliability-driven coverage of failures and anomalies  Basic and correlated failures, anomaly detection  Less prone to false positives  Less prone to false negatives  Able to maintain real-time window  Does not decrease application heath through migration  Correlation of short and long- term health context & history

13 Testing and Experimentation  How to evaluate a failure prediction mechanism?  Failure injection  Anomaly detection  How to evaluate the impact of a given proactive policy?  Simulation  Experimentation

14 Fault Injection / Testing  First purpose: testing our research  Inject failure at different levels: system, OS, application  Framework for fault injection  Controller: Analyzer, Detector & Injector  Target system & user level targets  Testing of failure prediction/detection mechanisms  Mimic behavior of other systems  “Replay” failures sequence on another system  Based on system logs, we can evaluate the impact of different policies

15 Fault Injection  Example faults/errors  Bit-flips - CPU registers/memory  Memory errors - mem corruptions/leaks  Disk faults - read/write errors  Network faults - packet loss, etc.  Important characteristics  Representative failures (fidelity)‏  Transparency and low overhead  Detection/Injection are linked  Existing Work  Techniques: Hardware vs. Software  Software FI can leverage perf./debug hardware  Not many publicly available tools

16 Simulator  System logs based  Currently based on LLNL ASCI White  Evaluate impact of  Alternate policies  System/FT mechanisms parameters (e.g., checkpoint cost)‏  Enable studies & evaluation of different configurations before actual deployment

17 Anomaly Detection: Experimentation on “XTORC”  Hardware  Compute nodes: ~45-60 (P4 @ 2 Ghz)‏  Head node: 1 (P4 @ 1.7Ghz)‏  Service/log server: 1 (P4 @ 1.8Ghz)  Network: 100 Mb Ethernet  Software  Operating systems span RedHat 9, Fedora Core 4 & 5  RH9: node53  FC4: node4, 58, 59, 60  FC5: node1-3, 5-52, 61  RH9 is Linux 2.4  FC4/5 is Linux 2.6  NFS exports ‘/home’

18 XTORC Idle 48-hr Results  Data classified and grouped automatically  However, those results were manually interpreted (admin & statistician)‏  Observations  Node 0 is the most different from the rest, particularly hours 13, 37, 46, and 47. This is the head node where most services are running.  Node 53 runs the older Red Hat 9 (all others run Fedora Core 4/5).  It turned out that nodes 12, 31, 39, 43, and 63 were all down.  Node 13 … and particularly its hour 47!  Node 30 hour 7 … ?  Node 1 & Node 5 … ?  Three groups emerged in data clustering  1. temperature/memory related, 2. cpu related, 3. i/o related

19 Anomaly Detection - Next Steps  Data  Reduce overhead in data gathering  Monitor more fields  Investigate methods to aid data interpretation  Identify significant fields for given workloads  Heterogeneous nodes  Different workloads  Base (no/low work)  Loaded (benchmark/app work)‏  Loaded + Fault Injection  Working toward links between anomalies and failures

20 Prototypes - Overview  Proactive & reactive fault tolerance  Process level: BLCR + LAM-MPI  Virtual machine level: Xen + any kind of MPI implementation  Detection  Monitoring framework: based on Ganglia  Anomaly detection tool  Simulator  System log based  Enable customization of policies and system/application parameters

21 Is proactive the answer?  Most of the time: prediction accuracy is not good enough, we may loose all the benefit of proactive FT  No “one-fit-all” solution  Combination of different policies  “Holistic” fault tolerance  Example: decrease the checkpoint frequency combining proactive and reactive FT policies  Optimization of existing policies  Leverage existing techniques/policies  Tuning  Customization

22 Resource http://www.csm.ornl.gov/srt/ Contacts Geoffroy Vallee

23 Performance Prediction  Important variance between different runs of the same experiment  Only few studies to address the problem  “System noise”  Critical to scale up  Scientists want strict answer  What are the problems:  Lack of tools?  VMMs are too big/complex?  Not enough VMM-bypass/optimization?

24 Fault Tolerance Mechanisms  FT mechanisms are not yet mainstream (out-of-the-box)‏  But different solutions start to be available (BLCR, Xen, etc.)‏  Support of as many mechanisms as possible  Reactive FT mechanisms  Process-level checkpoint/restart  Virtual machine checkpoint/restart  Proactive FT mechanisms  Process-level migration  Virtual machine migration

25 Existing System Level Fault Injection  Virtual Machines  FAUmachine  Pro: focused on FI & experiments, code available  Con: older project, lots of dependencies, slow  FI-QEMU (patch)‏  Pro: works with ‘qemu’ emulator, code available  Con: patch for ARM arch, limited capabilities  Operating System  Linux (>= 2.6.20)‏  Pro: extensible, kernel & user level targets, maintained by Linux community  Con: immature, focused on testing Linux

26 Future Work  Implementation of the RAS framework  Ultimately have an “end-to-end” solution for system resilience  From initial studies based on the simulator  To deployment and testing on computing platforms  Using different low-level mechanisms (process level versus virtual machine level mechanisms)‏  Adapting the policies to both the platform and the applications


Download ppt "A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team Presented by Geoffroy Vallee Oak Ridge National Laboratory Welcome to."

Similar presentations


Ads by Google