Presentation is loading. Please wait.

Presentation is loading. Please wait.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Similar presentations


Presentation on theme: "HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John."— Presentation transcript:

1 HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John T. Daly High Performance Computing Division Los Alamos National Laboratory William M. Jones Electrical and Computer Engineering Department United States Naval Academy LA-UR-08-3236 Resilience 2008 : Workshop on Resiliency in High Performance Computing

2 HPC HPC-5 Systems Integration High Performance Computing 2 Applications WILL Fail  In spite of improved fault tolerance  Failures will inevitably occur  Hardware failures  Application and system software bugs  We are moving to petaflop-scale supercomputers  More software layers = more points of failure  Extreme temperature  Extreme power  Extreme scale  With more computing power comes more potential for wasted money when not utilized as best as possible

3 HPC HPC-5 Systems Integration High Performance Computing 3 Should We Even Try to Avoid Failure?  Failure - how to avoid it?  Dynamic process creation to recover from node failures  Fault Tolerant MPI  Periodic checkpoints - but how often?  System support to advise the application of imminent failure  Save spare processors allocated for use after a failure  Costly. Complex.  Let us just ask ourselves instead a simple question: Is my application performing useful work (making progress)?

4 HPC HPC-5 Systems Integration High Performance Computing 4 Is My Application Making Progress?  How do we ensure progress is made?  Application monitoring frameworks  Intelligent application checkpointing  Analysis of checkpoint overhead  So, what’s the main problem?

5 HPC HPC-5 Systems Integration High Performance Computing 5 Failures May Go Unnoticed Application stops making progress wasted time

6 HPC HPC-5 Systems Integration High Performance Computing 6 There Are Many Ways to Monitor Application Progress  It is a surprisingly hard task to determine if an application has stopped making progress!  Maybe it’s just waiting on a network/disk  Maybe it’s computing or maybe it’s just spinning in an infinite loop  Maybe a node is not responding or maybe another task is just switched in  Let’s take a look at a layered approach to monitoring progress

7 HPC HPC-5 Systems Integration High Performance Computing 7 Node-Level System Monitoring  Daemons  Heart-beat mechanisms  Coupled with useful performance data sometimes  Are we willing to pay for daemon processing time? System “noise” already is considered too high

8 HPC HPC-5 Systems Integration High Performance Computing 8 Subsystem-Level System Monitoring  Network heartbeat - Infiniband  Fault tolerant MPI  Parallel file system fault tolerance  Fail over nodes  Redundancy  Kernel - power, heat  Degrade performance but try and recover in some cases  Helps pinpoint failure to specific subsystems

9 HPC HPC-5 Systems Integration High Performance Computing 9 Application-Level System Monitoring  Who better to know if an application is making progress than the application itself?  Source/binary instrumentation to emit heart beats  Kernel modifications to look for system call usage - does the application appear to be in a wait loop?  Watch application output. Is it producing any at a regular interval?  How does one determine these intervals?

10 HPC HPC-5 Systems Integration High Performance Computing 10 Suppose you could detect that an error occurred, migrate the job, and restart the job from last checkpoint. How quickly would you need to determine that an interrupt occurred?

11 HPC HPC-5 Systems Integration High Performance Computing 11 Our Assumptions  Coupled checkpoint / restart application  Some tradeoff exists between checkpoint frequency and how far we have to backup after an interrupt  R = f(detection latency + restart overhead)

12 HPC HPC-5 Systems Integration High Performance Computing 12 Analytical Model

13 HPC HPC-5 Systems Integration High Performance Computing 13

14 HPC HPC-5 Systems Integration High Performance Computing 14

15 HPC HPC-5 Systems Integration High Performance Computing 15 Compare Theory to Simulation  How closely does real supercomputer usage match the theory?  Need a simulator - BeoSim  Need real data - Pink at Los Alamos

16 HPC HPC-5 Systems Integration High Performance Computing 16 Workload Distribution Event driven simulation using 4,000,000 jobs (using BeoSim) (1926 node cluster)

17 HPC HPC-5 Systems Integration High Performance Computing 17 BeoSim: A Computational Grid Simulator JAVA front-front C back-end Discrete event simulator Single-threaded Parameter studies in parallel Parallel Job Scheduling Research Single and Multiple Clusters Checkpointing Studies

18 HPC HPC-5 Systems Integration High Performance Computing 18 BeoSim Framework Beosim: http://www.parl.clemson.edu/beosim

19 HPC HPC-5 Systems Integration High Performance Computing 19 Impact of Increasing Failure Rates May seem negligible, but, multiple interrupts, impact on throughput - NOT total number of failures

20 HPC HPC-5 Systems Integration High Performance Computing 20 Impact on Throughput for ALL jobs significant reduction in queueing delays CPdelta (time to determine an interrupt occurred) (min)‏

21 HPC HPC-5 Systems Integration High Performance Computing 21 Impact on Execution Time marginal (1.8%)‏ significant (13.5%) CPdelta (time to determine an interrupt occurred) (min)‏

22 HPC HPC-5 Systems Integration High Performance Computing 22 Keep in Mind That... CPdelta (time to determine an interrupt occurred) (min)‏ (6.5% of total job interrupted) (1.5% of total job interrupted) So while the averages are relatively close for both scenarios, there are an increasing number of jobs that are effected as the MTBF decreases; and therefore more resources tied to applications that are not making progress

23 HPC HPC-5 Systems Integration High Performance Computing 23 Conclusions  Simulation seems to relatively closely match theory approximation  Simple theory but applied to complex system not included in theory - but still closely matches  Could it extend to more complex systems?  Application monitoring is paramount  Immediate detection not necessarily a hard requirement (for this system)  Helps decision makers:  $100million to spend - do I need to pay 5x the cost for a better detection system?  What’s my expected workload?  Put it into the simulation!  Pink is a general purpose cluster - lots of different jobs with different runtimes and widths. We use averages which tend to make the results “murky”.

24 HPC HPC-5 Systems Integration High Performance Computing 24 Future Work  No time to factor in fixing the failure, hardware takes time to repair  Completely independent failures  Look at different “classes” of jobs or look at a system that is less diverse as Pink  How to come up with the MTBF and how it effects the optimal checkpointing intervals  More work determining parameter M for systems where we’re not running a job across the entire machine

25 HPC HPC-5 Systems Integration High Performance Computing 25 Thank-you! Questions? Nathan A. DeBardeleben


Download ppt "HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John."

Similar presentations


Ads by Google