Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

Similar presentations


Presentation on theme: "Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign."— Presentation transcript:

1 Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign

2 2 Motivation As machines grow in size MTBF decreases Applications have to tolerate faults Applications need fast, low cost and scalable fault tolerance support Fault tolerant runtime for: Charm++ Adaptive MPI

3 3 Outline Disk Checkpoint/Restart FTC-Charm++ in-memory checkpoint/restart Proactive Fault Tolerance FTL-Charm++ message logging

4 4 Disk Checkpoint/Restart

5 5 Checkpoint/Restart Simplest scheme for application fault tolerance Any long running application saves its state into disk periodically at certain point coordinated checkpointing strategy (barrier) State information is saved in a directory of your choosing Checkpoint of the application data is done by invoking pup routine of all objects Restore also uses pup, so no additional application code is needed (pup is all you need)

6 6 Checkpointing Job In Charm++, use: void CkStartCheckpoint(char* dirname,const CkCallback& cb) Called on one processor; calls resume when checkpoint is complete In AMPI, use: MPI_Checkpoint( ); Collective call; returns when checkpoint is complete

7 7 Restart Job from Checkpoint The charmrun option ++restart is used to restart./charmrun +p4./pgm ++restart log Number of processors need not be the same Parallel objects are redistributed when needed

8 8 FTC-Charm++ In-Memory Checkpoint/Restart

9 9 Disk vs. In-memory Scheme Disk checkpointing suffers Need user intervention to restart a job Assume reliable storage - disk Disk I/O is slow In-memory checkpoint/restart scheme Online version of the previous scheme Low impact on fault-free execution Provide fast and automatic restart capability Does not rely on extra processors Maintain execution efficiency after restart Does not rely on any fault-free component Not assume stable storage

10 10 Overview Coordinated checkpointing scheme Simple, low overhead on fault-free execution Scientific applications that are iterative Double checkpointing Tolerate one failure at a time In-memory checkpointing Diskless checkpointing Efficient for applications with small memory footprint In case when there is no extra processors Program continue to run with remaining processors Load balancing for restart

11 11 Checkpoint Protocol Similar to the previous scheme coordinated checkpointing strategy Programmers decide what to checkpoint void CkStartMemCheckpoint(CkCallback &cb) Each object pack data and send to two different (buddy) processors

12 12 Restart protocol Initiated by the failure of a physical processor Every object rolls back to the state preserved in the recent checkpoints Combine with load balancer to sustain the performance

13 13 H I JA BC E D F G A B C DEF G H I J A BCF G D E H I J A BC DE FG HIJ A F C D E FG HI J H I J A BC D E B G A A A A PE0 PE1PE2 PE3 PE0 PE2 PE3 object checkpoint 1 checkpoint 2 restored object PE1 crashed ( lost 1 processor ) Checkpoint/Restart Protocol

14 14 Local Disk-Based Protocol Double in-memory checkpointing Memory concern Pick checkpointing time where global state is small Double In-disk checkpointing Make use of local disk Also does not rely on any reliable storage Useful for applications with very big memory footprint

15 15 Compiling FTC-Charm++ Build charm with “syncft” option./build charm++ net-linux syncft –O Command line switch +ftc_disk for disk/memory checkpointing: charmrun./pgm +ftc_disk

16 16 Performance Evaluation IA-32 Linux cluster at NCSA 512 dual 1Ghz Intel Pentium III processors 1.5GB RAM each processor Connected by both Myrinet and 100MBit Ethernet

17 17 Performance Comparisons with Traditional Disk-based Checkpointing

18 18 Recovery Performance Molecular Dynamics Simulation application - LeanMD Apoa1 benchmark (92K atoms) 128 processors Crash simulated by killing processes No backup processors With load balancing

19 19 Performance improve with Load Balancing LeanMD, Apoa1, 128 processors

20 20 Recovery Performance 10 crashes 128 processors Checkpoint every 10 time steps

21 21 LeanMD with Apoa1 benchmark 90K atoms 8498 objects

22 22 Proactive Fault Tolerance

23 23 Motivation Run-time reacts to a failure Proactively migrate from a processor about to fail Modern hardware supports early fault indication SMART protocol, Motherboard temperature sensors, Myrinet interface cards Possible to create mechanism for fault prediction

24 24 Requirements Response time should be as low as possible No new processes should be required Collective operations should still work Efficiency loss should be proportional to computing power loss

25 25 System Application is warned of impending fault via signal Processor, memory and interconnect should continue to work correctly for sometime after warning Run-time ensures that application continues to run on the remaining processors even if one processor crashes

26 26 Solution Design Migrate Charm++ objects off warned processor Point to point message delivery should continue to work Collective operations should cope with the possible loss of multiple processors Modify the runtime system's reduction tree to remove the warned processor. Minimal number of processors should be affected Runtime system should remain load balanced after a processor has been evacuated

27 27 Proactive FT: Current Status Status Support for multiple faults ready; currently testing support for simultaneous faults Faults simulated via signal sent to process Current version fully integrated to Charm++ and AMPI Example: sweep3d (MPI code) on NCSA’s tungsten Original utilization Utilization after fault Utilization after LB

28 28 How to Use Part of default version of Charm++ No extra compiler flags required This code does not get executed until a warning Any detection system can be plugged in Can send signal (USR1) to process on compute node Can call a method (CkDecideEvacPe) to evacuate a processor Used with any Charm++ and AMPI program For AMPI needs to be used with -memory isomalloc

29 29 FTL-Charm++ Message Logging

30 30 Motivation Checkpointing not fully automatic Coordinated checkpointing is expensive Checkpoint/Rollback doesn’t scale All nodes are rolled back just because 1 crashed Even nodes independent of the crashed node are restarted

31 31 Design Message Logging Sender side message logging Asynchronous checkpoints Each processor has a buddy processor Stores its checkpoint in the buddy’s memory Checkpoint on its own (no barrier)

32 32 Message to Remote Chares Chare P sender Chare Q receiver If has been seen earlier TN is marked as received Otherwise create new TN and store the

33 33 Status Most of Charm++ and AMPI has been ported Support for migration has not yet been implemented in the fault tolerant protocol Parallel restart not yet implemented Not in Charm main branch

34 34 Thank You! Free source, binaries, manuals, and more information at: http://charm.cs.uiuc.edu/ http://charm.cs.uiuc.edu/ Parallel Programming Lab at University of Illinois


Download ppt "Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign."

Similar presentations


Ads by Google