Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.

Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi

Parallel Programming Laboratory University of Illinois, U-C 2 Outline Motivation Background Solutions Co-ordinated Checkpointing In-memory double checkpoint Sender based Message Logging Processor Evacuation in response to fault prediction : New Work

Parallel Programming Laboratory University of Illinois, U-C 3 Motivation As machines grow in size MTBF decreases Applications have to tolerate faults Applications need fast, low cost and scalable fault tolerance support Modern Hardware is making fault prediction possible Temperature sensors, PAPI-4, SMART Paper on detection tomorrow

Parallel Programming Laboratory University of Illinois, U-C 4 Background Checkpoint based methods Coordinated – Blocking [Tamir84], Non-blocking [Chandy85] Co-check, Starfish, Clip – fault tolerant MPI Uncoordinated – suffers from rollback propagation Communication – [Briatico84], doesn’t scale well Log-based Pessimistic – MPICH-V1 and V2, SBML [Johnson87] Optimistic – [Strom85] unbounded rollback, complicated recovery Causal Logging – [Elnozahy93] complicated causality tracking and recovery, Manetho, MPICH-V3

Parallel Programming Laboratory University of Illinois, U-C 5 Multiple Solutions in Charm++ Reactive : react to a fault Disk based Checkpoint/Restart In Memory Double Checkpointing/Restart Sender based Message Logging Proactive : react to a fault prediction Evacuate processors that are warned

Parallel Programming Laboratory University of Illinois, U-C 6 Checkpoint/Restart Mechanism Blocking Co-ordinated Checkpoint State of chares are checkpointed to disk Collective call MPI_Checkpoint(DIRNAME) The entire job is restarted Virtualization allows restarting on different # of Pes Runtime option >./charmrun pgm +p4 +vp16 +restart DIRNAME Simple but effective for common cases

Parallel Programming Laboratory University of Illinois, U-C 7 Drawbacks Disk based coordinated checkpointing is slow Job needs to be restarted Requires user intervention Impractical in the case of frequent faults

Parallel Programming Laboratory University of Illinois, U-C 8 In-memory Double Checkpoint In-memory checkpoint Faster than disk Co-ordinated checkpoint Simple User can decide what makes up useful state Double checkpointing Each object maintains 2 checkpoints on: Local physical processor Remote “buddy” processor

Parallel Programming Laboratory University of Illinois, U-C 9 Restart A “Dummy” process is created: Need not have application data or checkpoint Necessary for runtime Starts recovery on all other PEs Other processors: Remove all chares Restore checkpoints lost on the crashed PE Restore chares from local checkpoints Load balance after restart

Parallel Programming Laboratory University of Illinois, U-C 10 Overhead Evaluation Jacobi (200MB data size) on up to 128 processors 8 checkpoints in 100 steps

Parallel Programming Laboratory University of Illinois, U-C 11 Recovery Performance LeanMD application 10 crashes 128 processors Checkpoint every 10 time steps

Parallel Programming Laboratory University of Illinois, U-C 12 Drawbacks High Memory Overhead Checkpoint/Rollback doesn’t scale All nodes are rolled back just because 1 crashed Even nodes independent of the crashed node are restarted Restart cost is similar to Checkpoint period Blocking co-ordinated checkpoint requires user intervention

Parallel Programming Laboratory University of Illinois, U-C 13 Sender based Message Logging Message Logging Store message logs on the sender Asynchronous checkpoints Each processor has a buddy processor Stores its checkpoint in the buddy’s memory Restart: processor from an extra pool Recreate only objects on crashed processor Playback logged messages Restores state to that after the last processed message Processor virtualization can speed it up

Parallel Programming Laboratory University of Illinois, U-C 14 Message Logging State of an object is determined by Messages processed Sequence of processed messages Protocol Sender logs message and requests receiver for TN Receiver sends back TN Sender stores TN with log and sends message Receiver processes messages in order of TN Processor virtualization complicates message logging Messages to object on the same processor needs to be logged remotely

Parallel Programming Laboratory University of Illinois, U-C 15 Parallel Restart Message Logging allows fault-free processors to continue with their execution However, sooner or later some processors start waiting for crashed processor Virtualization allows us to move work from the restarted processor to waiting processors Chares are restarted in parallel Restart cost can be reduced

Parallel Programming Laboratory University of Illinois, U-C 16 Present Status Most of Charm++ has been ported Support for migration has not yet been implemented in the fault tolerant protocol AMPI ported Parallel restart not yet implemented

Parallel Programming Laboratory University of Illinois, U-C 17 Recovery Performance Execution Time with increasing number of faults on 8 processors (Checkpoint period 30s)

Parallel Programming Laboratory University of Illinois, U-C 18 Pros and Cons Low overhead for jobs with low communication Currently high overhead for jobs with high communication Should be tested with high virtualization ratio to reduce the message logging overhead

Parallel Programming Laboratory University of Illinois, U-C 19 Processor evacuation Modern Hardware can be used to predict faults Runtime system response Low response time No new processors should be required Efficiency loss should be proportional to loss in computational power

Parallel Programming Laboratory University of Illinois, U-C 20 Solution Migrate Charm++ objects off processor Requires remapping of “home” PEs of objects Point to Point message delivery continues to work efficiently Collective operations cope with loss of processors Rewire reduction tree around a warned processor Can deal with multiple simultaneous warnings Load balance after an evacuation

Parallel Programming Laboratory University of Illinois, U-C 21 Rearrange the reduction tree Do not rewire tree while reduction is going on Stop reductions Rewire tree Continue reductions Affects only parent and children of a node Unbalances tree: Could be solved by recreating tree

Parallel Programming Laboratory University of Illinois, U-C 22 Response time Evacuation time for a Sweep3d execution on the 150^3 case Total ~500 MB of data Pessimistic estimate of evacuation time

Parallel Programming Laboratory University of Illinois, U-C 23 Performance after evacuation Iteration time of Sweep3d on 32 processors for 150^3 problem with 1 warning

Parallel Programming Laboratory University of Illinois, U-C 24 Processor Utilization after evacuation Iteration time of Sweep3d on 32 processors for 150^3 problem with both processors on node 3( processors 4 and 5) being warned simultaneously

Parallel Programming Laboratory University of Illinois, U-C 25 Conclusions Available in Charm++ and AMPI Checkpoint/Restart In memory Checkpoint/Restart Proactive fault tolerance Under development Sender based message logging Deal with migration, deletion Parallel Restart Abstraction layers in Charm++/AMPI make it suitable for implementing fault tolerance protocols

Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.

Similar presentations

Presentation on theme: "Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.

Similar presentations

Presentation on theme: "Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi."— Presentation transcript:

Similar presentations

About project

Feedback