Presentation is loading. Please wait.

Presentation is loading. Please wait.

Use Cases for Fault Tolerance Support in MPI Rich Graham Oak Ridge National Laboratory.

Similar presentations


Presentation on theme: "Use Cases for Fault Tolerance Support in MPI Rich Graham Oak Ridge National Laboratory."— Presentation transcript:

1 Use Cases for Fault Tolerance Support in MPI Rich Graham Oak Ridge National Laboratory

2 Working Assumption  MPI Provides the hooks into the communications and process control system to allow others to implement Fault Tolerant algorithms  This working-group will address Process Fault Tolerance (FT), not network FT (don’t need changes to the standard for this)  MPI Provides the hooks into the communications and process control system to allow others to implement Fault Tolerant algorithms  This working-group will address Process Fault Tolerance (FT), not network FT (don’t need changes to the standard for this)

3 Process Failure  Scenario:  Running Parallel Application, of process count N, looses one or more processes due to failure not related to the application  Recovery scenarios:  Communicator/s abort  Application uses some sort of CPR to continue  May want to quiet the communications system  May want to log messages  Application continues to run with M processes, where M <= N (application chooses if it can continue with M < N )  Application expects a dense Rank index  Application expects a sparse Rank index  Scenario:  Running Parallel Application, of process count N, looses one or more processes due to failure not related to the application  Recovery scenarios:  Communicator/s abort  Application uses some sort of CPR to continue  May want to quiet the communications system  May want to log messages  Application continues to run with M processes, where M <= N (application chooses if it can continue with M < N )  Application expects a dense Rank index  Application expects a sparse Rank index

4 MPI Implications  Scenario: Communicator Abort (current state in MPI)  Process Control:  Terminate processes in the failed intra- communicator  Communications:  Discard traffic associated with the failed communicators  Scenario: Communicator Abort (current state in MPI)  Process Control:  Terminate processes in the failed intra- communicator  Communications:  Discard traffic associated with the failed communicators

5 MPI Implications  Scenario: Some sort of CPR method in use  Process Control:  Need to re-establish communications with the restarted processes  Communications:  May need to quiet the communications system to get into a state the CPR system can handle  May need to replay messages  May need to quiet communications until parallel application is fully restarted (after failure)  Scenario: Some sort of CPR method in use  Process Control:  Need to re-establish communications with the restarted processes  Communications:  May need to quiet the communications system to get into a state the CPR system can handle  May need to replay messages  May need to quiet communications until parallel application is fully restarted (after failure)

6 MPI Implications  Scenario: Application continues to run with M processes ( M <= N)  Process Control:  May need to re-index processes within affected communicators  Communications:  May need to (re-)establish communications  May need to handle communications to non-existent processes  May need to discard data  All outstanding traffic  Only traffic associated with failed processes  Groups and Communicators  May change during the life-cycle of these objects  Collective Communications  How are collective optimizations impacted ?  What happens with outstanding collective operations ?  Scenario: Application continues to run with M processes ( M <= N)  Process Control:  May need to re-index processes within affected communicators  Communications:  May need to (re-)establish communications  May need to handle communications to non-existent processes  May need to discard data  All outstanding traffic  Only traffic associated with failed processes  Groups and Communicators  May change during the life-cycle of these objects  Collective Communications  How are collective optimizations impacted ?  What happens with outstanding collective operations ?


Download ppt "Use Cases for Fault Tolerance Support in MPI Rich Graham Oak Ridge National Laboratory."

Similar presentations


Ads by Google