Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, May 5, 2015 Lessons Learned Implementing.

Similar presentations


Presentation on theme: "Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, May 5, 2015 Lessons Learned Implementing."— Presentation transcript:

1 Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, balaji}@anl.gov May 5, 2015 Lessons Learned Implementing User-Level Failure Mitigation in MPICH CCGrid 2015

2 User-Level Failure Mitigation (ULFM) What is ULFM? – A proposal and standardized way of handling fail-stop process failures in MPI – Mechanisms necessary to implement fault tolerance in applications and libraries in order to allow applications to continue execution after failures ULFM introduces semantics to define failure notification, propagation, and recovery within MPI 1 CCGrid 2015 0 0 1 MPI_Recv MPI_ERR_PROC_FAILED Failure Notification Failure Propagation Failure Recovery MPI_COMM_SHRINK()

3 Motivation & Goal ULFM is becoming the front-running solution for process fault tolerance in MPI – Not yet adopted into the MPI standard – Being used by applications and libraries and is being Introduce an implementation of ULFM in MPICH – MPICH is a high-performance and widely portable implementation of the MPI standard – Implementing ULFM in MPICH will expedite adoption more widely by other MPI implementations Demonstrate that while still a reference implementation, the runtime cost of the new API calls introduced is relatively low CCGrid 2015 2

4 Implementation & Evaluation ULFM Implementation in MPICH Failure Detection – Local failures detected by Hydra and netmods – Error codes are returned back to the user from the API calls Agreement – Uses two group-based allreduce operations – If either fails, an error is returned to the user Revocation – Non-optimized implementation done with message flood Shrinking – All processes construct consistent group of failed procs via allreduce CCGrid 2015 3 Shrinking Slower than MPI_COMM_DUP because of failure detection. As expected, introducing failures in the middle of the algorithm causes large runtime increase as the algorithm must restart.


Download ppt "Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, May 5, 2015 Lessons Learned Implementing."

Similar presentations


Ads by Google