Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Similar presentations


Presentation on theme: "Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications."— Presentation transcript:

1 Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications

2 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Motivation  Condor supports Checkpoint/Restart(C/R) mechanism only in Standard Universe. Single process jobs  C/R for parallel jobs is not provided in any of current Condor universes.  We would like to make C/R available for MPI programs.

3 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Introduction  Why Message Passing Interface (MPI)? Designing a generic FT framework is extremely hard due to the diversity of hardware and software systems. We have chosen MPICH series.... MPI is the most popular programming model in cluster computing. Providing fault-tolerance to MPI is more cost-effective than providing it to the OS or hardware …

4 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Architecture -Concept- Monitoring Failure Detection C/R Protocol FT-MPICH

5 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Architecture -Overall System- Ethernet IPC Management System Communication MPI Process Communication IPC Ethernet MPI Process Communication IPC Ethernet MPI Process Communication IPC Ethernet Message Queue

6 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Management System Management System Makes MPI more reliable Failure Detection Checkpoint Coordination Recovery Initialization Coordination Output Management Checkpoint Transfer

7 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Manager System MPI process Local Manager MPI process Local Manager MPI process Local Manager Stable Storage Leader Manager Initialization, CKPT cmd, CKPT transfer, Failure notification & recovery Communication between MPI process to exchange data

8 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Fault-tolerant MPICH_P4 FT Module Recovery Module Connection Re-establishment Checkpoint Toolkit Atomic Message Transfer ADI(Abstract Device Interface) Ch_p4 (Ethernet) FT-MPICH Ethernet Collective Operations P2P Operations

9 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Startup in Condor  Precondition Leader Manager already knows the machines where MPI process is executed and the number of MPI process by user input Binary of Local Manager and MPI process is located at the same location of each machine

10 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Startup in Condor  Job submission description file Vanilla Universe Shell script file is used in submission description file executable points a shell script The shell file only executes Leader Manager Ex) Example.cmd #!/bin/sh Leader_manager … exe.sh(shell script) universe = Vanilla executable = exe.sh output = exe.out error = exe.err log = exe.log queue Example.cmd

11 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Startup in Condor  User submits a job using condor_submit  Normal Job Startup Condor Pool Central Manager Submit Machine SubmitShadow Schedd NegotiatorCollector Execute Machine Job (Leader Manager) Starter Startd

12 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Startup in Condor  Leader Manager executes Local Manager  Local Manager executes MPI process Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1Execute Machine 2Execute Machine 3 Local Manager MPI Process Fork() & Exec()

13 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Startup in Condor  MPI processes send Communication Info and Leader Manager aggregates this info  Leader Manager broadcasts aggregated info Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1Execute Machine 2Execute Machine 3 Local Manager MPI Process

14 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Fault Tolerant MPI  To provide MPI fault-tolerance, we have adopted Coordinated checkpointing scheme (vs. Independent scheme) The Leader Manager is the Coordinator!! Application-level checkpointing (vs. kernel-level CKPT.) This method does not require any efforts on the part of cluster administrators User-transparent checkpointing scheme (vs. User-aware) This method requires no modification of MPI source codes

15 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Atomic Message Passing  Coordination between MPI process Assumption Communication Channel is FIFO Lock(), Unlock() To create atomic operation Proc 1 Lock() Unlock() Atomic Region CKPT SIG Checkpoint is performed!! Checkpoint is delayed!! Proc 0

16 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Atomic Message Passing (Case 1)  When MPI process receive CKPT SIG, MPI process send & receive barrier message Proc 1 Lock() Unlock() Atomic Region Proc 0 CKPT SIG Barrier Data CKPT CKPT SIG CKPT

17 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Atomic Message Passing (Case 2)  Through sending and receiving barrier message, In-transit message is pushed to the destination Proc 1 Lock() Unlock() Atomic Region Proc 0 CKPT SIG Barrier Data Delayed CKPT

18 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Atomic Message Passing (Case 3)  The communication channel between MPI process is flushed Dependency between MPI process is removed Proc 1 Lock() Unlock() Atomic Region Proc 0 CKPT SIG Barrier Data Delayed CKPT

19 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Checkpointing  Coordinated Checkpointing ver 2 ver 1 Leader Manager checkpoint command rank0 rank1rank2rank3 storage Stack Data Text Heap

20 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Failure Recovery  MPI process recovery Stack Data Text Heap Stack Data Text Heap CKPT ImageNew processRestarted Process

21 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Failure Recovery  Connection Re-establishment Each MPI process re-opens socket and sends IP, Port info to Local Manager  This is the same with the one we did before at the initialization time.

22 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Fault Tolerant MPI  Recovery from failure failure detection ver 1 Leader Manager checkpoint command rank0 rank1rank2rank3 storage

23 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Fault Tolerant MPI in Condor  Leader Manager controls MPI processes by issuing checkpoint command, monitoring Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1Execute Machine 2Execute Machine 3 Local Manager MPI Process Condor is not aware of the failure incident

24 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Fault-tolerant MPICH-variants (Seoul National University) FT Module Recovery Module Connection Re-establishment Ethernet Checkpoint Toolkit Atomic Message Transfer ADI(Abstract Device Interface) Globus2 (Ethernet) GM (Myrinet)MVAPICH (InfiniBand) Collective Operations MPICH-GF P2P Operations M3M3 SHIELD MyrinetInfiniBand

25 Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Summary  We can provide fault-tolerance for parallel applications using MPICH on Ethernet, Myrinet, and Infiniband.  Currently, only the P4(ethernet) version works with Condor.  We look forward to working with Condor team.

26


Download ppt "Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications."

Similar presentations


Ads by Google