Presentation is loading. Please wait.

Presentation is loading. Please wait.

CiFTS Coordinated Infrastructure for Fault Tolerant Systems.

Similar presentations


Presentation on theme: "CiFTS Coordinated Infrastructure for Fault Tolerant Systems."— Presentation transcript:

1 CiFTS Coordinated Infrastructure for Fault Tolerant Systems

2 Agenda The Problem and the purpose The CIFTS framework The CIFTS team Call for Action

3 The Problem detects “communication failure” with node X MPI JS/RM continues to schedule jobs on same resources 1.Cluster system software is agnostic of this MPI job failure 2.Cluster system software is agnostic of this reason of MPI job failure MPI Aborts! Application Aborts! More failures Very less fault information sharing! No mechanism for global system knowledge! No knowledge of this failure shared between System resources

4 The Purpose detects “communication failure” with node X MPI MPI shares this failure knowledge with the system JS/RM Not launch jobs on node X until Further diagnosis Diagnostics Utility Runs scripts for root-causing Node X problem Checkpoints itself Application Checkpoints itself Application Checkpoints itself Application

5 The Purpose IO node failure. PVFS down PVFS PVFS shares this information JS/RM Launch jobs with NFS file system MPI-IO Prints a coherent error message Checkpoints itself Application Checkpoints itself Application Migrates existing jobs

6 Fault Tolerant Backplane The CIFTS Framework Linear Algebra Libraries Middleware Like MPI MPI-IO Universal Logger Automatic Actions Diagnostics Tools Event Analysis Checkpoint Restart System PVFS Resource Manager/JS Libraries and Applications System Components Autonomics Fault Tolerant Backplane

7 A little deeper… Component Instance 1 2 3 Register with FTB Subscribe for events Publish events 1 2 3 Component Instance 1 2 3 Distributed Fault Tolerant Backplane

8 CIFTS API - some primitives FTB_Init ( IN FTB_comp_info_t *comp_info, OUT FTB_client_handle_t *client_handle, OUT char *error_msg ) FTB_Publish_event ( IN FTB_client_handle_t handle, IN char *event_name, IN FTB_event_data_t *datadetails, OUT char *error_msg ) FTB_Create_mask ( INOUT FTB_event_mask_t *evt_mask, IN char *field_name, IN char *field_val, OUT char *error_msg) FTB_Subscribe (IN FTB_client_handle_t chandle, IN FTB_event_mask_t *event_mask, OUT FTB_subscribe_handle_t *shandle, OUT char *error_msg IN int (*callback)(OUT FTB_catch_event_info_t *, OUT void*), IN void *arg) FTB_Poll_for_event (IN FTB_subscribe_handle_t shandle, OUT FTB_catch_event_info_t *catch_event, OUT char *error_msg); FTB_Finalize (IN FTB_client_handle_t handle);

9 Supported components BLCR Fault Tolerant Backplane FT-LA SWIM IPS LAMMPS OpenMPI PVFS MPICH2 MVAPICH2 LAM/MPI Cobalt ScaLAPACK ROMIO NWChem ZeptoOS

10 Status Quo Alpha version under works –Demos available on SC exhibit floor Client API to be finalized by Q1’ FY08 First release target : ? –Platforms supported : Linux clusters, IBM BGL Discuss more with Pete

11 CIFTS team Argonne National Laboratory –Pete Beckman, Rinku Gupta, Ewing Lusk, Rob Ross, Rajeev Thakur Indiana University –Andrew Lumsdaine & team Lawrence Berkeley National Laboratory –Paul Hargrove Oak Ridge National Laboratory –Al Geist, David Bernholdt Ohio State University –D.K. Panda University of Tennessee, Knoxville –Jack Dongarra

12 Call for Action BLCR Fault Tolerant Backplane FT-LA SWIM IPS PBS/Pro LAMMPS OpenMPI Lustre PVFS Scali MPI Global Arrays Intel MPI MPICH2 Polyserv GPFS GFS IBRIX MVAPICH2MPICH-MX Panasas LAM/MPI Other Applications SGE MAUI Condor LSF Cobalt Intel MLK ScaLAPACK ROMIO SLURM NWChem Fluent MM5 LS-Dyna ZeptoOS Linux EclipseBLAST Star-CD

13 Need more information? SC’07 Exhibit floor –Demos and/or talks at ANL, ORNL and LBNL booth CIFTS website –http://www.mcs.anl.gov/research/cifts/ CIFTS wiki –http://wiki.mcs.anl.gov/cifts CIFTS mailing list –cifts_discuss@googlegroups.com

14 Do we need a slide on timeline? Do we need to go into more details on the design?

15 Backup

16 CIFTS - The working view Middleware Like MPI MPI-IO Universal Logger Automatic Actions Diagnostics Tools Event Analysis Linear Algebra Libraries Checkpoint Restart System PVFS Resource Manager/JS Libraries and Applications System Components Autonomics Bootstrap Server

17 Manager Library Network Client Library Component 1 Network Module1 FTB Agent Component n LinuxBGLCRAY Network Module2 Manager Library Network Module1 Network Module2 FTB Client API FTB Manager API FTB Agent software stackComponent software stack FTB Internal Architecture Layers

18 Manager Library Network Client Library Component 1 Network Module1 FTB Agent Component n LinuxBGLCRAY Network Module2 Manager Library Network Module1 Network Module2 FTB Manager API FTB Agent software stackComponent software stack What you need to know! Just the FTB Client API

19 Building a FTB-enabled sample component 1.List the events you may want to publish in an XML file (for convenience) 2.Use the API to make the component FTB-enabled 3.Publish and subscribe to events

20 FTB-Enabled Component Development (Step1) STEP 1: Create an XML file, outlining the publishable events ftb.ftb_examples.watchdog WATCH_DOG_EVENT Info This event is used by watchdog …

21 Developing a FTB-enabled component (Step 2) STEP 2: Enabling your FTB component! #include "libftb.h" #include "ftb_event_def.h" #include "ftb_throw_events.h" int main (int argc, char *argv[]) { strcpy(cinfo.comp_namespace, "FTB.FTB_EXAMPLES.Watchdog"); strcpy(cinfo.schema_ver, "0.5"); strcpy(cinfo.inst_name, "watchdog"); strcpy(cinfo.jobid,"watchdog-111"); strcpy(cinfo.catch_style,"FTB_POLLING_CATCH"); FTB_Init(&cinfo, &handle, err_msg); FTB_Register_publishable_events(handle, ftb_ftb_examples_watchdog_events, FTB_FTB_EXAMPLES_WATCHDOG_TOTAL_EVENTS, err_msg); FTB_Create_mask(&mask, "all", "init", err_msg); FTB_Subscribe(handle, &mask, &shandle, err_msg, NULL, NULL); FTB_Publish_event(handle, "WATCH_DOG_EVENT", publish_event_data, err_msg); FTB_Poll_for_event(shandle, &caught_event, err_msg); FTB_Finalize(handle); return 0; }

22 Developing a FTB-enabled component (Step 2..contd) STEP 2: Enabling your FTB component! Creating your subscribe event mask Create a mask to catch all events 1.FTB_Create_mask(&mask, "all", "init", err_msg); Create a mask to catch “WATCH_DOG_EVENT” 1.FTB_Create_mask(&mask, "all", "init", err_msg); 2. FTB_Create_mask(&mask, "event_name", "WATCH_DOG_EVENT", err_msg ); Create a mask to catch events of severity fatal 1.FTB_Create_mask(&mask, "all", "init", err_msg); 2. FTB_Create_mask(&mask, “severity”, ”FTB_FATAL", err_msg );

23 Developing a FTB-enabled component (Step 3) STEP 3: Provide options to end user to compile your code with FTB Modify configure.in and makefiles, so that you can compile your code./configure --with-ftb=

24 Setting up FTB environment Compiling FTB Download FTB 1../configure --with-platform=linux --with-bstrap- name=bucco 2.make 3.make install

25 Using FTB Starting FTB 1../ftb_database_server 2../ftb_agent on all linux nodes 3.Run you component executables Bootstrap DB server FTB Agent Agent contacts server BS -Server provides parent address FTB Agent FTB Agent FTB Agent FTB Agent FTB Agent Connection Topology

26 Open Issues Policy management –Global knowledge of component prioritization for handling events How can components announce their FT capabilities? How can components request for action from other components? How to we establish scoping of events?


Download ppt "CiFTS Coordinated Infrastructure for Fault Tolerant Systems."

Similar presentations


Ads by Google