A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,

A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science, AGH 2 Academic Computer Centre -- CYFRONET Institute of Computer Science AGH

Outline Motivation & introduction Services useful in fault recovery approach Overview of our proposal Problems & workflow approach Summary

Institute of Computer Science AGH Motivation Fault tolerance problem becomes important in the Grid Environment and application size increase steadily Reliability of single component does not raise considerably Risk that application crashes is higher Crash is more expensive for large application     

Institute of Computer Science AGH Demands vs. Reality Minimal overhead Automatic, quick recovery Scalability Transparent Porting to any kind of application Checkpointing is costly Often restarting whole application Many global operations Additional developer’s effort is required Application-specific methods

Institute of Computer Science AGH Two classes of FT approaches Application Built-in FT Algorithm/structure profile can be exploited, FT activity can by done more efficiently, e.g. checkpointing Naturally Fault Tolerant problem class, e.g. genetic alg. Fault Tolerant-MPI but... all must be done by developer FT realized by external services automatic middleware services no developer effort required but... limited functionality It would be beneficial to combine this two

Institute of Computer Science AGH Services useful in FT approach Monitoring services For fault detection in hardware and software e.g. Check if process is still running, Checkpointing, logging, redundancy services For preparing recovery e.g. Store the current state of application Recovery services In case of failure e.g. Rollback from last checkpointing, Scheduler and resource broker For knowledge about started application For re-scheduling, re-brokering job or it’s part

Institute of Computer Science AGH How to make it work together? The component that manages this services is needed part of middleware job companion co-ordinate actions of FT services Recovery action taken is more appropriate, because: whole job state is considered the most suitable of available services could be used Checkpointing Services Application Mon. Services Recovery Services Scheduler Services Infrastructure Mon.Services Fault Tolerant Manager

Institute of Computer Science AGH FT Manager – Architecture Job Supervisor Recovery Scenario Executor Fault Tolerant Manager Checkpointing Services Application Mon. Services Recovery Services Scheduler Services Infrastructure Mon.Services Infrastructure Application Monitoring Check- pointing Recovery Decision Maker

Institute of Computer Science AGH Job Supervisor (1) Main functionality: Monitors job execution Manages (or stores information about) checkpointing When something is wrong generates Fault Alarm Fault Alarm contains not only the information what is wrong, but also the status of job (e.g. last checkpoint) Job Supervisor can be asked to perform more checking by Decision Maker Decision Maker Recovery Scenario Executor Fault Tolerant Manager Job Supervisor Fault Alarm

Institute of Computer Science AGH Job Supervisor (2) – Faults Typical examples of fault: process crash node is not responding lost connection (link is down) Extended fault characteristics: Occurring and duration characteristics Severity for application, E.g. Master fault is more dangerous than slave fault Fault is not only when connection is lost, but also when performance dramatically decreases Sophisticated performance monitoring is required Decision Maker Recovery Scenario Executor Fault Tolerant Manager Fault Alarm Job Supervisor

Institute of Computer Science AGH Decision Maker Main functionality: Analyzes the situation, when gets fault alarm Prepares recovery scenarios and sends the best of them for execution Issues to be considered: What is possible The cost of each recovery scenario Do-nothing or wait scenario is always possible and sometimes beneficial E.g. in case of problem with network link when only recovery is to restart the whole application Historical data and probabilistic methods should be used Decision Maker Job Supervisor Recovery Scenario Executor Fault Tolerant Manager Fault Alarm Recovery Scenario

Institute of Computer Science AGH Recovery Scenario Executor Main functionality: Executes actions from scenario Supervises recovery process Recovery Scenario contains several actions that could be performed by different recovery services In case of failure in scenario execution, Decision Maker is alarmed Decision Maker Job Supervisor Recovery Scenario Executor Fault Tolerant Manager Recovery Scenario

Institute of Computer Science AGH Problems Many class of services to cooperate with Many interfaces How to obtain information about application? Which services are available? Semantic specification for monitoring and recovery services is needed

Institute of Computer Science AGH Feasibility – WorkFlows Grid-Services-based approach could help to solve our problems Knowledge about application architecture is accessible Workflow description details are welcomed Exchange of single component is better that restart the whole application Directives for FT Manager could be included in job description Interfaces are unified

Institute of Computer Science AGH Summary Fault tolerance issues become more and more important in the Grid A service for fault tolerance management has been proposed...which enables more sophisticated fault tolerance for Grid Workflow-based framework facilites the task But, this is a proposal only... You are invited for commenting and remarking!

A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,

Similar presentations

Presentation on theme: "A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,

Similar presentations

Presentation on theme: "A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,"— Presentation transcript:

Similar presentations

About project

Feedback