A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,

Slides:

Advertisements

Similar presentations

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.

Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,

Software Quality Assurance Plan

Chapter 13 Managing Computer and Data Resources. Introduction A disciplined, systematic approach is needed for management success Problem Management,

A Computation Management Agent for Multi-Institutional Grids

Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.

Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.

Chapter 19: Network Management Business Data Communications, 4e.

University of Minho School of Engineering Algoritmi Center Uma Escola a Reinventar o Futuro – Semana da Escola de Engenharia - 24 a 27 de Outubro de 2011.

City University London

1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.

1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.

8. Fault Tolerance in Software

Figure 1.1 Interaction between applications and the operating system.

Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.

Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.

Module 3: Business Information Systems

EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space Towards scalable, semantic-based virtualized storage.

Virtual Organization Approach for Running HEP Applications in Grid Environment Łukasz Skitał 1, Łukasz Dutka 1, Renata Słota 2, Krzysztof Korcyl 3, Maciej.

Cracow Grid Workshop 2003 Institute of Computer Science AGH A Concept of a Monitoring Infrastructure for Workflow-Based Grid Applications Bartosz Baliś,

Emergency Management Information System - EMIS

Environment for Management of Experiments on the Grid Master of Science Thesis AGH University of Science and Technology, Krakow, Poland Faculty of Electrical.

Advanced Grid-Enabled System for Online Application Monitoring Main Service Manager is a central component, one per each.

CGW 2003 Institute of Computer Science AGH Proposal of Adaptation of Legacy C/C++ Software to Grid Services Bartosz Baliś, Marian Bubak, Michał Węgiel,

Checkpoint & Restart for Distributed Components in XCAT3 Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University.

An approach to Intelligent Information Fusion in Sensor Saturated Urban Environments Charalampos Doulaverakis Centre for Research and Technology Hellas.

1. There are different assistant software tools and methods that help in managing the network in different things such as: 1. Special management programs.

Fault Tolerance BOF Possible CBHPC paper –Co-authors wanted –Tammy, Rob, Bruce, Daniel, Nanbor, Sameer, Jim, Doug, David What infrastructure is needed.

Łukasz Skitał 2, Renata Słota 1, Maciej Janusz 1 and Jacek Kitowski 1,2 1 Institute of Computer Science AGH University of Science and Technology, Mickiewicza.

A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster

DISTRIBUTED COMPUTING

Cluster Reliability Project ISIS Vanderbilt University.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Xiao Liu CS3 -- Centre for Complex Software Systems and Services Swinburne University of Technology, Australia Key Research Issues in.

Distributed Data Mining System in Java Group Member D 王春笙 D 林俊甫 D 王慧芬.

Cracow Grid Workshop, October 27 – 29, 2003 Institute of Computer Science AGH Design of Distributed Grid Workflow Composition System Marian Bubak, Tomasz.

Something We Learned about Computer Supported Cooperative Work in Software Engineering Tangqiu Li, Zongkai Lin Xiamen University, China.

The Grid System Design Liu Xiangrui Beijing Institute of Technology.

Running a Scientific Experiment on the Grid Vilnius, 13 rd May, 2008 by Tomasz Szepieniec IFJ PAN & CYFRONET.

Basic Grid Registry configuration – there is not any backup data Grid Registry configuration where every domain has duplicated information Find all services.

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

High Level Architecture (HLA)  used for building interactive simulations  connects geographically distributed nodes  time management (for time- and.

 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.

Chapter 5 McGraw-Hill/Irwin Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved.

AKOGRIMO Integration of Grid services with mobile technologies; validation in e-health, e-learning and disaster management areas CoreGRID European Grid.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.

Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]

System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.

Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.

David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.

Fault Tolerance and Checkpointing - Sathish Vadhiyar.

DataTAG is a project funded by the European Union International School on Grid Computing, 23 Jul 2003 – n o 1 GridICE The eyes of the grid PART I. Introduction.

Collection and storage of provenance data Jakub Wach Master of Science Thesis Faculty of Electrical Engineering, Automatics, Computer Science and Electronics.

Next Generation of Apache Hadoop MapReduce Owen

INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.

InSilicoLab – Grid Environment for Supporting Numerical Experiments in Chemistry Joanna Kocot, Daniel Harężlak, Klemens Noga, Mariusz Sterzel, Tomasz Szepieniec.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

Artificial Intelligence In Power System Author Doshi Pratik H.Darakh Bharat P.

Introduction to Distributed Platforms

#01 Client/Server Computing

Supporting Fault-Tolerance in Streaming Grid Applications

Fault Tolerance Distributed Web-based Systems

Middleware for Fault Tolerant Applications

Introduction To Distributed Systems

Distributed Systems and Concurrency: Distributed Systems

#01 Client/Server Computing

Presentation transcript:

A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science, AGH 2 Academic Computer Centre -- CYFRONET Institute of Computer Science AGH

Outline Motivation & introduction Services useful in fault recovery approach Overview of our proposal Problems & workflow approach Summary

Institute of Computer Science AGH Motivation Fault tolerance problem becomes important in the Grid Environment and application size increase steadily Reliability of single component does not raise considerably Risk that application crashes is higher Crash is more expensive for large application     

Institute of Computer Science AGH Demands vs. Reality Minimal overhead Automatic, quick recovery Scalability Transparent Porting to any kind of application Checkpointing is costly Often restarting whole application Many global operations Additional developer’s effort is required Application-specific methods

Institute of Computer Science AGH Two classes of FT approaches Application Built-in FT Algorithm/structure profile can be exploited, FT activity can by done more efficiently, e.g. checkpointing Naturally Fault Tolerant problem class, e.g. genetic alg. Fault Tolerant-MPI but... all must be done by developer FT realized by external services automatic middleware services no developer effort required but... limited functionality It would be beneficial to combine this two

Institute of Computer Science AGH Services useful in FT approach Monitoring services For fault detection in hardware and software e.g. Check if process is still running, Checkpointing, logging, redundancy services For preparing recovery e.g. Store the current state of application Recovery services In case of failure e.g. Rollback from last checkpointing, Scheduler and resource broker For knowledge about started application For re-scheduling, re-brokering job or it’s part

Institute of Computer Science AGH How to make it work together? The component that manages this services is needed part of middleware job companion co-ordinate actions of FT services Recovery action taken is more appropriate, because: whole job state is considered the most suitable of available services could be used Checkpointing Services Application Mon. Services Recovery Services Scheduler Services Infrastructure Mon.Services Fault Tolerant Manager

Institute of Computer Science AGH FT Manager – Architecture Job Supervisor Recovery Scenario Executor Fault Tolerant Manager Checkpointing Services Application Mon. Services Recovery Services Scheduler Services Infrastructure Mon.Services Infrastructure Application Monitoring Check- pointing Recovery Decision Maker

Institute of Computer Science AGH Job Supervisor (1) Main functionality: Monitors job execution Manages (or stores information about) checkpointing When something is wrong generates Fault Alarm Fault Alarm contains not only the information what is wrong, but also the status of job (e.g. last checkpoint) Job Supervisor can be asked to perform more checking by Decision Maker Decision Maker Recovery Scenario Executor Fault Tolerant Manager Job Supervisor Fault Alarm

Institute of Computer Science AGH Job Supervisor (2) – Faults Typical examples of fault: process crash node is not responding lost connection (link is down) Extended fault characteristics: Occurring and duration characteristics Severity for application, E.g. Master fault is more dangerous than slave fault Fault is not only when connection is lost, but also when performance dramatically decreases Sophisticated performance monitoring is required Decision Maker Recovery Scenario Executor Fault Tolerant Manager Fault Alarm Job Supervisor

Institute of Computer Science AGH Decision Maker Main functionality: Analyzes the situation, when gets fault alarm Prepares recovery scenarios and sends the best of them for execution Issues to be considered: What is possible The cost of each recovery scenario Do-nothing or wait scenario is always possible and sometimes beneficial E.g. in case of problem with network link when only recovery is to restart the whole application Historical data and probabilistic methods should be used Decision Maker Job Supervisor Recovery Scenario Executor Fault Tolerant Manager Fault Alarm Recovery Scenario

Institute of Computer Science AGH Recovery Scenario Executor Main functionality: Executes actions from scenario Supervises recovery process Recovery Scenario contains several actions that could be performed by different recovery services In case of failure in scenario execution, Decision Maker is alarmed Decision Maker Job Supervisor Recovery Scenario Executor Fault Tolerant Manager Recovery Scenario

Institute of Computer Science AGH Problems Many class of services to cooperate with Many interfaces How to obtain information about application? Which services are available? Semantic specification for monitoring and recovery services is needed

Institute of Computer Science AGH Feasibility – WorkFlows Grid-Services-based approach could help to solve our problems Knowledge about application architecture is accessible Workflow description details are welcomed Exchange of single component is better that restart the whole application Directives for FT Manager could be included in job description Interfaces are unified

Institute of Computer Science AGH Summary Fault tolerance issues become more and more important in the Grid A service for fault tolerance management has been proposed...which enables more sophisticated fault tolerance for Grid Workflow-based framework facilites the task But, this is a proposal only... You are invited for commenting and remarking!