Presentation on theme: "Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión."— Presentation transcript:
Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión 1, and Rajkumar Buyya 2 1 Dept. of Computing Systems. The University of Castilla La Mancha, Spain 2 Grid Computing and Distributed Systems (GRIDS) Lab. The University of Melbourne, Australia
ICPADS 2007 2 Agenda Introduction Contribution of Our Work Design and Implementation Experiments and Results Conclusion and Further Work Questions and Answers
ICPADS 2007 3 Grid as Cyberinfrastructure for e-Science and e-Business Applications Grid Resource Broker Resource Broker Application Grid Information Service Grid Resource Broker database R2R2 R3R3 RNRN R1R1 R4R4 R5R5 R6R6 Grid Information Service
ICPADS 2007 4 Grids as Variable Environments Grids are variable environments, as organizations can decide its own policy and when to join/leave a VO at any time. Number of resources can fluctuate significantly over time. Availability of resources may vary due to: –changes in network condition, –partial failures, –the connection or disconnection of resources, … With as many resources in a Grid, resource or network falilures are the rule rather than the exception.
ICPADS 2007 5 The Importance of Dealing with Failures Supporting fault tolerance is one of the main technical challenges in designing Grid environments. This is because production Grid systems must be able to tolerate resource failures, while at the same time effectively exploiting the resources in a scalable and transparent manner. Thus, both detection and recovery schemes must be an integral part of the Grid computing infrastructure.
ICPADS 2007 7 Grids as a Research Area To test new detection and recovery schemes in a Grid environment like the above scenario, a lot of work is required to set up the testbeds on many distributed sites. It is very difficult to produce performance evaluation in a repeatable and controlled manner, due to the inherent heterogeneity of the Grid. In addition, Grid testbeds are limited and creating an adequately-sized testbed is expensive and time consuming. Therefore, it is easier to use simulation as a means of studying complex scenarios.
ICPADS 2007 8 Contribution of the Paper Among the existing Grid simulation tools, we can find GridSim, SimGrid, OptorSim, and MicroGrid. None of them provide support for computing resource failures. To address the above issues, we have incorporated failure detection and recovery scheme into GridSim. This extension allows GridSim to simulate the failure of computing resources. Most of the parameters of this extension are configurable, allowing researchers to simulate a wide variety of failure patterns.
ICPADS 2007 9 Existing Resource Failure Detections Computing resource failure can occur in hardware, operating systems, and Grid middleware components, as well as network connections. There are two methods for detecting resource failures: Push: –Each monitored resource periodically sends a message to a central server indicating its availability. –Missing a message after a certain time interval indicates that this resource has failed. Pull: –The resource monitor sends polling requests to the monitored resources. –On receiving these messages, the resources will send them back, so that the sender knows that each of them is alive. –A missed message indicates a resource failure.
ICPADS 2007 10 Designing Resource Failures We implement pull method. Two types of entities perform polling: Grid Information Service (GIS) entity polls the resources registered to it. Users poll resources running their jobs.
ICPADS 2007 11 Scenario of failure detection (I)
ICPADS 2007 12 Scenario of failure detection (II)
ICPADS 2007 13 Scenario of failure detection (III)
ICPADS 2007 14 RegionalGISWithFailure : Keeps a list of available resources, and polls them. Support for resource failures: –Decides how many resources, when, how long, and how many machines at each resource will fail. –These parameters are based on continuous, discrete or variate distributions, allowing a wide variety of failure patterns. GridUserFailure: Submits jobs to resources; polls the resources running its jobs; and on the failure of a job, chooses another resource and re-submits the job. Main classes
ICPADS 2007 15 Main classes (II) SpaceSharedWithFailure: Implements AllocPolicyWithFailure interface. Behaves like FCFS. TimeSharedWithFailure: Implements AllocPolicyWithFailure interface. Behaves like round-robin.
ICPADS 2007 21 Experiment Parameters We simulated failures based on the hyper-exponential distribution, with mean equal to half of the number of CPUs of the VO. Each user has 10 jobs, each one would take 10 min to be run in CERN. Users choose a resource to run each job among the resources in their primary VO. If no resource is available, they choose a resource from their secondary VO.
ICPADS 2007 22 Results: Availability and period of failure Fig 1. Availability of computing resources per VO. Fig 2. Failed machines per VO. VO_0 and VO_1 suffered a big drop in their available MIPS because powerful CPUs suffered a failure
ICPADS 2007 23 Results: Failed jobs for a user Fig 3. Time-line for User_0. Jobs submitted to different resources have different execution times
ICPADS 2007 25 Conclusion Grids are a hot topic in research at the moment, where simulation is essential. New features allow GridSim to support computing resource failures based on fully configurable mathematical patterns. Our experiment has shown that the new extension can be used to simulate failure of computing resources. New improvements regarding network link failures, and finite network buffers are considered as future work. GridSim is available to download: www.gridbus.org/gridsim/
Conference title 26 Thank you. 5th December, 2007
ICPADS 2007 27 Acknowledgement This work has been jointly supported by the Spanish MEC and European Commission FEDER funds under grants.Consolider Ingenio-2010 CSD2006-00046 and TIN2006-15516-C04-02; by JCCM under grants PBC- 05-007-01, PBC-05-005-01 and José Castillejo. This research is also partially funded by the Australian Research Council and the Department of Education, Science and Training. We would like to thank Chee Shin Yeo and anonymous reviewers for their comments on the paper.