Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

Slides:



Advertisements
Similar presentations
WELCOME BUDGET MANAGERS AND CHIEF FISCAL OFFICERS
Advertisements

EU Presidency Conference Effective policies for the development of competencies of youth in Europe Warsaw, November 2011 Improving basic skills in.
Advanced Piloting Cruise Plot.
GIS for Decision Support and Economic Development Beau Bradley, Neighborhood Transformation Initiative Jim Querry, Mayors Office of Information Services.
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
Libra: An Economy driven Job Scheduling System for Clusters Jahanzeb Sherwani 1, Nosheen Ali 1, Nausheen Lotia 1, Zahra Hayat 1, Rajkumar Buyya 2 1. Lahore.
Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)
GridSim Toolkit 3.1: Modelling and Simulation of Global Grids
Distributed Systems Architectures
Processes and Operating Systems
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
Remote Educational Programming Of Robots (REPOR) Tord Fauskanger Aurelie Aurilla Bechina Arntzen Dag Samuelsen Buskerud University College.
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of Chapter 11: Structure and Union Types Problem Solving & Program Design.
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Determine Eligibility Chapter 4. Determine Eligibility 4-2 Objectives Search for Customer on database Enter application signed date and eligibility determination.
Multiplying binomials You will have 20 seconds to answer each of the following multiplication problems. If you get hung up, go to the next problem when.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Addition Facts
Particle physics – the computing challenge CERN Large Hadron Collider –2007 –the worlds most powerful particle accelerator –10 petabytes (10 million billion.
1 Term 2, 2004, Lecture 9, Distributed DatabasesMarian Ursu, Department of Computing, Goldsmiths College Distributed databases 3.
GridPP Presentation to PPARC Grid Steering Committee 26 July 2001 Steve Lloyd Tony Doyle John Gordon.
Tony Doyle GridPP2 Proposal, BT Meeting, Imperial, 23 July 2003.
|epcc| NeSC Workshop Open Issues in Grid Scheduling Ali Anjomshoaa EPCC, University of Edinburgh Tuesday, 21 October 2003 Overview of a Grid Scheduling.
OMII-UK Steven Newhouse, Director. © 2 OMII-UK aims to provide software and support to enable a sustained future for the UK e-Science community and its.
Chapter 1 Introduction Copyright © Operating Systems, by Dhananjay Dhamdhere Copyright © Introduction Abstract Views of an Operating System.
Configuration management
Software change management
Software testing.
ABC Technology Project
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 2 The OSI Model and the TCP/IP.
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
CS 241 Spring 2007 System Programming 1 Memory Replacement Policies Lecture 32 Klara Nahrstedt.
25 July, 2014 Hailiang Mei, TU/e Computer Science, System Architecture and Networking 1 Hailiang Mei Remote Terminal Management.
VOORBLAD.
1 Analysis of Random Mobility Models with PDE's Michele Garetto Emilio Leonardi Politecnico di Torino Italy MobiHoc Firenze.
“Start-to-End” Simulations Imaging of Single Molecules at the European XFEL Igor Zagorodnov S2E Meeting DESY 10. February 2014.
IONA Technologies Position Paper Constraints and Capabilities for Web Services
Squares and Square Root WALK. Solve each problem REVIEW:
1..
Do you have the Maths Factor?. Maths Can you beat this term’s Maths Challenge?
© 2012 National Heart Foundation of Australia. Slide 2.
Global Analysis and Distributed Systems Software Architecture Lecture # 5-6.
Chapter 5 Test Review Sections 5-1 through 5-4.
1 Using Bayesian Network for combining classifiers Leonardo Nogueira Matos Departamento de Computação Universidade Federal de Sergipe.
Addition 1’s to 20.
25 seconds left…...
Week 1.
Analyzing Genes and Genomes
We will resume in: 25 Minutes.
Intracellular Compartments and Transport
PSSA Preparation.
VPN AND REMOTE ACCESS Mohammad S. Hasan 1 VPN and Remote Access.
Essential Cell Biology
How Cells Obtain Energy from Food
16/02/06Internet based monitoring and control of embedded systems 1 EES.5413 February 16, 2005 Remi Bosman System Architecture & Networking Department.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Link-State Routing Protocols Routing Protocols and Concepts – Chapter.
Scalable Rule Management for Data Centers Masoud Moshref, Minlan Yu, Abhishek Sharma, Ramesh Govindan 4/3/2013.
1 GridSim 2.0 Adv. Grid Modelling & Simulation Toolkit Rajkumar Buyya, Manzur Murshed (Monash), Anthony Sulistio, Chee Shin Yeo Grid Computing and Distributed.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
GIIS Implementation and Requirements F. Semeria INFN European Datagrid Conference Amsterdam, 7 March 2001.
Presentation transcript:

Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión 1, and Rajkumar Buyya 2 1 Dept. of Computing Systems. The University of Castilla La Mancha, Spain 2 Grid Computing and Distributed Systems (GRIDS) Lab. The University of Melbourne, Australia

ICPADS Agenda Introduction Contribution of Our Work Design and Implementation Experiments and Results Conclusion and Further Work Questions and Answers

ICPADS Grid as Cyberinfrastructure for e-Science and e-Business Applications Grid Resource Broker Resource Broker Application Grid Information Service Grid Resource Broker database R2R2 R3R3 RNRN R1R1 R4R4 R5R5 R6R6 Grid Information Service

ICPADS Grids as Variable Environments Grids are variable environments, as organizations can decide its own policy and when to join/leave a VO at any time. Number of resources can fluctuate significantly over time. Availability of resources may vary due to: –changes in network condition, –partial failures, –the connection or disconnection of resources, … With as many resources in a Grid, resource or network falilures are the rule rather than the exception.

ICPADS The Importance of Dealing with Failures Supporting fault tolerance is one of the main technical challenges in designing Grid environments. This is because production Grid systems must be able to tolerate resource failures, while at the same time effectively exploiting the resources in a scalable and transparent manner. Thus, both detection and recovery schemes must be an integral part of the Grid computing infrastructure.

ICPADS Grid Resource Failure Scenario

ICPADS Grids as a Research Area To test new detection and recovery schemes in a Grid environment like the above scenario, a lot of work is required to set up the testbeds on many distributed sites. It is very difficult to produce performance evaluation in a repeatable and controlled manner, due to the inherent heterogeneity of the Grid. In addition, Grid testbeds are limited and creating an adequately-sized testbed is expensive and time consuming. Therefore, it is easier to use simulation as a means of studying complex scenarios.

ICPADS Contribution of the Paper Among the existing Grid simulation tools, we can find GridSim, SimGrid, OptorSim, and MicroGrid. None of them provide support for computing resource failures. To address the above issues, we have incorporated failure detection and recovery scheme into GridSim. This extension allows GridSim to simulate the failure of computing resources. Most of the parameters of this extension are configurable, allowing researchers to simulate a wide variety of failure patterns.

ICPADS Existing Resource Failure Detections Computing resource failure can occur in hardware, operating systems, and Grid middleware components, as well as network connections. There are two methods for detecting resource failures: Push: –Each monitored resource periodically sends a message to a central server indicating its availability. –Missing a message after a certain time interval indicates that this resource has failed. Pull: –The resource monitor sends polling requests to the monitored resources. –On receiving these messages, the resources will send them back, so that the sender knows that each of them is alive. –A missed message indicates a resource failure.

ICPADS Designing Resource Failures We implement pull method. Two types of entities perform polling: Grid Information Service (GIS) entity polls the resources registered to it. Users poll resources running their jobs.

ICPADS Scenario of failure detection (I)

ICPADS Scenario of failure detection (II)

ICPADS Scenario of failure detection (III)

ICPADS RegionalGISWithFailure : Keeps a list of available resources, and polls them. Support for resource failures: –Decides how many resources, when, how long, and how many machines at each resource will fail. –These parameters are based on continuous, discrete or variate distributions, allowing a wide variety of failure patterns. GridUserFailure: Submits jobs to resources; polls the resources running its jobs; and on the failure of a job, chooses another resource and re-submits the job. Main classes

ICPADS Main classes (II) SpaceSharedWithFailure: Implements AllocPolicyWithFailure interface. Behaves like FCFS. TimeSharedWithFailure: Implements AllocPolicyWithFailure interface. Behaves like round-robin.

ICPADS GIS and users failure detections algorithms Fig 1. Users detection algorithm Fig 2. GIS detection algorithm

ICPADS EU DataGrid Testbed and Grid Modelling

ICPADS EU DataGrid Testbed and Grid Modelling

ICPADS Resource Characteristics 4Space-shared80,00067Bologna (Italy) 4Time-shared1,0001Padova (Italy) 1Space-shared6,0005Rome (Italy) 1Time-shared3,0002Torino (Italy) 1Space-shared70,0005Milano (Italy) 0Space-shared70,00059CERN (Switzerland) 0Space-shared14,00012Lyon (France) 3Space-shared21,00018NIKHEF (Netherlands) 3Space-shared20,00017NorduGrid (Norway) 2Space-shared62,00052Imperial College (UK) 2Space-shared49,00041RAL (UK) VOPolicyCPU Rating*# NodesResource (Location) *CPU Rating is measured in MIPS

ICPADS Users Characteristics 0412Bologna (Italy) 342Padova (Italy) 414Rome (Italy) 312Torino (Italy) 214Milano (Italy) 1024CERN (Switzerland) 1012Lyon (France) 438NIKHEF (Netherlands) 234NorduGrid (Norway) 0216Imperial College (UK) 4212RAL (UK) Secondary VOPrimary VO# UsersResource (Location)

ICPADS Experiment Parameters We simulated failures based on the hyper-exponential distribution, with mean equal to half of the number of CPUs of the VO. Each user has 10 jobs, each one would take 10 min to be run in CERN. Users choose a resource to run each job among the resources in their primary VO. If no resource is available, they choose a resource from their secondary VO.

ICPADS Results: Availability and period of failure Fig 1. Availability of computing resources per VO. Fig 2. Failed machines per VO. VO_0 and VO_1 suffered a big drop in their available MIPS because powerful CPUs suffered a failure

ICPADS Results: Failed jobs for a user Fig 3. Time-line for User_0. Jobs submitted to different resources have different execution times

ICPADS Results: Resource failure statistics 2.76 hours VO_0 9.5 hours VO_ hours VO_ hours VO_ hours VO_1 MFT *# Failed jobs# Jobs# Failed CPUs# CPUsVO * MFT: mean failure time.

ICPADS Conclusion Grids are a hot topic in research at the moment, where simulation is essential. New features allow GridSim to support computing resource failures based on fully configurable mathematical patterns. Our experiment has shown that the new extension can be used to simulate failure of computing resources. New improvements regarding network link failures, and finite network buffers are considered as future work. GridSim is available to download:

Conference title 26 Thank you. 5th December, 2007

ICPADS Acknowledgement This work has been jointly supported by the Spanish MEC and European Commission FEDER funds under grants.Consolider Ingenio-2010 CSD and TIN C04-02; by JCCM under grants PBC , PBC and José Castillejo. This research is also partially funded by the Australian Research Council and the Department of Education, Science and Training. We would like to thank Chee Shin Yeo and anonymous reviewers for their comments on the paper.