University of Westminster – www.cpc.wmin.ac.uk Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University.

Slides:

Advertisements

Similar presentations

Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.

Advertisements

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.

The google file system Cs 595 Lecture 9.

Walter Binder University of Lugano, Switzerland Niranjan Suri IHMC, Florida, USA Green Computing: Energy Consumption Optimized Service Hosting.

Persistent Linda 3.0 Peter Wyckoff New York University.

Spark: Cluster Computing with Working Sets

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.

Chapter 13 (Web): Distributed Databases

1 Complexity of Network Synchronization Raeda Naamnieh.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.

MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.

Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering.

Distributed Databases

A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:

Google Distributed System and Hadoop Lakshmi Thyagarajan.

1 Rollback-Recovery Protocols II Mahmoud ElGammal.

Pregel: A System for Large-Scale Graph Processing

1 The Google File System Reporter: You-Wei Zhang.

Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

Grid Data Management A network of computers forming prototype grids currently operate across Britain and the rest of the world, working on the data challenges.

A Survey of Rollback-Recovery Protocols in Message-Passing Systems.

Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.

Chapter 19 Recovery and Fault Tolerance Copyright © 2008.

Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.

A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,

1 Finding Constant From Change: Revisiting Network Performance Aware Optimizations on IaaS Clouds Yifan Gong, Bingsheng He, Dan Li Nanyang Technological.

EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Fault Tolerant Systems

12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.

1 M. Tudruj, J. Borkowski, D. Kopanski Inter-Application Control Through Global States Monitoring On a Grid Polish-Japanese Institute of Information Technology,

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.

1 Process migration n why migrate processes n main concepts n PM design objectives n design issues n freezing and restarting a process n address space.

1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.

1 Distributed Databases BUAD/American University Distributed Databases.

Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.

Sapna E. George, Ing-Ray Chen Presented By Yinan Li, Shuo Miao April 14, 2009.

Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.

Workflow Recovery with Ensuring Task Dependencies Presented by Yajie Zhu March 08, 2005.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.

EJB Replication Graham, Iman, Santosh, Mark Newcastle University.

Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.

FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.

1 Fault Tolerance and Recovery Mostly taken from

ScotGRID is the Scottish prototype Tier 2 Centre for LHCb and ATLAS computing resources. It uses a novel distributed architecture and cutting-edge technology,

Now every configuration is possible

CSS534: Parallel Programming in Grid and Cloud

Prepared by Ertuğrul Kuzan

EEC 688/788 Secure and Dependable Computing

The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.

EECS 498 Introduction to Distributed Systems Fall 2017

Outline Announcements Fault Tolerance.

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Distributed Databases

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Presentation transcript:

University of Westminster – Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University of Westminster

Checkpointing of Parallel Applications in a Grid Environment The Grid Environment  Nature of Grid Environment: –Generic, heterogeneous, and dynamic with lots of unreliable resources making it exposed to failures.  Solution: –Fault tolerant mechanisms should ensure successful execution of applications.

Checkpointing of Parallel Applications in a Grid Environment Fault Tolerant Solutions  Retrying –When a job fails, it is re-executed a certain number of times. –The expected job’s completion time is very big.  Replication –Replicas of a job are executed on different Grid resources simultaneously. –It requires extra processing power.  Checkpointing –It stores a snapshot of an application state, and use it for restarting the execution in case of failure. –It is very efficient in environment where failure rate is high.

Checkpointing of Parallel Applications in a Grid Environment Checkpointing  Transparent Checkpointing –Programmer orchestrates the checkpointing process –Message synchronisation is performed. –Checkpointing & Recovery process is transparent to the programmer.  Non-Transparent Checkpointing –Mechanism provides support for checkpointing through run-time libraries. –Programmer can specify data that should be included in checkpoint file. –Approach is not transparent to the programmer.

Challenges in Checkpointing  When to take the checkpoint  How to synchronise (or how to minimise inter-process communication)  What kind of info to store at the checkpoint  Where to store the checkpoint’s info  How to restore the execution after a fault

Checkpointing of Parallel Applications in a Grid Environment Checkpointing (2)  Performance constraints in existing solutions: –Overheads due to synchronisation of messages. –Checkpoint intervals are either user-defined with no regular pattern or are periodic.  Proposed solution: –Take checkpoint at the best possible pre-defined intervals. –Mimimalise (or optimise) the inter-communication as much as possible.

Checkpointing of Parallel Applications in a Grid Environment Checkpointing (3)  Inter-process communications can cause inconsistent checkpoints due to lost messages or orphan messages. –To achieve a global consistent checkpoint synchronization should be performed  Synchronization introduces extra communications among processes.

Checkpointing of Parallel Applications in a Grid Environment Approaches Used  Combination of : –First Order Approximation. –Natural Synchronisation Points.  First Order Approximation –Calculate the optimal checkpointing intervals. –Based on the Poisson process. Occurrence of failure is random with failure rate.

Checkpointing of Parallel Applications in a Grid Environment  The Optimal Checkpoint interval T c is: –T c =  2T s T f, where: T s is the time required to save information at a checkpoint. T f is the mean time between failures and T f = T h / k  The following data are needed: –The number of hours the program will run on the machines (T h ). –The known failure rate during that time ( k ). –The time required to save information at a checkpoint (T s ). First Order Approximation

First Order Approximation (2) Tc Ts t = 0 Rerun Time t r Restarting Point Point of Failure Tc Ts …t Tc T c = Checkpoint interval T s = Time to save a checkpoint t r = Rerun time of a failed application

Checkpointing of Parallel Applications in a Grid Environment First Order Approximation(3)  Using the PROVE toolset, we can measure both the execution time and the checkpointing time of an application.  Nagios can be used to determine the failure rate of Grid resources.

Checkpointing of Parallel Applications in a Grid Environment Natural Synchronisation Points  Examples of natural synchronization points: –Barriers. –Top or bottom of a main loop. –Collective operations (broadcast, gather, scatter, etc.)  No interprocess communication at these points. –Therefore, no need to be concerned with the state of the communication channels or possible in-transit message. –Eliminate the overhead incurred due to the synchronization process involved during checkpointing.

Checkpointing of Parallel Applications in a Grid Environment Natural Synchronisation Points (2) P1 P2 P3 Application Execution with Processes interacting P1 P2 P3 Coordinated checkpoint - waiting for in-transit messages

Checkpointing of Parallel Applications in a Grid Environment Natural Synchronisation Points (4) P1 P2 P3 Coordinated checkpoint - logging in-transit messages Checkpointing at natural synchronisation points. P1 P2 P3 N.S.P 1N.S.P 2 Ckpt1Ckpt2

Checkpointing of Parallel Applications in a Grid Environment New Checkpointing Approa  Using First Order Approximation only: –Involves synchronisation of messages and capturing in-transit messages.  Checkpointing at natural synchronisation points only: –May not be very effective because there are no patterns in their occurrences.

Checkpointing of Parallel Applications in a Grid Environment New Checkpointing Approach(2)  Use a combination of both the Natural Synchronisation Points and the First Order Approximation.  Take checkpoints at natural synchronization points which are closest to the optimal checkpoint intervals.

Checkpointing of Parallel Applications in a Grid Environment Choosing Checkpoint Intervals First Order approximation (Op) Natural Synchronisation pts (Ns) Critical Region { } Choosing appropriate checkpointing intervals Ns1 Ns2Ns4 Ns3Ns5 Ns6 Ns7 Ns 8 Ns9 Ns10 Op1Op2Op3Op4Op5Op6

Checkpointing of Parallel Applications in a Grid Environment Choosing Checkpoint Intervals(2)  Decision to select a checkpoint based on: –Optimal checkpoint interval, –Natural synchronisation points and –Critical Region.  Checkpointing process is triggered by signals sent to the coordinated process whenever synchronization points are encountered.

Checkpointing of Parallel Applications in a Grid Environment The Checkpointing Process  When coordinated process receives a signal, it checks to see if this signal is within the critical region. –If so, a checkpoint is taken and the clock is reset. –If not, no checkpointing is performed.  If no natural synchronization points are met within the critical region, we will have to force a checkpoint at the end of the critical region. –In such cases, the checkpointing mechanism will perform synchronization to ensure there are no lost or orphan messages.

Checkpointing of Parallel Applications in a Grid Environment The TestBed  Madcity Traffic Simulation tool was used. –Simulates traffic on a road network and shows how individual vehicles behave on roads and at junctions.  MadCity traffic simulator can be parallelised using PGRADE.

Checkpointing of Parallel Applications in a Grid Environment The Testbed(2) Proposed checkpointing solution First Order approximation (Op) Natural Synchronisation pts (Ns) Forced Synchronisation pts (Fs) Critical Region { } Saved Checkpoints Op1Op2Op3Op4Op5Op6 4 min Ns1 Ns2 Ns3 Ns4 Ns5 Ns6 Ns7 Ns8 Ns9Fs1

Checkpointing of Parallel Applications in a Grid Environment The Testbed(3)  Through the First Order Approximation, the calculated optimal checkpoint interval was 8 minutes.  A critical region of 2 minutes range from the optimal checkpoint interval was defined.  Checkpoint taken at: Ns1, Ns2, Ns5, Fs1, Ns6,Ns9.  Overall average time between checkpoints: 8.2 minutes

Checkpointing of Parallel Applications in a Grid Environment Conclusion  Proposed checkpointing mechanism provides a better and more efficient way to save checkpoint images. –Minimise the need of performing synchronisation of messages. –Ensure that our average checkpointing interval is close to the optimal checkpointing interval defined by the First Order Approximation.

Checkpointing of Parallel Applications in a Grid Environment Future Works  Integrate the checkpointing solution in PGRADE to provide an efficient fault tolerant solution to applications executed as Grid workflows.  Provide an efficient and reliable storage mechanism.

Checkpointing of Parallel Applications in a Grid Environment Questions