Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Westminster – www.cpc.wmin.ac.uk Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University.

Similar presentations


Presentation on theme: "University of Westminster – www.cpc.wmin.ac.uk Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University."— Presentation transcript:

1 University of Westminster – www.cpc.wmin.ac.uk Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University of Westminster

2 Checkpointing of Parallel Applications in a Grid Environment The Grid Environment  Nature of Grid Environment: –Generic, heterogeneous, and dynamic with lots of unreliable resources making it exposed to failures.  Solution: –Fault tolerant mechanisms should ensure successful execution of applications.

3 Checkpointing of Parallel Applications in a Grid Environment Fault Tolerant Solutions  Retrying –When a job fails, it is re-executed a certain number of times. –The expected job’s completion time is very big.  Replication –Replicas of a job are executed on different Grid resources simultaneously. –It requires extra processing power.  Checkpointing –It stores a snapshot of an application state, and use it for restarting the execution in case of failure. –It is very efficient in environment where failure rate is high.

4 Checkpointing of Parallel Applications in a Grid Environment Checkpointing  Transparent Checkpointing –Programmer orchestrates the checkpointing process –Message synchronisation is performed. –Checkpointing & Recovery process is transparent to the programmer.  Non-Transparent Checkpointing –Mechanism provides support for checkpointing through run-time libraries. –Programmer can specify data that should be included in checkpoint file. –Approach is not transparent to the programmer.

5 Challenges in Checkpointing  When to take the checkpoint  How to synchronise (or how to minimise inter-process communication)  What kind of info to store at the checkpoint  Where to store the checkpoint’s info  How to restore the execution after a fault

6 Checkpointing of Parallel Applications in a Grid Environment Checkpointing (2)  Performance constraints in existing solutions: –Overheads due to synchronisation of messages. –Checkpoint intervals are either user-defined with no regular pattern or are periodic.  Proposed solution: –Take checkpoint at the best possible pre-defined intervals. –Mimimalise (or optimise) the inter-communication as much as possible.

7 Checkpointing of Parallel Applications in a Grid Environment Checkpointing (3)  Inter-process communications can cause inconsistent checkpoints due to lost messages or orphan messages. –To achieve a global consistent checkpoint synchronization should be performed  Synchronization introduces extra communications among processes.

8 Checkpointing of Parallel Applications in a Grid Environment Approaches Used  Combination of : –First Order Approximation. –Natural Synchronisation Points.  First Order Approximation –Calculate the optimal checkpointing intervals. –Based on the Poisson process. Occurrence of failure is random with failure rate.

9 Checkpointing of Parallel Applications in a Grid Environment  The Optimal Checkpoint interval T c is: –T c =  2T s T f, where: T s is the time required to save information at a checkpoint. T f is the mean time between failures and T f = T h / k  The following data are needed: –The number of hours the program will run on the machines (T h ). –The known failure rate during that time ( k ). –The time required to save information at a checkpoint (T s ). First Order Approximation

10 First Order Approximation (2) Tc Ts t = 0 Rerun Time t r Restarting Point Point of Failure Tc Ts …t Tc T c = Checkpoint interval T s = Time to save a checkpoint t r = Rerun time of a failed application

11 Checkpointing of Parallel Applications in a Grid Environment First Order Approximation(3)  Using the PROVE toolset, we can measure both the execution time and the checkpointing time of an application.  Nagios can be used to determine the failure rate of Grid resources.

12 Checkpointing of Parallel Applications in a Grid Environment Natural Synchronisation Points  Examples of natural synchronization points: –Barriers. –Top or bottom of a main loop. –Collective operations (broadcast, gather, scatter, etc.)  No interprocess communication at these points. –Therefore, no need to be concerned with the state of the communication channels or possible in-transit message. –Eliminate the overhead incurred due to the synchronization process involved during checkpointing.

13 Checkpointing of Parallel Applications in a Grid Environment Natural Synchronisation Points (2) P1 P2 P3 Application Execution with Processes interacting P1 P2 P3 Coordinated checkpoint - waiting for in-transit messages

14 Checkpointing of Parallel Applications in a Grid Environment Natural Synchronisation Points (4) P1 P2 P3 Coordinated checkpoint - logging in-transit messages Checkpointing at natural synchronisation points. P1 P2 P3 N.S.P 1N.S.P 2 Ckpt1Ckpt2

15 Checkpointing of Parallel Applications in a Grid Environment New Checkpointing Approa  Using First Order Approximation only: –Involves synchronisation of messages and capturing in-transit messages.  Checkpointing at natural synchronisation points only: –May not be very effective because there are no patterns in their occurrences.

16 Checkpointing of Parallel Applications in a Grid Environment New Checkpointing Approach(2)  Use a combination of both the Natural Synchronisation Points and the First Order Approximation.  Take checkpoints at natural synchronization points which are closest to the optimal checkpoint intervals.

17 Checkpointing of Parallel Applications in a Grid Environment Choosing Checkpoint Intervals First Order approximation (Op) Natural Synchronisation pts (Ns) Critical Region { } Choosing appropriate checkpointing intervals Ns1 Ns2Ns4 Ns3Ns5 Ns6 Ns7 Ns 8 Ns9 Ns10 Op1Op2Op3Op4Op5Op6

18 Checkpointing of Parallel Applications in a Grid Environment Choosing Checkpoint Intervals(2)  Decision to select a checkpoint based on: –Optimal checkpoint interval, –Natural synchronisation points and –Critical Region.  Checkpointing process is triggered by signals sent to the coordinated process whenever synchronization points are encountered.

19 Checkpointing of Parallel Applications in a Grid Environment The Checkpointing Process  When coordinated process receives a signal, it checks to see if this signal is within the critical region. –If so, a checkpoint is taken and the clock is reset. –If not, no checkpointing is performed.  If no natural synchronization points are met within the critical region, we will have to force a checkpoint at the end of the critical region. –In such cases, the checkpointing mechanism will perform synchronization to ensure there are no lost or orphan messages.

20 Checkpointing of Parallel Applications in a Grid Environment The TestBed  Madcity Traffic Simulation tool was used. –Simulates traffic on a road network and shows how individual vehicles behave on roads and at junctions.  MadCity traffic simulator can be parallelised using PGRADE.

21 Checkpointing of Parallel Applications in a Grid Environment The Testbed(2) Proposed checkpointing solution First Order approximation (Op) Natural Synchronisation pts (Ns) Forced Synchronisation pts (Fs) Critical Region { } Saved Checkpoints Op1Op2Op3Op4Op5Op6 4 min Ns1 Ns2 Ns3 Ns4 Ns5 Ns6 Ns7 Ns8 Ns9Fs1

22 Checkpointing of Parallel Applications in a Grid Environment The Testbed(3)  Through the First Order Approximation, the calculated optimal checkpoint interval was 8 minutes.  A critical region of 2 minutes range from the optimal checkpoint interval was defined.  Checkpoint taken at: Ns1, Ns2, Ns5, Fs1, Ns6,Ns9.  Overall average time between checkpoints: 8.2 minutes

23 Checkpointing of Parallel Applications in a Grid Environment Conclusion  Proposed checkpointing mechanism provides a better and more efficient way to save checkpoint images. –Minimise the need of performing synchronisation of messages. –Ensure that our average checkpointing interval is close to the optimal checkpointing interval defined by the First Order Approximation.

24 Checkpointing of Parallel Applications in a Grid Environment Future Works  Integrate the checkpointing solution in PGRADE to provide an efficient fault tolerant solution to applications executed as Grid workflows.  Provide an efficient and reliable storage mechanism.

25 Checkpointing of Parallel Applications in a Grid Environment Questions


Download ppt "University of Westminster – www.cpc.wmin.ac.uk Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University."

Similar presentations


Ads by Google