Presentation is loading. Please wait.

Presentation is loading. Please wait.

New cluster capabilities through Checkpoint/ Restart

Similar presentations


Presentation on theme: "New cluster capabilities through Checkpoint/ Restart"— Presentation transcript:

1 New cluster capabilities through Checkpoint/ Restart
M. Rodríguez-Pascual, J.A Moríñigo, R. Mayo-García CIEMAT 1

2 Outline What have we done? How have we done it?
New cluster capabilities through Checkpoint/ Restart

3 What have we done? Provide support for DMTCP checkpoint library in Slurm Explore the functionalities of Slurm that get a boost with this new capabilities Create some new cool tools

4 Background: checkpoint/restart
Checkpoint/restart: being able to save the status of a running job and restart it somewhere else Why is it useful/important? Straightforward answer: provide fault tolerance, basic in highly parallel systems with long running jobs (exascale era) Not so straightforward usage: take scheduling decisions on running jobs. Interrupt a running job and continue it in the future Move a job from a particular resource to another

5 Software Stack Resource Manager: Slurm Open source Wide adoption
Checkpoint library: DMTCP User-level Support for MPI and OpenMP Robust and reliable MPI: mvapich2 Any should work

6 How have we done it? Implement Slurm API for checkpoint libraries
Modify Slurm to provide some required new functionalities Technical stuff related to access to information Find some bugs related to C/R management in Slurm Very difficult to detect before, as C/R wasn't really available Explore how to use C/R on different ways Existing scheduling algorithms Design of new ones Implementation of new tools

7 New cluster capabilities through Checkpoint/ Restart
Job Preemption New Scheduling algorithm Job migration for cluster maintenance and administration

8 Job Preemption Basic idea: when a job with high priority comes in, any job with lower priority gets temporarily out of the way How to do it: 2 job queues in Slurm: high priority and low priority Configure Slurm to use CHECKPOINT to “get the jobs out of the way” Jobs are restarted in any resource, not on the one they were originally running Conceptually pretty straightforward, but required quite a lot of developments!

9 New scheduling algorithms using C/R
The movement of running tasks to different places is now available Still analyzing how to use this to increase performance and/or reduce energy consumption “Dummy” prototype already implemented: all the technical problems are solved, now fighting with the maths!!

10 Tool for system management
Scenario: you have to update whatever in a node, but there is a job running there. Even marking the node as “DRAINING”, it might take hours or days before the job ends. Our idea: checkpoint that job, move it somewhere else, work on the node, and put it back as “AVAILABLE” Already implemented and working fine :)

11 Try it, it's free! Installation guide and testbed

12 Thanks for your attention!!
Questions?


Download ppt "New cluster capabilities through Checkpoint/ Restart"

Similar presentations


Ads by Google