Presentation is loading. Please wait.

Presentation is loading. Please wait.

Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan

Similar presentations


Presentation on theme: "Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan"— Presentation transcript:

1 Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov

2 Office of Science U.S. Department of Energy Outline Motivation for Checkpoint/Restart (CPR) CPR considerations CPR on the IBM SP Evaluation of CPR on the IBM SP Results Putting CPR into production

3 Office of Science U.S. Department of Energy Motivation for Checkpoint/Restart Large HPC systems typically have large parallel or long running jobs To be able to save the running state for large parallel or long running jobs periodically so that in the case of an interruption we don’t lose too much work To decrease the impact of single-node failures on the overall usability of the machine To be able to perform maintenance on the system with minimal impact to running jobs Better utilization of resources

4 Office of Science U.S. Department of Energy Checkpoint/Restart considerations User initiated (not from within the program) System (administrator) initiated Use of HPC systems is usually via a batch system (such as LoadLeveler) Both serial and parallel jobs are run on the machine Parallel jobs use message passing and we should be able to checkpoint these as well Use of CPR mechanism internal to code as well as externally

5 Office of Science U.S. Department of Energy Checkpoint/Restart Users System administrators and operators Checkpoint used to clear a node for maintenance work. End users of HPC systems (scientists, students, researchers) Programmers writing code that uses CPR mechanism internally (or utility programs to use CPR functionality for the system)

6 Office of Science U.S. Department of Energy Checkpoint/Restart mechanism For Parallel programs Stop and discard mechanism (K. Z. Meth and W. G. Tuel) On receiving a checkpoint request, the task stops sending messages and is checkpointed. In-transit message information is saved so we know what messages have been sent but not acknowledged. These messages are resent on restart.

7 Office of Science U.S. Department of Energy Checkpoint/Restart methods Utility program as part of system software CPR API via system calls (ll_init_ckpt, etc.) Batch system software can use the API to implement CPR mechanism.

8 Office of Science U.S. Department of Energy CPR on the IBM SP Done via LL command (llckpt) Once a process is checkpointed: 1.Process can continue running. 2.Process is killed. Within LL: 1.Job can be deleted from the queuing system. 2.Job can be resubmitted for consideration by the scheduler. 3.Job can be resubmitted and “held”.

9 Office of Science U.S. Department of Energy Checkpoint/Restart on the IBM SP Job command file keywords: In order to be able to checkpoint a LL job: #@ checkpoint = [yes|no| interval] #@ ckpt_time_limit = [time to checkpoint] #@ ckpt_dir = [path to checkpoint files] #@ ckpt_file = [basename of checkpoint files] In order to be able to restart a LL job: #@ checkpoint = [yes|no| interval] #@ ckpt_dir = [path to checkpoint files] #@ ckpt_file = [basename of checkpoint files] #@ restart_from_ckpt = [yes| no] #@ restart_on_same_nodes = [yes|no]

10 Office of Science U.S. Department of Energy We evaluated the use of C/R with LoadLeveler on the SP using both a 4-node development system (dev2) and the 416-node production system (seaborg). We evaluated: (a) System requirements (b) Configuration changes (c) Viability/Ease of Use CPR Evaluation on the IBM SP

11 Office of Science U.S. Department of Energy 2 kinds of programs: Serial code that allocates a certain amount of memory (integer array and initializes the array) MPI code that starts up a certain number of processes and allocates a certain amount of memory and does simple message passing User checkpoint: Submit a job using llsubmit, let it run, use llckpt -u to checkpoint, and resume job using llhold –r User can also use llckpt –k and resubmit job CPR Evaluation on the IBM SP

12 Office of Science U.S. Department of Energy Results – Dev2 Each task uses approximately 200 MB memory

13 Office of Science U.S. Department of Energy Results – Dev2 Each task uses approximately 200 MB memory

14 Office of Science U.S. Department of Energy Results – Dev2 Serial job

15 Office of Science U.S. Department of Energy Results – Dev2 Serial job

16 Office of Science U.S. Department of Energy Results – Dev2 Each task uses approximately 200 MB memory

17 Office of Science U.S. Department of Energy Results – Dev2 Each task uses approximately 200 MB memory

18 Office of Science U.S. Department of Energy Results – Seaborg 16 tasks per node; Each task uses approximately 260 MB memory

19 Office of Science U.S. Department of Energy Results – Seaborg Each task uses approximately 260 MB memory

20 Office of Science U.S. Department of Energy What about restart? Times to restart are on the order of time to checkpoint. Disk usage, user quotas (checkpoint files are owned by job owner) #@ restart = yes keyword is implied if checkpoint = yes. Priority issues: Checkpointed and held jobs retain their priority. Not all jobs can be checkpointed. List of exceptions is documented in the LL manual. Using CPR

21 Office of Science U.S. Department of Energy Acknowledgements: NERSC SP Systems Staff (N. Cardo, D. Paul, T. Stone) IBM Staff (S. Burrow) NERSC USG Staff (D. Skinner) NERSC ASG Staff (A. Wong)


Download ppt "Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan"

Similar presentations


Ads by Google