Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan

Slides:



Advertisements
Similar presentations
Experiment Workflow Pipelines at APS: Message Queuing and HDF5 Claude Saunders, Nicholas Schwarz, John Hammonds Software Services Group Advanced Photon.
Advertisements

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Operating System Structures
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Job Submission.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Spark: Cluster Computing with Working Sets
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
Scheduling in Batch Systems
J. Skovira 5/05 v11 Introduction to IBM LoadLeveler Batch Scheduling System.
Introduction to z/OS Basics © 2006 IBM Corporation Chapter 7: Batch processing and the Job Entry Subsystem (JES) Batch processing and JES.
1 Distributed Systems: Distributed Process Management – Process Migration.
Web Proxy Server Anagh Pathak Jesus Cervantes Henry Tjhen Luis Luna.
Chapter 3 Operating Systems Introduction to CS 1 st Semester, 2015 Sanghyun Park.
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
Introduction Optimizing Application Performance with Pinpoint Accuracy What every IT Executive, Administrator & Developer Needs to Know.
1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.
Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-3 CPU Scheduling Department of Computer Science and Software Engineering.
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
Fault Tolerance BOF Possible CBHPC paper –Co-authors wanted –Tammy, Rob, Bruce, Daniel, Nanbor, Sameer, Jim, Doug, David What infrastructure is needed.
Sun Grid Engine. Grids Grids are collections of resources made available to customers. Compute grids make cycles available to customers from an access.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
IBM Systems & Technology Group LoadLeveler 3.3 Dr. Roland Kunz, IT Specialist l.
Maintaining File Services. Shadow Copies of Shared Folders Automatically retains copies of files on a server from specific points in time Prevents administrators.
Computer Emergency Notification System (CENS)
An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.
1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.
Some Design Notes Iteration - 2 Method - 1 Extractor main program Runs from an external VM Listens for RabbitMQ messages Starts a light database engine.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Using the BYU SP-2. Our System Interactive nodes (2) –used for login, compilation & testing –marylou10.et.byu.edu I/O and scheduling nodes (7) –used for.
Using hpc Instructor : Seung Hun An, DCS Lab, School of EECSE, Seoul National University.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
Introduction to z/OS Basics © 2006 IBM Corporation Chapter 7: Batch processing and the Job Entry Subsystem (JES) Batch processing and JES.
Faucets Queuing System Presented by, Sameer Kumar.
Distributed System Services Fall 2008 Siva Josyula
UNIX Unit 1- Architecture of Unix - By Pratima.
Middleware Testing Framework. Outline Exhaustive testing v. Assurance Testing What do we test? How often do we perform out tests? What do we use to schedule.
Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab.
Copyright © Curt Hill Operating Systems An Introductory Overview.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Slide 6-1 Chapter 6 System Software Considerations Introduction to Information Systems Judith C. Simon.
Seaborg Decommission James M. Craw Computational Systems Group Lead NERSC User Group Meeting September 17, 2007.
T3g software services Outline of the T3g Components R. Yoshida (ANL)
1 Chapter Overview Monitoring Access to Shared Folders Creating and Sharing Local and Remote Folders Monitoring Network Users Using Offline Folders and.
GangLL Gang Scheduling on the IBM SP Andy B. Yoo and Morris A. Jette Lawrence Livermore National Laboratory.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.
System Components Operating System Services System Calls.
Resource Management IB Computer Science.
OpenPBS – Distributed Workload Management System
Copyright ©: Nahrstedt, Angrave, Abdelzaher
2. OPERATING SYSTEM 2.1 Operating System Function
Credits: 3 CIE: 50 Marks SEE:100 Marks Lab: Embedded and IOT Lab
Hands-On Microsoft Windows Server 2008
Operating Systems (CS 340 D)
Chapter 2: System Structures
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Operating Systems (CS 340 D)
湖南大学-信息科学与工程学院-计算机与科学系
Chapter 2: Operating-System Structures
Introduction to Operating Systems
New cluster capabilities through Checkpoint/ Restart
Introduction to OS (concept, evolution, some keywords)
Introduction to OS (concept, evolution, some keywords)
Chapter 2: Operating-System Structures
Presentation transcript:

Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan

Office of Science U.S. Department of Energy Outline Motivation for Checkpoint/Restart (CPR) CPR considerations CPR on the IBM SP Evaluation of CPR on the IBM SP Results Putting CPR into production

Office of Science U.S. Department of Energy Motivation for Checkpoint/Restart Large HPC systems typically have large parallel or long running jobs To be able to save the running state for large parallel or long running jobs periodically so that in the case of an interruption we don’t lose too much work To decrease the impact of single-node failures on the overall usability of the machine To be able to perform maintenance on the system with minimal impact to running jobs Better utilization of resources

Office of Science U.S. Department of Energy Checkpoint/Restart considerations User initiated (not from within the program) System (administrator) initiated Use of HPC systems is usually via a batch system (such as LoadLeveler) Both serial and parallel jobs are run on the machine Parallel jobs use message passing and we should be able to checkpoint these as well Use of CPR mechanism internal to code as well as externally

Office of Science U.S. Department of Energy Checkpoint/Restart Users System administrators and operators Checkpoint used to clear a node for maintenance work. End users of HPC systems (scientists, students, researchers) Programmers writing code that uses CPR mechanism internally (or utility programs to use CPR functionality for the system)

Office of Science U.S. Department of Energy Checkpoint/Restart mechanism For Parallel programs Stop and discard mechanism (K. Z. Meth and W. G. Tuel) On receiving a checkpoint request, the task stops sending messages and is checkpointed. In-transit message information is saved so we know what messages have been sent but not acknowledged. These messages are resent on restart.

Office of Science U.S. Department of Energy Checkpoint/Restart methods Utility program as part of system software CPR API via system calls (ll_init_ckpt, etc.) Batch system software can use the API to implement CPR mechanism.

Office of Science U.S. Department of Energy CPR on the IBM SP Done via LL command (llckpt) Once a process is checkpointed: 1.Process can continue running. 2.Process is killed. Within LL: 1.Job can be deleted from the queuing system. 2.Job can be resubmitted for consideration by the scheduler. 3.Job can be resubmitted and “held”.

Office of Science U.S. Department of Energy Checkpoint/Restart on the IBM SP Job command file keywords: In order to be able to checkpoint a LL job: checkpoint = [yes|no| interval] ckpt_time_limit = [time to checkpoint] ckpt_dir = [path to checkpoint files] ckpt_file = [basename of checkpoint files] In order to be able to restart a LL job: checkpoint = [yes|no| interval] ckpt_dir = [path to checkpoint files] ckpt_file = [basename of checkpoint files] restart_from_ckpt = [yes| no] restart_on_same_nodes = [yes|no]

Office of Science U.S. Department of Energy We evaluated the use of C/R with LoadLeveler on the SP using both a 4-node development system (dev2) and the 416-node production system (seaborg). We evaluated: (a) System requirements (b) Configuration changes (c) Viability/Ease of Use CPR Evaluation on the IBM SP

Office of Science U.S. Department of Energy 2 kinds of programs: Serial code that allocates a certain amount of memory (integer array and initializes the array) MPI code that starts up a certain number of processes and allocates a certain amount of memory and does simple message passing User checkpoint: Submit a job using llsubmit, let it run, use llckpt -u to checkpoint, and resume job using llhold –r User can also use llckpt –k and resubmit job CPR Evaluation on the IBM SP

Office of Science U.S. Department of Energy Results – Dev2 Each task uses approximately 200 MB memory

Office of Science U.S. Department of Energy Results – Dev2 Each task uses approximately 200 MB memory

Office of Science U.S. Department of Energy Results – Dev2 Serial job

Office of Science U.S. Department of Energy Results – Dev2 Serial job

Office of Science U.S. Department of Energy Results – Dev2 Each task uses approximately 200 MB memory

Office of Science U.S. Department of Energy Results – Dev2 Each task uses approximately 200 MB memory

Office of Science U.S. Department of Energy Results – Seaborg 16 tasks per node; Each task uses approximately 260 MB memory

Office of Science U.S. Department of Energy Results – Seaborg Each task uses approximately 260 MB memory

Office of Science U.S. Department of Energy What about restart? Times to restart are on the order of time to checkpoint. Disk usage, user quotas (checkpoint files are owned by job owner) restart = yes keyword is implied if checkpoint = yes. Priority issues: Checkpointed and held jobs retain their priority. Not all jobs can be checkpointed. List of exceptions is documented in the LL manual. Using CPR

Office of Science U.S. Department of Energy Acknowledgements: NERSC SP Systems Staff (N. Cardo, D. Paul, T. Stone) IBM Staff (S. Burrow) NERSC USG Staff (D. Skinner) NERSC ASG Staff (A. Wong)