An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

Slides:



Advertisements
Similar presentations
OPERATING SYSTEMS Threads
Advertisements

Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
3.5 Interprocess Communication
Threads CSCI 444/544 Operating Systems Fall 2008.
Operating System Concepts with Java – 7 th Edition, Nov 15, 2006 Silberschatz, Galvin and Gagne ©2007 Chapter 4: Threads.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.
Process Concept An operating system executes a variety of programs
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Experiences Deploying Xrootd at RAL Chris Brew (RAL)
Wind River VxWorks Presentation
SSI-OSCAR A Single System Image for OSCAR Clusters Geoffroy Vallée INRIA – PARIS project team COSET-1 June 26th, 2004.
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 5: Threads Overview Multithreading Models Threading Issues Pthreads Solaris.
Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
1 Apache. 2 Module - Apache ♦ Overview This module focuses on configuring and customizing Apache web server. Apache is a commonly used Hypertext Transfer.
Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari.
Process Management Working Group Process Management “Meatball” Dallas November 28, 2001.
Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx.
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
G-JavaMPI: A Grid Middleware for Distributed Java Computing with MPI Binding and Process Migration Supports Lin Chen, Cho-Li Wang, Francis C. M. Lau and.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Process Management & Monitoring WG Quarterly Report January 25, 2005.
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Cruz: Application-Transparent Distributed.
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen.
Welcome to the PVFS BOF! Rob Ross, Rob Latham, Neill Miller Argonne National Laboratory Walt Ligon, Phil Carns Clemson University.
Cluster Software Overview
How to for compiling and running MPI Programs. Prepared by Kiriti Venkat.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 4: Threads Modified from the slides of the text book. TY, Sept 2010.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.
Process Manager Specification Rusty Lusk 1/15/04.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Process Management & Monitoring WG Quarterly Report August 26, 2004.
1 Chapter 5: Threads Overview Multithreading Models & Issues Read Chapter 5 pages
Apache Ignite Compute Grid Research Corey Pentasuglia.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
TensorFlow– A system for large-scale machine learning
Chapter 5: Threads Overview Multithreading Models Threading Issues
Chapter 5: Threads Overview Multithreading Models Threading Issues
Chapter 4 Threads.
Chapter 4: Threads.
Chapter 5: Threads Overview Multithreading Models Threading Issues
Chapter 4: Threads.
Chapter 4: Threads.
OPERATING SYSTEMS Threads
Modified by H. Schulzrinne 02/15/10 Chapter 4: Threads.
Chapter 4: Threads.
Chapter 4: Threads.
MPJ: A Java-based Parallel Computing System
Chapter 4: Threads.
New cluster capabilities through Checkpoint/ Restart
Thomas E. Anderson, Brian N. Bershad,
Chapter 4: Threads.
Chapter 5: Threads Overview Multithreading Models Threading Issues
Chapter 4: Threads.
CS Introduction to Operating Systems
Presentation transcript:

An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric Roman January 13 th, 2004 (Based on slides by Jason Duell)

Linux checkpoint/restart Outline Project goals System design Entension interface Current status Future work

Uses of Checkpoint/Restart Gang scheduling ● No queue drain for maintenance, policy change ● Higher utilization and/or more flexible scheduling Process migration ● Save job if node failure imminent ● Pack jobs for optimal network performance Periodic backup ● Not our main focus ● Application can always do more efficiently ● But may be useful for systems with long jobs, fast I/O, and/or high node failure rates

Implementation Strategies Application-based checkpointing ● Efficient: save only needed data as step completes ● Good for fault tolerance: bad for preemption ● Requires per-application effort by programmer Library-based checkpointing ● Portable across operating systems ● Transparent to application (but may require relink, etc.) ● Can't (generally) restore all resources (ex: process IDs) ● Can’t checkpoint shell scripts Kernel-based checkpointing ● Not portable, and harder to implement ● Can save/restore all resources

Design Goals Target: parallel scientific applications ● MPI is a must ● But allow support for other programs/models, too ● Esoteric features (ptrace, Unix domain sockets) have lower implementation priority Implemention: Linux kernel module ● lower barrier to adoption than kernel patch ● Allows upgrades, bug fixes, without reboot

Design Goals II Provide ‘toolkit’ for distributed C/R ● We provide single node checkpoint/restart ● We don’t support distributed operating system features No built-in support for TCP sockets, bproc namespaces, etc. ● We provide hooks to allow parallel runtimes/libraries to implement distributed checkpoint/restart So the MPI library needs to know about checkpointing, but user applications don’t

Extension Interface Callback functions ● Registered at startup (or as needed) ● Run at checkpoint time, then resume at restart/continue ● Handle parallel coordination and/or unsupported objects Two types of callbacks ● Signal handler context ● Run with same PID (LinuxThreads); no thread-safety needed ● But callback limited to calling signal-safe functions (small subset of POSIX) ● Separate thread context ● Can call any function ● But code needs to be thread-safe, and separate PID (LinuxThreads) Critical sections ● Use to protect uncheckpointable sections of code

Current Status Support LAM-MPI jobs ● Both TCP and Myrinet supported ● Infrastructure in place for Infiniband, Quadrics ● Process migration: currently must restart whole job Simple semantics for open files ● Reopen and seek to original position ● Must be regular files (pipe support coming soon) ● Files must exist in same location on filesystem Single- and multi-threaded processes ● checkpoint of ‘mpirun’ checkpoints whole MPI job ● Will support process groups, sessions in future ● Restore original PID

Current status II Work with wide variety of 2.4 kernels ● kernel.org versions onwards ● RedHat: 7.2 through 9 ● SuSE: 7.2 through 9.0 ● autoconf feature probing, so support of custom patched kernels likely to be automatic ● we’ll maintain 2.4 support once 2.6 comes out Support both new and old pthreads ● I.e., old “LinuxThreads”, plus new 2.6 pthreads (backported to 2.4 by Red Hat)

Future Work Support for sessions & process groups ● Including pipes, mmaps, etc., shared within group ● Full restoration of parent/child tree, with original PIDs More semantics for files ● Allow checksum of file, with restart error if it has changed ● Allow saving contents of file (restore either clobbers, or opens anonymously) ● Support files that are not open at checkpoint time, but are specified as being part of the checkpoint Laundry list of other resources to support ● Page 4 of “Design and Implementation” paper

Future Work II Integration with parallel job systems ● Funded to work within suite from DOE Scalable systems software SciDAC. Work is in progress. ● Possibility of OpenPBS, PBSPro support ● Interested in others (LSF, SGE, SLURM, etc.) More MPI implementations ● MPICH 2 support anticipated ● Vendor support (Quadrics)? ● LAM/MPI support for partial/live migration

Conclusion Papers (available from website): ● “Design and Implementation of BLCR”: high-level system design, including description of user API ● “Requirements for Linux Checkpoint/Restart”: exhaustive list of Unix features we will support (or not). ● “A Survey of Checkpoint/Restart Implementations”: focusing on open source versions that run on Linux ● “The LAM/MPI Checkpoint/Restart Framework: System- Initiated Checkpointing”: implementation with LAM/MPI