New cluster capabilities through Checkpoint/ Restart

Slides:

Advertisements

Similar presentations

National Institute of Advanced Industrial Science and Technology Ninf-G - Core GridRPC Infrastructure Software OGF19 Yoshio Tanaka (AIST) On behalf.

Advertisements

Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.

S.Chechelnitskiy / SFU Simon Fraser Running CE and SE in a XEN virtualized environment S.Chechelnitskiy Simon Fraser University CHEP 2007 September 6 th.

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.

Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.

Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Status of Globus activities within INFN (update) Massimo Sgaravatto INFN Padova for the INFN Globus group

Chiba City: A Testbed for Scalablity and Development FAST-OS Workshop July 10, 2002 Rémy Evard Mathematics.

Universidad Politécnica de Baja California. Juan P. Navarro Sanchez 9th level English Teacher: Alejandra Acosta The Beowulf Project.

Software Developer By: Charlie Edwards Period 6 th Mrs. Truong.

VIRTUALISATION OF HADOOP CLUSTERS Dr G Sudha Sadasivam Assistant Professor Department of CSE PSGCT.

Operational computing environment at EARS Jure Jerman Meteorological Office Environmental Agency of Slovenia (EARS)

Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan

A Review by Raghu Rangan WPI CS525 September 19, 2012 An Early Warning System Based on Reputation for Energy Control Systems.

Semantic Interoperability Berlin, 25 March 2008 Semantically Enhanced Resource Allocator Marc de Palol Jorge Ejarque, Iñigo Goiri, Ferran Julià, Jordi.

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

The Owner Share scheduler for a distributed system 2009 International Conference on Parallel Processing Workshops Reporter: 李長霖.

October 18, 2005 Charm++ Workshop Faucets A Framework for Developing Cluster and Grid Scheduling Solutions Presented by Esteban Pauli Parallel Programming.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.

Real-Time Cyber Physical Systems Application on MobilityFirst Winlab Summer Internship 2015 Karthikeyan Ganesan, Wuyang Zhang, Zihong Zheng.

May http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.

Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.

1 Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory.

1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick.

Paul Graham Software Architect, EPCC PCP – The P robes C oordination P rotocol A secure, robust framework.

Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar Young Suk Moon Chair: Prof. Gregor von Laszewski Reader: Observer:

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.

Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Lecture 4 Page 1 CS 111 Summer 2013 Scheduling CS 111 Operating Systems Peter Reiher.

1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.

VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.

OSCAR Symposium – Quebec City, Canada – June 2008 Proposal for Modiﬁcations to the OSCAR Architecture to Address Challenges in Distributed System Management.

Grid Computing: An Overview and Tutorial Kenny Daily BIT Presentation 22/09/2016.

CREAM Status and plans Massimo Sgaravatto – INFN Padova

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI MPI VT report OMB Meeting 28 th February 2012.

Apache Ignite Compute Grid Research Corey Pentasuglia.

Valencia Cluster status Valencia Cluster status —— Gang Qin Nov

Plumbers UK How to choose best plumbers in UK. Plumbing Services Plumbing problem is common issue faced by most of the people in UK. One day, when you.

INTRODUCTION TO WIRELESS SENSOR NETWORKS

Introduction to Systems Analysis and Design

RoboVis Alistair Wick.

Managing Windows Server 2012

IL Marking Get out your CPU / Memory answers Swap with someone else

Software Project Configuration Management

Checkpoint/restart in Slurm: current status and new developments

ECAR 2017 Buenos Aires, September 18th-29th 2017

Slurm User Group Meeting’16 - Athens, Greece – Sept , 2016

Hadoop MapReduce Framework

GWE Core Grid Wizard Enterprise (

Alma at Strathclyde.

Day 08 Processes.

Day 09 Processes.

BIMSB Bioinformatics Coordination

Abstract Machine Layer Research in VGrADS

Outline Introduction Objectives Motivation Expected Output

Rui Wu, Jose Painumkal, Sergiu M. Dascalu, Frederick C. Harris, Jr

Lesson 5 Computer-Related Issues

SHARE THE KNOWLEDGE TIME MANAGEMENT Presented By HASHIR TM

CARLA Buenos Aires, Argentina - Sept , 2017

1.2 System Design Basics.

Wide Area Workload Management Work Package DATAGRID project

GRID Workload Management System for CMS fall production

JRA 1 Progress Report ETICS 2 All-Hands Meeting

**DRAFT** NOVA Blueprint 03/10/2015

Presentation transcript:

New cluster capabilities through Checkpoint/ Restart M. Rodríguez-Pascual, J.A Moríñigo, R. Mayo-García CIEMAT 1

Outline What have we done? How have we done it? New cluster capabilities through Checkpoint/ Restart

What have we done? Provide support for DMTCP checkpoint library in Slurm Explore the functionalities of Slurm that get a boost with this new capabilities Create some new cool tools

Background: checkpoint/restart Checkpoint/restart: being able to save the status of a running job and restart it somewhere else Why is it useful/important? Straightforward answer: provide fault tolerance, basic in highly parallel systems with long running jobs (exascale era) Not so straightforward usage: take scheduling decisions on running jobs. Interrupt a running job and continue it in the future Move a job from a particular resource to another

Software Stack Resource Manager: Slurm Open source Wide adoption Checkpoint library: DMTCP User-level Support for MPI and OpenMP Robust and reliable MPI: mvapich2 Any should work

How have we done it? Implement Slurm API for checkpoint libraries Modify Slurm to provide some required new functionalities Technical stuff related to access to information Find some bugs related to C/R management in Slurm Very difficult to detect before, as C/R wasn't really available Explore how to use C/R on different ways Existing scheduling algorithms Design of new ones Implementation of new tools

New cluster capabilities through Checkpoint/ Restart Job Preemption New Scheduling algorithm Job migration for cluster maintenance and administration

Job Preemption Basic idea: when a job with high priority comes in, any job with lower priority gets temporarily out of the way How to do it: 2 job queues in Slurm: high priority and low priority Configure Slurm to use CHECKPOINT to “get the jobs out of the way” Jobs are restarted in any resource, not on the one they were originally running Conceptually pretty straightforward, but required quite a lot of developments!

New scheduling algorithms using C/R The movement of running tasks to different places is now available Still analyzing how to use this to increase performance and/or reduce energy consumption “Dummy” prototype already implemented: all the technical problems are solved, now fighting with the maths!!

Tool for system management Scenario: you have to update whatever in a node, but there is a job running there. Even marking the node as “DRAINING”, it might take hours or days before the job ends. Our idea: checkpoint that job, move it somewhere else, work on the node, and put it back as “AVAILABLE” Already implemented and working fine :)

Try it, it's free! Installation guide and testbed https://github.com/ciemat-tic/codec/wiki/tests-on-task-migration

Thanks for your attention!! Questions?