A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

Use Cases for Fault Tolerance Support in MPI Rich Graham Oak Ridge National Laboratory.
Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.
Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.
Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable.
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team Presented by Geoffroy Vallee Oak Ridge National Laboratory Welcome to.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
Computer Science Department 1 Load Balancing and Grid Computing David Finkel Computer Science Department Worcester Polytechnic Institute.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
User-Level Process towards Exascale Systems Akio Shimada [1], Atsushi Hori [1], Yutaka Ishikawa [1], Pavan Balaji [2] [1] RIKEN AICS, [2] Argonne National.
1 MOLAR: MOdular Linux and Adaptive Runtime support Project Team David Bernholdt 1, Christian Engelmann 1, Stephen L. Scott 1, Jeffrey Vetter 1 Arthur.
SSI-OSCAR A Single System Image for OSCAR Clusters Geoffroy Vallée INRIA – PARIS project team COSET-1 June 26th, 2004.
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST.
Checkpoint & Restart for Distributed Components in XCAT3 Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University.
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
WING: A Consistent Computing Platform Yiwei Ci 22/02/2012
Fault Tolerance BOF Possible CBHPC paper –Co-authors wanted –Tammy, Rob, Bruce, Daniel, Nanbor, Sameer, Jim, Doug, David What infrastructure is needed.
So, Jung-ki Distributed Computing System LAB School of Computer Science and Engineering Seoul National University Implementation of Package Management.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
G-JavaMPI: A Grid Middleware for Distributed Java Computing with MPI Binding and Process Migration Supports Lin Chen, Cho-Li Wang, Francis C. M. Lau and.
PERVASIVE COMPUTING MIDDLEWARE BY SCHIELE, HANDTE, AND BECKER A Presentation by Nancy Shah.
An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.
Arun Babu Nagarajan Frank Mueller North Carolina State University Proactive Fault Tolerance for HPC using Xen Virtualization.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.
Comparison of Distributed Operating Systems. Systems Discussed ◦Plan 9 ◦AgentOS ◦Clouds ◦E1 ◦MOSIX.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Arun Babu Nagarajan, Frank Mueller Christian Engelmann, Stephen L. Scott Oak Ridge National Laboratory Proactive Fault Tolerance for HPC using Xen Virtualization.
In Large-Scale Cluster Yutaka Ishikawa Computer Science Department/Information Technology Center The University of Tokyo
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
An OBSM method for Real Time Embedded Systems Veronica Eyo Sharvari Joshi.
Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.
EXTENSIBILITY, SAFETY AND PERFORMANCE IN THE SPIN OPERATING SYSTEM
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott Christian Engelmann Computer Science Research.
1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick.
Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.
System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.
By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
IMPROVEMENT OF COMPUTATIONAL ABILITIES IN COMPUTING ENVIRONMENTS WITH VIRTUALIZATION TECHNOLOGIES Abstract We illustrates the ways to improve abilities.
- DAG Scheduling with Reliability - - GridSolve - - Fault Tolerance In Open MPI - Asim YarKhan, Zhiao Shi, Jack Dongarra VGrADS Workshop April 2007.
VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.
Checkpoint/restart in Slurm: current status and new developments
OpenMosix, Open SSI, and LinuxPMI
Jack Dongarra University of Tennessee
Experiments with Fault Tolerant Linear Algebra Algorithms
Grid Aware: HA-OSCAR By
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
Fault Tolerance with FT-MPI for Linear Algebra Algorithms
Presentation transcript:

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen L. Scott Oak Ridge National Laboratory

2 Outline Problem vs. Our Solution Overview of LAM/MPI and BLCR Our Design and Implementation Experimental Framework Performance Evaluation Related Work Conclusion

3 Problem Statement Trends in HPC: high end systems with thousands of processors — Increased probability of node failure: MTTF becomes shorter Extensions to MPI for FT exist but… MPI widely accepted in scientific computing — But no fault recovery method in MPI standard — Cannot dynamically add/delete nodes transparently at runtime — Must reboot LAM RTE — Must restart entire job -Inefficient if only one/few node(s) fail -Staging overhead — Requeuing penalty

4 Our Solution - Job-pause Service Integrate group communication — Add/delete nodes — Detect node failures automatically Processes on live nodes remain active (roll back to last checkpoint) Only processes on failed nodes dynamically replaced by spares resumed from the last checkpoint Hence: — no restart of entire job — no staging overhead — no job requeue penalty — no Lam RTE reboot

5 Outline Problem vs. Our Solution Overview of LAM/MPI and BLCR Our Design and Implementation Experimental Framework Performance Evaluation Related Work Conclusion

6 LAM-MPI Overview Modular, component-based architecture — 2 major layers — Daemon-based RTE: lamd — “Plug in” C/R to MPI SSI framework: — Coordinated C/R & support BLCR Ex: 2-node MPI job RTE: Run-time Environment SSI: System Services Interface RPI: Request Progression Interface

7 BLCR Overview Process-level C/R facility: for single MPI application process Kernel-based: saves/restores most/all resources Implementation: Linux kernel module allows upgrades & bug fixes w/o reboot Provides hooks used for distributed C/R: LAM-MPI jobs

8 Outline Problem vs. Our Solution Overview of LAM/MPI and BLCR Our Design and Implementation Experimental Framework Performance Evaluation Related Work Conclusion

9 Our Design & Implementation – LAM/MPI Decentralized scalable Membership and failure detector (ICS’06) — Radix tree  scalability — dynamically detects node failures — NEW: Integrated into lamd NEW: Decentralized scheduler — Integrated into lamd — Periodic coordinated checkpointing — Node failure  trigger 1.process migration (failed nodes) 2.job-pause (operational nodes) new

10 New Job Pause Mechanism – LAM/MPI & BLCR Operational nodes: Pause Failed nodes: Migrate — BLCR: reuse processes — restore part of state of process from checkpoint — LAM: reuse existing connections — Restart on new node from checkpoint file — Connect w/ paused tasks

11 New Job Pause Mechanism - BLCR (In kernel: dashed lines/boxes) 1. app registers threaded callback  spawns callback thread 4. All threads complete callbacks & enter kernel 6. Run regular application code from restored state 5. New: All threads restore part of their states Call-back kernel thread: coordinates user command process and app. process 2. thread blocks in kernel 3. pause utility calls ioctl(), unblocks callback thread

12 Process Migration – LAM/MPI Change addressing information of migrated process — in process itself — in all other processes Use node id (not IP) for addressing information No change to BLCR for Process Migration Update addressing information at run time 1.Migrated process tells coordinator (mpirun) about its new location 2.Coordinator broadcasts new location 3.All processes update their process list

13 Outline Problem vs. Our Solution Overview of LAM/MPI and BLCR Our Design and Implementation Experimental Framework Performance Evaluation Related Work Conclusion

14 Experimental Framework Experiments conducted on — Opt cluster: 16 nodes, 2 core, dual Opteron 265, 1 Gbps Ether — Fedora Core 5 Linux x86_64 — Lam/MPI + BLCR w/ our extensions Benchmarks — NAS V3.2.1 (MPI version) – run 5 times, results report avg. – Class C (large problem size) used – BT, CG, EP, FT, LU, MG and SP benchmarks – IS run is too short

15 Relative Overhead (Single Checkpoint) Checkpoint overhead < 10% Except FT, MG (explained late)

16 Absolute Overhead (Single Checkpoint) Short: ~ 10 secs Checkpoint times increase linearly with checkpoint file size EP: small+const. chkpt file size  incr. communication overhead Except FT (explained next)

17 Analysis of Outliers on 4 nodes on 16 nodes on 8 nodes Size of checkpoint files [MB] SPMGLUFTEPCGBTNode # MG: large checkpoint files, but short overall exec time Large Checkpoint files FT: thrashing/swap (BLCR problem)

18 Job Migration Overhead 69.6% < job restart + lam reboot NO LAM Reboot No requeue penalty Transparent continuation of exec on 16 nodes — Less staging overhead

19 Related Work FT – Reactive approach Transparent — Checkpoint/restart – LAM/MPI w/ BLCR [S.Sankaran et.al LACSI ’03] – Process Migration: scan & update checkpoint files [J. Cao, Y. Li and M.Guo, ICPADS, 2005]  still requires restart of entire job – CoCheck [G.Stellner, IPPS ’ 96] — Log based (Log msg + temporal ordering) – MPICH-V [G.Bosilica, Supercomputing, 2002] Non-transparent — Explicit invocation of checkpoint routines – LA-MPI [R.T.Aulwes et. Al, IPDPS 2004] – FT-MPI [G. E. Fagg and J. J. Dongarra, 2000]

20 Conclusion Job-Pause for fault tolerance in HPC Design generic for any MPI implementation / process C/R Implemented over LAM-MPI w/ BLCR Decentralized P2P scalable membership protocol & scheduler High-performance job-pause for operational nodes Process migration for failed nodes Completely transparent Low overhead: 69.6% < job restart + lam reboot — No job requeue overhead — Less staging cost — No LAM Reboot Suitable for proactive fault tolerance with diskless migration

21 Questions? Thank you!