Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, May 5, 2015 Lessons Learned Implementing.

Slides:



Advertisements
Similar presentations
Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.
Advertisements

Presented by Fault Tolerance and Dynamic Process Control Working Group Richard L Graham.
Presented by Fault Tolerance Working Group Update Rich Graham.
Presented by Structure of MPI-3 Rich Graham. 2 Current State of MPI-3 proposals Many working groups have several proposal being discussed ==> standard.
Building Algorithmically Nonstop Fault Tolerant MPI Programs Rui Wang, Erlin Yao, Pavan Balaji, Darius Buntinas, Mingyu Chen, and Guangming Tan Argonne.
Use Cases for Fault Tolerance Support in MPI Rich Graham Oak Ridge National Laboratory.
Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.
User Level Failure Mitigation Fault Tolerance Plenary December 2013, MPI Forum Meeting Chicago, IL USA.
User Level Failure Mitigation Fault Tolerance Working Group September 2013, MPI Forum Meeting Madrid, Spain.
Remote Procedure Call Design issues Implementation RPC programming
Update on ULFM Fault Tolerance Working Group MPI Forum, San Jose CA December, 2014.
Abstract HyFS: A Highly Available Distributed File System Jianqiang Luo, Mochan Shrestha, Lihao Xu Department of Computer Science, Wayne State University.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable.
Fault Detection in a HW/SW CoDesign Environment Prepared by A. Gaye Soykök.
MPI Progress-Independent Communicators. Pavan Balaji, Argonne National Laboratory Shared Object Semantics in MPI P0 (Thread 0)P0 (Thread 1)P1 MPI_Irecv(…,
CS 623 Lecture #9 Yen-Yu Chen Utku Irmak. Papers to be read Better operating system features for faster network servers.Better operating system features.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
Sangmin Seo, Robert Latham, Junchao Zhang, Pavan Balaji Argonne National Laboratory {sseo, robl, jczhang, May 4, 2015 Implementation and.
Runtime Support for Irregular Computations in MPI-Based Applications - CCGrid 2015 Doctoral Symposium - Xin Zhao *, Pavan Balaji † (Co-advisor), William.
Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.
Formal Service-Oriented Development of Fault Tolerant Communicating Systems Linas Laibinis, Elena Troubitsyna, Johan Lilius, Qaisar Malik (Åbo Akademi)
SIMULATING ERRORS IN WEB SERVICES International Journal of Simulation: Systems, Sciences and Technology 2004 Nik Looker, Malcolm Munro and Jie Xu.
Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.
Fault Tolerant Runtime ANL Wesley Bland LBL Visit 3/4/14.
March 13, 2001CSci Clark University1 CSci 250 Software Design & Development Lecture #15 Tuesday, March 13, 2001.
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
The Global View Resilience Model Approach GVR (Global View for Resilience) Exploits a global-view data model, which enables irregular, adaptive algorithms.
Xiao Liu CS3 -- Centre for Complex Software Systems and Services Swinburne University of Technology, Australia Key Research Issues in.
(Business) Process Centric Exchanges
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
MPI-3 Hybrid Working Group Status. MPI interoperability with Shared Memory Motivation: sharing data between processes on a node without using threads.
Toward Fault-tolerant P2P Systems: Constructing a Stable Virtual Peer from Multiple Unstable Peers Kota Abe, Tatsuya Ueda (Presenter), Masanori Shikano,
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
Author(s) Politehnica University of Bucharest Automatic Control and Computers Faculty Computer Science Department Robocheck – Integrated Code Validation.
Quick Messaging Service John Sung. Objective Use Demeter/Java to the fullest –Use COOL for coordination –Use RIDL for RMI Ability to send messages “instantly”
Efficient Multithreaded Context ID Allocation in MPI James Dinan, David Goodell, William Gropp, Rajeev Thakur, and Pavan Balaji.
ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.
MPI Communicator Assertions Jim Dinan Point-to-Point WG March 2015 MPI Forum Meeting.
Issues #1 and #3 Fault Tolerance Working Group December 2015 MPI Forum.
Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Section 2.1 Distributed System Design Goals Alex De Ruiter
Fault Tolerant MapReduce-MPI for HPC Clusters Yanfei Guo *, Wesley Bland +, Pavan Balaji +, Xiaobo Zhou * * Dept. of Computer Science, University of Colorado,
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
VOCL-FT: Introducing Techniques for Efficient Soft Error Coprocessor Recovery Antonio J. Peña, Wesley Bland, Pavan Balaji.
Error Handler Rework Fault Tolerance Working Group.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.
PVM and MPI.
Jack Dongarra University of Tennessee
For Massively Parallel Computation The Chaotic State of the Art
FT-MPI Survey Alan & Nathan.
Fault Tolerance in MPI Programs
YG - CS170.
Fault Injection: A Method for Validating Fault-tolerant System
Lecture 6: RPC (exercises/questions)
Mark McKelvin EE249 Embedded System Design December 03, 2002
SURVIVABILITY IN IP-OVER-WDM NETWORKS (2)
Team 6: Ali Nickparsa, Yoshimichi Nakatsuka, Yuya Shiraki
Lecture 6: RPC (exercises/questions)
Lecture 7: RPC (exercises/questions)
Abstractions for Fault Tolerance
Presentation transcript:

Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, May 5, 2015 Lessons Learned Implementing User-Level Failure Mitigation in MPICH CCGrid 2015

User-Level Failure Mitigation (ULFM) What is ULFM? – A proposal and standardized way of handling fail-stop process failures in MPI – Mechanisms necessary to implement fault tolerance in applications and libraries in order to allow applications to continue execution after failures ULFM introduces semantics to define failure notification, propagation, and recovery within MPI 1 CCGrid MPI_Recv MPI_ERR_PROC_FAILED Failure Notification Failure Propagation Failure Recovery MPI_COMM_SHRINK()

Motivation & Goal ULFM is becoming the front-running solution for process fault tolerance in MPI – Not yet adopted into the MPI standard – Being used by applications and libraries and is being Introduce an implementation of ULFM in MPICH – MPICH is a high-performance and widely portable implementation of the MPI standard – Implementing ULFM in MPICH will expedite adoption more widely by other MPI implementations Demonstrate that while still a reference implementation, the runtime cost of the new API calls introduced is relatively low CCGrid

Implementation & Evaluation ULFM Implementation in MPICH Failure Detection – Local failures detected by Hydra and netmods – Error codes are returned back to the user from the API calls Agreement – Uses two group-based allreduce operations – If either fails, an error is returned to the user Revocation – Non-optimized implementation done with message flood Shrinking – All processes construct consistent group of failed procs via allreduce CCGrid Shrinking Slower than MPI_COMM_DUP because of failure detection. As expected, introducing failures in the middle of the algorithm causes large runtime increase as the algorithm must restart.