Use Cases for Fault Tolerance Support in MPI Rich Graham Oak Ridge National Laboratory.

Slides:



Advertisements
Similar presentations
The MPI Forum: Getting Started Rich Graham Oak Ridge National Laboratory.
Advertisements

Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.
Presented by Fault Tolerance and Dynamic Process Control Working Group Richard L Graham.
Presented by Fault Tolerance Working Group Update Rich Graham.
Presented by Structure of MPI-3 Rich Graham. 2 Current State of MPI-3 proposals Many working groups have several proposal being discussed ==> standard.
Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.
Avionics Panel Go For Luna Landing! Graham ONeil United Space Alliance March 2008.
Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Impossibility of Distributed Consensus with One Faulty Process
Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.
Teaser - Introduction to Distributed Computing
Revoke / Incarnation #s / Matching Discussion around how to reclaim context IDs (resources that are a part of message matching) after an MPI_Comm_revoke.
Update on ULFM Fault Tolerance Working Group MPI Forum, San Jose CA December, 2014.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Persistent Linda 3.0 Peter Wyckoff New York University.
1Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 1Managed by UT-Battelle for the Department of Energy Richard L. Graham Computer.
SUSTAIN: An Adaptive Fault Tolerance Service for Geographically Overlapping Wireless Cyber-Physical Systems Gholam Abbas Angouti Kolucheh, Qi Han
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 ICS 214B: Transaction Processing and Distributed Data Management Lecture 12: Three-Phase Commits (3PC) Professor Chen Li.
1 Distributed Computing Algorithms CSCI Distributed Computing: everything not centralized many processors.
Conner Hansen Alex Summer Andreas Floeck Safe Message System.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
8. Fault Tolerance in Software
1 Mm3 Fault-Tolerance related to your projects 2 x 45 min. of Discussions.
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Lecture-10 Distributed Database System A distributed database system consists of loosely.
Minaashi Kalyanaraman Pragya Upreti CSS 534 Parallel Programming
Selection Sort
Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.
A Progressive Fault Tolerant Mechanism in Mobile Agent Systems Michael R. Lyu and Tsz Yeung Wong July 27, 2003 SCI Conference Computer Science Department.
CS 603 Data Replication February 25, Data Replication: Why? Fault Tolerance –Hot backup –Catastrophic failure Performance –Parallelism –Decreased.
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
1 ProActive performance evaluation with NAS benchmarks and optimization of OO SPMD Brian AmedroVladimir Bodnartchouk.
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Priority Research Direction Key challenges Fault oblivious, Error tolerant software Hybrid and hierarchical based algorithms (eg linear algebra split across.
Lecture 6: Introduction to Distributed Computing.
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
PVM and MPI What is more preferable? Comparative analysis of PVM and MPI for the development of physical applications on parallel clusters Ekaterina Elts.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, May 5, 2015 Lessons Learned Implementing.
Commit Algorithms Hamid Al-Hamadi CS 5204 November 17, 2009.
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen.
Selection Sort
ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.
Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003.
Introduction to CS739: Distribution Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
Data Structures and Algorithms in Parallel Computing Lecture 8.
Multi-phase Commit Protocols1 Based on slides by Ken Birman, Cornell University.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.
Seminar On Rain Technology
Denis Caromel1 OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis -- IUF IPDPS 2003 Nice Sophia Antipolis, April Overview: 1. What.
Recovery Requirements, Fault Notification Protocol, and LMP CCAMP WG (IETF-56) March 19, 2003 Peter Czezowski
Jack Dongarra University of Tennessee
For Massively Parallel Computation The Chaotic State of the Art
Dprocess on SMP Siyuan Ma.
Distributed Systems CS
Chapter 2: Getting Started
Distributed P2P File System
Fault-Tolerant CORBA By, Srinivas Seshu.
مديريت موثر جلسات Running a Meeting that Works
Fault Tolerance with FT-MPI for Linear Algebra Algorithms
Distributed Systems CS
Chapter 2: Getting Started
Brahim Ayari, Abdelmajid Khelil, Neeraj Suri and Eugen Bleim
Fault-Tolerant CORBA By, Srinivas Seshu.
Presentation transcript:

Use Cases for Fault Tolerance Support in MPI Rich Graham Oak Ridge National Laboratory

Working Assumption  MPI Provides the hooks into the communications and process control system to allow others to implement Fault Tolerant algorithms  This working-group will address Process Fault Tolerance (FT), not network FT (don’t need changes to the standard for this)  MPI Provides the hooks into the communications and process control system to allow others to implement Fault Tolerant algorithms  This working-group will address Process Fault Tolerance (FT), not network FT (don’t need changes to the standard for this)

Process Failure  Scenario:  Running Parallel Application, of process count N, looses one or more processes due to failure not related to the application  Recovery scenarios:  Communicator/s abort  Application uses some sort of CPR to continue  May want to quiet the communications system  May want to log messages  Application continues to run with M processes, where M <= N (application chooses if it can continue with M < N )  Application expects a dense Rank index  Application expects a sparse Rank index  Scenario:  Running Parallel Application, of process count N, looses one or more processes due to failure not related to the application  Recovery scenarios:  Communicator/s abort  Application uses some sort of CPR to continue  May want to quiet the communications system  May want to log messages  Application continues to run with M processes, where M <= N (application chooses if it can continue with M < N )  Application expects a dense Rank index  Application expects a sparse Rank index

MPI Implications  Scenario: Communicator Abort (current state in MPI)  Process Control:  Terminate processes in the failed intra- communicator  Communications:  Discard traffic associated with the failed communicators  Scenario: Communicator Abort (current state in MPI)  Process Control:  Terminate processes in the failed intra- communicator  Communications:  Discard traffic associated with the failed communicators

MPI Implications  Scenario: Some sort of CPR method in use  Process Control:  Need to re-establish communications with the restarted processes  Communications:  May need to quiet the communications system to get into a state the CPR system can handle  May need to replay messages  May need to quiet communications until parallel application is fully restarted (after failure)  Scenario: Some sort of CPR method in use  Process Control:  Need to re-establish communications with the restarted processes  Communications:  May need to quiet the communications system to get into a state the CPR system can handle  May need to replay messages  May need to quiet communications until parallel application is fully restarted (after failure)

MPI Implications  Scenario: Application continues to run with M processes ( M <= N)  Process Control:  May need to re-index processes within affected communicators  Communications:  May need to (re-)establish communications  May need to handle communications to non-existent processes  May need to discard data  All outstanding traffic  Only traffic associated with failed processes  Groups and Communicators  May change during the life-cycle of these objects  Collective Communications  How are collective optimizations impacted ?  What happens with outstanding collective operations ?  Scenario: Application continues to run with M processes ( M <= N)  Process Control:  May need to re-index processes within affected communicators  Communications:  May need to (re-)establish communications  May need to handle communications to non-existent processes  May need to discard data  All outstanding traffic  Only traffic associated with failed processes  Groups and Communicators  May change during the life-cycle of these objects  Collective Communications  How are collective optimizations impacted ?  What happens with outstanding collective operations ?