Minaashi Kalyanaraman Pragya Upreti CSS 534 Parallel Programming

Slides:



Advertisements
Similar presentations
Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.
Advertisements

Making the System Operational
MPI Message Passing Interface
Lectures on File Management
Use Cases for Fault Tolerance Support in MPI Rich Graham Oak Ridge National Laboratory.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,
Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
1Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 1Managed by UT-Battelle for the Department of Energy Richard L. Graham Computer.
Spark: Cluster Computing with Working Sets
Chapter 6 Concurrency: Deadlock and Starvation
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
PRASHANTHI NARAYAN NETTEM.
Message Passing Interface In Java for AgentTeamwork (MPJ) By Zhiji Huang Advisor: Professor Munehiro Fukuda 2005.
A Progressive Fault Tolerant Mechanism in Mobile Agent Systems Michael R. Lyu and Tsz Yeung Wong July 27, 2003 SCI Conference Computer Science Department.
Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.
Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.
Distributed Deadlocks and Transaction Recovery.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
Distributed File Systems
CS 5204 (FALL 2005)1 Leases: An Efficient Fault Tolerant Mechanism for Distributed File Cache Consistency Gray and Cheriton By Farid Merchant Date: 9/21/05.
Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers.
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Issues #1 and #3 Fault Tolerance Working Group December 2015 MPI Forum.
Enterprise Computing with Jini Technology Mark Stang and Stephen Whinston Jan / Feb 2001, IT Pro presented by Alex Kotchnev.
Text TCS INTERNAL Oracle PL/SQL – Introduction. TCS INTERNAL PL SQL Introduction PLSQL means Procedural Language extension of SQL. PLSQL is a database.
Fault Tolerance (2). Topics r Reliable Group Communication.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.
- DAG Scheduling with Reliability - - GridSolve - - Fault Tolerance In Open MPI - Asim YarKhan, Zhiao Shi, Jack Dongarra VGrADS Workshop April 2007.
VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.
TensorFlow– A system for large-scale machine learning
Virtual Memory CSSE 332 Operating Systems
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
Distributed Shared Memory
Jack Dongarra University of Tennessee
FT-MPI Survey Alan & Nathan.
Chapter 9: Virtual Memory
Fault Tolerance in MPI Programs
Chapter 9: Virtual-Memory Management
Page Replacement.
MPI-Message Passing Interface
Outline Midterm results summary Distributed file systems – continued
Fault Tolerance Distributed Web-based Systems
Chapter 2: Operating-System Structures
Distributed Resource Management: Distributed Shared Memory
Chapter 2: Operating-System Structures
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Minaashi Kalyanaraman Pragya Upreti CSS 534 Parallel Programming Fault Tolerance in MPI Minaashi Kalyanaraman Pragya Upreti CSS 534 Parallel Programming

OVERVIEW Fault Tolerance in MPI Levels of survival in MPI Approaches to fault tolerance in MPI Advantages & disadvantages of implementing fault tolerance in MPI Extending MPI to HARNESS Why FT-MPI Implementation Comparison MPI and FT-MPI Performance consideration Conclusion Future scope

MPI is not fault tolerant! -Is that true? It is a common misconception about MPI. MPI provides considerable flexibility in the handling of errors. FAULT TOLERANCE IS THE PROPERTY OF AN MPI PROGRAM! Job1 Processes in Job1 Sends MPI_ERRORS_ARE_FATAL MPI_COMM_WORLD P1 P2 P3 P4 By default other processes detect error and abort Process P2 dies Job1 Processes in Job1 MPI_COMM_WORLD P1 P2 P3 P4 Sends MPI_SUCCESS

Approaches to achieve fault tolerance in MPI Levels of Survival of an MPI Implementation Level 1 – MPI implementation automatically recovers from failure and continues without significant change to its behavior. Highest Level of Survival and difficult to implement. Level 2 – The MPI implementation is notified of the problem and is prepared to take corrective action. Example: Using Intercommunicators Level 3 – In case of failure, certain MPI operations, although not all become invalid. Example: Modifying MPI Semantics, Extending MPI Level 4 – In case of failure, the MPI program can abort and be restarted from a checkpoint. Example: Checkpointing Program state of the failed process is retained for the overall computation to proceed

The MPI Standard and Fault Tolerance Reliable Communication: The MPI implementation is responsible for detecting and handling network faults. The MPI implementation can retransmit the message or inform the application that an error has occurred, allowing the application to take its own corrective action. Error Handlers: Error handlers are set on communicators with MPI_Comm_set_errhandler. The default is MPI_ERRORS_ARE_FATAL and it can be changed to MPI_ERRORS_RETURN. Users can define their own error handlers and attach them to communicators.

ERROR HANDLING- CONTINUED In c++ , MPI :: ERRORS_THROW_EXCEPTIONS is defined to handle the errors If an error is returned, the standard does not require that subsequent operations succeed or that they fail. Thus the standard allows implementations to take various approaches to the fault tolerance issue

Approach to Fault Tolerance in MPI programs 1.Checkpointing: This is a common technique that periodically saves the state of a computation, allowing the computation to be restarted from that point in the event of a failure. The cost of checkpointing is determined by, Cost to create and write checkpoint. Cost to read and restore checkpoint. Probability of failure. Time between checkpoints. Total time to run without checkpoints. Types of checkpointing: User-Directed checkpointing. System-Directed checkpointing. Advantage & disadvantage: It is easy to implement. Cost of saving and restoring checkpoints must be relatively small.

Approach to Fault Tolerance in MPI programs 2.Using Intercommunicators: It contains two groups of processes. All communications occurs between processes in one group and processes in the other group. Example: Manager-Worker Manager process keeps track of a pool of tasks and dispatches them to working processes for completion. Workers return results to the manager, simultaneously requesting a new task. Advantages & Disadvantage: The manager can easily recognize that a particular worker has failed and communicate to other processes. Each group can keep tack of the state held by the other group. Difficult to implement in complex systems.

Approach to Fault Tolerance in MPI programs 3.Modifying MPI Semantics: Takes advantage of the existing MPI objects that contain more state and MPI functions defined in the standard. Example: MPI objects guarantees that the number of processes and its rank in a communicator is constant. This property can be used by the program, To decompose data according to a communicator’s size. Calculate the data assigned to a process using its rank. Advantage & Disadvantage: Fault tolerant programs can be written for a wider set of algorithms. This approach uses the already existing semantics and therefore provides lesser fault tolerant features compared to other approaches.

Approach to Fault Tolerance in MPI programs 4.Extending MPI: This approach is developed to address the difficulty of using MPI communicators when processes may fail. It is difficult to construct the communicator consisting of the two individual processes. If the Manager group has failed, then it is even more difficult because of collective semantics of communicator construction in MPI.

Advantages of using MPI fault tolerance features It is simple and easy to use the existing error handling features in MPI. Users can extend the “MPI_ERRORS_RETURN” to define errors specific to their needs. Error handling is purely local. Every process can have a different handler. The ability to attach error handlers on a communicator increases the modularity of MPI. MPI provides the ability to define one’s own application-specific error handler which is an important approach to fault tolerance.

Limitations of Fault Tolerance in MPI The specification makes no demands on MPI to survive failures. The defined MPI error classes are used only to clarify to the user about the source of the error. It is difficult for MPI to notify users of the failure of a given function that happen after the function has already returned. There is no description of when error notification will happen relative to the occurrence of the error. It is not possible for one application process to ask to be informed of errors on other processes or for the application to be informed of specific classes of errors.

Harness/ Fault Tolerant MPI: an Extension to MPI HARNESS (Heterogeneous Adaptive Reconfigurable Networked SyStem) Experimental System which provides highly dynamic, fault-tolerant computing environment for high performance computing applications HARNESS is a joint DOE funded project involving Oak Ridge National Laboratory (ORNL), University of Tennessee at Knoxville (UTK/ICL) and Emory University in Atlanta, GA.

Harness : an extension to MPI Current MPI implementations either abort or use check-pointing Communications only via communicator MPI communicator based on static model

Implementation FT MPI (HARNESS) extends MPI Allows applications to decide when errors occurs Restart failed node Continue with less number of nodes When member communicator fails: Communicator state changes to indicate problem Message transfer continues if safe or be stopped or ignored User application can fix or abort communicator to continue

Comparison FT-MPI and MPI: Communicator And Process States FT_OK FT_DETECTED VALID FT_RECOVER INVALID FT_RECOVERED FT_FAILED PROCESS STATES OK UNAVAILABLE FAILED JOINING

Implementation: Extending MPI When running an FT-MPI application, there are two parameters used to specify modes in which application is running. The first parameter, the ’communicator mode’, indicates what is the status of an MPI object after recovery. Which can be specified when starting the application: ABORT BLANK REBUILD SHRINK Like MPI FTMPI can abort error Failed process are not replace Failed process respawned surviving process has same rank. Default mode Failed Process not replaced. No gaps in lists of processors

FT/MPI : Second parameter communication mode: Two types of communications Cont/ CONTINUE NOOP /RESET All operations which returned MPI_SUCCESS code will finish properly. All ongoing messeages dropped. Error on application sents it to last consistent state.

FT/MPI : Communicator (COMM.) Failure Handling COMM. Invalidated if failure detected Underlying system sends a state update to all processes for that COMM. System behavior depends on COMM. mode chosen All COMM. are not updated for communication errors Process exit

FT/MPI Usage In form of error check Some corrective action like communicator rebuild For example*: (Simple FT-MPI send usage) rc = MPI_Send(-------, com); if (rc == MPI_ERR_OTHER) MPI_Comm_dup (com, newcom); com = newcom; SPMD master-worker node only need master code to check for errors if user only takes master code as the point of failure

Example : MPI Error handling

Example of Error Handling Using FT-MPI

Performance Consideration Fault free overhead of P2P communication in MPI/FT is negligible in long running applications. Check-pointing increases communication overhead considerably therefore user must determine less frequency of checkpoints.

Conclusions FT-MPI is tool to provide with methods of dealing with failures within MPI applications FT-MPI is useful for experimenting with Self tuning collective communications Distributed control algorithms Dynamics libraries download methods

Future Scope Developing further implementations that support more restrictive environments (ie. embedded clusters) Creation of number of drop-in library templates to simplify the construction of fault tolerant applications High performance and survivability

References Fault Tolerance in MPI Programs: http://www.mcs.anl.gov/~lusk/papers/fault-tolerance.pdf LEGION: http://legion.virginia.edu/documentation/FAQ_mpi_run.html HARNESS: http://icl.cs.utk.edu/ftmpi/index.html MPI 3.0 Fault Tolerance Working Group: http://meetings.mpi-forum.org/mpi3.0_ft.php Graham E. Fagg, George Bosilca, Thara Angskun, Zhizhong Chen, Jelena Pjesivac-Grbovic, Kevin London and Jack J. Dongarra "Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems" manual HARNESS Graham E. Fagg, Antonin Bukovsky,Jack J. Dongarra "HARNESS and fault tolerant MPI" Parallel Computing 27, 2001 1479-1495 Graham E. Fagg, Jack J. Dongarra "BUILDING AND USING A FAULT–TOLERANT MPI IMPLEMENTATION" The International Journal of High Performance Computing Applications, Volume 18, No. 3, Fall 2004, pp. 353–361 Conference proceedings FT-MPI Presentation Graham E. Fagg, Jack J. Dongarra

Q & A ?

FAQs 1.MPI vs TCP socket: Arguably, one of the biggest weaknesses of MPI is its lack of resilience — most (if not all) MPI implementations will kill an entire MPI job if any individual process dies.  This is in contrast to the reliability of TCP sockets, for example: if a process on one side of a socket suddenly goes away, the peer just gets a stale socket. 2. Does MPI guarantee that user-defines handler be used as MPI_ERRORS_RETURN The specification does not state whether an error that would cause MPI functions to return an error code under the MPI_ERRORS_RETURN error handler would cause a user-defined error handler to be called during the same MPI function or at some earlier or later point in time. 3. Relation between checkpointing and I/O The practicality of checkpointing is related to performance of parallel I/O as checkpoint data is saved to a parallel file system.

FAQs 4. Usability of HARNESS FT-MPI The fault tolerance feature provides by HARNESS depends on its implementation. The HARNESS team actually works on the reported bugs and releases new versions. 5. Data Recovery in MPI The MPI standard does not provide a way to recover data. It depends on the implementation of the MPI program. 6. Is fault tolerance in MPI can be made transparent? It is very difficult to make the fault tolerance in MPI transparent. This is because of the complexity involved in communication between processes.

Reference Slides

Referrence: Structure of FT-MPI

Derived datatype handling Reduces memory copies while allowing overlapping 3 stages of data handling Gather/Scatter Encoding/Decoding Send/Receive Package

Handling of compacted Datatype: only MPI_Snd and receive wer used

Performance Consideration Tests show compacted data handling gives 10% to 19% imrovement. Benefit of buffer reuse and reordering of data elements leads to considerable improvements on heterogeneous networks.