FT-MPI Survey Alan & Nathan.

Slides:



Advertisements
Similar presentations
Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.
Advertisements

MPI Message Passing Interface
Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.
User Level Failure Mitigation Fault Tolerance Working Group September 2013, MPI Forum Meeting Madrid, Spain.
Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Winter, 2004CSS490 MPI1 CSS490 Group Communication and MPI Textbook Ch3 Instructor: Munehiro Fukuda These slides were compiled from the course textbook,
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
Computer Science Department 1 Load Balancing and Grid Computing David Finkel Computer Science Department Worcester Polytechnic Institute.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,
8. Fault Tolerance in Software
Minaashi Kalyanaraman Pragya Upreti CSS 534 Parallel Programming
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Message Passing Interface In Java for AgentTeamwork (MPJ) By Zhiji Huang Advisor: Professor Munehiro Fukuda 2005.
Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.
Ch4: Distributed Systems Architectures. Typically, system with several interconnected computers that do not share clock or memory. Motivation: tie together.
PVM and MPI What is more preferable? Comparative analysis of PVM and MPI for the development of physical applications on parallel clusters Ekaterina Elts.
PVM. PVM - What Is It? F Stands for: Parallel Virtual Machine F A software tool used to create and execute concurrent or parallel applications. F Operates.
16 Copyright © 2007, Oracle. All rights reserved. Performing Database Recovery.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.
Chapter 4 Message-Passing Programming. The Message-Passing Model.
Grid Computing Framework A Java framework for managed modular distributed parallel computing.
R*: An overview of the Architecture By R. Williams et al. Presented by D. Kontos Instructor : Dr. Megalooikonomou.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.
Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
Denis Caromel1 OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis -- IUF IPDPS 2003 Nice Sophia Antipolis, April Overview: 1. What.
- DAG Scheduling with Reliability - - GridSolve - - Fault Tolerance In Open MPI - Asim YarKhan, Zhiao Shi, Jack Dongarra VGrADS Workshop April 2007.
PVM and MPI.
Sun Tech Talk 3 Solaris 10 and OpenSolaris Pierre de Filippis Sun Campus Evangelist
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Parallel Virtual File System (PVFS) a.k.a. OrangeFS
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
OpenMosix, Open SSI, and LinuxPMI
Jack Dongarra University of Tennessee
Prabhaker Mateti Wright State University
MPI Message Passing Interface
CRESCO Project: Salvatore Raia
Abstract Machine Layer Research in VGrADS
CHAPTER 3 Architectures for Distributed Systems
Steven Whitham Jeremy Woods
Fault Tolerance in MPI Programs
NGS computation services: APIs and Parallel Jobs
University of Technology
Supporting Fault-Tolerance in Streaming Grid Applications
MPI and comparison of models Lecture 23, cs262a
Chapter 17: Database System Architectures
QNX Technology Overview
Middleware for Fault Tolerant Applications
MPI: Message Passing Interface
Time Gathering Systems Secure Data Collection for IBM System i Server
Introduction to Teradata
Fault Tolerance with FT-MPI for Linear Algebra Algorithms
Overview SAP Basis Functions
Mark McKelvin EE249 Embedded System Design December 03, 2002
Database System Architectures
Abstractions for Fault Tolerance
MapReduce: Simplified Data Processing on Large Clusters
Distributed Systems and Concurrency: Distributed Systems
Presentation transcript:

FT-MPI Survey Alan & Nathan

FT- MPI - Overview Background Architecture Modes Drawbacks Commercial Applications

FT-MPI – Background [1] FT-MPI is a full MPI 1.2 implementation C and Fortran interfaces Provides process level fault tolerance Able to survive n-1 process crashes on n process job Does not recover data on crashed node

dynamic process management and fault tolerance FT- MPI – Architecture MPI Library HARNESS FT-MPI dynamic process management and fault tolerance

FT- MPI – Background [1] (HARNESS) Heterogeneous Adaptive Reconfigurable Networked SyStem Underlying framework to provide highly dynamic and fault tolerant high performance computing FT-MPI is a HARNESS MPI API

FT-MPI – Background [3] Communicators MPI communicator {valid, invalid} FT-MPI communicator {OK, Problem, Failed} Problem = detected, recover, recovered Processes MPI {OK, Failed} FT-MPI {OK, Unavailable, Joining, Failed}

FT- MPI – Architecture [4] MPI 1.2 API/MPI objects Highly tuned Tuned collective routines OS interaction via Hlib Startup, recovery, and shutdown Inter node communication

FT- MPI – Architecture [3] MPI collective operations tuning Broadcast Gather Three options for buffering Derived Data Types (DDT) Zero padding Minimal padding Re-ordering pack – encoding/decoding

FT- MPI – Architecture [3] DDT and Buffer Management Reorders data and compresses 10-19% improvement for small messages (~12kb)* 78-81% improvement for large messages (~95kb)* *Compared to MPICH (1.3.1) on 93 element DDT

FT- MPI – Architecture [3] HARNESS Kernel allows dynamic code insertion both directly and indirectly Crucial impact of this: Spawn and Notify Service Remote processes Naming Service Distributed Replicated Database (DRD) System state & Metadata

FT- MPI – Modes Provides 4 error modes [1] Abort – Quits if a process crashes Blank – Continue execution with missing data Shrink – Continue running and shrink # nodes Rebuild – Restart crashed process Message modes [3] NOP – no operations on error CONT – all other continue

FT-MPI – Abort Architecture Initial Configuration: 4 ranks Rank 1 Rank 2 Scatter Abort Scatter Gather Gather Rank 0 Scatter Abort Gather Rank 3 Abort Configuration: Gracefully Shutdown All Ranks

FT-MPI – Shrink Architecture Initial Configuration: 4 ranks Rank 1 Rank 2 Scatter MyRank = 1 Scatter Scatter Gather Gather Gather Rank 0 Scatter MyRank = 2 Scatter Gather Gather Continue Rank 3 Shrunk Configuration: 3 ranks Rank 2

FT-MPI – Blank Architecture Initial Configuration: 4 ranks, MPI_COMM_SIZE = 4 Blank Rank 1 Rank 2 Scatter Scatter Scatter Gather Gather Gather Rank 0 Scatter Scatter Gather Gather Continue Rank 3 Blank Configuration: 3 valid ranks MPI_COMM_SIZE = 4

FT-MPI – Rebuild Architecture Rebuilt Configuration: 4 ranks, MPI_COMM_SIZE = 4 Blank Rank 1 Rank 2 Scatter Scatter MyRank = 2 Gather Gather Rank 2 Rank 0 Scatter Gather Continue Rank 3 Blank Configuration: 3 valid ranks MPI_COMM_SIZE = 4

FT- MPI – Example Code #include "mpi.h" void wave2D () { int rc; // start MPI rc = MPI_Init(&argc, &argv); If (rc==MPI_ERR_OTHER); // handle error and restart // compute array for t = 0 & t = 1 // compute t = 2+ For (t = 2; t < max; t++) { rc = MPI_Allgather(array, …); if (rc == MPI_ERR_OTHER) // restart lost node MPI_Comm_dup (oldcomm, &newcomm); // redistribute data to crashed node // revert to last completed t t = last_complete; // calculate array } }

FT- MPI – Tool [1] COMINFO

FT-MPI – Drawbacks Modifies MPI (Standard) Semantics (Gropp and Ewing) Communicator ranks can enter undefined states Not realistic for writing production Applications What FT-MPI does not do (Gabrial et al., EuroPVM/MPI 2003) Recover user data (e.g., automatic checkpointing) Provide transparent fault-tolerance Minimum Checkpoint support – must be manually coded Known bugs and problems (2006) MPI_Ssend, MPI_Issend and MPI_Ssend_init are not providing a real synchronous send mode MPI_Cancel on Isend operations might fail where MPI standard requires the cancel operation to succeed. Mixed-mode communication for heterogeneous platforms requires XDR format

FT-MPI – Drawbacks – Based on Outdated Tech 2001 Today and Future High-speed Information high way 100x CPU perf. 1000x Node Density Disaggregated Architecture 100/400 GBE Failure Rates 100x greater NVMeOF (non-volatile memory express over Fabric) Single core CPUs Low Speed Internet Links No/little Virtualization SDN (Software defined Networking) Limited Redundancy, no RAS References: http://www.intel.com/content/www/us/en/architecture-and-technology/rack-scale-design-overview.html

Comparative FT Approaches for MPI [4] Reference: Gabrial et al., Fault Tolerant MPI presentation, EuroPVM/MPI 2003

FT- MPI – Commercial Applications No Commercial Applications based directly on FT-MPI (2001-2006) No Development has been done on FT-MPI since being integrated in Open MPI (~2006) Fault Tolerance for Open MPI is based on FT-MPI, with extended functionality and support Commercial Implementations of Open MPI include: IBM Oracle Fujitsu ISV – library/application that sits on top of Open MPI References: George Bosilca, Jack Dongarra, Innovative Computing Laboratory, University of Tennessee

References Dongarra, J. (n.d.). FT-MPI. Retrieved November 22, 2016, from http://icl.cs.utk.edu/ftmpi/index.html Gabriel, E., Fagg, G., Bukovsky, A., Angskun, T., & Dongarra, J. (n.d.). A Fault-Tolerant Communication Library for Grid Environments (pp. 1-10, Tech.). Fagg, G., Bukovsky, A., & Dongarra, J. (2001). HARNESS and fault tolerant MPI (Vol. 27, Parallel Computing, pp. 1479-1495, Rep.). Gabriel, E., Fagg, G., Angskun, T., Bosilca, G., Bukovsky, A., & Dongarra, J. (2003). Fault Tolerant MPI. Retrieved December 01, 2016, from http://www.dsi.unive.it/pvmmpi03/post/epvm03tutb2.pdf Bosilca, George, et al. "MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes." Supercomputing, ACM/IEEE 2002 Conference. IEEE, 2002. Gropp, William, and Ewing Lusk. "Fault tolerance in message passing interface programs." International Journal of High Performance Computing Applications 18.3 (2004): 363-372. Lemarinier, Pierre, et al. "Improved message logging versus improved coordinated checkpointing for fault tolerant MPI." Cluster Computing, 2004 IEEE International Conference on. IEEE, 2004.