Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.

Slides:

Advertisements

Similar presentations

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.

Advertisements

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

The SMART Way to Migrate Replicated Stateful Services Jacob R. Lorch, Atul Adya, Bill Bolosky, Ronnie Chaiken, John Douceur, Jon Howell Microsoft Research.

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.

Improved Message Logging versus Improved Coordinated Checkpointing For Fault Tolerant MPI Pierre Lemarinier joint work with.

Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.

A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:

1 Rollback-Recovery Protocols II Mahmoud ElGammal.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

1 The Google File System Reporter: You-Wei Zhang.

Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.

Parallelization Of The Spacetime Discontinuous Galerkin Method Using The Charm++ FEM Framework (ParFUM) Mark Hills, Hari Govind, Sayantan Chakravorty,

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.

A Survey of Rollback-Recovery Protocols in Message-Passing Systems.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Fault Tolerant Systems

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.

CS5204 – Operating Systems 1 Checkpointing-Recovery.

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

ISADS'03 Message Logging and Recovery in Wireless CORBA Using Access Bridge Michael R. Lyu The Chinese Univ. of Hong Kong

CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.

Coordinated Checkpointing Presented by Sarah Arnold 1.

1 Coscheduling in Clusters: Is it a Viable Alternative? Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das Presented by: Richard Huang.

More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers.

Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen.

CS603 Fault Tolerance - Communication April 17, 2002.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

Distributed and hierarchical deadlock detection, deadlock resolution

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.

A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.

Fault Tolerance and Checkpointing - Sathish Vadhiyar.

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.

1 Fault Tolerance and Recovery Mostly taken from

Snapshots, checkpoints, rollback, and restart

Jack Dongarra University of Tennessee

EEC 688/788 Secure and Dependable Computing

Performance Evaluation of Adaptive MPI

Mattan Erez The University of Texas at Austin July 2015

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab

Fault Tolerance Distributed Web-based Systems

Middleware for Fault Tolerant Applications

EEC 688/788 Secure and Dependable Computing

Outline Announcement Distributed scheduling – continued

EEC 688/788 Secure and Dependable Computing

The SMART Way to Migrate Replicated Stateful Services

Support for Adaptivity in ARMCI Using Migratable Objects

Computer Networks Protocols

Distributed Systems and Concurrency: Distributed Systems

Laxmikant (Sanjay) Kale Parallel Programming Laboratory

Presentation transcript:

Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi

Parallel Programming Laboratory University of Illinois, U-C 2 Outline Motivation Background Solutions Co-ordinated Checkpointing In-memory double checkpoint Sender based Message Logging Processor Evacuation in response to fault prediction : New Work

Parallel Programming Laboratory University of Illinois, U-C 3 Motivation As machines grow in size MTBF decreases Applications have to tolerate faults Applications need fast, low cost and scalable fault tolerance support Modern Hardware is making fault prediction possible Temperature sensors, PAPI-4, SMART Paper on detection tomorrow

Parallel Programming Laboratory University of Illinois, U-C 4 Background Checkpoint based methods Coordinated – Blocking [Tamir84], Non-blocking [Chandy85] Co-check, Starfish, Clip – fault tolerant MPI Uncoordinated – suffers from rollback propagation Communication – [Briatico84], doesn’t scale well Log-based Pessimistic – MPICH-V1 and V2, SBML [Johnson87] Optimistic – [Strom85] unbounded rollback, complicated recovery Causal Logging – [Elnozahy93] complicated causality tracking and recovery, Manetho, MPICH-V3

Parallel Programming Laboratory University of Illinois, U-C 5 Multiple Solutions in Charm++ Reactive : react to a fault Disk based Checkpoint/Restart In Memory Double Checkpointing/Restart Sender based Message Logging Proactive : react to a fault prediction Evacuate processors that are warned

Parallel Programming Laboratory University of Illinois, U-C 6 Checkpoint/Restart Mechanism Blocking Co-ordinated Checkpoint State of chares are checkpointed to disk Collective call MPI_Checkpoint(DIRNAME) The entire job is restarted Virtualization allows restarting on different # of Pes Runtime option >./charmrun pgm +p4 +vp16 +restart DIRNAME Simple but effective for common cases

Parallel Programming Laboratory University of Illinois, U-C 7 Drawbacks Disk based coordinated checkpointing is slow Job needs to be restarted Requires user intervention Impractical in the case of frequent faults

Parallel Programming Laboratory University of Illinois, U-C 8 In-memory Double Checkpoint In-memory checkpoint Faster than disk Co-ordinated checkpoint Simple User can decide what makes up useful state Double checkpointing Each object maintains 2 checkpoints on: Local physical processor Remote “buddy” processor

Parallel Programming Laboratory University of Illinois, U-C 9 Restart A “Dummy” process is created: Need not have application data or checkpoint Necessary for runtime Starts recovery on all other PEs Other processors: Remove all chares Restore checkpoints lost on the crashed PE Restore chares from local checkpoints Load balance after restart

Parallel Programming Laboratory University of Illinois, U-C 10 Overhead Evaluation Jacobi (200MB data size) on up to 128 processors 8 checkpoints in 100 steps

Parallel Programming Laboratory University of Illinois, U-C 11 Recovery Performance LeanMD application 10 crashes 128 processors Checkpoint every 10 time steps

Parallel Programming Laboratory University of Illinois, U-C 12 Drawbacks High Memory Overhead Checkpoint/Rollback doesn’t scale All nodes are rolled back just because 1 crashed Even nodes independent of the crashed node are restarted Restart cost is similar to Checkpoint period Blocking co-ordinated checkpoint requires user intervention

Parallel Programming Laboratory University of Illinois, U-C 13 Sender based Message Logging Message Logging Store message logs on the sender Asynchronous checkpoints Each processor has a buddy processor Stores its checkpoint in the buddy’s memory Restart: processor from an extra pool Recreate only objects on crashed processor Playback logged messages Restores state to that after the last processed message Processor virtualization can speed it up

Parallel Programming Laboratory University of Illinois, U-C 14 Message Logging State of an object is determined by Messages processed Sequence of processed messages Protocol Sender logs message and requests receiver for TN Receiver sends back TN Sender stores TN with log and sends message Receiver processes messages in order of TN Processor virtualization complicates message logging Messages to object on the same processor needs to be logged remotely

Parallel Programming Laboratory University of Illinois, U-C 15 Parallel Restart Message Logging allows fault-free processors to continue with their execution However, sooner or later some processors start waiting for crashed processor Virtualization allows us to move work from the restarted processor to waiting processors Chares are restarted in parallel Restart cost can be reduced

Parallel Programming Laboratory University of Illinois, U-C 16 Present Status Most of Charm++ has been ported Support for migration has not yet been implemented in the fault tolerant protocol AMPI ported Parallel restart not yet implemented

Parallel Programming Laboratory University of Illinois, U-C 17 Recovery Performance Execution Time with increasing number of faults on 8 processors (Checkpoint period 30s)

Parallel Programming Laboratory University of Illinois, U-C 18 Pros and Cons Low overhead for jobs with low communication Currently high overhead for jobs with high communication Should be tested with high virtualization ratio to reduce the message logging overhead

Parallel Programming Laboratory University of Illinois, U-C 19 Processor evacuation Modern Hardware can be used to predict faults Runtime system response Low response time No new processors should be required Efficiency loss should be proportional to loss in computational power

Parallel Programming Laboratory University of Illinois, U-C 20 Solution Migrate Charm++ objects off processor Requires remapping of “home” PEs of objects Point to Point message delivery continues to work efficiently Collective operations cope with loss of processors Rewire reduction tree around a warned processor Can deal with multiple simultaneous warnings Load balance after an evacuation

Parallel Programming Laboratory University of Illinois, U-C 21 Rearrange the reduction tree Do not rewire tree while reduction is going on Stop reductions Rewire tree Continue reductions Affects only parent and children of a node Unbalances tree: Could be solved by recreating tree

Parallel Programming Laboratory University of Illinois, U-C 22 Response time Evacuation time for a Sweep3d execution on the 150^3 case Total ~500 MB of data Pessimistic estimate of evacuation time

Parallel Programming Laboratory University of Illinois, U-C 23 Performance after evacuation Iteration time of Sweep3d on 32 processors for 150^3 problem with 1 warning

Parallel Programming Laboratory University of Illinois, U-C 24 Processor Utilization after evacuation Iteration time of Sweep3d on 32 processors for 150^3 problem with both processors on node 3( processors 4 and 5) being warned simultaneously

Parallel Programming Laboratory University of Illinois, U-C 25 Conclusions Available in Charm++ and AMPI Checkpoint/Restart In memory Checkpoint/Restart Proactive fault tolerance Under development Sender based message logging Deal with migration, deletion Parallel Restart Abstraction layers in Charm++/AMPI make it suitable for implementing fault tolerance protocols