Fault Tolerance BOF Possible CBHPC paper –Co-authors wanted –Tammy, Rob, Bruce, Daniel, Nanbor, Sameer, Jim, Doug, David What infrastructure is needed.

Slides:



Advertisements
Similar presentations
Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Silberschatz and Galvin  Operating System Concepts Module 16: Distributed-System Structures Network-Operating Systems Distributed-Operating.
Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.
Spark: Cluster Computing with Working Sets
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Causal Logging : Manetho Rohit C Fernandes 10/25/01.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
®® Microsoft Windows 7 for Power Users Tutorial 8 Troubleshooting Windows 7.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
Chapter Fourteen Windows XP Professional Fault Tolerance.
Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan
Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment.
Unit R005: Understanding Computer Systems Introduction System Software Software (i.e., programs) used to control the hardware directly Used to run the.
A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
VERITAS Cluster Server for Solaris Event Notification.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Beowulf Software. Monitoring and Administration Beowulf Watch 
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick.
System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
Functions of Operating Systems V1.0 (22/10/2005).
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
Managing Multi-User Databases
Jack Dongarra University of Tennessee
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
湖南大学-信息科学与工程学院-计算机与科学系
QNX Technology Overview
EECS 498 Introduction to Distributed Systems Fall 2017
Outline Announcements Fault Tolerance.
Operating System Reliability
Fault Tolerance Distributed Web-based Systems
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Outline Introduction Background Distributed DBMS Architecture
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Chapter-1 Computer is an advanced electronic device that takes raw data as an input from the user and processes it under the control of a set of instructions.
EEC 688/788 Secure and Dependable Computing
Abstractions for Fault Tolerance
Harrison Howell CSCE 824 Dr. Farkas
Operating System Reliability
University of Wisconsin-Madison Presented by: Nick Kirchem
Operating System Reliability
Presentation transcript:

Fault Tolerance BOF Possible CBHPC paper –Co-authors wanted –Tammy, Rob, Bruce, Daniel, Nanbor, Sameer, Jim, Doug, David What infrastructure is needed to enable application-level FT in (component) applications? –Little experience with anything beyond checkpoint/restart (CR) in general Assume an FT-friendly lower-level environment –Event service for awareness of faults –Ability to request certain behavior from lower-level software Example: scheduler shouldn’t automatically kill an FT job

Use Cases MCMD app, 1 node fails –Recovery restarting failed task or ignore failure and go on (self-healing) MCMD C/R

Checkpoint/Restart Taxonomy System-level –Eg: BLCR, Cray XT (site option?), but not universally available –Store (complete) memory image to stable storage –Daemon schedules checkpoints –No application (or framework) involvement –Possibly app can request checkpoint –Potential problems: open files, driver state, in-flight messages, etc. –Component i/f to system c/r API –Component support for intelligent reduced checkpointing (MyState interface)

Application-level –Coordinated –Uncoordinated –Causal, Message Logging, etc. –Incremental checkpointing support –Capture component assembly –Checkpoint data component In-memory (copy or RAID), disk, write-behind, etc., special system services Quality of Fault Tolerance What does interface look like? Like RMI –Components to detect faults –Reduced storage (satisfying stability criteria, but not all available data) Checkpoint-free FT data holders

How to capture/restore state of blackbox components? Components implement SaveYourself method Central service invokes SaveYourself on all components that implement it Serialized data sent to central service for storage How to restart and restore state? Not all components will implement SaveYourself –May not have state to store –May be error –Check at start of execution and notify user RestoreYourself Specify state (in SIDL file?) and auto-gen serialize/unserialize methods

Another idea Components register their state data with a central service –Breaks encapsulation and OOness, but not a major violation –Could be higher performance

Recovery Local restore vs global restore? Rollback Need to save execution path at which checkpoint was taken? –Put responsibility on app components? –Extend GoPort abstraction to include save/restore/restart? Framework tells components to restart –What order? Shouldn’t matter

Paper themes How components can help with FT –Abstracting FT services into reusable components –Abstracting FT requirements into ports How components make FT more complicated –Don’t have monolithic view of application (state) Going beyond CR