FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
CS-550: Distributed File Systems [SiS]1 Resource Management in Distributed Systems: Distributed File Systems.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
Processes CSCI 444/544 Operating Systems Fall 2008.
Coda file system: Disconnected operation By Wallis Chau May 7, 2003.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Other File Systems: LFS and NFS. 2 Log-Structured File Systems The trend: CPUs are faster, RAM & caches are bigger –So, a lot of reads do not require.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
1 Process Description and Control Chapter 3. 2 Process Management—Fundamental task of an OS The OS is responsible for: Allocation of resources to processes.
Advanced OS Chapter 3p2 Sections 3.4 / 3.5. Interrupts These enable software to respond to signals from hardware. The set of instructions to be executed.
1 Process Description and Control Chapter 3 = Why process? = What is a process? = How to represent processes? = How to control processes?
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Process Description and Control A process is sometimes called a task, it is a program in execution.
PRASHANTHI NARAYAN NETTEM.
Using Two Queues. Using Multiple Queues Suspended Processes Processor is faster than I/O so all processes could be waiting for I/O Processor is faster.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
Chapter 3 Process Description and Control Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
Chapter 3 Process Description and Control Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Chapter 41 Processes Chapter 4. 2 Processes  Multiprogramming operating systems are built around the concept of process (also called task).  A process.
Chapter 3 Process Description and Control
The Structure of Processes (Chap 6 in the book “The Design of the UNIX Operating System”)
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
Beowulf Software. Monitoring and Administration Beowulf Watch 
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
ITEC 502 컴퓨터 시스템 및 실습 Chapter 11-2: File System Implementation Mi-Jung Choi DPNM Lab. Dept. of CSE, POSTECH.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
Processes, Threads, and Process States. Programs and Processes  Program: an executable file (before/after compilation)  Process: an instance of a program.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Group Communication Theresa Nguyen ICS243f Spring 2001.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
An Introduction to GPFS
Processes and threads.
Chapter 3 – Process Concepts
Operating System Reliability
Operating System Reliability
Introduction to Networks
Introduction to Operating Systems
Structure of Processes
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
Process Description and Control
Lecture 4- Threads, SMP, and Microkernels
Process Control B.Ramamurthy 2/22/2019 B.Ramamurthy.
Operating System Reliability
Unix Process Control B.Ramamurthy 4/11/2019 B.Ramamurthy.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Process Description and Control in Unix
Developer: Thadpong Pongthawornkamol
Process Description and Control in Unix
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
Presentation transcript:

FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava

Why FTOP ? Fault tolerant environment built for PVM. Implements a transparent fault tolerance technique using Checkpointing and Rollback Recovery for PVM based distributed applications. Handles issues related to in-transit messages, routing of messages to migrated tasks and open files. Entirely at user level. No changes in kernel needed. Intended to be extensible to other C&RR schemes.

FTOP assumptions Assumes a homogeneous Linux cluster with PVM running on them. One of the host is configured as a Global resource Manager which is assumed to be fault free.... (impl.!) Another host assumed to be fault free is configured as the Stable storage. The file system of the stable storage is NFS mounted on all other host. (Using NFS has problems ?) Assumes reliable FIFO channels between hosts in the cluster. Handles task/node crash failure only.

System and Fault Model System consists of: –A set of workstations. –Connected through a high speed LAN. –Stable storage accessible to all workstations (assumed to be fault proof). Fault can be: –Network failure. –Node failure. Fail stop model.

Implementation: Checkpointing Non blocking Coordinated checkpointing. What is checkpointed ? –The process context ( pc value, registers etc. ). –The process control state ( like pid, parent pid, fd of open files etc.). –The process address space ( the text area, data area and stack area). Where are the checkpoints stored ? –On a stable storage(assumed to be failure proof). –Two checkpoint files for each process.

How we checkpoint The process context ( pc,register value etc ) –Signal mechanism. A process on receiving a signal saves state in stack which could be checkpointed.. Use of setjmp( ) and longjmp( ). The process Memory Regions –“RO” sections are not checkpointed. Other sections are checkpointed by writing them to a file. –/proc file system provides section boundaries. The process control state – Written to a regular file named after the taskid.

Checkpoint Protocol SIGALARM SM_CKPTSIGNAL SIGUSR1 TM_CKPTDONE SM_CKPTDONE SM_CKPTCOMMIT SIGUSR1 GRMPVMdTASK Time Diagram of the Checkpointing Protocol. It is based on 2 phase commit Protocol.

Checkpoint Protocol (contd..) GRM PVMd Task 1Task 2Task 3Task 1 Task 2Task 3 Host 1Host2 SM_CKPTSIGNAL SM_CKPTDONE SM_CKPTCOMMIT SIGUSR1 TM_CKPTDONE

Other Messages Two more messages are required for the consistency of the checkpoints taken - –TM_Ckptsignal ( from task to its daemon ) –DM_Ckptsignal ( from daemon to another daemon ) To allow checkpointing to be partly non- blocking, these messages precede any application message when the checkpoint protocol is in progress i.e. after a process has taken a checkpoint and before the checkpoint is committed.

Other Messages (contd..) For TM_Ckptsignal if the application message is destined to a local task the daemon determines the status of the task and delivers the message to the destination only if it has completed its checkpoint. If the application message is bound to a foreign task the daemon sends DM_Ckptsignal to the destination before sending the application message.

Recovery Fault Detection –Daemons detect node failure. –Inform GRM through SM_HOSTX message Fault Assessment –GRM finds all the failed tasks. Fault Recovery –GRM spawns the failed tasks on appropriate hosts. Each Failed tasks start from beginning and then copy its last checkpoint on its own address space.

Recovery (contd..) Recovering tasks –Local state of the tasks are restored using setjmp() and longjmp() calls. Setjmp() is called before checkpointing begins and longjmp() is called after the address space is restored from the checkpoint file. – Note issues related to Processes which started after the recovery-line. Processes which exited normally after recovery-line.

Recovery (contd..) GRM starts the recovery protocol Calculates the recovery-line. Transmits to every process the file-id of the last committed checkpoint (integer 1 or 2). Each process restores its checkpointed image. Processes not allowed to send or receive application Messages during the recovery stage.

Recovery Protocol HOSTX SM_RECOVER SIGUSR2 TM_ RECOVERYDONE SM_RECOVERYDONE SM_RECOVERYCOMMIT SIGUSR2 GRMPVMd TASK

Other Issues In-transit messages. –Logging: reliable comm. model, part of checkpoint. –Replaying: Before future interaction. Routing. –Why a problem? –Maintain route table: what to keep… Open files. –Why a problem? –How to handle… Reconnecting with daemon.

Handling Routing tid (task identifier) is used as an address of message in PVM. Failed task when they recover get a new tid. Other tasks don’t know about this change causing routing problems. A mapping table of the oldest and the most recent tid of a task is maintained. Header of each message is parsed; and if the message is destined to one of the failed task, then the address field is replaced with the most recent tid of the failed task.

Handling Open files lsof a Linux utility provides list of all open files, their descriptors and mode. An lseek call provides the file pointer. All this information (file name, descriptor, mode and file pointer) is stored with the checkpoint image of the process. The state of the file is restored using this information at the time of recovery. May need to actually checkpoint the file content.

Reconnecting with the Daemon A task is connected to the virtual machine through the PVM daemon. A failed task when spawns on a new host needs to reconnect to the daemon. It connects to the new daemon through the unix domain socket name advertised by the daemon in a host specific file. It will also clean up old socket information.

Testing Testing Environment : –The Hosts : 3-5 Pentium III with red hat Linux 7.1 –The Channel : 100 Mbps Ethernet LAN. Failure Simulation : By removing a host from the virtual machine. Test Cases : –Matrix Multiplication. –PVMPOV (full featured distributed ray tracer algorithm build on PVM). –Others for correctness: simple file I/O, “ping-pong” etc.

Overheads Checkpointing Overhead for the Matrix multiplication program. Checkpointing Overhead for the PVMPOV program Checkpointing Interval Running time 10, 36 20, 18 30, 11 40, 8 infinity, infinity Checkpointing Interval (secs ) Running time (secs) Series1

Conclusion and future work Builds fault tolerance into the standard PVM staying entirely at the user level. Able to rollback the open files and in transit messages. In future direction we wish to handle device association which may require explicit OS support. We also intend to integrate well known optimizations into the checkpointing protocol. We also aim to other C&RR Schemes.