Fault Tolerant Distributed Computing system.

Slides:



Advertisements
Similar presentations
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Advertisements

EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
1 Message Logging Pessimistic & Optimistic CS717 Lecture 10/16/01-10/18/01 Kamen Yotov
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerant Systems
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.
Databases Illuminated
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.
Shuman Guo CSc 8320 Advanced Operating Systems
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
Fault Tolerant Distributed Computing system. zWhat is fault? yA fault is a blemish, weakness, or shortcoming of a particular hardware or software component.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
1 Fault Tolerance and Recovery Mostly taken from
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
Object Interaction: RMI and RPC 1. Overview 2 Distributed applications programming - distributed objects model - RMI, invocation semantics - RPC Products.
rain technology (redundant array of independent nodes)
Prepared by Ertuğrul Kuzan
Self Healing and Dynamic Construction Framework:
Distributed Systems – Paxos
Alternative system models
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
#01 Client/Server Computing
Programming Models for Distributed Application
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
Commit Protocols CS60002: Distributed Systems
Outline Announcements Fault Tolerance.
Operating System Reliability
Fault Tolerance Distributed Web-based Systems
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel
Replication Improves reliability Improves availability
Middleware for Fault Tolerant Applications
EEC 688/788 Secure and Dependable Computing
Outline Introduction Background Distributed DBMS Architecture
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
UNIVERSITAS GUNADARMA
Transactions in Distributed Systems
Chapter 5 Architectural Design.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Abstractions for Fault Tolerance
Last Class: Fault Tolerance
Operating System Reliability
#01 Client/Server Computing
Operating System Reliability
Presentation transcript:

Fault Tolerant Distributed Computing system.

Fundamentals What is fault? Why fault tolerant? A fault is a blemish, weakness, or shortcoming of a particular hardware or software component. Fault, error and failures Why fault tolerant? Availability, reliability, dependability, … How to provide fault tolerance ? Replication Checkpointing and message logging Hybrid

Message Logging Tolerate crash failures Each process periodically records its local state and log messages received after Once a crashed process recovers, its state must be consistent with the states of other processes Orphan processes surviving processes whose states are inconsistent with the recovered state of a crashed process Message Logging protocols guarantee that upon recovery no processes are orphan processes

Message logging protocols Pessimistic Message Logging avoid creation of orphans during execution no process p sends a message m until it knows that all messages delivered before sending m are logged; quick recovery Can block a process for each message it receives - slows down throughput allows processes to communicate only from recoverable states; synchronously log to stable storage any information that may be needed for recovery before allowing process to communicate

Message Logging Optimistic Message Logging take appropriate actions during recovery to eliminate all orphans Better performance during failure-free runs allows processes to communicate from non-recoverable states; failures may cause these states to be permanently unrecoverable, forcing rollback of any process that depends on such states

Causal Message Logging no orphans when failures happen and do not block processes when failures do not occur. Weaken condition imposed by pessimistic protocols Allow possibility that the state from which a process communicates is unrecoverable because of a failure, but only if it does not affect consistency. Append to all communication information needed to recover state from which communication originates - this is replicated in memory of processes that causally depend on the originating state.

KAN – A Reliable Distributed Object System Developed at UC Santa Barbara Project Goal: Language support for parallelism and distribution Transparent location/migration/replication Optimized method invocation Fault-tolerance Composition and proof reuse

System Description Kan source Kan Compiler Java bytecode + Kan run-time libraries JVM JVM JVM UNIX sockets

Fault Tolerance in Kan Log-based forward recovery scheme: Log of recovery information for a node is maintained externally on other nodes. The failed nodes are recovered to their pre-failure states, and the correct nodes keep their states at the time of the failures. Only consider node crash failures. Processor stops taking steps and failures are eventually detected.

Basic Architecture of the Fault Tolerance Scheme Logical Node y Logical Node x Fault Detector Failure handler Request handler Communication Layer Physical Node i External Log IP Address Network

Logical Ring Use logical ring to minimize the need for global synchronization and recovery. The ring is only used for logging (remote method invocations). Two parts: Static part containing the active correct nodes. It has a leader and a sense of direction: upstream and downstream. Dynamic part containing nodes that trying to join the ring A logical node is logged at the next T physical nodes in the ring, where T is the maximum number of nodes failures to tolerate.

Logical Ring Maintenance Each node participating in the protocol maintains a variables: Failedi(j): true if i has detected the failure of j Mapi(x): the physical node on which logical node x resides Leaderi: i’s view of the leader of the ring Viewi: i’s view of the logical ring (membership and order) Pendingi: the set of physical nodes that i suspects of failing Recovery_counti: the number of logical nodes that need to be recovered Readyi: records whether I is active. Initial set of ready nodes; new nodes become ready when they are linked into the ring.

Failure Handling When node i is informed of failure of node j: If every node upstream of i has failed, then I must become new leader. It remaps all logical nodes from the upstream physical nodes, and informs the other correct nodes by sending a remap message. It then recovers the logical nodes. If the leader has failed but there is some upstream node k that will become the new leader, then just update the map and leader variables to reflect the new situation If the failed node j is upstream of i, then just update map. If I is the next downstream node from j, also recover the logical nodes from j. If j is downstream of i and there is some node k downstream of j, then just update map. If j is downstream of I and there is no node downstream of j, then wait for the leader to update map. If i is the leader and must recover j, then change map, send a remap message to change the correct nodes’ maps, and recover all logical nodes that are mapped locally

Physical Node and Leader Recovery When a physical node comes back up: It sends a join message to the leader. The leader tries to link this node in the ring: Acquire <-> Grant Add, Ack_add Release When the leader fails, the next downstream node in the ring becomes the new leader.

AQuA Fault tolerance Adaptive Quality of Service Availability Developed in UIUC and BBN. Goal: Allow distributed applications to request and obtain a desired level of availability. Fault tolerance replication reliable messaging

Features of AQuA Uses the QuO runtime to process and make availability requests. Proteus dependability manager to configure the system in response to faults and availability requests. Ensemble to provide group communication services. Provide CORBA interface to application objects using the AQuA gateway.

Proteus functionality How to provide fault tolerance for appl. Style of replication (active, passive) voting algorithm to use degree of replication type of faults to tolerate (crash, value or time) location of replicas How to implement chosen ft scheme dynamic configuration modification start/kill replicas, activate/deactivate monitors,voters

Group structure For reliable mcast and pt-to-pt. Comm Replication groups Connection groups Proteus Communication Service Group for replicated proteus manager replicas and objects that communicate with the manager e.g. notification of view change, new QuO request ensure that all replica managers receive same info Point-to-point groups proteus manager to object factory

AQuA Architecture

Fault Model, detection and Handling Object Fault Model: Object crash failure - occurs when object stops sending out messages; internal state is lost crash failure of an object is due to the crash of at lease one element composing the object Value faults - message arrives in time with wrong content (caused by application or QuO runtime) Detected by voter Time faults Detected by monitor Leaders report fault to Proteus; Proteus will kill objects with fault if necessary, and generate new objects

AQuA Gateway Structure

Egida Developed in UT, Austin An object-oriented, extensible toolkit for low-overhead fault-tolerance Provides a library of objects that can be used to compose log-based rollback recovery protocols. Specification language to express arbitrary rollback-recovery protocols

Log-based Rollback Recovery Checkpointing independent, coordinated, induced by specific patterns of communication Message Logging Pessimistic, optimistic, causal

Core Building Blocks Almost all the log-based rollback recovery protocols share event-driven structures The common events are: Non-deterministic events Orphans, determinant Dependency-generating events Output-commit events Checkpointing events Failure-detection events

A grammar for specifying rollback-recovery protocols Protocol := <non-det-event-stmt>* <output-commit-event-stmt>* <dep-gen-event-stmt> <ckpt-stmt>op t <recovery-stmt>op t <non-det-event-stmt> := <event> : determinant : <determinant-structure> <Log <event-info-list> <how-to-log> on <stable-storage>>opt <output-commit-event-stmt> := <output-commit-proto> output commit on < event-list> <event> := send | receive | read | write <determinant-structure> := {source, sesn, dest, dest} <output-commit-proto> := independent | co-ordinated <how-to-log> := synchronously | asynchronously <stable-storage> := local disk | volatile memory of self

Egida Modules EventHandler Determinant HowToOutputCommit LogEventDeterminant LogEventInfo HowToLog WhereToLog StableStorage VolatileStorage Checkpointing …