Classifying fault-tolerance Masking tolerance. Application runs as it is. The failure does not have a visible impact. All properties (both liveness & safety)

Slides:



Advertisements
Similar presentations
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Reliable Multicast Steve Ko Computer Sciences and Engineering University at Buffalo.
Advertisements

Leader Election Let G = (V,E) define the network topology. Each process i has a variable L(i) that defines the leader.  i,j  V  i,j are non-faulty.
6.852: Distributed Algorithms Spring, 2008 Class 7.
Sliding window protocol The sender continues the send action without receiving the acknowledgements of at most w messages (w > 0), w is called the window.
Self-stabilizing Distributed Systems Sukumar Ghosh Professor, Department of Computer Science University of Iowa.
Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Transport Protocols Slide 1 Transport Protocols.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
R R R Fault Tolerant Computing. R R R Acknowledgements The following lectures are based on materials from the following sources; –S. Kulkarni –J. Rushby.
Error Checking continued. Network Layers in Action Each layer in the OSI Model will add header information that pertains to that specific protocol. On.
Distributed Mutual Exclusion
Faults and fault-tolerance
ARQ Mechanisms Rudra Dutta ECE/CSC Fall 2010, Section 001, 601.
Wireless TCP Prasun Dewan Department of Computer Science University of North Carolina
Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey (Paper by X. Défago, A. Schiper, and P. Urbán) ACM computing Surveys, Vol. 36,No 4,
Review for Exam 2. Topics included Deadlock detection Resource and communication deadlock Graph algorithms: Routing, spanning tree, MST, leader election.
CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
Coordination and Agreement. Topics Distributed Mutual Exclusion Leader Election.
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
Group Communication Group oriented activities are steadily increasing. There are many types of groups:  Open and Closed groups  Peer-to-peer and hierarchical.
Faults and fault-tolerance
CSE 461 University of Washington1 Where we are in the Course Finishing off the Link Layer! – Builds on the physical layer to transfer frames over connected.
Program correctness The State-transition model The set of global states = so x s1 x … x sm {sk is the set of local states of process k} S0 ---> S1 --->
Hwajung Lee. A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types of groups:
Program correctness The State-transition model A global states S  s 0 x s 1 x … x s m {s k = set of local states of process k} S0  S1  S2  Each state.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CS603 Fault Tolerance - Communication April 17, 2002.
Sliding window protocol The sender continues the send action without receiving the acknowledgements of at most w messages (w > 0), w is called the window.
Physical clock synchronization Question 1. Why is physical clock synchronization important? Question 2. With the price of atomic clocks or GPS coming down,
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
1 Computer Networks Congestion Avoidance. 2 Recall TCP Sliding Window Operation.
TCP OVER ADHOC NETWORK. TCP Basics TCP (Transmission Control Protocol) was designed to provide reliable end-to-end delivery of data over unreliable networks.
Hwajung Lee. Mutual Exclusion CS p0 p1 p2 p3 Some applications are:  Resource sharing  Avoiding concurrent update on shared data  Controlling the.
Building Dependable Distributed Systems, Copyright Wenbing Zhao
Hwajung Lee. Mutual Exclusion CS p0 p1 p2 p3 Some applications are: 1. Resource sharing 2. Avoiding concurrent update on shared data 3. Controlling the.
SysRép / 2.5A. SchiperEté The consensus problem.
CS 542: Topics in Distributed Systems Self-Stabilization.
ITEC452 Distributed Computing Lecture 2 – Part 2 Models in Distributed Systems Hwajung Lee.
Failure detection The design of fault-tolerant systems will be easier if failures can be detected. Depends on the 1. System model, and 2. The type of failures.
Alternating Bit Protocol S R ABP is a link layer protocol. Works on FIFO channels only. Guarantees reliable message delivery with a 1-bit sequence number.
Lecture 12-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) October 4, 2012 Lecture 12 Mutual Exclusion.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking Congestion Control 0.
Hwajung Lee. Mutual Exclusion CS p0 p1 p2 p3 Some applications are:  Resource sharing  Avoiding concurrent update on shared data  Controlling the.
1 The utopia protocol  Unrealistic assumptions: –processing time ignored –infinite buffer space available –simplex: data transmitted in one direction.
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
Computer Networking Lecture 16 – Reliable Transport.
Group Communication A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
1 Fault Tolerance and Recovery Mostly taken from
Chapter 3: The Data Link Layer –to achieve reliable, efficient communication between two physically connected machines. –Design issues: services interface.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Faults and fault-tolerance
Faults and fault-tolerance
Faults and fault-tolerance
Alternating Bit Protocol
Faults and fault-tolerance
Faults and fault-tolerance
Mutual Exclusion CS p0 CS p1 p2 CS CS p3.
EEC 688/788 Secure and Dependable Computing
Physical clock synchronization
ITEC452 Distributed Computing Lecture 7 Mutual Exclusion
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
CSE 486/586 Distributed Systems Reliable Multicast --- 1
M. Mock and E. Nett and S. Schemmer
Presentation transcript:

Classifying fault-tolerance Masking tolerance. Application runs as it is. The failure does not have a visible impact. All properties (both liveness & safety) continue to hold. Non-masking tolerance. Safety property is temporarily affected, but not liveness. Example 1. Clocks lose synchronization, but recover soon thereafter. Example 2. Multiple processes temporarily enter their critical sections, but thereafter, the normal behavior is restored. Backward error-recovery vs. forward error-recovery

Backward vs. forward error recovery Backward error recovery When safety property is violated, the computation rolls back and resumes from a previous correct state. time rollback Forward error recovery Computation does not care about getting the history right, but moves on, as long as eventually the safety property is restored. True for self-stabilizing systems.

Classifying fault-tolerance Fail-safe tolerance Given safety predicate is preserved, but liveness may be affected Example. Due to failure, no process can enter its critical section for an indefinite period. In a traffic crossing, failure changes the traffic in both directions to red. Graceful degradation Application continues, but in a “degraded” mode. Much depends on what kind of degradation is acceptable. Example. Consider message-based mutual exclusion. Processes will enter their critical sections, but not in timestamp order.

Failure detection The design of fault-tolerant systems will be easier if failures can be detected. Depends on the 1. System model, and 2. the type of failures. Asynchronous systems are more tricky. We first focus on synchronous systems only

Detection of crash failures Failure can be detected using heartbeat messages (periodic “I am alive” broadcast) and timeout - if processors speed has a known lower bound - channel delays have a known upper bound.

Detection of omission failures For FIFO channels: Use sequence numbers with messages. Non-FIFO channels and bounded propagation delay - use timeout What about non-FIFO channels for which the upper bound of the delay is not known? Use unbounded sequence numbers and acknowledgments. But acknowledgments may be lost too, causing unnecessary re-transmission of messages :- ( Let us look how a real protocol deals with omission ….

Tolerating crash failures Triple modular redundancy (TMR) for masking any single failure. N-modular redundancy masks up to m failures, when N = 2m +1 Take a vote What if the voting unit fails?

Tolerating omission failures A central issue in networking A B router Routers may drop messages, but reliable end-to-end transmission is an important requirement. This implies, the communication must tolerate Loss, Duplication, and Re-ordering of messages

Stenning’s protocol {program for process S} define ok : boolean; next : integer; initially next = 0, ok = true, both channels are empty; dook  send (m[next], next); ok:= false ÿ(ack, next) is received  ok:= true; next := next + 1  timeout (R,S)  send (m[next], next) od {program for process R} define r : integer; initially r = 0; do(m[ ], s) is received  s = r  accept the message; send (ack, r); r:= r+1  (m[ ], s) is received  s≠r   (ack, r-1) od Sender S Receiver R m[0], 0 ack

Observations on Stenning’s protocol Both messages and acks may be lost Q. Why is the last ack reinforced by R when s≠r? A. Needed to guarantee progress. Progress is guaranteed, but the protocol is inefficient due to low throughput. Sender S Receiver R m[0], 0 ack

Sliding window protocol The sender continues the send action without receiving the acknowledgements of at most w messages (w > 0), w is called the window size.

Sliding window protocol {program for process S} define next, last, w : integer; initially next = 0, last = -1, w > 0 do last+1 ≤ next ≤ last + w  send (m[next], next); next := next + 1 ÿ(ack, j) is received  if j > last  last := j  j ≤ last  skip fi ÿtimeout (R,S)  next := last+1 {retransmission begins} od {program for process R} define j : integer; initially j = 0; do(m[next], next) is received  if j = next  accept message; send (ack, j); j:= j+1  j ≠ next  send (ack, j-1) fi; od