CS 542: Topics in Distributed Systems Self-Stabilization.

Slides:



Advertisements
Similar presentations
CS3771 Today: deadlock detection and election algorithms  Previous class Event ordering in distributed systems Various approaches for Mutual Exclusion.
Advertisements

Chapter 6 - Convergence in the Presence of Faults1-1 Chapter 6 Self-Stabilization Self-Stabilization Shlomi Dolev MIT Press, 2000 Shlomi Dolev, All Rights.
Program correctness The State-transition model A global state S  s 0 x s 1 x … x s m {s k = local state of process k} S0  S1  S2  … Each state transition.
Self Stabilizing Algorithms for Topology Management Presentation: Deniz Çokuslu.
Self-stabilizing Distributed Systems Sukumar Ghosh Professor, Department of Computer Science University of Iowa.
Self-Stabilization in Distributed Systems Barath Raghavan Vikas Motwani Debashis Panigrahi.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Self Stabilization 1.
UBE529 Distributed Algorithms Self Stabilization.
Introduction to Self-Stabilization Stéphane Devismes.
Safety and Liveness. Defining Programs Variables with respective domain –State space of the program Program actions –Guarded commands Program computation.
From Self- to Snap- Stabilization Alain Cournier, Stéphane Devismes, and Vincent Villain SSS’2006, November 17-19, Dallas (USA)
CPSC 668Set 9: Fault Tolerant Consensus1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
CPSC 668Set 9: Fault Tolerant Consensus1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.
CS5371 Theory of Computation
CPSC 668Self Stabilization1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.
LSRP: Local Stabilization in Shortest Path Routing Anish Arora Hongwei Zhang.
CS294, YelickSelf Stabilizing, p1 CS Self-Stabilizing Systems
Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.
CS294, YelickConsensus, p1 CS Consensus
Self-Stabilization An Introduction Aly Farahat Ph.D. Student Automatic Software Design Lab Computer Science Department Michigan Technological University.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 2 – Distributed Systems.
CS 603 Communication and Distributed Systems April 15, 2002.
Chapter 3 - Motivating Self-Stabilization3-1 Chapter 3 Self-Stabilization Self-Stabilization Shlomi Dolev MIT Press, 2000 Shlomi Dolev, All Rights Reserved.
On Probabilistic Snap-Stabilization Karine Altisen Stéphane Devismes University of Grenoble.
Self stabilizing Linux Kernel Mechanism Doron Mishali, Alex Plits Supervisors: Prof. Shlomi Dolev Dr. Reuven Yagel.
Selected topics in distributed computing Shmuel Zaks
On Probabilistic Snap-Stabilization Karine Altisen Stéphane Devismes University of Grenoble.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa.
Snap-Stabilizing PIF and Useless Computations Alain Cournier, Stéphane Devismes, and Vincent Villain ICPADS’2006, July , Minneapolis (USA)
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 10 Instructor: Haifeng YU.
Diffusing Computation. Using Spanning Tree Construction for Solving Leader Election Root is the leader In the presence of faults, –There may be multiple.
Review for Exam 2. Topics included Deadlock detection Resource and communication deadlock Graph algorithms: Routing, spanning tree, MST, leader election.
Defining Programs, Specifications, fault-tolerance, etc.
Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa.
By J. Burns and J. Pachl Based on a presentation by Irina Shapira and Julia Mosin Uniform Self-Stabilization 1 P0P0 P1P1 P2P2 P3P3 P4P4 P5P5.
Faults and fault-tolerance
Dissecting Self-* Properties Andrew Berns & Sukumar Ghosh University of Iowa.
Program correctness The State-transition model The set of global states = so x s1 x … x sm {sk is the set of local states of process k} S0 ---> S1 --->
Self Stabilizing Smoothing and Counting Maurice Herlihy, Brown University Srikanta Tirthapura, Iowa State University.
Autonomic distributed systems. 2 Think about this Human population x10 9 computer population.
Program correctness The State-transition model A global states S  s 0 x s 1 x … x s m {s k = set of local states of process k} S0  S1  S2  Each state.
Diffusing Computation. Using Spanning Tree Construction for Solving Leader Election Root is the leader In the presence of faults, –There may be multiple.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
Stabilization Presented by Xiaozhou David Zhu. Contents What-is Motivation 3 Definitions An Example Refinements Reference.
Weak vs. Self vs. Probabilistic Stabilization Stéphane Devismes (CNRS, LRI, France) Sébastien Tixeuil (LIP6-CNRS & INRIA, France) Masafumi Yamashita (Kyushu.
University of Iowa1 Self-stabilization. The University of Iowa2 Man vs. machine: fact 1 An average household in the developed countries has 50+ processors.
Self-stabilization. What is Self-stabilization? Technique for spontaneous healing after transient failure or perturbation. Non-masking tolerance (Forward.
1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
Self-stabilization. Technique for spontaneous healing after transient failure or perturbation. Non-masking tolerance (Forward error recovery). Guarantees.
Hwajung Lee.  Technique for spontaneous healing.  Forward error recovery.  Guarantees eventual safety following failures. Feasibility demonstrated.
DISTRIBUTED ALGORITHMS Spring 2014 Prof. Jennifer Welch Set 9: Fault Tolerant Consensus 1.
Program Correctness. The designer of a distributed system has the responsibility of certifying the correctness of the system before users start using.
“Towards Self Stabilizing Wait Free Shared Memory Objects” By:  Hopeman  Tsigas  Paptriantafilou Presented By: Sumit Sukhramani Kent State University.
Superstabilizing Protocols for Dynamic Distributed Systems Authors: Shlomi Dolev, Ted Herman Presented by: Vikas Motwani CSE 291: Wireless Sensor Networks.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
ITEC452 Distributed Computing Lecture 15 Self-stabilization Hwajung Lee.
Design of Tree Algorithm Objectives –Learning about satisfying safety and liveness of a distributed program –Apply the method of utilizing invariants and.
Classifying fault-tolerance Masking tolerance. Application runs as it is. The failure does not have a visible impact. All properties (both liveness & safety)
CSE-591: Term Project Self-stabilizing Network Algorithms by Tridib Mukherjee ASU ID :
Computer Science 425/ECE 428/CSE 424 Distributed Systems (Fall 2009) Lecture 20 Self-Stabilization Reading: Chapter from Prof. Gosh’s book Klara Nahrstedt.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Faults and fault-tolerance
第1部: 自己安定の緩和 すてふぁん どぅゔぃむ ポスドク パリ第11大学 LRI CNRS あどばいざ: せばすちゃ てぃくそい
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Faults and fault-tolerance
CS60002: Distributed Systems
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Presentation transcript:

CS 542: Topics in Distributed Systems Self-Stabilization

Motivation As the number of computing elements increases in distributed systems failures become more common We desire that fault-tolerance should be automatic, without external intervention Two kinds of fault tolerance –masking: application layer does not see faults, e.g., redundancy and replication –non-masking: system deviates, deviation is detected and then corrected: e.g., roll back and recovery Self-stabilization is a general technique for non-masking distributed systems We deal only with transient failures which corrupt data, but not crash-stop failures

Self-stabilization Technique for spontaneous healing Guarantees eventual safety following failures Feasibility demonstrated by Dijkstra (CACM `74) E. Dijkstra

Self-stabilizing systems Recover from any initial configuration to a legitimate configuration in a bounded number of steps, as long as the processes are not further corrupted Assumption: Failures affect the state (and data) but not the program code

Self-stabilizing systems The ability to spontaneously recover from any initial state implies that no initialization is ever required. Such systems can be deployed ad hoc, and are guaranteed to function properly within bounded number of steps Guarantees-fault tolerance when the mean time between failures (MTBF) >> mean time to recovery (MTTR)

Self-stabilizing systems Self-stabilizing systems exhibit non-masking fault-tolerance They satisfy the following two criteria –Convergence –Closure Not L L convergence closure fault

Example 1: Stabilizing mutual exclusion in unidirectional ring N-1 Consider a unidirectional ring of processes. Counter-clockwise ring. One special process (yellow above) is process with id=0 Legal configuration = exactly one token in the ring (Safety) Desired “normal” behavior: single token circulates in the ring

Dijkstra’s stabilizing mutual exclusion 0 p 0 if x[0] = x[N-1] then x[0] := x[0] + 1 p j j > 0 if x[j] ≠ x[j -1] then x[j] := x[j-1] Wrap-around after K-1 N processes: 0, 1, …, N-1 state of process j is x[j]  {0, 1, 2, K-1}, where K > N TOKEN a process p = “if” condition is process p Legal configuration: only one process has token Can start the system from an arbitrary initial configuration

Example execution K-1 p 0 if x[0] = x[N-1] then x[0] := x[0] + 1 p j j > 0 if x[j] ≠ x[j -1] then x[j] := x[j-1]

Stabilizing execution p 0 if x[0] = x[N-1] then x[0] := x[0] + 1 p j j > 0 if x[j] ≠ x[j -1] then x[j] := x[j-1]

What Happens Legal configuration = a configuration with a single token Perturbations or failures take the system to configurations with multiple tokens –e.g. mutual exclusion property may be violated Within finite number of steps, if no further failures occur, then the system returns to a legal configuration Not L L convergence closure fault

Why does it work ? 1.At any configuration, at least one process can make a move (has token) 2.Set of legal configurations is closed under all moves 3.Total number of possible moves from (successive configurations) never increases 4.Any illegal configuration C converges to a legal configuration in a finite number of moves

Why does it work ? 1.At any configuration, at least one process can make a move (has token), i.e., if condition is false at all processes –Proof by contradiction: suppose no one can make a move –Then p 1,…,p N-1 cannot make a move –Then x[N-1] = x[N-2] = … x[0] –But this means that p 0 can make a move => contradiction p 0 if x[0] = x[N-1] then x[0] := x[0] + 1 p j j > 0 if x[j] ≠ x[j -1] then x[j] := x[j-1]

Why does it work ? 1.At any configuration, at least one process can make a move (has token) 2.Set of legal configurations is closed under all moves –If only p 0 can make a move, then for all i,j: x[i] = x[j]. After p 0 ’s move, only p 1 can make a move –If only pi (i≠0) can make a move »for all j < i, x[j] = x[i-1] »for all k ≥ i, x[k] = x[i], and »x[i-1] ≠ x[i] »x[0] ≠ x[N-1] in this case, after p i ‘s move only p i+1 can move p 0 if x[0] = x[N-1] then x[0] := x[0] + 1 p j j > 0 if x[j] ≠ x[j -1] then x[j] := x[j-1]

Why does it work ? 1.At any configuration, at least one process can make a move (has token) 2.Set of legal configurations is closed under all moves 3.Total number of possible moves from (successive configurations) never increases –any move by p i either enables a move for p i+1 or none at all p 0 if x[0] = x[N-1] then x[0] := x[0] + 1 p j j > 0 if x[j] ≠ x[j -1] then x[j] := x[j-1]

Why does it work ? 1.At any configuration, at least one process can make a move (has token) 2.Set of legal configurations is closed under all moves 3.Total number of possible moves from (successive configurations) never increases 4.Any illegal configuration C converges to a legal configuration in a finite number of moves –There must be a value, say v, that does not appear in C (since K > N) –Except for p 0, none of the processes create new values (since they only copy values) –Thus p 0 takes infinitely many steps, and since it only self-increments, it eventually sets x[0] = v (within K steps) –Soon after, all other processes copy value v and a legal configuration is reached in N-1 steps p 0 if x[0] = x[N-1] then x[0] := x[0] + 1 p j j > 0 if x[j] ≠ x[j -1] then x[j] := x[j-1]

Putting it All Together Legal configuration = a configuration with a single token Perturbations or failures take the system to configurations with multiple tokens –e.g. mutual exclusion property may be violated Within finite number of steps, if no further failures occur, then the system returns to a legal configuration Not L L convergence closure fault

Summary Many more self-stabilizing algorithms –Self-stabilizing distributed spanning tree –Self-stabilizing distributed graph coloring –Not covered in the course – look them up on the web!