Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa.

Slides:



Advertisements
Similar presentations
How to Schedule a Cascade in an Arbitrary Graph F. Chierchetti, J. Kleinberg, A. Panconesi February 2012 Presented by Emrah Cem 7301 – Advances in Social.
Advertisements

Chapter 7 - Local Stabilization1 Chapter 7: roadmap 7.1 Super stabilization 7.2 Self-Stabilizing Fault-Containing Algorithms 7.3 Error-Detection Codes.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.
EE 553 Integer Programming
Self Stabilizing Algorithms for Topology Management Presentation: Deniz Çokuslu.
Self-stabilizing Distributed Systems Sukumar Ghosh Professor, Department of Computer Science University of Iowa.
Self-Stabilization in Distributed Systems Barath Raghavan Vikas Motwani Debashis Panigrahi.
NL equals coNL Section 8.6 Giorgi Japaridze Theory of Computability.
Outline. Theorem For the two processor network, Bit C(Leader) = Bit C(MaxF) = 2[log 2 ((M + 2)/3.5)] and Bit C t (Leader) = Bit C t (MaxF) = 2[log 2 ((M.
Copyright © Cengage Learning. All rights reserved. CHAPTER 5 SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION.
Fast Leader (Full) Recovery despite Dynamic Faults Ajoy K. Datta Stéphane Devismes Lawrence L. Larmore Sébastien Tixeuil.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Self Stabilization 1.
Lecture 4: Elections, Reset Anish Arora CSE 763 Notes include material from Dr. Jeff Brumfield.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
CPSC 668Set 3: Leader Election in Rings1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.
Chapter 4 - Self-Stabilizing Algorithms for Model Conservation4-1 Chapter 4: roadmap 4.1 Token Passing: Converting a Central Daemon to read/write 4.2 Data-Link.
LSRP: Local Stabilization in Shortest Path Routing Hongwei Zhang and Anish Arora Presented by Aviv Zohar.
Parallel Routing Bruce, Chiu-Wing Sham. Overview Background Routing in parallel computers Routing in hypercube network –Bit-fixing routing algorithm –Randomized.
CPSC 668Self Stabilization1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.
LSRP: Local Stabilization in Shortest Path Routing Anish Arora Hongwei Zhang.
CS294, YelickSelf Stabilizing, p1 CS Self-Stabilizing Systems
Outline Max Flow Algorithm Model of Computation Proposed Algorithm Self Stabilization Contribution 1 A self-stabilizing algorithm for the maximum flow.
Self-Stabilization An Introduction Aly Farahat Ph.D. Student Automatic Software Design Lab Computer Science Department Michigan Technological University.
Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path Ilya N. Shindyalov, Philip E. Bourne.
On Probabilistic Snap-Stabilization Karine Altisen Stéphane Devismes University of Grenoble.
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Selected topics in distributed computing Shmuel Zaks
On Probabilistic Snap-Stabilization Karine Altisen Stéphane Devismes University of Grenoble.
Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 10 Instructor: Haifeng YU.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
Defining Programs, Specifications, fault-tolerance, etc.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.
The Complexity of Distributed Algorithms. Common measures Space complexity How much space is needed per process to run an algorithm? (measured in terms.
Lectures on Greedy Algorithms and Dynamic Programming
1 Leader Election in Rings. 2 A Ring Network Sense of direction left right.
Autonomic distributed systems. 2 Think about this Human population x10 9 computer population.
Stabilization Presented by Xiaozhou David Zhu. Contents What-is Motivation 3 Definitions An Example Refinements Reference.
Linear Program Set Cover. Given a universe U of n elements, a collection of subsets of U, S = {S 1,…, S k }, and a cost function c: S → Q +. Find a minimum.
University of Iowa1 Self-stabilization. The University of Iowa2 Man vs. machine: fact 1 An average household in the developed countries has 50+ processors.
Hwajung Lee. Well, you need to capture the notions of atomicity, non-determinism, fairness etc. These concepts are not built into languages like JAVA,
Hwajung Lee. Why do we need these? Don’t we already know a lot about programming? Well, you need to capture the notions of atomicity, non-determinism,
Self-stabilization. What is Self-stabilization? Technique for spontaneous healing after transient failure or perturbation. Non-masking tolerance (Forward.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
CS 542: Topics in Distributed Systems Self-Stabilization.
Chapter 21 Asynchronous Network Computing with Process Failures By Sindhu Karthikeyan.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Spring 2014 Prof. Jennifer Welch CSCE 668 Set 3: Leader Election in Rings 1.
1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.
Self-stabilization. Technique for spontaneous healing after transient failure or perturbation. Non-masking tolerance (Forward error recovery). Guarantees.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
Hwajung Lee.  Technique for spontaneous healing.  Forward error recovery.  Guarantees eventual safety following failures. Feasibility demonstrated.
Program Correctness. The designer of a distributed system has the responsibility of certifying the correctness of the system before users start using.
Superstabilizing Protocols for Dynamic Distributed Systems Authors: Shlomi Dolev, Ted Herman Presented by: Vikas Motwani CSE 291: Wireless Sensor Networks.
1 Fault-Tolerant Consensus. 2 Communication Model Complete graph Synchronous, network.
ITEC452 Distributed Computing Lecture 15 Self-stabilization Hwajung Lee.
Design of Tree Algorithm Objectives –Learning about satisfying safety and liveness of a distributed program –Apply the method of utilizing invariants and.
Self-stabilizing Overlay Networks Sukumar Ghosh University of Iowa Work in progress. Jointly with Andrew Berns and Sriram Pemmaraju (Talk at Michigan Technological.
第1部: 自己安定の緩和 すてふぁん どぅゔぃむ ポスドク パリ第11大学 LRI CNRS あどばいざ: せばすちゃ てぃくそい
Jordan Adamek Mikhail Nesterenko Sébastien Tixeuil
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Maximal Independent Set
Self-stabilization.
Alternating Bit Protocol
Student: Fang Hui Supervisor: Teo Yong Meng
CS60002: Distributed Systems
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Applied Discrete Mathematics Week 12: Discrete Probability
Presentation transcript:

Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Preview Weak stabilization (Gouda 2001) guarantees reachability and closure of the legal configuration. Once “stable”, if there is a minor perturbation, apparently no recovery guarantee exists, let alone “efficient recovery”. We take a weakly stabilizing leader election algorithm, and add fault-containment to it.

Our contributions An exercise in adding fault-containment to a weakly stabilizing leader election algorithm on a line topology. Processes are anonymous. Containment time = O(1) from all single failures Lim m  ∞ (contamination number) is O(1) (precisely 4), where m is a tuning parameter (Contamination number = max. no. of non-faulty processes that change their states during recovery)

The big picture

Model and Notations Consider n processes in a line topology N(i) = neighbors of process i Variable P(i) = {N(i) U ⊥ } (parent of i) Macro C(i) = {q ∈ N(i): P(q) = i} (children of i) Predicate Leader(i) ≡ (P(i)= ⊥ ) Legal configuration: 1.For exactly one process i: P(i) = ⊥ 2.  j ≠ i: P(j) = k  P(k) ≠ j Node i P(i) C(i) Leader

Model and Notations Shared memory model and central scheduler Weak fairness of the scheduler Guarded action by a process: g  A Computation is a sequence of (global) states and state transitions Node i P(i) C(i) Leader

Stabilization A stable (or legal) configuration satisfies a predicate LC defined in terms of the primary variables p that are observable by the application. However, fault-containment often needs the use of secondary variables (a.k.a auxiliary or state variables) s. Thus, Local state of process i = (p i, s i ) Global state of the system = (p, s), where p = the set of all p i, and s = the set of all s i (p, s)  LC  p  LC p and s  LC s

Definitions Containment time is the maximum time needed to establish LC p from a 1-faulty configuration Containment in space means the primary variables of O(1) processes changing their state during recovery from any 1- faulty configuration Fault-gap is the time to reach LC (both LC p and LC S ) from any 1-faulty configuration LC p restored LC s restored Fault gap

Weakly stabilizing leader election We start from the weakly stabilizing leader election algorithm by Devismes, Tixeuil,Yamashita [ICDCS 2007], and then modify it to add fault-containment. Here is the DTY algorithm for an array of processes. DTY algorithm: Program for any process in the array Guarded actions: R1 :: not leader ∧ N(i) = C(i)→ be a leader R2 :: not leader ∧ N(i) \ {C(i) U P(i)} ≠  → switch parent R3 :: leader ∧ N(i) ≠ C(i) → parent := k : k  C(i)

Effect of a single failure With a randomized scheduler, the weakly stabilizing system will recover to a legal configuration with probability 1. However, If a single failure occurs, the recovery time can be as large as n (Using situations similar to Gambler’s ruin). For fault-containment, we need something better. We bias a randomized scheduler to achieve our goal. The technique is borrowed [Dasgupta, Ghosh, Xiao: SSS 2007]. Here we show that the technique is indeed powerful enough to solve a larger class of problems.

Biasing a random scheduler For fault-containment, each process i uses a secondary variable x(i). A node i updates its primary variable P(i). when the following conditions hold: 1.The guard involving the primary variables is true 2.The randomized scheduler chooses i 3.x(i) ≥ x(k), where k  N(i)

Biasing a random scheduler i jk x(i)=10x(k)=7x(j)=8 ijk x(i)=13x(k)=7x(j)=8 (Let m = 5) ijk x(i)=10x(k)=7x(j)=8 ijk x(i)=10x(k)=8x(j)=8 After the action, x(i) is incremented as x(i) := max q ∈ N(i) x(q) + m, m ∈ Z+ (call it update x(i), here m is a tuning parameter). When x(i) < x(k) but conditions 1-2 hold, the primary variable P(i) remains unchanged -- only x(i) is incremented by 1 UPDATE x(i) INCREMENT x(i) ** **

The Algorithm Algorithm 1 (containment) : program for process i Guarded actions: R1 :: (P(i) ≠ ⊥ ) ∧ (N(i) = C(i)) → P(i) := ⊥ R2 ::(P(i) = ⊥ ) ∧ ( ∃ k ∈ N(i) \ C(i)) → P(i) := k R3a ::(P(i) = j) ∧ ( ∃ k ∈ N(i) : P(k) ≠ i or ⊥ ) ∧ x(i) ≥ x(k) → P(i) := k; update x(i) R3b ::(P(i) = j) ∧ ( ∃ k ∈ N(i) : P(k) ≠ i or ⊥ ) ∧ x(i) < x(k) → increment x(i) R4a :: (P(i) = j) ∧ ( ∃ k ∈ N(i) : P(k) = ⊥ ) ∧ x(i) ≥ x(k) → P(i) := k R4b :: (P(i) = j) ∧ ( ∃ k ∈ N(i) : P(k) = ⊥ ) ∧ x(i) < x(k) → increment x(i) R5 :: (P(i) = j) ∧ (P(j) = ⊥ ) ∧ ( ∃ k ∈ N(i) : P (k) ≠ i or ⊥ } → P(i) := k

Analysis of containment Consider six cases 1. Fault at the leader 2. Fault at distance-1 from the leader 3. Fault at distance-2 from the leader 4. Fault at distance-3 from the leader 5. Fault at distance-4 from the leader 6. Fault at distance-5 or greater from the leader

Case 1: fault at leader node R1 applied by node 5 R1 applied by node 4: node 4 is the new leader ** R1 :: (P(i) ≠ ⊥ ) ∧ (N(i) = C(i)) → P(i) := ⊥

Case 2: fault at distance-1 from the leader node R1: node R2: node 5 ** ** * R2 :: (P(i) = ⊥ ) ∧ ( ∃ k ∈ N(i) \ C(i)) → P(i) := k

Case 5: fault at distance-4 from the leader node ** ** R4a(2): x(2)>x(1) R5 (4) ** * R2(5) * R3a(3): x(3)>x(2) stable Non-faulty processes up to distance 4 from the faulty node being affected R4a :: (P(i) = j) ∧ ( ∃ k ∈ N(i) : P(k) = ⊥ ) ∧ x(i) ≥ x(k) → P(i) := k

Case 6: fault at distance ≥ 5 from the leader node ** ** R4a(2): x(2)>x(1) R3a (3); R5 (2) ** * R2 (1) * R3a(3): x(3)>x(2), x(4) With a high m, it is difficult for 4 to change its parent, but 3 can easily do it Recovery complete Current leader

Fault-containment in space Theorem 1. As m  ∞, the effect of a single failure is restricted within distance-4 from the faulty process i.e., algorithm is spatially fault-containing. Proof idea. Uses the exhaustive case-by-case analysis. The worst case occurs when a node at distance-4 from the leader node fails as shown earlier.

Fault-containment in time Theorem 2. The expected number of steps needed to contain a single fault is independent of n. Hence algorithm containment is fault-containing in time. Proof idea. Case by case analysis. When a node beyond distance-4 from the leader fails, its impact on the time complexity remains unchanged.

Fault-containment in time ** Recovery completed in a single move regardless of whether node 3 or 4 executes a move. C ase 1 : leader fails C ase 2 : A node i at distance -1 from the leader fails. (a) P(i) becomes ⊥ : recovery completed in one step (b) P(i) switches to a new parent: recovery time = 2 +∑ ∞ n=1 n/2 n = 4

Fault-containment in time Summary of expected containment times Fault at leader-1 Fault at dist-114 Fault at dist-22151/108 Fault at dist-3131/54115/36 Fault at dist-410/929/27 Fault at dist ≥ 433/32115/36 P(i)  ⊥ P(i) switches Thus, the expected containment time is O(1)

Another proof of convergence Theorem 3. The proposed algorithm recovers from all single faults to a legal configuration in O(1) time. Proof (Using martingale convergence theorem) A martingale is a sequence of random variables X 1, X 2, X 3, … s.t. ∀ n 1.E(|X n |) < ∞, and 2.E(X n+1 |X 1 … X n ) = X n (for super-martingale use ≤ for =, and for sub-martingale, use ≥ for =) We use the following corollary of Martingale convergence theorem: Corollary. If X n ≥ 0 is a super-martingale then as n → ∞, X n converges to X with probability 1, and E(X) ≤ E(X 0 ).

Proof of convergence (continued) Let X i be the number of processes with enabled guards in step i. After 0 or 1 failure, X can be 0, 2, or 3 (exhaustive enumeration). When X i = 0, X i+1 = 0 (already stable) When X i = 2, E(X i+1 )= 1/2 x 1 + 1/2 x 2 = 1 ≤ 2 When X i = 3, E(X i+1 )= 1/3 x 0 + 1/3 x 2 + 1/3 x 4 = 2 ≤ 3 Thus X 1, X 2, X 3, … is a super-martingale. Using the Corollary, as n → ∞, E(X n ) ≤ E(X 0 ). Since X is non-negative by definition, X n converges to 0 with probability 1, and the system stabilizes.

Proof idea of weak stabilization DTY algorithmOur algorithm R1 R2 R3 R1 R2 R3 R4 R5 Weakly stabilizing Executes the same action (P(i) :=k) as in DTY, but the guards are biased differently Weakly stabilizing 

Stabilization from multiple failures Theorem 3. When m → ∞, the expected recovery time from multiple failures is O(1) if the faults occur at distance 9 or more apart. Proof sketch. Since the contamination number is 4, no non-faulty process is influenced by both failures. 44 Fault

Conclusion 1.With increasing m, the containment in space is tighter, but stabilization from arbitrary initial configurations slows down. 2.LC s = true, so the systems is ready to deal with the next single failure as soon as LC p holds. This reduces the fault-gap and increases system availability. 3.The unbounded secondary variable x can be bounded using the technique discussed in [Dasgupta, Ghosh, Xiao SSS 2007] paper. 4.It is possible to extend this algorithm to a tree topology (but we did not do it here)