Download presentation

Presentation is loading. Please wait.

Published byJair Jaquith Modified about 1 year ago

1
Chapter 7 - Local Stabilization1 Chapter 7: roadmap 7.1 Super stabilization 7.2 Self-Stabilizing Fault-Containing Algorithms 7.3 Error-Detection Codes and Repair

2
Chapter 7 - Local Stabilization2 Introduction We present a scheme that can be used to correct the state of algorithms for ongoing long-lived tasks. Converting non-stabilizing algorithms for such tasks to self-stabilizing algorithm for the same task.

3
Chapter 7 - Local Stabilization3 The Malicious Fault Model Starting from a safe configuration c, after which k processors experience transient fault - a new configuration c’ is reached. The states of the faulty processors can be chosen as the states that result in the longest convergence time.

4
Chapter 7 - Local Stabilization4 The Malicious Fault Model (2) This worst case measure minimize the convergence time in the worst case scenario However, algorithms designed with the worst case measure may have larger average convergence time than other algorithms

5
Chapter 7 - Local Stabilization5 The Non-malicious Fault Model In this model, a transient fault assigns a state to a processor, that is chosen with equal probability from the state space of the processor

6
Chapter 7 - Local Stabilization6 Average Convergence Time Pr (c, k, c’) : The probability of reaching a particular configuration c’ from a safe configuration c due to the occurrence of k faults WorstCase(c) : The maximal number of cycles before the system reaches a safe configuration when it starts in c

7
Chapter 7 - Local Stabilization7 Average Convergence Time (2) The average convergence time following the occurrence of k non-malicious transient faults is: Σ [pr(c, k, c’) · WorstCase(c’)] Computed over all possible configurations c’

8
Chapter 7 - Local Stabilization8 Error Detection Codes We use error-detection codes to reduce average convergence time For each processor we maintain a variable ErrorDetect holding the error-detection code ed, of its current state s The error-detecting function computes a pair given s

9
Chapter 7 - Local Stabilization9 Converting the Algorithm Replace every step a by a step a’ that does the following: 1.Examines whether the value of ErrorDetect fits the current state 2.If (1) holds, execute a 3.Otherwise, execute a special repair step a ’’ 4.Compute the new ed ’ by using the error-detecting function on the resulting state s ’

10
Chapter 7 - Local Stabilization10 Converting the Algorithm (2) A transient fault can corrupt all the memory bits of a processor Thus, the probability that the value of ErrorDetect will fit the state of the faulty processor, decreases as the number of bits in ErrorDetect increases

11
Chapter 7 - Local Stabilization11 Pyramids A pyramid ∆ i = v i [0], v i [1], v i [2], …, v i [d] of views is maintained by every processor P i, where v i [h] is a view of all the processors that are within a distance of no more than h from P i, h times units ago. In particular, vi[d] is a view of the entire system, d time units ago.

12
Chapter 7 - Local Stabilization12 V1V1 V 1 [0] : View of V1 Now.

13
Chapter 7 - Local Stabilization13 V1V1 V 1 [1] : View of colored vertices, one time unit ago.

14
Chapter 7 - Local Stabilization14 V1V1 V 1 [2] : View of colored vertices, two time units ago.

15
Chapter 7 - Local Stabilization15 V1V1 V 1 [3] : View of colored vertices, three time units ago.

16
Chapter 7 - Local Stabilization16 V1V1 V 1 [4] : View of the entire system, four time units ago.

17
Chapter 7 - Local Stabilization17 V1V1 V 1 [5] and V 1 [6] are views of the entire system as well, the difference is only in the time these views were taken.

18
Chapter 7 - Local Stabilization18 Neighboring Pyramids Neighboring processors exchange pyramids between themselves, and check agreement on the shared portions If shared portions are equal, then all the v[d] views are equal In addition, every processor checks that v i [d] is a consistent configuration for the input algorithm AL and the current task (the configuration is reachable from the initial state of AL)

19
Chapter 7 - Local Stabilization19 Checking Consistent Configuration P i checks that its state in the view v i [h], for 0 ≤ h ≤ d- 1, is obtained by executing AL using the state of P i and its neighbors in v i [h+1].

20
Chapter 7 - Local Stabilization20 Updating the Pyramids In every time unit, P i receives the pyramid ∆ j = v j [0], v j [1], v j [2], …, v j [d] of every neighbor, and uses the values of v j [d-1] to construct the value of the new v i [d] The values of v j [d-1] contain information about every processor at distance d from P i, d-1 time units ago In the same way, P i uses the received values of v j [k-1], for 0 ≤ k ≤ d-1, (together with v i [k-1] ) to compute v i [k]

21
Chapter 7 - Local Stabilization21 The Repair Scheme First, we will assume that the error detection code, identifies all the faults In general, the faulty processors initialize their states, and collect state information from non-faulty processors to reconstruct their pyramids

22
Chapter 7 - Local Stabilization22 The Repair Scheme(2) Let c’ be a configuration reached after several faults Three groups of processors: Faulty,Border-non-faulty, Operating. A Process which identifies an error, assigns faulty to its local status variable, and resets its pyramid

23
Chapter 7 - Local Stabilization23 Border-Non-Faulty and Operating The pyramid of a non-faulty processor that is neighbor to a faulty processor has almost all the information stored in the faulty processor before the fault. Such process assigns its local status variable the value border-non-faulty. The rest non-faulty processors are defined operating.

24
Chapter 7 - Local Stabilization24 Faulty Border-non-faulty Operating

25
Chapter 7 - Local Stabilization25 Freezing the Pyramids A border-non-faulty processor does not change its pyramid until all the faulty processors finished reconstructing theirs The Topology Collection procedure is used to verify that.

26
Chapter 7 - Local Stabilization26 Topology Collection Every faulty and border-non-faulty processors send their topology known at that moment to their neighbors After several rounds (the diameter of the corrupted region + 1), all the information in the pyramids of processors next to a faulty one has arrived

27
Chapter 7 - Local Stabilization27 Topology Collection (2) Every processor checks if there exists a faulty processor which has an edge connected to a processor with an unknown state When this test returns false, the processor pyramids can be reconstructed

28
Chapter 7 - Local Stabilization28 Reconstruction The faulty processors reconstruct their pyramids using the collected information from the other pyramids and the transition functions of the processors

29
Chapter 7 - Local Stabilization29 Back to Operating Using a local counter, and the collected topology, the faulty and border-non-faulty processors conclude when the rest have finished reconstructing their pyramids At the end of the repair process, all the processors change their status to operating

30
Chapter 7 - Local Stabilization30 The algorithm State variables: Status = {operating, faulty, border non faulty} Topology = {V, E} Pyramid (Explained before) Round Counter – counts the number of rounds since the occurrence of the recent fault.

31
Chapter 7 - Local Stabilization31 The algorithm (cont.) Upon a clock tick: 1. If (status = operating) 1.1 if (DetectError()) status = faulty Pyramid = nil RoundCounter = else if (HaveFaultyNeighbor()) status = Border non faulty RoundCounter = else UpdatePyramid() 2. Else 2.1 ExchangeLocalTopologyInformation() 2.2 if ( HasAllTopology() & status = faulty) ReconstructPyramid() 2.3 RoundCounter If (Diamater(Topology) = RoundCounter) status = operating Detects if a transient error occurred Error Detection Codes If one of the neighbors is faulty Send immediate neighbors information, and receive Information from neighbors Returns true iff there is not an edge coming out from faulty to an unknown state processor`

32
Chapter 7 - Local Stabilization32 Undetected Faults What happens in case the faults are not detected? Transient fault detectors and watch dog counters are used in this situation When an error is detected by the transient fault detector, the faulty process starts counting while letting the repair scheme try and fix the problem

33
Chapter 7 - Local Stabilization33 Undetected Faults (2) When the counter reaches its upper bound, the system is examined again If the repair failed, a reset is triggered to the system

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google