Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb 2016192.

Similar presentations


Presentation on theme: "Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb 2016192."— Presentation transcript:

1 Fault tolerance and related issues in distributed computing Shmuel Zaks zaks@cs.technion.ac.il GSSI - Feb 2016192

2 Part 0: Part 0: An overview Part 1: Part 1: Lower bounds Part 2: Part 2: Computing in spite of faults Part 3: Part 3: Detecting faults Part 4: Part 4: Self-stabilization 193GSSI - Feb 2016

3 194 The snapshot algorithm (Candy and Lamport) GSSI - Feb 2016

4 195GSSI - Feb 2016

5 196GSSI - Feb 2016

6 197 Goal: design a snapshot (=global-state- detection) algorithm that:  will record a collection of states of all system components (which forms a global system state),  will not change the underlying computation,  will not freeze the underlying computation GSSI - Feb 2016

7 198 A Process Can…  record its own state,  send and receive messages,  record messages it sends and receives,  cooperate with other processes  Processes do not share clocks or memory  Processes cannot record their state precisely at the same instant GSSI - Feb 2016

8 199 Motivation Many problems in distributed systems can be stated in terms of the problem of detecting global states: Stable property detection problems : termination detection, deadlock detection etc. GSSI - Feb 2016

9 200 Stable Property Detection Problem D - distributed system y - a predicate function defined on the set of global states of D S, S’ – global states of D y is stable if y(S) implies y(S’) for all S’ reachable from S GSSI - Feb 2016

10  many distributed algorithms are structured as a sequence of phases  A phase: transient part, then a stable part phase termination vs. computation termination  our view on the problem: i.detect the termination of a phase ii.initiate a new phase Notice that “the kth phase has terminated” is a stable property 201GSSI - Feb 2016

11 202 Model  Distributed system D is a finite, labeled, directed graph. p q C2 C1  Channels have infinite buffers, are error- free and preserve FIFO  Message delay is bounded, but unknown GSSI - Feb 2016

12 203 State of a Channel 1 p q C1 23 1  [1, 2, 3] – sequence X of messages that were sent  [1] – sequence Y of received messages ( prefix of X )  [2, 3] – state of C1: X \ Y pq C2 C1 GSSI - Feb 2016

13 204 Example: System Distributed system: p C2C2 C1C1 Initial global state: B A Ø Ø State transitions (same for p and q): A B send receive q GSSI - Feb 2016

14 205 A A Ø A A Ø A B Ø Ø B A Ø Ø A computation corresponds to a path in the diagram p qq p p sends q receives q sends p receives q sends C1C1 p C2C2 q deterministic A B send receive Global state transition diagram GSSI - Feb 2016

15 206 Distributed system: State transition: p : q : CD send receive A B send receive p C2C2 C1C1 q Example: System GSSI - Feb 2016

16 207 qp C2C2 C1C1 A D Ø B C Ø B D A C Ø Ø pq q p p sends q sends p receives Global state transition diagram q receives non-deterministic q sends A B send receive CD send receive q receives GSSI - Feb 2016

17 208 Each process records its own state p and q cooperate to record the state of C. p C q in the snapshot algorithm: GSSI - Feb 2016

18 209 B A Ø p q Example: System A A A A Recorded state: p C q Ø No token C1C1 p C2C2 q A B send receive Record C Record q Record p GSSI - Feb 2016

19 210 B A Ø Ø p q Example: System B A A A Ø Recorded state: p C1C1 q Two tokens Record p Record C Record q C1C1 p C2C2 q A B send receive GSSI - Feb 2016

20 211 q will record the state of C q starts recording C after it records its state p C q p and q have to coordinate ; using a special marker q stops when receiving from p But: how does q know when to record its state? GSSI - Feb 2016

21 212 Who starts? We assume one process. The snapshot algorithm Hw: extend discussion + proof to any number of startes. GSSI - Feb 2016

22  Who will record the state of channel C? q  How q knows when to stop recording? p sends right after it records its state, and before sending any other message  q starts recording after it records its state (Intuition for the Algorithm) p C q 213 GSSI - Feb 2016

23 214 The snapshot algorithm Ends when q receives along C Starts when q records itself channel recording p C q Note : for any q  p 0, the channel along which arrived first is recorded as  GSSI - Feb 2016

24 215 p 0 starts. The snapshot algorithm p 0 recoreds its state, and then broadcasts. Shout-algorithm = PI (Propogation-of-information)= hot potato = … When q receives for the first time, it records its own state State recording GSSI - Feb 2016

25 216 1. record the state of p 2. send along c before sending any other message Marker-Receiving Rule for a process q if q’s state is not recorded: 1. record state; 2. record c’s state =  ; else: c’s state is the sequence of messages received since q recorded its state The snapshot algorithm on receiving along channel c: Marker-Sending Rule for a process q GSSI - Feb 2016

26 Termination Assumption No marker remains forever in an input channel Claim: If the graph is strongly connected and at least one process records its state, then all processes will record their state in finite time Proof: by induction 217 GSSI - Feb 2016

27 218 The Recorded Global State State transition: p : q : C D send receive A B send receive p C2C2 C1C1 q Ex: System GSSI - Feb 2016

28 219 A D  B C  B D A C   pqqp p sends q sends p receives A D  qp C2C2 C1C1 A B send receive CD send receive A GSSI - Feb 2016

29 220 What did we get? GSSI - Feb 2016

30 221  Event e in process p is an atomic action: can change the state of p, and a state of at most one channel c incident on p (by sending/receiving message M along c )  e is defined by  e = may occur in global state S if 1. the state of p in S is s. 2 a. if c is directed towards p: c ’s state has M in its head, and is deleted after applying e. b. if c is directed from p: c ’s state has M in its tail after applying e. 3. the state of p after applying e is s’. GSSI - Feb 2016

31 222 Process State and Global State  A process: set of states, an initial state set of events  A global state S : collection of process states and channel states initially, each process is in its initial state and all channels are empty next(S, e) is the global state after event e in applied to global state S GSSI - Feb 2016

32 223 Process State and Global State  seq = (e i : i = 0…n) is a computation of the system iff e i may occur in S i, S i+1 = next(S i, e i ) (S 0 is the initial global state) GSSI - Feb 2016

33 224 seq = (e i : i ≥ 0) a distributed computation S i – the state of the system right before e i occurs S 0 – the initial state of the system S t – the state of the system at the termination of the algorithm S* - the recorded global state The Recorded Global State GSSI - Feb 2016

34 225 Definition Event e j is called pre-recording if e j is in a process p and p records its state after e j in seq. Event e j is called post-recording if e j is in a process p and p records its state before e j in seq. Assume that e j-1 is a post-recording event before Pre-recording event e j in seq. pre-recording post-recording GSSI - Feb 2016

35 226 Lemma: Proof: e j-1 occurs in p and e j in q, and q ≠p (since e j-1 is and e j is.) GSSI - Feb 2016 pre-recording post-recording

36 227 The only scenario that might prevent interchanging the two events is that a message M is sent at e j-1 and received at e j. but this cannot be possible: if M is sent at e j-1, then M is, so a marker was sent to q before M, so when it is received in e j q already recorded its state, so e j is,a, a contradiction! GSSI - Feb 2016

37 228 Hence, event e j can occur in global state S j-1. The state of process p is not altered by e j, hence e j-1 can occur after e j. GSSI - Feb 2016

38 229 We have to show that the states of all Processes and channels are the same in S 2 and S 4. This clearly holds for proceses and channels That do not take part in ej-1 and ej. GSSI - Feb 2016

39 230 states: the states of p and q in S2 and in S4 are the same. channels: whether ej-1/ej send/receive(/neither) a message along a channel, the same is done in both scenarios, So the states of the channels in S 2 and S 4 are the same. (End of proof. ) GSSI - Feb 2016

40 (The Recorded Global State) GSSI - Feb 2016231

41 232 Proof Using the lemma, swap the events till all events appear after all events. The acquired computation is seq’. All that is left to show: S* is a global state after all events and before all events. 1.Process states 2.Channel states GSSI - Feb 2016

42 233 Claim: The state of a channel in S* is (sequence of messages corresp. to pre-recorded receives)-(sequence of messages corresp. to prerecorded sends) Proof: The state of channel c from process p to process q recorded in S* is the sequence of messages received on c by q after q records its state and before q receives a marker on c. The sequence of messages sent by p is the sequence corres. to prerecording sends on c. GSSI - Feb 2016

43 234 A D B C D A C   pq q p p sends q sends p receives A D  B post pre post qp C2C2 C1C1 A B send receive CD send receive  GSSI - Feb 2016

44 235 A D  A D D A C   p q q p q sends p sends p receives A D  A (Another execution) pre post B  qp C2C2 C1C1 A B send receive CD send receive GSSI - Feb 2016

45 What did we get? A configuration that could have happened 236GSSI - Feb 2016

46 seq = (e i : i ≥ 0) a distributed computation S i – the state of the system right before e i occurs S 0 – the initial state of the system S t – the state of the system at the termination of the algorithm S* - the recorded global state 237GSSI - Feb 2016

47 Stable Detection D - distributed system y - a predicate function defined on the set of global states of D S, S’ – global states of D y is a stable property of D if y(S) implies y(S’) for all S’ reachable from S 238GSSI - Feb 2016

48 239 Input: A stable property y Output: a boolean value b with the property: y(S 0 ) b and b y(S t ) Algorithm Algorithm: begin record a global state S* b := y(S*) end GSSI - Feb 2016

49 240 Correctness 1. S* is reachable from S 0 2. S t is reachable from S* 3. y(S) y(S’) for all S’ reachable from S S 0 S* S t y(S*)=true y(S t )=true  y(S*)=false  y(S 0 )=false GSSI - Feb 2016

50 References K. M. Chandy and L. Lamport, Distributed Snapshots: Determining Global States of Distributed, ACM Trans. on Computer Systems, 1985. 241GSSI - Feb 2016


Download ppt "Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb 2016192."

Similar presentations


Ads by Google