Presentation is loading. Please wait.

Presentation is loading. Please wait.

PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.

Similar presentations


Presentation on theme: "PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure."— Presentation transcript:

1 PROCESS RESILIENCE By Ravalika Pola

2 outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure Detection

3 Process Resilience Problem: –How fault tolerance in distributed system is achieved, especially against Process Failures? Solution: –Replicating processes into groups. –Consider collections of process as a single abstraction –All members of the group receive the same message, if one process fails, the others can take over for it. –Process groups are dynamic and a Process can be member of several groups. –Hence we need some mechanisms to manage the groups.

4 Flat Group vs. Hierarchical Group Design Issues

5 Group Membership Process can enter and leave the groups, groups can be created and destroyed. Hence we need to keep Track of them. Group Server –Straight forward, simple and easy to implement –Major disadvantage  Single point of failure Distributed Approach –Broadcast message to join and leave the group –In case of fault, how to identify between a really dead and a dead slow member –Joining and Leaving must be synchronized  on joining send all previous messages to the new member –Another issue is how to create a new group?

6 Failure Masking & Replication Replicate Process and organize them into groups Replace a single vulnerable process with the whole fault tolerant Group A system is said to be K fault tolerant if it can survive faults in K components and still meet its specifications. How much replication is needed to support K Fault Tolerance? –K+1 or 2K+1 ? Case: 1)If K processes stop, then the answer from the other one can be used.  K+1 2)If meet Byzantine failure, the number is  2K+1

7 Agreement in Faulty Systems Why we need Agreements? Goal of Agreement –Make all the non-faulty processes reach consensus on some issue –Establish that consensus within a finite number of steps. A process group typically requires reaching an agreement in: –Electing a coordinator –Deciding whether or not to commit a transaction –Dividing tasks among workers –Synchronization

8 When the communication and processes: –are perfect, reaching an agreement is often straightforward –are not perfect, there are problems in reaching an agreement Problems of two cases –Good process, but unreliable communication Example: Two-army problem –Good communication, but crashed process Example: Byzantine generals problem

9 Two-army problem This problem is classically stated as the two-army problem, and is insoluble. The agreed upon action will never take place, because the last sender will never be certain that the last confirmation went through.(Due to unreliable communication)

10 Byzantine generals problem The Byzantine generals problem for 3 loyal generals and1 traitor. a)The generals announce their troop strengths (in units of 1 thousand soldiers). b)The vectors that each general assembles based on (a) c)The vectors that each general receives in step 3.

11 Step 4: Each process examines the ith element of each of the newly received vectors If any value has a majority, that value is put into the result vector If no value has a majority, the corresponding element of the result vector is marked UNKNOWN Cont.. Result Vector: (1, 2, UNKNOWN, 4) THE ALGORITHM REACHES AN AGREEMENT

12 Cont.. The same as in previous slide, except now with 2 loyal generals and one traitor.

13 Step 4: Each process examines the ith element of each of the newly received vectors If any value has a majority, that value is put into the result vector If no value has a majority, the corresponding element of the result vector is marked UNKNOWN Cont.. Result Vector: (UNKOWN, UNKNOWN, UNKNOWN)

14 Concluding Remarks on the Byzantine Agreement Problem  In their paper, Lamport et al. (1982) proved that in a system with k faulty processes, an agreement can be achieved only if 2k+1 correctly functioning processes are present, for a total of 3k+1.  i.e., An agreement is possible only if more than two-thirds of the processes are working properly.  Fisher et al. (1985) proved that in a distributed system in which ordering of messages cannot be guaranteed to be delivered within a known, finite time, no agreement is possible even if only one process is faulty.

15 Process Failure Detection Before we properly mask failures, we generally need to detect them For a group of processes, non-faulty members should be able to decide who is still a member and who is not Two policies:  Processes actively send “are you alive?” messages to each other (i.e., pinging each other)  Processes passively wait until messages come in from different processes

16 Failure Considerations There are various issues that need to be taken into account when designing a failure detection subsystem:  Failure detection can be done as a side-effect of regularly exchanging information with neighbors (e.g., gossip based information dissemination)  A failure detection subsystem should ideally be able to distinguish network failures from node failures  When a member failure is detected, how should other non-faulty processes be informed

17 THANK YOU


Download ppt "PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure."

Similar presentations


Ads by Google