EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering.

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org

2 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Outline Midterm#2 result Group communication systems –Membership protocols –Agreed and safe delivery Checkpointing and recovery Reference: –Reliable distributed systems, by K. P. Birman, Springer; Chapter 14-16

3 Midterm#2 Result High 98, low 79, mean 92.7 Average Q1-18.9, Q2-17.6, Q3-18.3, Q4-19.1, Q5-18.9 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao

4 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Unreliable Failure Detection Recall that failures are hard to distinguish from network delay –So we accept risk of mistake –If p is running a protocol to exclude q because “q has failed”, all processes that hear from p will cut channels to q Avoids “messages from the dead” –q must rejoin to participate in GMS again

5 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Basic GMP Someone reports that “q has failed” Leader (process p) runs a 2-phase commit protocol –Announces a “proposed new GMS view” Excludes q, or might add some members who are joining, or could do both at once –Waits until a majority of members of current view have voted “ok” –Then commits the change

6 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao GMP Example Proposes new view: {p,r} [-q] Needs majority consent: p itself, plus one more (“current” view had 3 members) Can add members at the same time p q r Proposed V 1 = {p,r} V 0 = {p,q,r} OK Commit V 1 V 1 = {p,r}

7 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Special Concerns? What if someone doesn’t respond? –P can tolerate failures of a minority of members of the current view New first-round “overlaps” its commit: –“Commit that q has left. Propose add s and drop r” –P must wait if it can’t contact a majority Avoids risk of partitioning

8 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao What If Leader Fails? Here we do a 3-phase protocol –New leader identifies itself based on age ranking (oldest surviving process) –It runs an inquiry phase “The adored leader has died. Did he say anything to you before passing away?” Note that this causes participants to cut connections to the adored previous leader –Then run normal 2-phase protocol but “terminate” any interrupted view changes leader had initiated

9 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao GMP Example New leader first sends an inquiry Then proposes new view: {r,s} [-p] Needs majority consent: q itself, plus one more (“current” view had 3 members) Again, can add members at the same time p q r Proposed V 1 = {r,s} V 0 = {p,q,r} OK Commit V 1 V 1 = {r,s} Inquire [-p] OK: nothing was pending

10 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Safe and Agreed Delivery For totally ordered reliable multicast, there are two delivery policies –Safe delivery: a message is delivered only when all correct processes have received it –Agreed delivery: a message is delivered as long as it is the next message in total order

11 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Safe and Agreed Delivery Safe delivery guarantees the uniformity of multicast: –If a message is delivered to any process, it is delivered by all correct processes Agreed delivery does not: –It is possible that a message is delivered in one (or more) process, but is not delivered by some correct process

12 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Checkpointing Checkpointing: the act of taking a snapshot of an entity so that we can restore it later A replica is a process running in an operating system. The state of a process –Processes' memory, stack and registers –Threads –Open or mmap'ed files –Current working directory –Interprocess communication: Semaphores, shared memory, pipes, sockets –Dynamic Load Libraries –…

13 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Checkpointing Many tools are available to perform checkpointing transparently or semi- transparently –http://www.checkpointing.org/ –Condor, libckpt, etc. –Checkpoints taken in general are not portable –Checkpoint size might be big

14 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Checkpointing of Application State Sometimes it is more efficient to save and store the application state only –Checkpoints can be very portable and compact in size –class Counter { int counter; Counter(int initVal) { counter = initVal; } void increment() {counter++; } void decrement() {counter--; } void setState(int c) {counter = c; } int getState() { return counter;}| }

15 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Logging Logging of messages –Checkpointing in general is expensive –Logging of messages is cheaper => we can periodically do checkpointing, or do checkpointing on demand and log all messages in between Logging of other non-deterministic activities –Access order to shared data

16Recovery Roll-backward recovery –Used primarily by transaction processing –When a failure occurs, roll back using the most recent checkpoint (and retry) Roll-forward recovery –Used primarily in space redundancy –To recover a repaired replica, transfer the state from a current replica to the recovering replica Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao

17 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Roll-Forward Recovery With replication in space, it is possible to recover a fault while the system is progressing ahead Roll-forward recovery is made possible by –Checkpointing of replica state –Logging of incoming messages –Reliable, totally ordered group communication system

18 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Roll-Forward Recovery We want to ensure the newly admitted replica to have a consistent state with others when it starts Steps of adding a new replica into a group (with on-demand checkpointing) –A recovered (or a new) replica joins a group –A join message is multicast in total order –On receiving the join message, it is put into incoming message queue and wait for processing –When the join message is at the head of the queue, a checkpoint is taken and it is transferred to the new replica

19 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Roll-Forward Recovery –At the new replica, it starts queueing messages after it receives the join messages (sent by itself) –When the checkpoint is received by the new replica, its state is restored using the received checkpoint (the checkpoint is delivered out of order!) –The queued messages are delivered in order, at the new replica –Other replicas do not stop and wait for the new replica Steps of adding a new replica into a group with periodic checkpointing is similar

20 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Steps of Roll-Forward Recovery

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering.

Similar presentations

Presentation on theme: "EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering.

Similar presentations

Presentation on theme: "EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering."— Presentation transcript:

Similar presentations

About project

Feedback