EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering.

Slides:

Advertisements

Similar presentations

Reliable Communication in the Presence of Failures Kenneth Birman, Thomas Joseph Cornell University, 1987 Julia Campbell 19 November 2003.

Advertisements

CS 542: Topics in Distributed Systems Diganta Goswami.

CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.

CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.

Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.

EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.

Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.

EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering.

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering.

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering.

Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

EEC 688/788 Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.

EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Ken Birman Cornell University. CS5410 Fall

EEC-681/781 Distributed Computing Systems Lecture 11 Wenbing Zhao Cleveland State University.

16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.

EEC 688 Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering.

EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.

Distributed Transactions Chapter 13

EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!

Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

EEC 688/788 Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

Building Dependable Distributed Systems, Copyright Wenbing Zhao

Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.

EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Multi-phase Commit Protocols1 Based on slides by Ken Birman, Cornell University.

EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

1 Fault Tolerance and Recovery Mostly taken from

EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Last Class: Fault Tolerance

Presentation transcript:

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

2 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Outline Midterm#2 result Group communication systems –Membership protocols –Agreed and safe delivery Checkpointing and recovery Reference: –Reliable distributed systems, by K. P. Birman, Springer; Chapter 14-16

3 Midterm#2 Result High 98, low 79, mean 92.7 Average Q1-18.9, Q2-17.6, Q3-18.3, Q4-19.1, Q Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao

4 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Unreliable Failure Detection Recall that failures are hard to distinguish from network delay –So we accept risk of mistake –If p is running a protocol to exclude q because “q has failed”, all processes that hear from p will cut channels to q Avoids “messages from the dead” –q must rejoin to participate in GMS again

5 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Basic GMP Someone reports that “q has failed” Leader (process p) runs a 2-phase commit protocol –Announces a “proposed new GMS view” Excludes q, or might add some members who are joining, or could do both at once –Waits until a majority of members of current view have voted “ok” –Then commits the change

6 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao GMP Example Proposes new view: {p,r} [-q] Needs majority consent: p itself, plus one more (“current” view had 3 members) Can add members at the same time p q r Proposed V 1 = {p,r} V 0 = {p,q,r} OK Commit V 1 V 1 = {p,r}

7 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Special Concerns? What if someone doesn’t respond? –P can tolerate failures of a minority of members of the current view New first-round “overlaps” its commit: –“Commit that q has left. Propose add s and drop r” –P must wait if it can’t contact a majority Avoids risk of partitioning

8 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao What If Leader Fails? Here we do a 3-phase protocol –New leader identifies itself based on age ranking (oldest surviving process) –It runs an inquiry phase “The adored leader has died. Did he say anything to you before passing away?” Note that this causes participants to cut connections to the adored previous leader –Then run normal 2-phase protocol but “terminate” any interrupted view changes leader had initiated

9 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao GMP Example New leader first sends an inquiry Then proposes new view: {r,s} [-p] Needs majority consent: q itself, plus one more (“current” view had 3 members) Again, can add members at the same time p q r Proposed V 1 = {r,s} V 0 = {p,q,r} OK Commit V 1 V 1 = {r,s} Inquire [-p] OK: nothing was pending

10 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Safe and Agreed Delivery For totally ordered reliable multicast, there are two delivery policies –Safe delivery: a message is delivered only when all correct processes have received it –Agreed delivery: a message is delivered as long as it is the next message in total order

11 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Safe and Agreed Delivery Safe delivery guarantees the uniformity of multicast: –If a message is delivered to any process, it is delivered by all correct processes Agreed delivery does not: –It is possible that a message is delivered in one (or more) process, but is not delivered by some correct process

12 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Checkpointing Checkpointing: the act of taking a snapshot of an entity so that we can restore it later A replica is a process running in an operating system. The state of a process –Processes' memory, stack and registers –Threads –Open or mmap'ed files –Current working directory –Interprocess communication: Semaphores, shared memory, pipes, sockets –Dynamic Load Libraries –…

13 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Checkpointing Many tools are available to perform checkpointing transparently or semi- transparently – –Condor, libckpt, etc. –Checkpoints taken in general are not portable –Checkpoint size might be big

14 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Checkpointing of Application State Sometimes it is more efficient to save and store the application state only –Checkpoints can be very portable and compact in size –class Counter { int counter; Counter(int initVal) { counter = initVal; } void increment() {counter++; } void decrement() {counter--; } void setState(int c) {counter = c; } int getState() { return counter;}| }

15 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Logging Logging of messages –Checkpointing in general is expensive –Logging of messages is cheaper => we can periodically do checkpointing, or do checkpointing on demand and log all messages in between Logging of other non-deterministic activities –Access order to shared data

16Recovery Roll-backward recovery –Used primarily by transaction processing –When a failure occurs, roll back using the most recent checkpoint (and retry) Roll-forward recovery –Used primarily in space redundancy –To recover a repaired replica, transfer the state from a current replica to the recovering replica Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao

17 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Roll-Forward Recovery With replication in space, it is possible to recover a fault while the system is progressing ahead Roll-forward recovery is made possible by –Checkpointing of replica state –Logging of incoming messages –Reliable, totally ordered group communication system

18 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Roll-Forward Recovery We want to ensure the newly admitted replica to have a consistent state with others when it starts Steps of adding a new replica into a group (with on-demand checkpointing) –A recovered (or a new) replica joins a group –A join message is multicast in total order –On receiving the join message, it is put into incoming message queue and wait for processing –When the join message is at the head of the queue, a checkpoint is taken and it is transferred to the new replica

19 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Roll-Forward Recovery –At the new replica, it starts queueing messages after it receives the join messages (sent by itself) –When the checkpoint is received by the new replica, its state is restored using the received checkpoint (the checkpoint is delivered out of order!) –The queued messages are delivered in order, at the new replica –Other replicas do not stop and wait for the new replica Steps of adding a new replica into a group with periodic checkpointing is similar

20 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Steps of Roll-Forward Recovery

21 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Steps of Roll-Forward Recovery

22 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Steps of Roll-Forward Recovery

23 Spring 2008EEC693: Secure & Dependable ComputingWenbing Zhao Steps of Roll-Forward Recovery