Presentation is loading. Please wait.

Presentation is loading. Please wait.

EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Similar presentations


Presentation on theme: "EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University"— Presentation transcript:

1 EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org

2 2 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Outline Midterm#2 result Group communication systems –Agreed and safe delivery Checkpointing and recovery

3 3 Midterm#2 Result Scores: 98, 96, 90, 90 Average Q1-29, Q2-26, Q3-18.5, Q4-20 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao

4 4 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Safe and Agreed Delivery For totally ordered reliable multicast, there are two delivery policies –Safe delivery: a message is delivered only when all correct processes have received it –Agreed delivery: a message is delivered as long as it is the next message in total order

5 5 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Safe and Agreed Delivery Safe delivery guarantees the uniformity of multicast: –If a message is delivered to any process, it is delivered by all correct processes Agreed delivery does not: –It is possible that a message is delivered in one (or more) process, but is not delivered by some correct process

6 6 Checkpointing and Recovery Faults occur over time. How to ensure a fault tolerant system remain operational for extensive period of time? –Recover failed replicas, or replace failed replicas with new one => Recovery is needed How to recover a failed replica or install a new replica? –Checkpointing a correct replica and transfer the state to the recovering replica Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao

7 7 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Checkpointing Checkpointing: the act of taking a snapshot of an entity so that we can restore it later A replica is a process running in an operating system. The state of a process –Processes' memory, stack and registers –Threads –Open or mmap'ed files –Current working directory –Interprocess communication: Semaphores, shared memory, pipes, sockets –Dynamic Load Libraries –…

8 8 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Checkpointing Many tools are available to perform checkpointing transparently or semi- transparently –http://www.checkpointing.org/ –Condor, libckpt, etc. –Checkpoints taken in general are not portable –Checkpoint size might be big

9 9 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Checkpointing of Application State Sometimes it is more efficient to save and store the application state only –Checkpoints can be very portable and compact in size –class Counter { int counter; Counter(int initVal) { counter = initVal; } void increment() {counter++; } void decrement() {counter--; } void setState(int c) {counter = c; } int getState() { return counter;}| }

10 10 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Logging Logging of messages –Checkpointing in general is expensive –Logging of messages is cheaper => we can periodically do checkpointing, or do checkpointing on demand and log all messages in between Logging of other non-deterministic activities –Access order to shared data

11 11 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Roll-Forward Recovery With replication in space, it is possible to recover a fault while the system is progressing ahead Roll-forward recovery is made possible by –Checkpointing of replica state –Logging of incoming messages –Reliable, totally ordered group communication system

12 12 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Roll-Forward Recovery We want to ensure the newly admitted replica to have a consistent state with others when it starts Steps of adding a new replica into a group (with on-demand checkpointing) –A recovered (or a new) replica joins a group –A join message is multicast in total order –On receiving the join message, it is put into incoming message queue and wait for processing –When the join message is at the head of the queue, a checkpoint is taken and it is transferred to the new replica

13 13 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Roll-Forward Recovery –At the new replica, it starts queueing messages after it receives the join messages (sent by itself) –When the checkpoint is received by the new replica, its state is restored using the received checkpoint (the checkpoint is delivered out of order!) –The queued messages are delivered in order, at the new replica –Other replicas do not stop and wait for the new replica Steps of adding a new replica into a group with periodic checkpointing is similar

14 14 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Steps of Roll-Forward Recovery

15 15 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Steps of Roll-Forward Recovery

16 16 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Steps of Roll-Forward Recovery

17 17 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Steps of Roll-Forward Recovery

18 18 Roll-backward Recovery Roll-backward recovery is used for systems relying on replication in time for fault tolerance –When a failure occurs, roll back using the most recent checkpoint (and retry) Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao

19 19 Roll-backward Recovery in a Distributed System Performing roll-backward recovery in a distributed system is non-trivial –Need to solve the distributed snapshot problem –It is easy to perform a local checkpoint of a process, but in a distributed system, when one process rolls back, other processes must also roll back to a consistent state Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao

20 20 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Distributed Snapshot Problem Goal: Determine the global system state –e.g. the total amount of money Assumptions –Each process records its own state –No shared clock/memory Imagine that a group of photographers taking snapshots of different portions and trying to combine to get the overall picture

21 21 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Distributed Snapshot A distributed snapshot reflects a state in which the distributed system might have been What constitute a consistent global state? –If we have recorded that process P has received a message from another process Q, then we should also have recorded that process Q had actually sent the message –The reverse condition (Q has sent a message that P has not yet received) is allowed

22 22 Distributed Snapshot A pair of mutually consistent checkpoints Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao

23 23 Distributed Snapshot A missing message => need to log messages (i.e.,consider channel state in addition to process state) Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao

24 24 Distributed Snapshot An orphan message The two checkpoints are definitely not consistent Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao

25 25 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Chandy and Lamport's Algorithm Assumptions –FIFO, unidirectional, reliable channels (A bidirectional channel is modelled as two unidirectional channels) –No process fails during the snapshot –System state consists of process state and channel state (messages sent but not received) –Any process P can initiate taking a distributed snapshot

26 26 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Chandy and Lamport's Algorithm P starts by recording its own local state and sends a marker along each of its outgoing channels When Q receives a marker through channel C, its action depends on whether it had already recorded its local state: –Not yet recorded: It records its local state, and sends the marker along each of its outgoing channels It starts recording incoming messages on OTHER channels –Already recorded: the marker on C indicates that the channel’s state should be recorded: All messages received before this marker and after Q recorded its own state

27 27 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Chandy and Lamport's Algorithm Q is finished when it has received a marker along each of its incoming channels The recorded local state as well as the state it recorded for each incoming channel, can be collected and sent to the process that initiated the snapshot The global state can be subsequently constructed

28 28 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Chandy and Lamport's Algorithm M Process Q receives a marker for the first time (from C1) and records its local state Q records all incoming message on C2 (and other incoming channels except C1, if any) Q receives a marker for its incoming channel C2 and finishes recording the state of the incoming channel C2 C2 C1


Download ppt "EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University"

Similar presentations


Ads by Google