EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Slides:



Advertisements
Similar presentations
Global States.
Advertisements

Lecture 8: Asynchronous Network Algorithms
Time and Global States Part 3 ECEN5053 Software Engineering of Distributed Systems University of Colorado, Boulder.
Synchronization Chapter clock synchronization * 5.2 logical clocks * 5.3 global state * 5.4 election algorithm * 5.5 mutual exclusion * 5.6 distributed.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Systems Dinesh Bhat - Advanced Systems (Some slides from 2009 class) CS 6410 – Fall 2010 Time Clocks and Ordering of events Distributed Snapshots.
CS 582 / CMPE 481 Distributed Systems
Distributed Systems Fall 2009 Logical time, global states, and debugging.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 688/788 Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
EEC-681/781 Distributed Computing Systems Lecture 11 Wenbing Zhao Cleveland State University.
EEC-681/781 Distributed Computing Systems Lecture 11 Wenbing Zhao Cleveland State University.
EEC 688 Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering.
Computer Science Lecture 10, page 1 CS677: Distributed OS Last Class: Clock Synchronization Physical clocks Clock synchronization algorithms –Cristian’s.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering.
Chapter 9 Global Snapshot. Global state  A set of local states that are concurrent with each other Concurrent states: no two states have a happened before.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Distributed Systems Fall 2010 Logical time, global states, and debugging.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Snapshot. One-dollar bank Let a $1 coin circulate in a network of a million banks. How can someone count the total $ in circulation? If not.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Building Dependable Distributed Systems, Copyright Wenbing Zhao
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CSE 486/586 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
EEC 688/788 Secure and Dependable Computing Lecture 5 Wenbing Zhao Cleveland State University
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Global state and snapshot
8.6. Recovery By Hemanth Kumar Reddy.
Global state and snapshot
EEC 688/788 Secure and Dependable Computing
CSE 486/586 Distributed Systems Global States
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Parallel and Distributed Simulation
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Last Class: Fault Tolerance
Chandy-Lamport Example
Presentation transcript:

EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

2 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Outline Midterm#2 result Group communication systems –Agreed and safe delivery Checkpointing and recovery

3 Midterm#2 Result Scores: 98, 96, 90, 90 Average Q1-29, Q2-26, Q3-18.5, Q4-20 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao

4 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Safe and Agreed Delivery For totally ordered reliable multicast, there are two delivery policies –Safe delivery: a message is delivered only when all correct processes have received it –Agreed delivery: a message is delivered as long as it is the next message in total order

5 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Safe and Agreed Delivery Safe delivery guarantees the uniformity of multicast: –If a message is delivered to any process, it is delivered by all correct processes Agreed delivery does not: –It is possible that a message is delivered in one (or more) process, but is not delivered by some correct process

6 Checkpointing and Recovery Faults occur over time. How to ensure a fault tolerant system remain operational for extensive period of time? –Recover failed replicas, or replace failed replicas with new one => Recovery is needed How to recover a failed replica or install a new replica? –Checkpointing a correct replica and transfer the state to the recovering replica Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao

7 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Checkpointing Checkpointing: the act of taking a snapshot of an entity so that we can restore it later A replica is a process running in an operating system. The state of a process –Processes' memory, stack and registers –Threads –Open or mmap'ed files –Current working directory –Interprocess communication: Semaphores, shared memory, pipes, sockets –Dynamic Load Libraries –…

8 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Checkpointing Many tools are available to perform checkpointing transparently or semi- transparently – –Condor, libckpt, etc. –Checkpoints taken in general are not portable –Checkpoint size might be big

9 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Checkpointing of Application State Sometimes it is more efficient to save and store the application state only –Checkpoints can be very portable and compact in size –class Counter { int counter; Counter(int initVal) { counter = initVal; } void increment() {counter++; } void decrement() {counter--; } void setState(int c) {counter = c; } int getState() { return counter;}| }

10 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Logging Logging of messages –Checkpointing in general is expensive –Logging of messages is cheaper => we can periodically do checkpointing, or do checkpointing on demand and log all messages in between Logging of other non-deterministic activities –Access order to shared data

11 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Roll-Forward Recovery With replication in space, it is possible to recover a fault while the system is progressing ahead Roll-forward recovery is made possible by –Checkpointing of replica state –Logging of incoming messages –Reliable, totally ordered group communication system

12 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Roll-Forward Recovery We want to ensure the newly admitted replica to have a consistent state with others when it starts Steps of adding a new replica into a group (with on-demand checkpointing) –A recovered (or a new) replica joins a group –A join message is multicast in total order –On receiving the join message, it is put into incoming message queue and wait for processing –When the join message is at the head of the queue, a checkpoint is taken and it is transferred to the new replica

13 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Roll-Forward Recovery –At the new replica, it starts queueing messages after it receives the join messages (sent by itself) –When the checkpoint is received by the new replica, its state is restored using the received checkpoint (the checkpoint is delivered out of order!) –The queued messages are delivered in order, at the new replica –Other replicas do not stop and wait for the new replica Steps of adding a new replica into a group with periodic checkpointing is similar

14 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Steps of Roll-Forward Recovery

15 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Steps of Roll-Forward Recovery

16 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Steps of Roll-Forward Recovery

17 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Steps of Roll-Forward Recovery

18 Roll-backward Recovery Roll-backward recovery is used for systems relying on replication in time for fault tolerance –When a failure occurs, roll back using the most recent checkpoint (and retry) Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao

19 Roll-backward Recovery in a Distributed System Performing roll-backward recovery in a distributed system is non-trivial –Need to solve the distributed snapshot problem –It is easy to perform a local checkpoint of a process, but in a distributed system, when one process rolls back, other processes must also roll back to a consistent state Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao

20 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Distributed Snapshot Problem Goal: Determine the global system state –e.g. the total amount of money Assumptions –Each process records its own state –No shared clock/memory Imagine that a group of photographers taking snapshots of different portions and trying to combine to get the overall picture

21 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Distributed Snapshot A distributed snapshot reflects a state in which the distributed system might have been What constitute a consistent global state? –If we have recorded that process P has received a message from another process Q, then we should also have recorded that process Q had actually sent the message –The reverse condition (Q has sent a message that P has not yet received) is allowed

22 Distributed Snapshot A pair of mutually consistent checkpoints Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao

23 Distributed Snapshot A missing message => need to log messages (i.e.,consider channel state in addition to process state) Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao

24 Distributed Snapshot An orphan message The two checkpoints are definitely not consistent Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao

25 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Chandy and Lamport's Algorithm Assumptions –FIFO, unidirectional, reliable channels (A bidirectional channel is modelled as two unidirectional channels) –No process fails during the snapshot –System state consists of process state and channel state (messages sent but not received) –Any process P can initiate taking a distributed snapshot

26 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Chandy and Lamport's Algorithm P starts by recording its own local state and sends a marker along each of its outgoing channels When Q receives a marker through channel C, its action depends on whether it had already recorded its local state: –Not yet recorded: It records its local state, and sends the marker along each of its outgoing channels It starts recording incoming messages on OTHER channels –Already recorded: the marker on C indicates that the channel’s state should be recorded: All messages received before this marker and after Q recorded its own state

27 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Chandy and Lamport's Algorithm Q is finished when it has received a marker along each of its incoming channels The recorded local state as well as the state it recorded for each incoming channel, can be collected and sent to the process that initiated the snapshot The global state can be subsequently constructed

28 Spring 2009EEC693: Secure & Dependable ComputingWenbing Zhao Chandy and Lamport's Algorithm M Process Q receives a marker for the first time (from C1) and records its local state Q records all incoming message on C2 (and other incoming channels except C1, if any) Q receives a marker for its incoming channel C2 and finishes recording the state of the incoming channel C2 C2 C1