Distributed Snapshot Distributed Systems.

Slides:



Advertisements
Similar presentations
Distributed Snapshots: Non-blocking checkpoint coordination protocol Next: Uncoordinated Chkpnt.
Advertisements

Distributed Snapshots: Determining Global States of Distributed Systems - K. Mani Chandy and Leslie Lamport.
Global States.
Distributed Snapshots: Determining Global States of Distributed Systems Joshua Eberhardt Research Paper: Kanianthra Mani Chandy and Leslie Lamport.
Global States in a Distributed System By John Kor and Yvonne Cheng.
Global States and Checkpoints
Distributed Computing 5. Snapshot Shmuel Zaks ©
Lecture 8: Asynchronous Network Algorithms
Uncoordinated Checkpointing The Global State Recording Algorithm.
Uncoordinated Checkpointing The Global State Recording Algorithm Cristian Solano.
6.852: Distributed Algorithms Spring, 2008 Class 12.
Time and Global States Part 3 ECEN5053 Software Engineering of Distributed Systems University of Colorado, Boulder.
Synchronization Chapter clock synchronization * 5.2 logical clocks * 5.3 global state * 5.4 election algorithm * 5.5 mutual exclusion * 5.6 distributed.
Distributed Computing 5. Snapshot Shmuel Zaks ©
OSU CIS Lazy Snapshots Nigamanth Sridhar and Paul A.G. Sivilotti Computer and Information Science The Ohio State University
Global State Collection. Global state collection Some applications - computing network topology - termination detection - deadlock detection Chandy-Lamport.
Distributed Snapshot (continued)
S NAPSHOT A LGORITHM. W HAT IS A S NAPSHOT - INTUITION Given a system of processors and communication channels between them, we want each processor to.
CS 582 / CMPE 481 Distributed Systems
Ordering and Consistent Cuts Presented By Biswanath Panda.
CMPT 431 Dr. Alexandra Fedorova Lecture VIII: Time And Global Clocks.
Distributed Systems Fall 2009 Logical time, global states, and debugging.
Computer Science Lecture 11, page 1 CS677: Distributed OS Last Class: Clock Synchronization Logical clocks Vector clocks Global state.
20101 Synchronization in distributed systems A collection of independent computers that appears to its users as a single coherent system.
EEC-681/781 Distributed Computing Systems Lecture 11 Wenbing Zhao Cleveland State University.
Ordering and Consistent Cuts Presented by Chi H. Ho.
EEC-681/781 Distributed Computing Systems Lecture 11 Wenbing Zhao Cleveland State University.
Computer Science Lecture 10, page 1 CS677: Distributed OS Last Class: Clock Synchronization Physical clocks Clock synchronization algorithms –Cristian’s.
Distributed Computing 5. Snapshot Shmuel Zaks ©
UBI529 Distributed Algorithms
Chapter 9 Global Snapshot. Global state  A set of local states that are concurrent with each other Concurrent states: no two states have a happened before.
Distributed Snapshot. Think about these -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes?
Distributed Systems Fall 2010 Logical time, global states, and debugging.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Snapshot. One-dollar bank Let a $1 coin circulate in a network of a million banks. How can someone count the total $ in circulation? If not.
Hwajung Lee. -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes? How do we compute.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
CSE 486/586 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
Hwajung Lee. Some applications - computing network topology - termination detection - deadlock detection Chandy Lamport algorithm does a partial job.
1 Chapter 11 Global Properties (Distributed Termination)
Hwajung Lee. -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes? How do we compute.
Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation Author: Friedermann Mattern Presented By: Shruthi Koundinya.
Global state and snapshot
Consistent cut A cut is a set of events.
Global State Recording
Global state and snapshot
Lecture 3: State, Detection
CSE 486/586 Distributed Systems Global States
Theoretical Foundations
Distributed Snapshots & Termination detection
Lecture 9: Asynchronous Network Algorithms
ITEC452 Distributed Computing Lecture 9 Global State Collection
Distributed Snapshot.
Global State Recording
EECS 498 Introduction to Distributed Systems Fall 2017
Distributed Snapshot.
湖南大学-信息科学与工程学院-计算机与科学系
Global state collection
Global State Collection
Chapter 5 (through section 5.4)
Uncoordinated Checkpointing
ITEC452 Distributed Computing Lecture 8 Distributed Snapshot
Distributed Snapshot.
CSE 486/586 Distributed Systems Global States
Jenhui Chen Office number:
CIS825 Lecture 5 1.
Consistent cut If this is not true, then the cut is inconsistent
Chandy-Lamport Example
Distributed Snapshot.
Distributed Snapshots
Presentation transcript:

Distributed Snapshot Distributed Systems

Introduction: ¿What is a Distributed System? A network of processes. The nodes are processes, and the edges are comunication channels.

Introduction A computation is a sequence of atomic actions that transform a given initial state to the final state. While such actions are totally ordered in a sequential process, they are only partially ordered in a distributed system.

Introduction In this context, the state (also known as global state) of a distributed system is the set of local states of all the component processes, as well as the states of every channel through which messages flow.

Introduction So the important question is: when or how do we record the states of the processes and the channels? Depending on when the states of the individual components are recorded, the value of the global state can vary widely.

Difficulties The recording of the global state may look simple for some external observert who looks at the system from outside. The same problem is surprisingly challenging, when one takes a snapshot from inside the system.

Difficulties Consider a system of three processes numbered 0, 1, and 2 connected by FIFO channels, and assume that an unknown number of indistinguishable tokens are circulating indefinitely through this network. We want the processes to cooperate with one another to count the exact number of tokens circulating in the system (without ever stopping the system).

Difficulties Deadlock detection. Any process that does not have an eligible action for a prolonged period would like to find out if the system has reached a deadlock configuration. Termination detection. To begin the computation in a certain phase, a process must therefore know whether every other process has finished their computation in the previous phase. Network reset. In case of a malfunction or a loss of coordination, a distributed system will need to roll back to a consistent global state and initiate a recovery. Previous snapshots may be helpful.

Properties of Consistent Snapshots A snapshot state (SSS) consists of a set of local states, where each local state is the outcome of a recording event that follows a send, or a receive, or an internal action. The important notion here is that of a consistent cut.

Properties of Consistent Snapshots A cut is a set of events—it contains at least one event per process. A cut is called consistent, if for each event that it contains, it also includes all events causally ordered before it.

Properties of Consistent Snapshots The set of local states following the recorded recent events of a consistent cut forms a consistent snapshot. In a distributed system, many consistent snapshots can be recorded. A snapshot that is often of practical interest is the one that is most recent.

The Chandy-Lamport Algorithm Let the topology of a distributed system be represented by a strongly connected graph. Each node represents a process and each directed edge represents a FIFO channel. A process called the initiator initiates the distributed snapshot algorithm. Any process can be an initiator. The initiator process sends a special message, called a marker (*) that prompts other processes in the system to record their states. The global state consists of the states of the processes as well as the channels. However, channels are passive entities — so the responsibility of recording the state of a channel lies with the process on which the channel is incident.

The Chandy-Lamport Algorithm DS1 The initiator process, in one atomic action, does the following: Turns red Records its own state Sends a marker along all its outgoing channels DS2 Every process, upon receiving a marker for the first time and before doing anything else, does the following in one atomic action: Records its state Sends markers along all its outgoing channels

The Chandy-Lamport Algorithm The snapshot algorithm terminates, when: Every process has turned red Every process has received a marker through each of its incoming channels

The Chandy-Lamport Algorithm

The Chandy-Lamport Algorithm The individual processes only record the fragments of a snapshot state SSS. It requires another phase of activity to collect these fragments and form a composite view of SSS. Global state collection is not a part of the snapshot algorithm.

The Lai-Yang Algorithm Lai andYang proposed an algorithm for distributed snapshot on a network of processes where the channels need not be FIFO. A message is white if it is sent by a process that has not recorded its state, and a message is red if the sender has already recorded its state. However, there are no markers — processes are allowed to record their local states spontaneously,

The Lai-Yang Algorithm LY1. The initiator records its own state. When it needs to send a message m to another process, it sends (m, red). LY2. When a process receives a message (m, red), it records its state if it has not already done so, and then accepts the message m.

The Lai-Yang Algorithm The approach is “lazy” in as much as processes do not send or use any control message for the sake of recording a consistent snapshot. The good thing is that if a complete snapshot is taken, then it will be consistent. However, there is no guarantee that a complete snapshot will eventually be taken: if a process i wants to detect termination, then i will record its own state following its last action, but send no message, so other process may not record their states (dummy control messages).