Consistency and Replication (3). Topics Consistency protocols.

Consistency and Replication (3)

Topics Consistency protocols

Readings Van Steen and Tanenbaum: 6.5 Coulouris: 11,14

Introduction A consistency protocol describes an implementation of a specific consistency model. We will look at different architectures that can be used to support different consistency models, but first we look at a basic architectural model.

A Basic Architectural Model for the Management of Replicated Data FE Requests and replies C Replica C Service Clients Front ends managers RM FE RM

A Basic Architectural Model for the Management of Replicated Data A collection of replica managers provides a service to clients. The clients see a service that gives them access to objects (e.g., calendar or bank accounts) which are replicated. Each client’s requests are handled by a component called a front end.

A Basic Architectural Model for the Management of Replicated Data The purpose of the front end is to hide the replication from the client process. The client processes do not know how many replicas there are. A front end may be implemented in the client’s address space or it may be a separate process. Replicas coordinate in preparation to execute the request consistently.

A Basic Architectural Model for the Management of Replicated Data Replica managers execute requests One or more replicas may respond to the application (through the front end).

Primary-Based Protocols In primary-based protocols, each data item x in the data store has an associated primary, which is responsible for coordinating write operations on x. Primary-Backup Protocols u Read operations are performed on a locally available copy. u Write operations are done at a fixed primary copy. u The primary performs the update on its local copy of x and then forwards the update to all the other replicas (which are considered to be backups).

Primary-Based Protocols Primary-Backup Protocols (cont) u Each backup server performs the update as well and sends an acknowledgement back to the primary. u When all backup servers have updated their local copy the primary sends an acknowledgement back to the initial process. This implements sequential consistency. The primary RM is a performance bottleneck Can tolerate F failures for F+1 RMs SUN NIS (yellow pages) uses passive replication: client can contact primary or backup servers for reads, but only primary servers for updates.

The Primary-Backup Protocol FE C C RM Primary Backup RM

Replicated-Write Protocols In replicated-write protocols, write operations can be carried out at multiple replicas instead of only one (as seen in the case of primary- based replicas). Operations need to be carried out in the same order everywhere. We discussed one approach for doing so that uses Lamport’s timestamps. Using Lamport timestamps does not scale well in large distributed systems.

Replicated-Write Protocols An alternative approach to achieving total order is to use a central coordinator which is sometimes called a sequencer. u Forward each operation to the sequencer. u Sequencer assigns a unique sequence number and subsequently forwards the operation to all replicas. u Operations are carried out in the order of their sequence number. u Hmm. This resembles primary-based consistency protocols. Useful for sequential consistency.

Replicated-Write Protocols The use of a sequencer does not solve the scalability problem. A combination of Lamport timestamps and sequencers may be necessary. The approach is summarized as follows: u Each process has a unique identifier, p i, and keeps a sent message counter c i. The process identifier and message counter uniquely identify a message. u Active processes (or a sequencer) keep an extra counter: t i. This is called the ticket number. A ticket is a triplet (p i, t i, (p j, c j )).

Replicated-Write Protocols Approach Summary (cont) u An active process issues tickets for its own messages and for messages from its associated passive processes (these are processes that are not sequencers). u Passive processes multicast their messages to all group processes which then wait for a ticket stating the total order of each message. u The ticket is sent by each passive process’s sequencer. u Lamport’s totally ordered multicast algorithm is used among the sequencers to determine the order of update operations. u When an operation is allowed, each sequencer sends the ticket to its associated passive processes. It is assumed that the passive process receives these tickets in the order sent.

Replicated-Write Protocols Approach Summary (cont) u If a sequencer terminates abnormally, then one of the passive sequencers associated with it can become the new sequencer. u An election algorithm may be used to choose the new sequencer.

Replicated-Write Protocols Let’s say that we have 6 processes: p 1,p 2,p 3,p 4,p 5,p 6 Assume that p 1,p 2 are sequencers; p 3,p 4 are associated with p 1 and p 5,p 6 are associated with p 2 Let’s say that p 3 sends a message which is identified by (p 3, 1). p 1 generates a ticket as follows: (p 1, 1, (p 3, 1)) The ticket number is generated using the Lamport clock algorithm.

Replicated-Write Protocols Let’s say that p 5 sends a message which is identified by (p 5, 1). p 2 generates a ticket as follows: (p 2, 1, (p 3, 1)) Which update gets done first? Basically, p 1,p 2 will apply Lamport’s algorithm for totally ordered multicast. When an update operation is allowed to proceed, the sequencers send messages to their associated processes.

Gossip Architecture We just studied some architectures for sequential consistency. What about causal consistency? The Gossip Architecture supports causally-consistent lazy replication which in essence refers to the potential causality between read and write operations. Clients are allowed to communicate with each other, but will then have to exchange information on the operations they performed on the data store. This exchange of information is done through gossip messages.

Gossip Architecture

Each RM i maintains for its local copy the vector timestamp VAL(i) u VAL(i)[i]: the total number of completed write requests that have been sent from a client to RM i u VAL(i)[j]: the total number of completed write requests that have been sent from RM j to RM i u This is referred to as the value timestamp and it reflects the updates that have been completed at the replica. u This timestamp is attached to the reply of a read operation.

Gossip Architecture Each RM i maintains for its local copy the vector timestamps WORK(i) which represents those write operations that been been received (but not necessarily processed) at RM i u WORK(i)[i]: the total number of write requests that have been sent from a client to RM i including those that have been completed by RM i. u WORK(i)[j]: the total number of write requests that have been sent from RM j to RM i including those that have been completed by RM i. u This is referred to as the replica timestamp. u This timestamp is attached to the reply of a write operation.

Gossip Architecture Each client keeps track of the writes that it has seen so far. The client C maintains a vector timestamp LOCAL(C) with LOCAL (C )[i] set equal to the most recent value of the number of writes seen at RM i (from C’s view point). This vector timestamp is attached to every request sent to a replica. Note that the client can contact a different replica each time it wants to read or write data. Two front ends may exchange messages directly; these messages also carry the timestamp represented by LOCAL (C).

Gossip Architecture Write log (queue) u Every write operation, when received by a replica, is recorded in the update log of the replica. u Two reasons for this: n The update cannot be applied yet; it is held back n It is uncertain if the update has been received by all replicas. u The entries are sorted by timestamp. A similar log is needed for read operations. This is referred to as the read log (or queue).

Gossip Architecture The Executed Operation table u The same write operation may arrive at a replica from a front end and in a gossip message from another replica. u To present an update from being applied twice, the replica keeps a list of identifiers of the write operations that have been applied so far.

Gossip Architecture Processing read request R from C u Let DEP (R) be the timestamp associated with R. It is set to LOCAL(C). u The request is sent to RM i (with DEP (R)) which stores the request in its read queue. u The read request is processed if DEP(R)[j] <= VAL(i)[j] (for all j). This indicates that RM i has seen the same writes as the client. u As soon as a read operation can be carried out, RM i returns the value of the requested data item to the client, along with VAL(i). u LOCAL(C) is adjusted to the value max{LOCAL(C)[j],VAL(i)[j]} for all j. n This make sense since the value returned by read is potentially the cumulative result of all previous writes.

Gossip Architecture Performing a read operation at a local copy.

Gossip Architecture Processing a write operation, W, from C u Let DEP (W) be the timestamp associated with W. It is set to LOCAL(C). u When the request is received by RM i it increments WORK(i)[i] by 1 but leaves the other entries intact. n This is done so that WORK reflects that RM i has received the latest write request. At this point it isn’t known if it can be carried out. u A timestamp ts(W) is derived from DEP(W) by setting ts(W)[i] to WORK(i)[i]; the rest of entries are as found in DEP(W). u This timestamp is sent back as an acknowledgement to the client, which subsequently adjusts LOCAL(C) by setting each kth entry to max{LOCAL(C)[k],ts(W)[k]}.

Gossip Architecture Processing Write Operations (cont) u The write request W is processed if DEP(W)[j] <= VAL(i)[j] (for all j). u This indicates that RM i has seen the same writes as the client. This is referred to as the stability condition. u The write operation takes place. u What if there exists a j such that DEP(W)[j] > VAL(i)[j]? n This would indicate that there was a write seen by the client that is not yet seen by RM i.

Gossip Architecture Processing Write Operations(cont) u VAL(i) is adjusted by setting each jth entry to max{VAL(i)[j],ts(W)[j]}. n Recall that ts(W)[j] is set to DEP(W)[j] for all j != i and is set to WORK(i)[i] for j = i(which had been incremented upon receiving the write request; the end result is that VAL(i) is incremented by 1). The following two conditions are satisfied: u All operations sent directly to RM i from other clients but that preceded W, have been processed. n ts(W)[i] = VAL(i)[i] + 1 u All write operations that W depends on have been processed. n ts(W)[j] <= VAL(i)[j] for all j != i

Gossip Architecture Performing a write operation at a local copy.

Gossip Architecture For every gossip message received by RM j from RM i, does the following: u RM j adjusts WORK(j) by setting each kth entry equal to max{WORK(i)[k],WORK(j)[k]} u RM j merges the write operations sent by RM i with its own u Apply those writes that have become stable i.e., a write request W is processed if DEP(W)[j] <= VAL(i)[j] (for all j). A write from RM j that is processed should cause VAL(i)[j] to be incremented by 1. A gossip message need not contain the entire log, if it is certain that some of the updates have been seen by the receiving replica.

Gossip Architecture (Example) VAL = (0,0,0) WORK=(0,0,0) VAL = (0,0,0) WORK=(0,0,0) 0 2 1 replicas LOCAL = (0,0,0) Initial state VAL = (0,0,0) WORK=(0,0,0) 0 1

Gossip Architecture (Example) VAL = (0,0,0) WORK=(0,0,0) VAL = (0,0,0) WORK=(0,0,0) 0 2 1 replicas LOCAL = (0,0,0) Client 0 sends a write, W 0, to replica 0 VAL = (0,0,0) WORK=(0,0,0) 0 1 DEP(W 0 )=(0,0,0)

Gossip Architecture (Example) VAL = (0,0,0) WORK=(1,0,0) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) 0 2 1 replicas LOCAL = (0,0,0) WORK is updated VAL = (0,0,0) WORK=(0,0,0) 0 1

Gossip Architecture (Example) VAL = (0,0,0) WORK=(1,0,0) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) 0 2 1 replicas LOCAL = (1,0,0) LOCAL = (0,0,0) client 0 receives an ack from replica 0 for its write LOCAL changes from (0,0,0) to (1,0,0) VAL = (0,0,0) WORK=(0,0,0) 0 1 ack ( ts(W 0 ))

Gossip Architecture (Example) VAL = (1,0,0) WORK=(1,0,0) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) 0 2 1 replicas LOCAL = (1,0,0) LOCAL = (0,0,0) W 0 is applied since DEP(W 0 ) <= VAL; VAL changes VAL = (0,0,0) WORK=(0,0,0) 0 1

Gossip Architecture (Example) VAL = (1,0,0) WORK=(1,0,0) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) 0 2 1 replicas LOCAL = (1,0,0) LOCAL = (0,0,1) Represents state after Client 1 sends a write,W 1, to replica 2 VAL = (0,0,1) WORK=(0,0,1) DEP(W 1 )=(0,0,0) ts(W 1 )=(0,0,1) 0 1

Gossip Architecture (Example) VAL = (1,0,0) WORK=(1,0,0) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) 0 2 1 replicas LOCAL = (1,0,0 ) LOCAL = (0,0,1) Client 0 sends a write message W 2 to replica 2; Cannot be done yet since replica 2 didn’t see the write done at replica 1 VAL = (0,0,1) WORK=(0,0,2) DEP(W 1 )=(0,0,0) ts(W 1 )=(0,0,1) DEP(W 2 )=(1,0,0) ts(W 2 )=(1,0,2) 0 1 DEP(W 2 )=(1,0,0)

Gossip Architecture (Example) VAL = (1,0,0) WORK=(1,0,0) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) 0 2 1 replicas LOCAL = (1,0,2 ) LOCAL = (0,0,1) An ack has been returned to 0 which then updates LOCAL from (1,0,0) to (1,0,2) VAL = (0,0,1) WORK=(0,0,2) DEP(W 1 )=(0,0,0) ts(W 1 )=(0,0,1) DEP(W 2 )=(1,0,0) ts(W 2 )=(1,0,2) 0 1 ack(ts(W 2 ))

Gossip Architecture (Example) VAL = (1,0,0) WORK=(1,0,2) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) 0 2 1 replicas LOCAL = (1,0,2) LOCAL = (0,0,1) Replica 0 and 2 exchange update propagation messages (gossip) WORK at both replicas is adjusted VAL = (0,0,1) WORK=(1,0,2) DEP(W 1 )=(0,0,0) ts(W 1 )=(0,0,1) DEP(W 2 )=(1,0,0) ts(W 2 )=(1,0,2) 0 1

Gossip Architecture (Example) VAL = (1,0,0) WORK=(1,0,2) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) 0 2 1 replicas LOCAL = (1,0,2) LOCAL = (0,0,1) Replica 0 has one write operation (W 0 ). This is sent to replica 2 with DEP(W 0 ). Replica 2 has write operation(W 1 ). This is sent to replica 2 with DEP(W 1 ). Replica 2 also sends W 2 with DEP(W 2 ) VAL = (0,0,1) WORK=(1,0,2) DEP(W 1 )=(0,0,0) ts(W 1 )=(0,0,1) DEP(W 2 )=(1,0,0) ts(W 2 )=(1,0,2) 0 1

Gossip Architecture (Example) VAL = (1,0,0) WORK=(1,0,2) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) 0 2 1 replicas LOCAL = (1,0,2) LOCAL = (0,0,1) VAL = (0,0,1) WORK=(1,0,2) DEP(W 1 )=(0,0,0) ts(W 1 )=(0,0,1) DEP(W 2 )=(1,0,0) ts(W 2 )=(1,0,2) 0 1 Replica 2 can carry out W 0 since DEP(W 0 ) < VAL Replica 0 can carry out W 1 since DEP(W 1 ) <= VAL

Gossip Architecture (Example) VAL = (1,0,1) WORK=(1,0,2) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) 0 2 1 replicas LOCAL = (1,0,2) LOCAL = (0,0,1) VAL = (1,0,1) WORK=(1,0,2) DEP(W 1 )=(0,0,0) ts(W 1 )=(0,0,1) DEP(W 2 )=(1,0,0) ts(W 2 )=(1,0,2) 0 1 VAL in replica 0 and replica 2 are updated

Gossip Architecture (Example) VAL = (1,0,1) WORK=(1,0,2) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) 0 2 1 replicas LOCAL = (1,0,2) LOCAL = (0,0,1) VAL = (1,0,1) WORK=(1,0,2) DEP(W 1 )=(0,0,0) ts(W 1 )=(0,0,1) DEP(W 2 )=(1,0,0) ts(W 2 )=(1,0,2) 0 1 W 2 can now be executed at replica 2 since DEP(W 2 ) < VAL; W 2 can also be applied at replica 0

Summary There are good reasons to introduce replication. However, replication introduces consistency problems. Doing so may severely degrade performance, especially in large-scale systems. Thus consistency is relaxed. We have studied consistency models and protocols.

Consistency and Replication (3). Topics Consistency protocols.

Similar presentations

Presentation on theme: "Consistency and Replication (3). Topics Consistency protocols."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Consistency and Replication (3). Topics Consistency protocols.

Similar presentations

Presentation on theme: "Consistency and Replication (3). Topics Consistency protocols."— Presentation transcript:

Similar presentations

About project

Feedback