Distributed Systems Topic 5: Time, Coordination and Agreement

Distributed Systems Topic 5: Time, Coordination and Agreement
Dr. Michael R. Lyu Computer Science & Engineering Department The Chinese University We discuss time, coordination and agreement issues in distributed systems.

Outline 1 Time 2 Coordination and agreement 3 Multicast communication
Physical time Logical time 2 Coordination and agreement 3 Multicast communication 4 Summary We explore the notion of time, describe the notation of “happened-before” between events physically, and the notation of logical clocks. We examine algorithms to achieve mutual exclusion among a collection of processes, so as to coordinate their access to shared resources, including how election can be implemented in a distributed systems. We then discuss the issues of multicast communication in distributed systems.

1 Time The notation of time External synchronization
Example: 12/9/1949 External synchronization Internal synchronization Physical clocks and their synchronization Logical time and logical clocks We define the notation of time. We examine the problem of how to synchronize clocks in different computers, and so time events occurring at them consistently, and we can determine the order in which events occurred. External synchronization - to synchronize the clock with an authoritative, external source of time. Internal synchronization - to synchronize computer clocks with one another to a known degree of accuracy, so as to measure the interval of two events occurring at different computers by appealing to their local clocks. We describe the problem of physical time synchronization in distributed systems. We examine methods whereby computer clocks can be approximately synchronized using message passing. We go on to introduce logical clocks, which are used to define an order on events without measuring the physical time at which they occurred.

1.1 Synchronizing Physical Clocks
Computer each contain its physical clock. Physical clock is limited by its resolution - the period between updates of the clock register. Clock drift often happens to physical clocks. To compensate for clock drifts, computers are synchronized to a time service, e.g., UTC - Coordinated universal time. Several other algorithms for synchronization. Computer physical clocks are electronic devices that count oscillations occurring in a crystal at a definite frequency. This count can be stored in a counter register. The clock output can be read by software and scaled into a suitable time unit. This value can be used to timestamp any event experienced. Successive events will correspond to different timestamps only if the clock resolution is smaller than the rate at which events can occur. The rate at which events occur depends on such factors as the length of the processor instruction cycle. Applications running at a given computer require only the value of the counter to timestamp events. The date and time-of-day can be calculated from the counter value. Clock drift may happen when computer clocks count time at different rates, and so diverge. Coordinated universal time (UTC) is an international standard that is based on atomic time. UTC signals are synchronized and broadcast regularly from land-based radio stations and satellites. If the computer clock is behind the time service’s, it is ok to set the computer clock to be the time service’s time. However, when the computer clock runs faster, then it should be slowed down for a period instead of set back to the time service’s time directly. The way to cause computer’s clock run to slow for a period can be achieved in software without changing the rate of the hardware clock.

1.1 Skew Between Computer Clocks in a Distributed System

1.1 Compensating for Clock Drift
S(t) = H(t) + (t); S = application time, H = hardware clock time,  = compensating factor. Assuming linear relation (t) = aH(t) + b. Let the value of the software clock be Tskew when H = h, and let the actual time be Treal. If S is to give the actual time after N further ticks, we have Tskew = (1 + a)h + b, and Treal + N = (1 + a)(h + N) + b. a = (Treal - Tskew) / N and b = Tskew - (1 + a)h Let the application time be S and the hardware clock time be H. Let the compensating factor be , so that S(t) = H(t) + (t). Assuming (t) = aH(t) + b, we have S(t) = (1 + a)H(t) + b. Let the value of the software clock be Tskew when H = h, and let the actual time be Treal. We may have Tskew > Treal or Treal > Tskew. If S is to give the actual time after N further ticks, we have Tskew = (1 + a)h + b, and Treal + N = (1 + a)(h + N) + b. Solving this, we get a = (Treal - Tskew) / N and b = Tskew - (1 + a)h. If H is running too fast, Tskew > Treal, a will be a negative number and S reading will be gradually reduced to slow down. If H is running too slow, Tskew < Treal, a will be a positive number and S reading will be gradually increased to advance further.

1.1 Cristian’s Clock Synchronization
Let the time returned in S’s message mt be t. P should set its clock to t + Tround/2. min: the minimum value when no other traffic. The time by S’s clock when the reply message arrives is [t + min, t + Tround - min], with width Tround - 2  min and accuracy ±(Tround/2 - min). m r t p Time server,S Cristian suggested the use of a central time server process S to supply time, using signals from a source of UTC. A process P wishing to learn the time from S can record the total round-trip time Tround taken to send the request mr and receive the reply mt. Let the time returned in S’s message mt be t. A simple estimate of the time to which P should set its clock is t + Tround/2. The earliest point that S could have placed the time in mt was in after P dispatched mr. The latest point at which it could have done this was min before mt arrived at P. The time by S’s clock when the reply message arrives is [t + min, t + Tround - min], with width Tround - 2 min and accuracy ±(Tround/2 - min). The minimum value min is the value that would be obtained if no other processes executed and no other network traffic existed. For fault tolerance purpose, time server can be provided by a group of synchronized time servers, each with a receiver for UTC time signals.

1.1 The Berkeley Algorithm
A coordinator computer is chosen to act as the master. Master periodically polls to slaves whose clocks are to be synchronized. The master estimates their local clock times by observing the round-trip times, and it averages the values obtained. The master takes a fault-tolerant average. Should the master fail, then another can be elected to take over. In the Berkeley algorithm, a coordinator computer is chosen to act as the master. This computer periodically polls the other computers whose clocks are to be synchronized, called slaves. The master estimates their local clock times by observing the round-trip times, and it averages the values obtained. The balance of probabilities is that this average cancels out the individual clock’s drift. The accuracy of the protocol depends upon a nominal maximum round-trip time between the master and the slaves. The algorithm eliminates readings from clocks that have drifted badly, or that have failed and provide spurious readings. The master takes a fault-tolerant average. That is, a subset of clocks is chosen that do not differ from one another by more than a specified amount, and the average is taken only of readings from these clocks. Should the master fail, then another can be elected to take over and function exactly as its predecessor.

1.1 The Network Time Protocol (NTP)
NTP distributes time information to provide: a service to synchronize clients in Internet a reliable service that survives loss of connection a frequent resynchronization for client’s clock drift protection against interference with time server NTP service is provided by various servers: Primary servers, secondary servers, and servers of other levels (called strata). Synchronization subnet: the servers which are connected in a logical hierarchy. The Network Time Protocol (NTP) defines an architecture for a time service and a protocol to distribute time information over a wide variety of interconnected networks. NTP’s chief design aims and features are: To provide a service enabling clients across the Internet to be synchronized accurately to UTC. To provide a reliable service that can survive lengthy losses of connectivity; there are redundant servers and redundant paths between the servers. To enable clients to resynchronize sufficiently frequently; to offset the rates of drift found in most computers. To provide protection against interference with the time server, whether malicious or accidental. NTP service is provided by a network of servers located across the Internet. These include: Primary servers: directly connected to a time source of UTC. Secondary servers: synchronized to primary servers. Servers of other levels (called strata). Synchrinozation subnet: the servers which are connected in a logical hierarchy.

1.1 An Example Synchronization Subnet in an NTP Implementation
2 3 Note: Arrows denote synchronization control; numbers denote strata.

1.1 NTP Synchronization Modes
NTP servers synchronize in three modes: Multicast mode Procedure-call mode Symmetric mode Estimating delay and offset in NTP protocol: a = Ti-2 - Ti-3 b = Ti –Ti-1 di = a + b oi = (a-b)/2 Multicast mode is intended for use on a high-speed LAN. One or more servers periodically multicasts the time to the servers running in other computers connected by the LAN. This mode can only achieve relatively low accuracy. Procedure-call mode is similar to Cristian’s algorithm, where one server accepts requests from other computers, which it processes by replying with its timestamp. This mode is suitable where higher accuracy is required than can be achieved with multicast. Symmetric mode is intended for use by the master servers that supply time information in LANs and by the higher levels (lower strata) of the synchronization subnet, where the highest accuracy is to be achieved. A pair of servers operating in symmetric mode exchange messages bearing timing information. For each pair of messages sent between two servers, the NTP get: - offset oi: an estimate of the actual offset between the two clocks. - delay di: the total transmission time for the two messages If the true offset of the clock at B relative to that at A is o, and if the actual transmission times for m and m’ are t and t’ respectively, then Ti-2 = Ti-3 + t + o and Ti = Ti-1 + t’ - o Let a = Ti-2 - Ti-3 and b = Ti – Ti-1, we get di = t + t’ = a + b o = oi + (t’-t)/2 where oi = (a-b)/2 Among a number of oj’s, the value of oj that corresponds to the minimum value dj is chosen to estimate o.

1.2 Logical Time and Logical Clocks
The order of the events two events occurred in the order they appear in a process. event of sending occurred before event of receiving. happened-before relation, denoted by  HB1: If  process p: x p y, then x  y. HB2: For any message m, send(m)  rcv(m), HB3: If x, y and z are events such that x  y and y  z, then x  z. The order of events occurring at different processes can be critical in a distributed application. To determine the order of the events that occur at different processes in a distributed application, two obvious points are: If two events occurred at the same process, then they occurred in the order in which it observes them. Whenever a message is sent between processes, the event of sending the message occurred before the event of receiving the message. The ordering obtained by generalizing these two relationships is called happen-before relation. We write x p y if two events x and y occurred at a single process p and x occurred before y. Using this restricted order we can define the happened-before relation, denoted by , as follows: HB1: If  process p: x p y, then x  y. HB2: For any message m, send(m)  rcv(m), where send(m) is the event of sending the message, and rcv(m) is the event of receiving it. HB3: If x, y and z are events such that x  y and y  z, then x  z.

1.2 Logical Timestamps Example
Events occurring at three processes If x  y, then we can find a series of events occurring at one or more processes such that either HB1 or HB2 applies between them. The sequence of events need not be unique. If two events are not related by the  relation (I.e., neither a  b nor b  a), then they are concurrent (a || b). a  b  c  d  f; e  f but a || e.

1.2 Lamport Logical Timestamps
Logical clock - a monotonically increasing software counter. Cp: logical clock for process p; Cp(a): timestamp of event a at p; C(b): timestamp of event b LC1: event issued at process p: Cp := Cp + 1 LC2: a) p sends message m to q with value t = Cp b) Cq := max(Cq,t) and applies LC1 to rcv(m). If a  b then C(a) < C(b), but not visa versa! Total order logical clock and vector clock. Lamport invented a simple mechanism by which the happened-before ordering can be captured numerically, called a logical clock. A logical clock is a monotonically increasing software counter. Each process p keeps its own logical clock, Cp, which it uses to timestamp events. Cp(a) denote the timestamp of event a at p, and C(b) denote the timestamp at whatever process it occurred. To capture happen-before relation , processes update their logical clocks and transmit the values of their logical clocks in messages: LC1: Cp is incremented before each event is issued at process p: Cp := Cp + 1 LC2: a) When a process p sends a message m, it piggybacks on m the value t = Cp. b) On receiving (m,t), a process q computes Cq := max(Cq,t) and then applies LC1 before timestamping the event rcv(m). If a  b then C(a) < C(b), but not visa versa! Logical clocks impose only a partial order on the set of all events. We can extend this to a total order where all pairs of distinct events are ordered. If a is an event occurring at pa with local timestamp Ta, and b is an event occurring at pb with local timestamp Tb, we define the global logical timestamps for these events to be (Ta, pa) and (Tb, pb) respectively. Also we define (Ta, pa) < (Tb, pb) if and only if either Ta < Tb, or (Ta = Tb and pa < pb). A vector clock for a system of N processes is an array of N integers to obtain that a  b if and only if vector_clock(a) < vector_clock(b).

1.2 Lamport Timestamps Example
Events occurring at three processes

1.3 Vector Clocks Vector clock
A vector clock of N processes is an array of N integers Each process keeps its own vector clock Vi, which it uses to timestamp a local event VC1: Initially, Vi[j] = 0, for i, j = 1, 2…, N. VC2: Just before pi timestamps an event, it sets Vi[i] := Vi[i] + 1. VC3: pi includes the value t = Vi in every message it sends. VC4: When pi receives a timestamp t in a message, it sets Vi[j] := max(Vi[j], t[j]), for j = 1, 2…, N. Taking the component-wise maximum of two vector timestamps in this way is known as a merge operation. Mattern [1989] and Fidge [1991] developed vector clocks to overcome the shortcoming of Lamport’s clocks: the fact that from C(e) < C(e’) we cannot conclude that ee’. A vector clock for a system of N processes is an array of N integers. Each process keeps its own vector clock Vi, which it uses to timestamp local events. Like Lamport timestamps, processes piggyback vector timestamps on on the messages they send to one anotehr, and there are simple rules for updating the clocks as follows: VC1: Initially, Vi[j] = 0, for i, j = 1, 2…, N. VC2: Just before pi timestamps an event, it sets Vi[i] := Vi[i] + 1. VC3: pi includes the value t = Vi in every message it sends. VC4: When pi receives a timestamp t in a message, it sets Vi[j] := max(Vi[j], t[j]), for j = 1, 2…, N. Taking the component-wise maximum of two vector timestamps in this way is known as a merge operation. For a vector clock Vi, Vi[i] is the number of events that pi has timestamped, and Vi[j] (j≠i) is the number of events that have occurred at pj that pi has potentially been affected by. (Process pj may have timestamped more events by this point, but no information has flowed to pi about them in messages as yet.)

1.3 Vector Clocks Example Events occurring at three processes

1.5 Lamport Timestamps Exercise

1.5 Vector Clocks Exercise

1.4 Comparison In Lamport’s logical clock, C(a) < C(b) does not imply a  b; while in Vector timestamp, V(a) < V(b) implies a  b. Vector timestamps take up an amount of storage and message payload that is proportional to N, the number of process; while Lamport’s clock does not. In Vector timestamp, let V(a) be the vector timestamp applied by the process at which a occurs. We can compare vector timestamps as follows: V = V’ iff V[j] = V’[j] for j = 1, 2, .. , N V  V’ iff V[j]  V’[j] for j = 1, 2, .. , N V < V’ iff V  V’ ^ V  V’ It is straightforward to show, by induction on the length of any sequence of events relating two events a and b, that a  b  V(a) < V(b). However, in Lamport’s clock, the fact that from C(a) < C(b) we cannot conclude that a  b. Vector timestamps have the disadvantages, compared with Lamport timestamps, of taking up an amount of storage and message payload that is proportional to N, the number of processes.

2 Coordination Distributed processes need to coordinate their activities. Distributed mutual exclusion is required for safety, liveness, and ordering properties. Election algorithms: methods for choosing a unique process for a particular role. Distributed processes often need to coordinate their activities. For example, if a collection of processes share a resource or collection of resources managed by a server, then often mutual exclusion is required to prevent interference and ensure consistency when accessing the resources. A separate, generic mechanism for distributed mutual exclusion is required for a single process being given a privilege (the right to access shared resources) temporarily, before another process is granted it. A method for choosing a unique process to play a particular role is called an election algorithm. For example, there is a need for a new time server to be elected if the last one failed.

2.1 Distributed Mutual Exclusion
The basic requirements for mutual exclusion: ME1 (safety): At most one process may execute in the critical section (CS) at a time. ME2 (liveness): A process requesting entry to the CS is eventually granted. ME3 (ordering): Entry to the CS should be granted in happened-before order. The central server algorithm. A ring-based algorithm. A distributed algorithm using logical clocks. Our basic requirements for mutual exclusion concerning some resources are as follows: ME1 (safety): At most one process may execute in the critical section (CS) at a time. ME2 (liveness): A process requesting entry to the CS is eventually granted it (so long as any process executing in the CS eventually leaves it). ME2 implies that the implementation is deadlock-free, and that starvation does not occur. ME3 (ordering): Entry to the CS should be granted in happened-before order (causal ordering). The central server algorithm: The simplest way to achieve mutual exclusion is to employ a server that grants permission to enter a critical section. Server bottleneck and single point failure are two problems in this approach. A ring-based algorithm makes use of the ring topology of processes for mutual exclusion. A distributed algorithm using logical clocks is based on distributed agreement, instead of using a central server.

2.2 Elections An election is a procedure carried out to choose a process from a group. A ring-based election algorithm. The bully algorithm. An election is a procedure carried out to choose a process from a group, for example to take over the role of a process that has failed. The main requirement is for the choice of elected process to be unique, even if several processes call elections concurrently. A ring-based election algorithm is suitable for a collection of processes that are arranged in a logical ring. Each process only knows how to communicate with its neighbour in, say, the clockwise direction. The goal of this algorithm is to elect a single coordinator which is the process with the largest identifier. The bully algorithm can be used when the members of the group know the identities and addresses of the other members. The algorithm selects the surviving member with the largest identifier to function as the coordinator.

2.2.1 Ring-Based Election Algorithm
Each process P(i) has a communication channel to the next process P(i+1) mod N. Messages are sent clockwise. The goal is to elect a single process called the coordinator, which is the process with the largest identifier.

2.2.1 Ring-Based Algorithm 1 7 3 5 12 34 Process number status 1
Non-participant 3 5 7 12 34 7 3 5 12 34 Direction of message flow

2.2.1 Ring-Based Algorithm 1 7 3 Participant 7 Participant 5 12
Process number status 1 3 5 7 12 34 1 7 3 Participant 7 Participant 5 12 Participant 34 Participant When a process receives an election message, it compares the identifier in the message with its own. If the arrived identifier is greater, then it forwards the message to its neighbor; in the mean time, it marks itself a participant If the arrived identifier is smaller and the receiver is not a participant, then it substitutes its own identifier in the message and forwards it If the received identifier is that of the receiver itself, then this process’s identifier must be the greatest, and it becomes the coordinator. Finally, the coordinator marks itself as a non-participant once more and sends an elected message to its neighbor, announcing its election and enclosing its identity. This message goes for one round to announce the coordinator to all the participants. Election message 7 Participant Election message Participant Election message Every process can begin an election A process begins an election by marking itself as a participant, and sends an election message to its neighbor by placing its identifier Suppose process 7 now begins the election

2.2.2 Bully Algorithm The processes themselves are synchronous. I.e. they use timeouts to detect a process failure. Unlike the ring-based algorithm in which processes only know their neighbors, bully algorithm allows processes to know those processes with a higher identifier. There are three types of message: Election Answer Coordinator 1 13 12 5 Coordinator is now 13, because it has the highest identifier

2.2.2 Bully Algorithm Election message Election message 1 13 12 5
The election begins when a process notices that the coordinator is failed. Several processes may discover this concurrently A process which detects the failure will send an election message to those with a higher identifier When a process receives an election message, it sends back an answer message and begins another election Election message Election message 1 13 12 5 Coordinator Answer Message Process 12 will know that it is the highest identifier now as all its higher identifier process (i.e. process 13) have failed, this process will then send back the coordinator message to all its lower identifier process.

3 Multicast Communication
Group (multicast) communication requires coordination and agreement. One multicast operation is much better than multiple send operation in terms of efficiency and delivery guarantees. Basic multicast: guarantees a correct process will eventually deliver the message. Reliable multicast: requires that all correct processes in the group must receive a message if any of them does. Group, or multicast, communication requires coordination and agreement. The aim is for each of a group of processes to receive copies of the messages sent to the group with delivery guarantees. The guarantees include agreement to the set of messages that every process in the group should receive and on the delivery ordering across the group members. Group communication systems are extremely sophisticated even for static groups of processes. The problems are multiplied when processes can join and leave groups at arbitrary times. The essential feature of multicast communication is that a process issues only one multicast operation to send a message to each of a group of processes instead of issuing multiple send operations to individual processes. Communication to all processes in the system, as opposed to a sub-group of them, is known as broadcast. A single multicast operation instead of multiple send operations can provide better efficiency and delivery guarantees. Efficiency refers to efficient utilization of bandwidth due to message sent over a distribution tree with shared communication links and network hardware support. The total time taken to deliver the message to all destinations is also minimized. Delivery guarantees reliable message delivery with right ordering. A basic multicast primitive guarantees that a correct process will eventually deliver the message, as long as the multicaster does not crash. A reliable multicast requires that all correct processes in the group must receive a message if any of them does. This is, in addition to the one-to-one basic multicast communication, it also guarantee liveness of the sender, and the ‘all or nothing’ property (atomic property).

3.1 Open and Closed Groups A group is said to be closed if only members of the group may multicast to it. A process in a closed group delivers to itself any message that it multicasts to the group. A group is open if processes outside the group may send to it. Closed groups of processes are useful for cooperating servers to send messages to one another that only they should receive. Open groups are useful for delivering events to groups of interested processes.

3 Bulletin Board Example
Two different bulletin board items may appear differently in two different universities. Reliable multicast is required if every user is to receive every posting eventually. The users also have ordering requirements: A FIFO ordering is desirable, since then every posting from a given user – ‘A. Hanlon’, say – will be received in the same order, and users can talk consistently about A.Hanlon’s second posting. A casually ordered multicast is needed to guarantee the relationship that the message whose subjects are ‘Re: Microkernels’ (25) and ‘Re: Mach’ (27) appear after the messages to which they refer. If the multicast delivery was totally ordered, then the numbering in the left-hand column would be consistent between users. This example shows some of the above orders are kept: A.Sahiner arrives in the second table but not the first – the multicasting is not reliable. Lack of total ordering – The numbering and the ordering of the items differ between the two. Since the propagation of updates is asynchronous and since items are posted from different universities, items may not arrive in the same order at any two universities. This is particularly so where a site fails after it has sent an item to some sites but not others. Lack of causal ordering – Note in particular items 22 and 24 on the second table, where reversed natural ordering may happen.

3.2 Consistency and Request Ordering
Criteria: correctness vs. expenses. Total, causal, and FIFO ordering requirements. Implementing request ordering. Implementing total ordering. Implementing causal ordering with vector timestamps. The chief ordering considered here are total and causal orderings. The order in which requests are processed at different replicas is important not only because it is often necessary that particular ordering constraints are obeyed for correctness, but also because meeting ordering requirements carries certain expenses. First, the processing of a request may be delayed because a ‘prior’ request has yet to be processed. Secondly, protocols designed to guarantee a particular ordering can be expensive to implement in terms of the number of rounds of messages that have to be transmitted. It is advisable to avoid request ordering wherever possible. Requests whose effect in processing order is the same are said to commute. Any two read-only operations commute; and any two operations that do not perform read but write distinct data also commute. A system for managing replicated data may be able to use knowledge of commutativity in order to avoid the expense of request ordering. We discuss various consistency and request ordering requirements.

3.2.1 Total, FIFO, Causal Ordering
Let m1 and m2 be messages delivered to the group. Total ordering: Either m1 is delivered before m2 or m2 is delivered before m1, at all processes. Causal ordering: If m1 happened-before m2 then m1 is delivered before m2 at all processes. FIFO ordering: If m1 is issued before m2 then m1 is delivered before m2 at all processes. A requirement for total ordering is exemplified by the bulletin board example, where it would be convenient if all group members could label items with the same numbers. Under totally ordered message processing, if m1 and m2 are multicast messages then either m1 is delivered before m2 at all processes or m2 is delivered before m1 at all processes. Total ordering is a general relation over events in a distributed system. Under causal ordering, if m1 and m2 are multicast messages and m1 happened-before m2, then m1 is delivered before m2 at all processes. For FIFO ordering if m1 is issued before m2 then m1 is delivered before m2 at all processes. This is a special case for causal ordering since if m1 is issued before m2 then m1 happened-before m2 (but not visa versa). Another ordering, not discussed in the textbook, is Sync ordering. A sync-ordered message forces the order of multicast messages delivered at processes to be ‘in sync’, in the sense that every other message is consistently delivered before it or after it at all of them. Sync-ordering is needed because a causally ordered message and a totally ordered message can be delivered in an arbitrary order, unless they are causally related. A sync-ordered message effectively flushes any outstanding multicast messages that have been issued but not yet delivered everywhere, so that they are delivered before it; all later multicast messages are delivered after it. It thus draws a conceptual line across the system, dividing all message processing consistently into a ‘past’ and a ‘future’. Under sync-ordering, if m1 and m2 are multicast messages and m1 is sync-ordered, then either m1 is delivered before m2 at all processes or m2 is delivered before m1 at all processes regardless the declared ordering of m2.

3.2.1 Ordering of Multicast Messages
Notice the consistent ordering of totally ordered messages T1 and T2; FIFO-related messages F1 and F2 ; C1 and C2 ; Causally related messages C1 and C3 (assuming C3 is a reply to C1 at P3 ); And other arbitrary delivery ordering of messages. This slide shows three sets of multicast messages, {T1,T2}, {F1, F2, F3}, and {C1, C2, C3} Totally ordered messages, FIFO-related messages, and Causally related messages can be seen in this example.

3.2.2 Implementing Message Ordering
Hold-back: A received message is not delivered until ordering constraints can be met. Stable message: all prior messages processed. Hold-back queue vs. delivery queue. Safety property: no message will be delivered out of order by being prematurely transferred. Liveness property: no message should wait on the hold-back queue forever. A received message is not delivered for message processing until ordering constraints can be met, that is, it is held back. The bulletin board item Re: Microkernels should be held back until an item concerning Microkernels has already appeared. A request message is said to be stable at a process if all prior messages (defined according to the type of ordering) have been processed, i.e., if it is ready to be processed next. Incoming messages is placed initially on a hold-back queue, where they remain until their order has been determined; when stable, they are placed on a processing (or delivery) queue. The safety property requires that no message will be delivered out of order by being transferred from the hold-back queue to the delivery queue prematurely. That is, the implementation must guarantee that once a message has been processed, it is impossible for a ‘prior’ message to arrive. The liveness property is that no message should wait on the hold-back queue indefinitely. An incorrect implementation might await some ’prior’ message that will never in fact arrive, and so will never transfer an existing message to the delivery queue.

3.2.2 The Hold-Back Queue The hold-back queue, as shown here, retains any message that cannot yet be delivered. Such queues are often needed to meet message delivery guarantees.

3.2.3 Implementing Total Ordering
Basic approach: assign totally ordered sequence to messages. Sequencer Distributed agreement in assigning sequence. 2 1 1 Message 2 Proposed Seq P 3 4 3 Agreed Seq The basic approach to implementing total ordering is to assign totally ordered identifiers to messages so that each process makes the same ordering decision based upon these identifiers. First method is to use a sequencer to assign the identifiers. All messages are sent to the sequencer as well as to the RM sites. The sequencer assigns consecutive increasing identifiers to messages as it receives them, and forwards the assigned identifiers to the RM sites. Messages arriving at an RM site are held back until they are next in the sequence. Note that sequencer can become a bottleneck. Second method is to achieve distributed agreement in assigning message sequence numbers. A process, P1, multicasts its message to the members of the group. The group may be open or closed. The receiving processes propose sequence numbers for messages as they arrive and return these to the sender, which uses them to generate agreed sequence number. Each process stores Fmax, the largest sequence number agreed so far the Pmax, its own largest proposed sequence number: (This is a simplified explanation. Details are referred to the book.) The P1 sends the message bearing a temporary id to all other processes. The id is larger than any other id used by P1. Each process replies with proposed final identifier to be Max(Fmax, Pmax) i/N, where i is the process id and N is the total number of processes. Sender process (P1) collects all the proposed ids and select the largest one. P1 then notifies all processes of the final id. When the message at the front of the hold-back queue is this final id, it is stable and can be transferred to the delivery queue. Advantages of this approach include: simple to implement, no bottleneck, no single point failure, and adaptable. The disadvantage is that the algorithm is expensive in terms of number of messages sent. This total ordering algorithm can guarantee neither causal ordering nor FIFO ordering.

3.2.4 Implementing Causal Ordering
Vector timestamp: a list of counts of update events, one for each of the processes. Merging vector timestamps: choose the largest values from the two vectors, component-wise. P3 P2 P1 … The timestamp of the bulletin board can be represented by a list of counts of update events, one for each of the process. Such an array of event counts is called a vector timestamp or a multipart timestamp. In order to maintain causal ordering it is necessary to ensure that each front end reads from a version of the bulletin board that is at least as advanced as the version from which it last read. Also a new item should be added to a process only when the process already reflects all causally prior updates. Vector clock update algorithm: (This is again a simplified version. More details are referred to the book.) All processes pi initialize VTi to 0 When pi generates a new event, it increments VTi[i] by 1. The value vt = VTi is piggybacked on the message. When pj processes a request bearing a timestamp vt, it updates VTj := merge(VTj, vt), where merge(u,v)[k] = max(u[k],v[k]), for k = 1, 2, .., n. For partial orders, u  v iff u[k]  v[k], for k’s. u < v iff u  v and u  v. It can be shown that, if e, f are events and u, v are their timestamps, then u  v iff e happened-before f. Used in the gossip architecture and ISIS causally-ordered multicast.

4 Summary Timing issues Synchronizing physical clocks. Logical time and logical clocks. Distributed coordination and mutual exclusions. Coordination. Elections Ring-based algorithm Bully algorithm Multicast communication. Read textbook Chapters 14 and 15. We address timing issues in distributed systems, including synchronizing physical clocks and the concepts of logical time and logical clocks. We further discuss the concept and techniques for distributed coordination and mutual exclusions. We then describe two election algorithms: A ring-based algorithm, and the bully algorithm. We address techniques and issues relating to multicast communication, in which basic multicast, reliable multicast, and ordered multicast are discussed. Read textbook Chapter 14 and Chapter 15.

Distributed Systems Topic 5: Time, Coordination and Agreement

Similar presentations

Presentation on theme: "Distributed Systems Topic 5: Time, Coordination and Agreement"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Systems Topic 5: Time, Coordination and Agreement

Similar presentations

Presentation on theme: "Distributed Systems Topic 5: Time, Coordination and Agreement"— Presentation transcript:

Similar presentations

About project

Feedback