Operating Systems & Concurrent Programming Distributed Operating Systems & Algorithms Lecturer: Xu Qiwen Textbook: Randy Chow Theodore Johnson.

Operating Systems & Concurrent Programming Distributed Operating Systems & Algorithms Lecturer: Xu Qiwen Textbook: qwxu@umac.mo Randy Chow Theodore Johnson Addison Wesley 1997

This course, we study OS, Network and Distributed, in particular, algorithms used in these systems Concurrent Programming, mainly analysis of distributed algorithms such as simulation and verification of the algorithms

Spin system Modelling language Promela concurrent Processes communication via message channels, either synchronous (hand-shaking) or asynchronous (buffered) Simulation Verification by model checking exhaustive search of the state space to check whether properties are satisfied or not - system invariants - progress - Linear temporal Logic

Spin is developed by G.J. Holzmarn at AT&T http://netlib.bell-labs.com/netlib/spin/whatisspin.html Formal methods library www.afm.sbu.ac.ulc/fm/

A spectrum of operating systems Decreasing Degree of Hardware and Software Coupling 1 st 3 rd 4 th 2nd centralized distributed cooperative network operating operating autonomous operating system system system system

A comparison of features in modern operating systems firstcentralized operating system process management memory management I/0 management File management Resource management Extended machine (virtuality) secondnetwork operating system remote access information exchange network browsing resource sharing (interoperability) thirddistributed operating system global view of: file system, name space, time, security, computational power single computer view of multiple computer system (transparency) fourthcooperative autonomous system open and cooperative distributed applications cooperative work (autonomicity) GenerationSystemCharacteristicsGoals

Causality A fundamental property of a distributed system: lack of a global system state This is due to - Noninstantaneous communication propagation delay contention of network resource lost messages - Clock synchronization clock drift - Unpredicatable execution CPU contention interrupts page faults garbage collection Therefore, in distributed systems, we can only talk about causality

Causal: the cause precedes the effect, sending precedes receipt E: the set of all events Ep: the set of all events occur at processor p e 1 < p e 2 : e 1 precedes e 2 at processor p for e 1,e 2 in Ep, either e 1 <p e 2 or e 2 <p e 1 e 1 < m e 2 : e 1 sending message m, e 2 receipt message m Happens – before e 1 < H e 2 1.if e 1 < p e 2, then e 1 < H e 2 2.if e 1 < m e 2, then e 1 < H e 2 3.if e 1 < H e 2 and e 2 < H e 3, then e 1 < H e 3

A happens – before relation, H – DAG p 1 p 2 p 3 e 1 e 2 e 4 e 3 e 5 e 6 e 8 e 7 e 1 < p1 e 4 < p1 e 7 e 3 < p2 e 5 e 1 < m e 3 e 1 < H e 8 e 5 < m e 8

Lamport Timestamps Algorithm global time does not exist global `clock`: a total order to the events must be consistent with the happens-before relation < H algorithm on the fly e.TS : time stamp attached to e my_TS : local timestamp of the processor Initially my_TS =0 On event e if e is the receipt of message m my_TS = max (m.TS, my _TS) my_TS ++ e.TS = my_TS if e is the sending of message m m.TS =my_TS

if e 1 < H e 2, then e 1.TS < e 2.TS to break ties of identical timestamps, Lamport suggests using the processor address for the lower order bits of the timestamp no guarantee: if e 1.TS < e 2.TS then e 1 < H e 2. Therefore, it cannot be used to detect for example causality violation Causality violation s(m): the event of sending m r(m): the event of receipt m if s(m 1 ) < H s(m 2 ), but r(m 2 ) < H r(m 1 )

Vector timestamps have the property e 1. VT < v e 2. VT iff e 1 < H e 2 Must be able to tell which events of every processor an event causally follows VT: an array of integers VT[i]=k: causally follows the first k events at processor i e 1. VT <=v e 2. VT : e 1. VT [i]<= e 2. VT [i] for every i= 1…M e 1. VT <v e 2. VT : e 1. VT <=v e 2. VT and e 1. VT  e 2. VT

Vector timestamp algorithm Initially my_ VT = [0,…,0] On event e if e is the receipt of message m for i = 1 to M my_ VT [i]=max(m. VT [i], my_ VT [i]) my_ VT [self]++ e. VT =my_ VT if e is the sending of message m m. VT =my_ VT We show if e 1. VT < e 2. VT, then e 1 < H e 2. Suppose e. 1 is at processor i and e 2 is at processor j. From e 1. VT < e 2. VT, e 1. VT [i] <= e 2. VT [i]. The value of e 2. VT [i] is obtained from an event from processor i, therefore e 1 < H e 2.

Causal communication ensure no causality violation assume point-to-point messages delivered in the order sent main idea : hold back message m until no messages m' < H m will be delivered from any other processor. earliest [1,…,M] earliest[k]: the timestamp of the earliest message that can be delivered from processor k initially the smallest timestamp 1 k (1 in Lamport, (0…010…0) in vector timestamp) blocked[1,…,M] block[k]: queue of blocked messages from processor k

Causal Message delivery algorithm Initially each earliest[k] is set to 1 k, k=1,…,M each blocked[k] is set to {}, k=1,…,M On the receipt of message m from processor p delivery_list={} if (blocked[p] is empty) earliest[p]=m.timestamp Add m to the tail of blocked [p] while ( there is k such that blocked[k] is not empty, and for every i  k, self, not-earlier(earliest[i], earliest[k],i) ) remove message at head of blocked [k], put in delivery_list if blocked[k] is not empty set earliest[k] to m'.timestamp, where m' head of blocked [k] else increment earliest [k] by 1 k Deliver messages in delivery_list, in causal order

Deadlock in the algorithm if one processor does not send messages, other processors will be blocked to receive Multicast communication Every processor receives the same set of messages p receives m 1, m 2 < H m 1 p will eventually receive m 2

Distributed Snapshots no global state distributed snapshot : a global view of the system that is consistent with causality Si : state of processor Pi S = (S1, S2,…,Sm) channel Cij: communication channel Pi to Pj C = {Cij | i,j  1,… M} Lij = (m1, m2,….mk) messages sent by Pi but yet to be received by Pj L = {Lij | i,j  1,… M} Global state G = (S,L)

Consistent Cut observations of different processors should be concurrent snapshot token : special message indicating a state to be recorded p q O 1 O 1 and O 2 are concurrent O 1 and O 3 are not concurrent t (in the original system, i.e. O 2 without the snapshot tokens) t O 3

Distributed Snapshot Algorithm Variables integer my_version integer current_snap [1…M] integer tokens_received [1…M] processor_state S [1…M] channel _state L [1…M] [1…M] S[r] contains processor self ’s state, L[r][q] contains L q,self in the snapshot requested by processor r Initially my_version=0 for each processor p current_snap [p] = 0

execute_snapshot() Wait for a snapshot request or a token Snapshot_Request: my_ version ++ S[self]=current state current_snap[self] = my_version for each q in Oself send(q, TOKEN, self, my_version) token_received[self] =0 TOKEN (q; r, version) :......

TOKEN(q;r,version): if current_snap[r]<version S[r]=current state current_snap[r]=version L[r][q]=() for every p in O self send(p, TOKEN, r, version) tokens_received[r]=1 else if (current_snap[r]=version) tokens_received[r]++ put messages received from q since first receiving token(r,version) into L[r][q] if tokens_received[r]=|I self| the local snapshot for(r.version) is finished

Distributed Mutual Exclusion Timestamp Algorithms record timestamp send requests to other processors, other processors grant / deny the request using timestamp info Variables timestamp current_time timestamp my_timestamp integer reply_count boolean reply_deferred[l…M]

Requesting the critical section Request_CS() my_timestamp=current_timestamp is_requesting=True reply_pending=M-1 for every other processor q send(q,remote_request,my_timestamp) wait until reply-pending=0 ( CS )

Monitoring CS_monitor() Wait a remote_request or a reply message remote_request(q,request_time): if ( not is_requesting or my_timestamp>request_timestamp ) send(q,reply) else reply_deferred[q]=True reply(q): reply_pending--

Releasing critical section Release_CS() (leave CS) is_requesting=False for q=1 to M if reply_deferred [q]=True send(q, reply) reply_deferred[q]=false

Voting Processors compete for votes to enter critical sections Naive Voting Algorithm Naïve_Voting_Enter_CS() Send a vote request to all processors Wait until  (M+1)/2  votes (CS)

Voting with district Sp: Voting district of processor p S i  S j  {} 1<= i,j <= M

Variables used in voting based algorithm S self voting district current_timestamp my_timestamp yes_votes have_voted candidate candidate voted for candidate_timestamp true if have tried to recall a vote have_inquired waitingQ

Requesting the critical section Request_CS() my_timestamp = current_timestamp for every processor r in S self send ( r, REQUEST, my_timestamp ) while ( yes_votes<  S self |) Wait until a YES, NO or INQUIRE message YES (q) : yes_votes ++ INQUIRE (q, inquire_timestamp) if my_timestamp = inquire_timestamp send (q, RELINQUISH ) yes_votes--

Monitor the critical section Voter() while true wait until a REQUEST, RELEASE, or RELINQUISH REQUEST (q;request_timestamp): if have_voted is False send(q, YES ) candidate_timestamp = request_TS candidate = q have_voted = True else add(q,request_timestamp) to waitingQ if request_timestamp<candidate_timestamp and not have_inquired have_inquired = True send(candidate; INQUIRE, candidate_timestamp) RELINQUISH(q): RELEASE(q):

RELINQUISH (q): add(candidate,candidate_timestamp) to waitingQ remove the (s, timestamp) from waitingQ such that timestamp is the minimum send(s, YES ) candidate_timestamp=timestamp candidate=s have_inquired=False RELEASE (q): if waitingQ is not empty remove the (s, timestamp) from waitingQ such that timestamp minimum send(s, YES ) candidate_timestamp=timestamp candidate=s else have_voted=False have_inquired=False

Fixed Logical Structure A processor can enter the critical section if it possesses a token ring structure Tree structure

Variables used by the fixed structure algorithm Token_hldr Incs Current_dir Request_Q operations on request_Q Nq(q) Dg( ) ismt( ) Raymond’s algorithm

Requesting and releasing the critical section Request_CS() if not Token_hldr if ismt ( ) send (current_dir, REQUEST ) Nq(self) wait until Token_hldr is True Incs=True Release_CS() Incs=False if not ismt( ) current_dir=Dq( ) send(current_dir, TOKEN ) Token_hldr=False if not ismt ( ) Send(current_dir, REQUEST )

Monitor_(SL) whit True wait for a REQUEST or a TOKEN REQUEST (q) if Token_hldr if Incs Nq(q) else current_dir=q send(current_dir, TOKEN) Token_hldr= False else if ismt( ) send(current_dir,REQUEST) Nq(q) TOKEN: current_dir=Dq( ) if current_dir=self Token_hldr=True else send(current_dir,TOKEN) if not ismt( ) send(current_dir,REQUEST)

Path compression Token_hldr Incs IsRequesting current_dir next – The next processor to receive the token, nil if the processor is at the end of the waiting list (if the processor just requested)

Request_CS() IsRequesting = True if not Token_hldr send (current_dir, REQUEST,self) current_dir = self next = NIL wait until Token_hldr is True Incs = true Release_CS() Incs = False IsRequesting = False if next  NIL token_hldr = False send (next, TOKEN ) next = NIL

Monitor_CS() while True wait for a REQUEST or a TOKEN REQUEST (requester) : if IsRequesting if next = NIL next = requester else send(current_dir, REQUEST, requester) else if token_hldr token_hldr = False send(requester, TOKEN ) else send(current_dir, REQUEST, requester) current_dir = requester TOKEN: token_hldr = True

Leader Election coordinator / participant(s) The Bully Algorithm Assumptions 1. message propagation time Tm 2. message handling time Tp Failure detector timeout T = 2Tm + Tp Variables state : {Down, Election, Reorganization, Normal} coordinator : definition up halted

Correctness Assertions 1. If state i  {Normal, Reorganization} and state i  {Normal, Reorganization} then coordinator i = coordinator j 2. If state i = state j = normal, then definition i = definition j 3. (liveness) eventually true state i = normal and coordinator i =i For every other nonfailed node j state j = Normal and coordinator j = i

Idea of the Bully Algorithm Each node has a priority In election, a node first checks if higher_priority nodes have failed, if so, the node knows it should be the leader The leader “bullies” the other nodes into accepting its leader ship An election is initiated by the Coordinator_time out if a node does not hear form the coordinator for a long time or by Recovery when the node recovers from a failure The leader calls an election if it detects a processor fails or a failed processor recovers

Algorithm to initiate an election by a node Coordinator_Timeout( ) if state = Normal or state = Reorganization send (coordinator, AreYouUp) timeout = T wait until coordinator sends (AYU_answer) timeout = T on timeout Election Recovery ( ) state = Down Election( )

Algorithm by the coordinator to check the state of other processors Check( ) if state = Normal and coordinator = self for every other node j send(j, AreYouNormal) wait until j sends (AYN_answer, status) timeout = T if (j  up and status = False) or j  up Election return( )

Bully election algorithm Election( ) highest = True for every higher priority processor p send (p, AreYouUp) wait up to T seconds for (AYU_answer) AYU_answer(sender): highest = False if highest = False return( ) state = Election halted = self up = { } for every lower priority processor p send (p, Enter_Election) wait up to T for (EE_answer) EE_answer(sender) : up = up  {sender}

Bully election algorithm continued Election( ) …… num_answers = 0 coordinator = self state = Reorganization for each p in up send (p, Set_Coordinator, self) wait up to T for (SC_answer) SC_answer (sender): num_answers ++ if num_answer < |up| Election ( ) return ( )

Bully Algorithm continued Election ( ) …… num_answers = 0 for each p in up send (p, New_State, Definition) wait up to T for (NS_answer) NS_answer(sender): num_answers++ if num_answers < |up | Election( ) return( ) state = Normal

Monitoring the election Monitor_Election( ) while (true) wait for a message case AreYouUp (sender) send (sender, AYU_answer) case AreYouNormal(sender) if state = Normal send (sender, AYN_answer,True) else send (sender, AYN_answer, False) case Enter_Election(sender) state = Election stop_processing( ) stop the election procedure if it is executing halted= sender send(sender,EE_answer)

Monitoring the election continued Monitor_Election( ) …… case Set_Coordinator(sender, newleader) if state = Election and halted = newleader cooridinator = newleader state = Reorganization send (sender,sc_answer) case New_state (sender, newdef) if coordinator = sender and state = Reorganization definition = newdef state = Normal

The invitation Algorithm Assumption: delay can be arbitrary, no global coordinator Processors into groups, different groups have different coordinators, merge groups into large groups. Correctness assertion 1.If state i  {Normal,Reorganization}, state j  {Normal,Reorganization}, and Group i = Group j, then Coordinator i = Coordinator j 2.If state i = state j = Normal, Group i = Group j, then Definition i = Definition j

Check( ) if state = Normal and Coordinator = self others = { } for every other node p send (p, AreYouCoordinator ) wait up to T seconds for (AYC_answer) messages AYC_answer,(sender, is_coordinator) if is_coordinator others = others  {sender} if others = { } return ( ) wait for a time inversely proportional to your priority Merge (others)

Timeout ( ) if Coordinator = self return ( ) send(Coordinator, AreYouThere, Group) wait for AYT_answer, timeout is T on timeout is_coordinator = False AYT_answer(sender, is_coordinator): skip if is_coordinator=False Recovery ( )

Merge (Coordinator_set) if Coordinator = self and state = Normal state = Election stop_processing ( ) counter ++ Group = (self |counter) Coordinator = self { not necessary or problem with} UpSet = Up { interleaving with Invitation() ? *} Up={} For each p in Coordinator_set send (p, Invitation, self, Group) For each p in UpSet send (p, Invitation,self,Group) Wait for T seconds /* Answers are collected by the Monitor_Election thread */ * Invitation() contains Coordinator=new_coordinator

Merge (Coordinator_Set) …… state = Reorganization num_answer = 0 For each p in Up send(p, Ready, Group, Definition ) wait up to T seconds for Ready_answer messages Ready_answer ( sender, in group, new_group ) if in group and new_group = Group num_answer + + if num_answer < | Up | Recovery ( ) else state = Normal

Invitation( ) while True wait for Invitation (new_coordinator, new_group ) if state = Normal stop_processing ( ) old_coordinator = Coordinator UpSet = Up state = Election Coordinator = new_coordinator Group = new_group if old_coordinator = self for each p in UpSet send(p, Invitation, Coordinator,Group ) send(Coordinator, Accept, Group ) …… Question: is this better put in Monitor thread?

Invitation ( ) …… wait up to T seconds for an Accept_answer(sender, accepted) on timeout accepted = False if accepted=False Recovery( ) State = Reorganization

Election_Monitor( ) while True wait for a message Ready(sender, New_group,new_definition) if Group=new_group and state = Reorganization Definition = new_definition state = Normal send(Coordinator, Ready_answer, True, Group ) else send (sender, Ready_answer, False )

…… AreYoucoordinator(sender): if state = Normal and Coordinator = self send(sender,AYC_answer,True) else send (sender,AYC_answer,False) AreYouThere(sender, old_group): if Group = old_group and Coordinator = self and sender in Up send(sender,AYT_answer, True) else send(sender, AYT_answer, False) Accept (sender, new_group): if state = Election and Coordinator = self and Group =new_group Up = Up  {sender} send (sender, accept_answer,True) else send (sender, accept_answer,False)

Recovery ( ) state = Election stop_processing ( ) Counter + + Group = (self |Counter) coordinator = self Up = { } state = Reorganization Definition = {a single node task description} state = Normal

Data Management The ACID properties Atomicity: Either all of the operations or none in a transaction are performed, in spite of failures Consistency (serializability): The execution of interleaved transactions is equivalent to a serial execution of the transactions in some order Isolation: Partial results of an incomplete transactions are not visible to others before the transaction is successfully committed Durability: The system guarantees that the results of a committed transaction will be made permanent even if a failure occurs after the commitment

Data Replication ACID properties more difficult to ensure

Atomicity All processors involved in the transaction agree to either commit or abort the transaction Naïve protocol: coordinator completes its execution, commits, and sends commit messages to other processors Problem of naive protocol: if a participant processor fails, it will not not sucessfully commit (therefore, not all processors commit) Database Technique Two-phase Commit

2PC_Coordinator() pre commit the transaction For every participant p send(p, VOTE_REQ ) wait up to T for VOTE messages VOTE (sender,vote_response) if vote_response = YES increment the number of yes votes if each participant responded with a YES vote commit the transaction for every participant p send(p, COMMIT ) else abort the transaction for every participant p send(p, ABORT )

2PC_Participant() while True wait for a message from the coordinator VOET_REQ (coordinator): if I can commit the transaction precommit the transaction write a YES vote to the log send(coordinator, YES ) else abort the transaction send(coordinator, NO ) COMMIT (coordinator): commit the transaction ABORT (coordinator): abort the transaction

Failure of any processor prior to the vote request, abort If the coordinator fails after pre committing but before committing, abort after recovery (textbook also says “in practice, the coordinator will attempt to commit’’. My understanding of this is that the coordinator will perform another round of vote request). If a participant fails after precommitting but before committing, Contact other processors to decide (the transaction may or may not have committed) after recovery.

Disadvantage of 2 phase commit if the coordinator fails after a participant has voted YES, the participant must wait until the coordinator recovers. Protocol cannot complete: blocked Three Phase Commit avoid blocking if a majority of processors agree on the action

Serializability (consistency) if the result of execution is equivalent to a serial one Example t 0 : bt Write A=100, Write B=20 et t 1 : bt Read A, Read B 1: Write sum in C 2: Write diff in D et t 2 : bt Read A, Read B 3: Write diff in C 4: Write sum D et Conflict: Write-Write, Write-Read, Read-Write

Interleaving schedules t 0 < t 1 < t 2 log in C log in D Result (C,D) 2PL Timestamp 1,2,3,4 W1=120 W1=80 (80,120) feasible feasible W2=80 W2=120 consistent 3,4,1,2 W2=80 W2=120 (120,80) feasible t 1 aborts W1=120 W1=80 consistent and restarts 1,3,2,4 W1=120 W1=80 (80,120) not feasible feasible W2=80 W2=120 consistent 3,1,4,2 W2=80 W2=120 (120, 80) not feasible t 1 aborts W1=120 W1=80 consistent and restarts 1,3,4,2 W1=120 W2=120 (80,80) not feasible cascade W2=80 W1=80 inconsistent aborts 3,1,2,4 W2=80 W1=80 (120,120) not feasible t 1 aborts W1=120 W2=120 inconsistent and restarts

Two Phase Locking (2PL) A growing phase of locking, a shrinking phase of releasing An extreme case: locks all objects at the beginning, releases all at the end. Serialization is trivial, no concurrency, simple applications 2PL: 1. A transaction must obtain a read or a write lock on data d before reading d and must obtain a write lock on d before updating d 2. After a transaction relinquishes a lock, it may not acquire any new locks * many transaction can have read locks on a data, but if one transaction has a write lock, no other transactions can have locks

2PL concurrency limited deadlock (e.g., t 2 writes D then writes C) strict 2PL: releasing lock, usually at commit or abort point non-strict 2PL difficult to implement, difficult to know when the last lock is requested strict 2PL sacrifices some concurrency

Timestamp ordering 1. when an operation on a shared object is invoked, the object records the timestamp of the invoking transaction 2. when a (different) transaction invokes a conflicting operation on the object, if it has a larger timestamp than the one recorded by the object, then let the transaction proceed (and record the new timestamp), otherwise abort the transaction (restarts with a larger timestamp).

Optimistic Concurrency Control execution phase validation phase update phase

One-copy serializability The result of execution is equivalent to a serial one on nonreplicated objects Read-one-primary Read-one Read-quorum Write-one-primary Write-all Write-all-available Write-quorum Write-gossip

Read-one / Write-all-available Example t 0: bt W(X) W(Y) et t 1 : bt R(X) W(Y) et t 2 : bt R(Y) W(X) et t 0 initialization, followed by t 1 and t 2. Only serial schedules ( t 1 t 2 or t 2 t 1 ) are consistent. Now replicate X to Xa and Xb, Y to Yc and Yd Xa and Yd fail t 1 : bt R(Xa) (Yd fails) W(Yc) et t 2 : bt R(Yd) (Xa fails) W(Xb) et No conflict, not one copy

Quorum Voting Read-quorum: each read operation to a replicated object d must obtain a read quorum R(d) Witre-quorum: W(d) Quorum must overlap V(d): total number of copies Write-Write conflict: 2W(d) > V(d) Read-Write conflict: R(d)+W(d) > V(d) R(d)=1, W(d)=V(d), Read-one, Write-all

Gossip Update Propagation Many applications do not need one-copy serializability Basic Gossip Protocol TSi: last update time of the data object (maintained by Replica Manager, RM i) TSf: timestamp of the last successful access operation (maintained by File Service Agent, FSA) Read: TSf is compared with TSi if TSf  TSi (data more recent) return value TSf is set to TSi else wait until data is updated by gossip

Update: TSf ++ if TSf > Tsi update is executed TSi=TSf propagate the new data by gossip else (the update is too late, possible action: overwrite or become more upto date by a read) Gossip: A gossip message carrying a data value from replica j to replica i is accepted if TSj > TSi

In the Basic Gossip Protocol, updates are simple overwrites (do not depend on the current state). To handle read-modify updates (depending on current state), Casual Order Gossip Protocol Example of casual order gossip: Figure 6.12

Distributed Agreement A number of processors, some of them faulty, try to agree on a value. Assumption: faulty processors may do anything, including the worst (Byzantine). Aim: a protocol which allows all the non faulty processors to reach the agreement.

Byzantine agreement In an ancient war in Byzantium, some Byzantium generals are loyal, but some are disloyal. The loyal general need to decide whether to attack together or retreat. Question: Suppose there are 3 generals, 2 loyal and 1 disloyal, can the loyal generals reach the agreement ? disloyal loyal attack retreat attack retreat 1 attack, 1 retreat loyal disloyal attack retreat 1 attack, 1 retreat cannot decide

Question : Can the loyal generals reach the agreement if there are 4 generals, 3 loyal, 1 disloyal. A A R A A disloyal loyal A R A R 2A, 1R loyal A A A A A disloyal loyal A R A R 2A, 1R

Theorem Suppose there are M generals, t disloyal ones. If M≤3t, then the generals cannot reach agreement. Proof idea: Suppose the theorem is not true. Let one general simulate t generals, then the three general problem can also be solved. Contradiction.

Byzantine general’s broadcast BG_Send(k, v, I) send v to every general in I. BG_Receive(k) Let v be the value received, or “Retreat” if no value is received before time out Let I be the set of generals who have never broadcast v ( the delivery list for this message ) BG_Send(k-1, v, I-self) Use BG_Receive(k-1) to receive v(i) for every i in I-self return majority (v, v (1)…….v (|I|-1))

Majority and default decision Majority (v 1, v 2, …,v n ) Return the majority v among v 1, v 2, …,v n or “Retreat” if no majority exists Base case BG_Send(0,v,I) The commanding general broadcasts v to every other generals in I BG_Receive (0) Return the value received, or “Retreat” if no message is received

C 12 34 56 O1O1 O6O6 O5O5 O4O4 O3O3 O2O2 O1O1 L 1: O 1 1 2 3 456 L 1: 2 3 6 …… 3 4 56 2 45623 4 5 2 decides the value from 1 by majority(L 1: O 1, L 3: L 1: O 1, L 4: L 1: O 1, L 5: L 1: O 1, L 6: L 1: O 1) L 1: O 1 L 6: L 1: O 1

In a similar way, 2 decides the value from generals 3, 4, 5, 6. Finally, general 2 decides the value by taking the majority of these values together with the one received from C.

Lemma For any t and k, if the commanding general is loyal, the BG ( k ) protocol is correct if there are no more than t traitors and at least 2t+k+1 generals (2t+k in textbook, mistake?). Proof. By induction on k. Base case k=0, BG (k) works because the loyal generals just accept the orders from the commanding general which is assumed to be loyal.

Assume BG(k-1) works for 2t+k generals and t traitors. Consider The case of 2t+k+1 generals and t traitors O 1 = O 2 =…= O t+k =…= O 2t+k After receiving the command from the commanding general, each of the 2t+k loyal general will broadcast the correct command. There are t traitors. By the assumption, a loyal general will decide on the correct values of the other t+k-1 loyal generals. Together with the order from the commanding general, there are t+k >t correct orders and at most t incorrect orders, so the loyal general will decide on the right order. …… O1O1 O2O2 O3O3 O t+k O 2t+k C

Theorem For any k, the BG(k) protocol is correct if there are more than 3k generals and no more than k traitors. Proof: Induction on k. Base case k = 0, the protocol is correct, because there are no traitors. Assume BG(k-1) works, if there are more than 3(k-1) generals and no more than k-1 traitors. Consider there are 3k+1 generals, and k traitors. If the commander is loyal, then the Lemma says the protocol is correct, because there are 3k+1 = 2k+k+1 generals. If the commander is disloyal, then when any other general rebroadcasts, there are 3k>3(k-1) generals and k-1 traitors, so the loyal generals agree on the rebroadcasted orders,and therefore will agree on the final order.

Distributed Shared Memory (DSM) Process Communication Paradigms message passing remote procedure call (RPC) distributed shared memory first introduced by K. Li, in his PhD thesis 1986 RPC and DSM provide abstraction, and they are implemented by message passing in distributed systems. DSM has a mapping and management software between DSM and message passing mechanism

Shared Memory tightly coupled systems memory accessed via a common bus or network direct information sharing programming is similar to conventional shared memory programming (a logical shared memory) memory management problems: efficiency, coherence/consistency

A generic NUMA architecture processor memory processor memory memory coherence memory coherence controller controller buses or network NUMA: Nonuniform Memory Access local/remote accesses, not uniform

NUMA Architectures

Memory Consistency Models Process viewpoint (compared to data viewpoint, distributed file system) more concurrency less concurrency difficult to program easy to program weak consistency strong consistency

General Access Consistency Models R(X)v: read variable X, value v W(X)v: write variable X with value v Atomic (strict) consistency: All read and write must appear to be executed atomically and sequentially. All processors observe the same ordering of event execution, which coincides with the real-time occurrence. P1: W(X)1 P2: R(X)1 P2: R(X)0 R(X)1 atomically consistent not atomically consistent This is the strictest consistency model. High complexity in implementation. Usually used only as a baseline to evaluate the performance of other consistency models.

Sequential consistency Defined by Lamport: The result of any execution is the same as if the operations of all processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. Interleaving, real-time order not required. P1: W(X)1 P2: R(X)1 R(X)1 P2: R(X)0 R(X)1 atomically consistent not atomically consistent both sequentially consistent Programming friendly, but poor performance.

Causal consistency Writes that are potentially causally related must be seen by all processors. Concurrent writes may be seen in a different order on different processors (therefore may not lead to a global sequential order). P1: W(X)1 W(X)3 P2: R(X)1 W(X)2 P3: R(X)1 R(X)3 R(X)2 P4: R(X)1 R(X)2 R(X)3 causally consistent, not sequentially consistent

Causal consistency (continued) P1: W(X)1 P2: R(X)1 W(X)2 P3: R(X)2 R(X)1 P4: R(X)1 R(X)2 not causally consistent If we remove R(X)1, W(X)1 and W(X)2 are concurrent P1: W(X)1 P2: W(X)2 P3: R(X)2 R(X)1 P4: R(X)1 R(X)2 causally consistent

Processor consistency Writes from the same processor are performed and observed in the order they were issued. Writes from different processors can be in any order. P1: W(X)1 P2: R(X)1 W(X)2 P3: R(X)1 R(X)2 P4: R(X)2 R(X)1 processor consistent, not causally consistent

Slow memory consistency Writes to the same location by the same processor must be in order. P1: W(X)1 W(Y)2 W(X)3 P2: R(Y)2 R(X)1 R(X)3 slow memory consistent

Consistency models with synchronization access user information to relax consistency synchronization access: read/write operations to synchronization variables only by special instructions Weak consistency Access to synchronization variables are sequentially consistent No access to a synchronization variable is issued by a processor before all previous read/write operations have been performed No read/write data access is issued by a processor before a previous access to a synchronization variable has been performed P1: W(X)1 W(X)2 S P2: R(X)1 R(X)2 S P2: S R(X)1 P3: R(X)2 R(X)1 S weakly consistent not weakly consistent

Release consistency Use a pair of synchronization operations: acquire(S) and release(S) No future access can be performed until the acquire operation is completed All previous operations must have been performed before the completion of the release operation Order of synchronization access follows the processor consistency model (acquire - read, release - write)

Entry consistency Locking objects, instead of locking critical section For each shared variable X, associate acquire(X) and release(X) acquire(X) locks the shared variable X for the subsequent exclusive operations on X until X is unblocked by a release(X)

Operating Systems & Concurrent Programming Distributed Operating Systems & Algorithms Lecturer: Xu Qiwen Textbook: Randy Chow Theodore Johnson.

Similar presentations

Presentation on theme: "Operating Systems & Concurrent Programming Distributed Operating Systems & Algorithms Lecturer: Xu Qiwen Textbook: Randy Chow Theodore Johnson."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Operating Systems & Concurrent Programming Distributed Operating Systems & Algorithms Lecturer: Xu Qiwen Textbook: Randy Chow Theodore Johnson.

Similar presentations

Presentation on theme: "Operating Systems & Concurrent Programming Distributed Operating Systems & Algorithms Lecturer: Xu Qiwen Textbook: Randy Chow Theodore Johnson."— Presentation transcript:

Similar presentations

About project

Feedback