Presentation on theme: "State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis."— Presentation transcript:
General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis zApplication To Data Replication yreplica coordination ygroup membership; reintegration yunique identifiers using logical/real clocks
The Paxos Parliament And The Consensus Problem zThe Paxos Parliament ydetermine the law of the land, defined by the sequence of decrees passed yeach legislator had his own ledger with decrees, their unique number and their contents yentries in ledgers could not be modified or deleted ylegislators could leave the court for very long periods of time and return later ycommunication only by messangers (could lose the message, send it many times or lose the messages) zRequirements yconsistency of the ledgers yprogress to ensure that some decree will eventually be passed zThe Synod ybasically, the same problem as with the Parliament, just that a single decree had to be passed ythe group of priests/legislators asked to vote for a decree was called the quorum
This can be modelled as a consensus problem: Agreement: no two ledgers should contain different decrees with the same number (no conflicts among ledgers) Validity: any decree should be written in the standard form Termination (the progress condition) Agreement and validation are guaranteed and progress is possible if three conditions are satisfied: B1 Each ballot has a unique number. B2 The quorums of any two ballots have at least one priest in common. B3 For every ballot, if any priest in a quorum has voted in an earlier ballot, then the decree equals the decree of the latest of those earlier ballots.
Assumptions About The System zpartial synchronous distributed system in which processes take actions within l time and messages are delivered within d time zthe system doen not necessarily exhibits this “normal” timing behavior zeach process has a direct communication channel with each other process zallowed failures: ytimig failures (the bounds of l and d can be occasionally exceded) yloss, duplication or reordering of messages yprocess stopping zsome stable storage is needed zprocess recovery is considered
The Synod Algorithm (1) Priest p chooses a new ballot number b. p sends message NextBallot(b) to some set of priests. (2) When a priest q recieves a NextBallot(b), he checks the notes in the back of his ledger and determines the vote v with the largest ballot number less then b that he has voted for. If such a vote doesn’t exist, then a default value null(q) is used. q sends p a LastVoted(b,v) message. (3) After p receives a LastVoted(b,v) message from all the priests in a majority set Q, he initiates a new ballot with number b, quorum Q, and decree chosen according to B3. p records the new ballot and sens BeginBallot(b,d) to Q. (4) If q receives BeginBallot(b,d) and decides to vote, then he records the vote in the back of his ledger and sends Voted(b,q) to p.
(5) If p has recieved a Voted(b,q) from all q in Q, then he writes d in his ledger and sends Success(d) to all priests. (6) After receiving Success(d), a priest enters d in his ledger.
Notes on The Synod Algorithm zto maintain B1, each ballot has to receive a unique number; this can be done by yhaving each priest noting the ballots in his ledger ypatitioning the set of possible ballots among the priests ( later we will talk about different implementations) za priest should not cast the vote after receiving BeginBallot(b,d) if he has already sent a LastVote(b’,v’) message for some other ballot and v.bal’
"name": "Notes on The Synod Algorithm zto maintain B1, each ballot has to receive a unique number; this can be done by yhaving each priest noting the ballots in his ledger ypatitioning the set of possible ballots among the priests ( later we will talk about different implementations) za priest should not cast the vote after receiving BeginBallot(b,d) if he has already sent a LastVote(b’,v’) message for some other ballot and v.bal’
Stating The Problem in Terms of State Machines za state machine consists of ystate variables (encoded in states) ycommands (which transform the states) xeach command is implemented by a deterministic program and its execution is atomic with respect to other commands zclock I/O automaton: specific state machine devised by Lynch and Tuttle for modelling, verifying, and analyzing time-based systems
Clock I/O Automata An I/O time automaton A consists of za set of states: states(A) za nonempty set start(A) of start states za set of actions partitioned in input, output, internal, and time- passage actions and specified in the signature of A za transition relation steps(A) subset of states(A)*acts(A)*states(A). No input action can be blocked: for all s state, for all a input action, there is a state s’ such that (s,a,s’ ) is a step in A. A time-passage action (t) models the passage of real time t. A special real variable Clock is included in each state to model the local clock of the process. It is not necessary that Clock simulates the real time.
The Synod Algorithm In Terms Of Clock GTA zThe Distributed Setting yrelation with the Paxos problem: priest/process law book/state passing a decree/executing a command ycomplete network of n processes with unique identifiers in a totally ordered set known by all processes yclock GT automata are used to model both processes and channels; each automaton has a local clock and the local clock for a channel is used to detect timig failures zThe Algorithm yideea: propose values until one of them is accepted by a majority of processes yany process may propose a value by initiating a round for that value; it becomes the leader of that round ythe leader and the other processes are agents
(1) The leader sends a Collect message to all agents (2) If an agent recieves a Collect message and it is already committed for a round with a biger round number, it sends an OldRound message; otherwise, it sends a Last message with its information about rounds previously conducted. (3) If the leader receives more than n/2 Last messages, it initiates a new round and sends to all agents a Begin message. (4) If an agent receives the Begin message and is committed, it sends an OldRound message; otherwise, it accepts the value proposed and responds with an Accept message. (5) If the leader receives more than n/2 Accept messages, then the round is successful and its own output value is the value of the round. (6) The leader broadcasts the reached decision. Notes: the set of agents Last (Accept) messages are received from=info-quorum (accepting-quorum)
Implementation(2) zBPLEADER(I) (clock GTA running the leader at process i) Input: NewRound(i), Leader(i) NotLeader(i) Receive(m)(j,i), m=Last, Accept, Success, OldRound Output: Send(m)(j,i), m=Collect, Begin BeginCast(i) RndSuccess(v)(i) Internal: Collect(i), GatherLast(i)... Time-passage:... z BPAGENT(I) (clock GTA running an agent at process i) Input: Receive(m)(j,i), m=Collect, Begin Output: Send(m)(j,i), m=Last, Accept, OldRound Internal: LastAccept(i), Accept(i),... Time-passage:...
Correctness Proof zexecution fragment: sequence of states followed by actions in steps according to the automaton zproblem specification: set of allowable behaviors (behavior = sequence of external actions from an execution fragment) zan automaton A solves the problem if each of its behaviors is contained in the problem specification zsafety properties: must hold in every state of a computation zliveness properties: specify events that must eventually be performed
Safety/Liveness Properties zsafety property: in any execution of the system agreement and validity are guaranteed zliveness property: under some conditions, termination is guaranteed yan execution fragment is nice if x no loss or duplication takes place xat each time-passage action the local clock is incremented with the real time variation xevery process is either stopped or alive xa majority of process are alive zTheorem: If a nice execution fragment starts in a reachable state and it has a unique leader and lasts for more than 16l+8nl+9d time units, then by the time 16l+8nl+9d the leader has reached a decision. Note: proofs are based on invariants.
Other Results On Time Performance zIf a nice execution fragment starts in a reachable state and lasts more than 24l+10nl+13d, then: ythe leader decides by the time 21l+8nl+11d and at most 8n messages are sent yall alive processes decide by time 24l+10nl+13d and at most 2n additional messages are sent
Generalization Of The Synod Protocol : MULTIPAXOS zconsensus has to be reached on a sequence of values zfor each value we run BAXICPAXOS zthe automata used for each instance of the algorithm are like automata in BAXIXPAXOS, except that an additional parameter (the index of the proposed value) is present in each action zconcurrency: several leaders may concurrently initiate rounds and these round are carried out concurrently yseveral leaders initiating values concurrently is an important difference between Paxos algorithm and three phase commit protocol
Data Replication zproblem: providing distributed and concurrent access to data objects zsimple implementation: maintain the object at a single process accessed by multiple clients ysome disadvantages: not good scaling when the number of clients increases not fault-tolerant zother solution: data replication yservers are replicated: each server runs the same state machine yclients make requests which are redirected to specific servers
Replica Coordination(1) zRequirements yrequests should be processed by state machines one at a time ythe order of processing should be consistent with potential causality youtputs: determined only by the sequence of requests, independent of time or any other activity in the system zReplica coordination yagreement: every nonfaulty state machine replica receives every request yorder: every nonfaulty state machine replica processes the requests it receives in the same relative order yissues to be considered: fault-tolerance and reconfiguration zMULTIPAXOS: possible solution to the problem
Replica Coordination(2) MULTIPAXOS For Replica Coordination zeach process in the system maintains a copy of the data object za client requests un update operation ya process proposes the operation in an instance of MULTIPAXOS yafter some time, the update operation is the output value of the instance of MULTIPAXOS ythe leader of the round updates its local copy; because of correctness, all the alive processes update their copies, too ya report to the client is given za client requests a read operation ythe request is immediately satisfied based on the local copy Note: majority to achieve consistency-> majority voting a unique leader required to achieve termination-> primary copy replication
Replica Coordination(3) Order and Stability zunique identifiers for requests (total order) zimplementation: a replica next processes the stable request with the smallest unique identifier (stable request: no request from a correct client and with a lower uid can be subsequently delivered to that state machine) zusing logical clocks to ensure order and stability: yeach process has a local counter ylocal counter is incremented after each event at that process yeach message sent is timestamped with the local clock yupon receipt of a message, the local clock of the receiver becomes 1+maximum of timestamp and local clock ya uid for each event is given by appending a fixed-length bit (encodes the process id) to the counter value of the process where the event takes place zusing real clocks to ensure order and stability yassumptions: x the degree of clock synchronization better than min message delivery time xa request r will be received by every correct process no later then uid(r)+Δ ystability test: a request r is stable at a state machine if the local clock reads time t and t>uid(r)+ Δ
Replica Coordination(4) Reconfiguration zat time t there are P(t) processes, F(t) faulty ynecessary condition for correct output: xP(t)>F(t)/2 if Byzantine failures are possible xP(t)>F(t) if only fail-stop failures zsystem described by 3 sets: clients (C), state machines (S), and output devices (O) ; information about them stored in state variables and changed by commands yC and O make periodical queries-> better share processors ymessages sent by S always contain information about future reconfiguration-> permanent communication S C and S O zrequests to change a configuration of the system made by failure/recovery detector mechanism
Replica Coordination(6) Integrating A Repaired Object zgoal: integrate element e at request r znotation: e[r] is the state a non-faulty system element e should be in after processing all the requests up to r zif processors are fail stop and logical clocks are implemented, then the cooperation of only one state machine replica is needed (if the sm has not failed, then it is correct, and because of consensus among replicas, its information on the system is correct and complete with respect to other sm) -> the used sm should have access to enough information zimplementation: e[r] is sent to e before the output produced by processing any request with uid larger than uid(r) ye in O : e[r] usually is device-specific setup information xcan be stored in state variables of sm ye in C : e[r] usually based on sensor values read xuse information from C to sm
Replica Coordination(7) Integrating A Repaired State Machine ztry to use the algorithm: sm sends to e the values of all its state variables before the output produced by processing any request with uid larger than uid(r).... problem: some client request might be recieved by sm after sending e[r], but delivered to e before its repair zsolution: sm must relay to e requests received from clients yhow long: as soon as e has received a request directly from a client c, requests from the same c with larger uid need not be relayed to e yso, e should inform sm of the uid of requests received directly from c zalgorithm: (1) sm sends e the values of its state variables and copies of pending requests (2) sm sends to e every subsequent request r received from client c s.t. uid(r)
"name": "Replica Coordination(7) Integrating A Repaired State Machine ztry to use the algorithm: sm sends to e the values of all its state variables before the output produced by processing any request with uid larger than uid(r)....",
"description": "problem: some client request might be recieved by sm after sending e[r], but delivered to e before its repair zsolution: sm must relay to e requests received from clients yhow long: as soon as e has received a request directly from a client c, requests from the same c with larger uid need not be relayed to e yso, e should inform sm of the uid of requests received directly from c zalgorithm: (1) sm sends e the values of its state variables and copies of pending requests (2) sm sends to e every subsequent request r received from client c s.t. uid(r)