Replicated state machine and Paxos

Slides:



Advertisements
Similar presentations
Replication techniques Primary-backup, RSM, Paxos Jinyang Li.
Advertisements

CS 542: Topics in Distributed Systems Diganta Goswami.
CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
NETWORK ALGORITHMS Presenter- Kurchi Subhra Hazra.
CS 5204 – Operating Systems1 Paxos Student Presentation by Jeremy Trimble.
Distributed Systems Overview Ali Ghodsi
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Consensus Hao Li.
The SMART Way to Migrate Replicated Stateful Services Jacob R. Lorch, Atul Adya, Bill Bolosky, Ronnie Chaiken, John Douceur, Jon Howell Microsoft Research.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Strong Consistency and Agreement COS 461: Computer Networks Spring 2011 Mike Freedman 1 Jenkins,
Fault-tolerance techniques RSM, Paxos Jinyang Li.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.
Eddie Bortnikov & Aran Bergman, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Byzantine fault tolerance
Paxos Made Simple Jinghe Zhang. Introduction Lock is the easiest way to manage concurrency Mutex and semaphore. Read and write locks. In distributed system:
Bringing Paxos Consensus in Multi-agent Systems Andrei Mocanu Costin Bădică University of Craiova.
Byzantine fault tolerance
Paxos A Consensus Algorithm for Fault Tolerant Replication.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
Fault Tolerant Services
Consensus and leader election Landon Cox February 6, 2015.
Implementing Replicated Logs with Paxos John Ousterhout and Diego Ongaro Stanford University Note: this material borrows heavily from slides by Lorenzo.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Primary-Backup Replication
The consensus problem in distributed systems
Distributed Systems – Paxos
Consensus I FLP Impossibility, Paxos
Two phase commit.
Distributed Replication
View Change Protocols and Reconfiguration
Distributed Systems CS
Distributed Systems: Paxos
EECS 498 Introduction to Distributed Systems Fall 2017
Implementing Consistency -- Paxos
EECS 498 Introduction to Distributed Systems Fall 2017
Outline Announcements Fault Tolerance.
Distributed Systems, Consensus and Replicated State Machines
Replication Improves reliability Improves availability
Fault-tolerance techniques RSM, Paxos
EEC 688/788 Secure and Dependable Computing
From Viewstamped Replication to BFT
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
IS 651: Distributed Systems Fault Tolerance
Consensus, FLP, and Paxos
Lecture 21: Replication Control
EEC 688/788 Secure and Dependable Computing
EECS 498 Introduction to Distributed Systems Fall 2017
IS 651: Distributed Systems Final Exam
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
View Change Protocols and Reconfiguration
Consensus Algorithms.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
The SMART Way to Migrate Replicated Stateful Services
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Paxos Made Simple.
Lecture 21: Replication Control
Implementing Consistency -- Paxos
Presentation transcript:

Replicated state machine and Paxos Jinyang Li (NYU) and Frans Kaashoek

Fault tolerance => replication How to recover a single node from power failure? Wait for reboot Data is durable, but service is unavailable temporarily Use multiple nodes to provide service Another node takes over to provide service How to make sure nodes respond in the same way?

Replicated state machine (RSM) RSM is a general replication method Lab 8: apply RSM to lock service RSM Rules: All replicas start in the same initial state Every replica apply operations in the same order All operations must be deterministic All replicas end up in the same state

RSM opA opB opB opA opA opA opB opB How to maintain a single order in the face of concurrent client requests?

RSM using primary/backup opA opB opA opB primary backup opA opB Primary/backup: ensure a single order of ops: Primary orders operations Backups execute operations in order

When does primary respond? opA 2. Prepare opA 3. OK opA 4. Commit opA primary backup backup 1. opA 5. Result opA After majority of backups have commit to op Run two-phase commit Lab 8: no persistent state; can avoid messages 2 and 3

Caveats in Hypervisor RSM Hypervisor assumes failure detection is perfect What if the network between primary/backup fails? Primary is still running Backup becomes a new primary Two primaries at the same time! Can timeouts detect failures correctly? Pings from backup to primary are lost Pings from backup to primary are delayed

Paxos: fault tolerant agreement Paxos lets all nodes agree on the same value despite node failures, network failures and delays Extremely useful: e.g. Nodes agree that X is the primary e.g. Nodes agree that Y is the last operation executed

Paxos: general approach One (or more) node decides to be the leader Leader proposes a value and solicits acceptance from others Leader announces result or try again

Paxos requirement Correctness (safety): Fault-tolerance: All nodes agree on the same value The agreed value X has been proposed by some node Fault-tolerance: If less than N/2 nodes fail, the rest nodes should reach agreement eventually w.h.p Liveness is not guaranteed

Why is agreement hard? What if >1 nodes become leaders simultaneously? What if there is a network partition? What if a leader crashes in the middle of solicitation? What if a leader crashes after deciding but before announcing results? What if the new leader proposes different values than already decided value?

Paxos setup Each node runs as a proposer, acceptor and learner Proposer (leader) proposes a value and solicit acceptence from acceptors Leader announces the chosen value to learners

Strawman Designate a single node X as acceptor (e.g. one with smallest id) Each proposer sends its value to X X decides on one of the values X announces its decision to all learners Problem? Failure of the single acceptor halts decision Need multiple acceptors!

Strawman 2: multiple acceptors Each proposer (leader) propose to all acceptors Each acceptor accepts the first proposal it receives and rejects the rest If the leader receives positive replies from a majority of acceptors, it chooses its own value There is at most 1 majority, hence only a single value is chosen Leader sends chosen value to all learners Problem: What if multiple leaders propose simultaneously so there is no majority accepting?

Paxos solution Proposals are ordered by proposal # Each acceptor may accept multiple proposals If a proposal with value v is chosen, all higher proposals have value v

Paxos state Acceptor maintains across reboots: Proposer maintains: na, va: highest proposal # and its corresponding accepted value np: highest proposal # seen Proposer maintains: myn: my proposal # in current Paxos Each round of Paxos has an instance #

Proposer PROPOSE(v) choose myn > np send PREPARE(myn) to all nodes if PREPARE_OK(na, va) from majority then va = va with highest na, or choose own v otherwise send ACCEPT (na, va) to all if ACCEPT_OK(na) from majority then send DECIDED(va) to all

Acceptor PREPARE(n) ACCEPT(n, v) If n > np np= n reply <PREPARE_OK, na,va> ACCEPT(n, v) If n >= np na = n va = v reply with <ACCEPT_OK> This node will not accept any proposal lower than n

Paxos operation: 3 phase example nh=N1:0 na = va = null nh=N2:0 na = va = null nh=N0:0 na = va = null Prepare,N1:1 Prepare,N1:1 nh= N1:1 na = null va = null nh: N1:1 na = null va = null ok, na= va=null ok, na =va=nulll Accept,N1:1,val1 Accept,N1:1,val1 nh=N1:1 na = N1:1 va = val1 nh=N1:1 na = N1:1 va = val1 ok ok Decide,val1 Decide,val1 N0 N1 N2

Paxos properties When is the value V chosen? When leader receives a majority prepare-ok and proposes V When a majority nodes accept V When the leader receives a majority accept-ok for value V

Understanding Paxos What if more than one leader is active? Suppose two leaders use different proposal number, N0:10, N1:11 Can both leaders see a majority of prepare-ok?

Understanding Paxos What if leader fails while sending accept? What if a node fails after receiving accept? If it doesn’t restart … If it reboots … What if a node fails after sending prepare-ok?

Using Paxos for RSM Fault-tolerant RSM requires consistent replica membership Membership: <primary, backups> RSM goes through a series of membership changes <vid-0, primary, backups><vid-1, primary, backups> .. Use Paxos to agree on the <primary, backups> for a particular vid vid == paxos instance #

Lab8: Using Paxos for RSM All nodes start with static config vid1:N1 vid1: N1 N2 joins A majority in vid1:N1 accept vid2: N1,N2 vid2: N1,N2 N3 joins A majority in vid2:N1,N2 accept vid3: N1,N2,N3 vid3: N1,N2, N3 N3 fails A majority in vid3:N1,N2,N3 accept vid4: N1,N2 vid4: N1,N2

Lab7: Using Paxos for RSM vid1: N1 N1 Prepare, vid2, N3:1 vid2: N1,N2 oldview, vid2=N1,N2 N3 vid1: N1 N3 joins vid1: N1 N2 vid2: N1,N2

Lab7: Using Paxos for RSM vid1: N1 N1 Prepare, vid3, N3:1 vid2: N1,N2 N3 vid1: N1 vid2: N1,N2 vid1: N1 N2 N3 joins vid2: N1,N2

Lab8: re-configurable RSM Use RSM to replicate lock_server Primary in each view assigns a viewstamp to each client requests Viewstamp is a tuple (vid:seqno) (0:0)(0:1)(0:2)(0:3)(1:0)(1:1)(1:2)(2:0)(2:1) All replicas execute client requests in viewstamp order

Lab8: Viewstamp replication To execute an op with viewstamp (vs), a replica must have executed all ops < vs A newly joined replica need to transfer state to ensure its state reflect executions of all ops < vs

Lab8: Using Paxos for RSM vid1: N1 Prepare, accept,decide etc. … N1 myvs:(1:50) Primary vs (1:50) N2 vid1: N1 N2 joins Transfer state to bring N2 up-to-date Primary in new view is last primary, if alive Otherwise, backup lowest ID Resume responding to client after backups and primary are in sync