Strong Consistency and Agreement COS 461: Computer Networks Spring 2011 Mike Freedman 1 Jenkins,

Slides:



Advertisements
Similar presentations
Consensus on Transaction Commit
Advertisements

Replication techniques Primary-backup, RSM, Paxos Jinyang Li.
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Authored by: Seth Gilbert and Nancy Lynch Presented by:
CS 5204 – Operating Systems1 Paxos Student Presentation by Jeremy Trimble.
Failure Detection The ping-ack failure detector in a synchronous system satisfies – A: completeness – B: accuracy – C: neither – D: both.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Consensus Hao Li.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Computer Science Lecture 14, page 1 CS677: Distributed OS Consistency and Replication Today: –Introduction –Consistency models Data-centric consistency.
CS 582 / CMPE 481 Distributed Systems
Fault-tolerance techniques RSM, Paxos Jinyang Li.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Versioning and Eventual Consistency COS 461: Computer Networks Spring 2011 Mike Freedman 1.
Versioning, Consistency, and Agreement COS 461: Computer Networks Spring 2010 (MW 3:00-4:20 in CS105) Mike Freedman
Distributed Storage System Survey
Paxos Made Simple Jinghe Zhang. Introduction Lock is the easiest way to manage concurrency Mutex and semaphore. Read and write locks. In distributed system:
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Bringing Paxos Consensus in Multi-agent Systems Andrei Mocanu Costin Bădică University of Craiova.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Paxos A Consensus Algorithm for Fault Tolerant Replication.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Building Dependable Distributed Systems, Copyright Wenbing Zhao
Hwajung Lee.  Improves reliability  Improves availability ( What good is a reliable system if it is not available?)  Replication must be transparent.
Replication Improves reliability Improves availability ( What good is a reliable system if it is not available?) Replication must be transparent and create.
CSE 486/586 Distributed Systems Consistency --- 3
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CSE 486/586 CSE 486/586 Distributed Systems Consistency Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Systems – Paxos
Alternative system models
CSE 486/586 Distributed Systems Consistency --- 1
Strong Consistency & CAP Theorem
Strong Consistency & CAP Theorem
Distributed Transactions and Spanner
Consistency and Replication
Strong Consistency & CAP Theorem
Replication and Consistency
EECS 498 Introduction to Distributed Systems Fall 2017
Implementing Consistency -- Paxos
Consistency Models.
Replication and Consistency
EECS 498 Introduction to Distributed Systems Fall 2017
CSE 486/586 Distributed Systems Consistency --- 1
Replication Improves reliability Improves availability
Fault-tolerance techniques RSM, Paxos
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Consensus, FLP, and Paxos
EEC 688/788 Secure and Dependable Computing
EECS 498 Introduction to Distributed Systems Fall 2017
Causal Consistency and Two-Phase Commit
Distributed Systems CS
Replicated state machine and Paxos
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Strong Consistency & CAP Theorem
Replication Chapter 7.
Implementing Consistency -- Paxos
CSE 486/586 Distributed Systems Consistency --- 3
CSE 486/586 Distributed Systems Consistency --- 1
Replication and Consistency
Presentation transcript:

Strong Consistency and Agreement COS 461: Computer Networks Spring 2011 Mike Freedman 1 Jenkins, if I want another yes-man, I’ll build one! Lee Lorenz, Brent Sheppard

What consistency do clients see? 2 Distributed stores may store data on multiple servers -Replication provides fault-tolerance if servers fail -Allowing clients to access different servers potentially increasing scalability (max throughput) -Does replication necessitate inconsistencies? Harder to program, reason about, confusing for clients, … ?

Consistency models Strict Strong (Linearizability) Sequential Causal Eventual 3 Weaker Consistency Models These models describes when and how different nodes in a distributed system / network view the order of messages / operations

Strict Consistency Strongest consistency model we’ll consider – Any read on a data item X returns value corresponding to result of the most recent write on X Need an absolute global time – “Most recent” needs to be unambiguous – Corresponds to when operation was issued – Impossible to implement in practice on multiprocessors x 4 Host 1 Host 2 W(x,a) a=R(x) Host 1 Host 2 W(x,a) a=R(x) 0=R(x)

Sequential Consistency Definition: All (read and write) operations on data store were executed in some sequential order, and the operations of each individual process appear in this sequence Definition: When processes are running concurrently: – Interleaving of read and write operations is acceptable, but all processes see the same interleaving of operations Difference from strict consistency – No reference to the most recent time – Absolute global time does not play a role 5

Nodes use vector clocks to determine if two events had distinct happens-before relationship – If timestamp (a) < timestamp (b)  a  b If ops are concurrent (\exists i,j, a[i] b[j]) – Hosts can order ops a, b arbitrarily but consistently Implementing Sequential Consistency 6 OP1 OP2 OP3 OP4 Valid: Host 1: OP 1, 2, 3, 4 Host 2: OP 1, 2, 3, 4 Valid Host 1: OP 1, 3, 2, 4 Host 2: OP 1, 3, 2, 4 Invalid Host 1: OP 1, 2, 3, 4 Host 2: OP 1, 3, 2, 4

Examples: Sequential Consistency? x 7 Host 1 Host 2 Host 3 Host 4 W(x,a) W(x,b) b=R(x)a=R(x) b=R(x)a=R(x) Host 1 Host 2 Host 3 Host 4 W(x,a) W(x,b) b=R(x)a=R(x) b=R(x) (but is valid causal consistency) Sequential consistency is what allows databases to reorder “isolated” (i.e. non causal) queries But all DB replicas see same trace, a.k.a. “serialization”

Strong Consistency / Linearizability Strict > Linearizability > Sequential All operations (OP = read, write) receive a global time-stamp using a synchronized clock sometime during their execution Linearizability: – Requirements for sequential consistency, plus – If ts op1 (x) < ts op2 (y), then OP1(x) should precede OP2(y) in the sequence – “Real-time requirement”: Operation “appears” as if it showed up everywhere at same time 8

Linearizability 9 Server 1 Server 2 W(x,a) a=R(x) Client 1 W Ack when “committed” 0=R(x) Write appears everywhere

Implications of Linearizability 10 Server 1 Server 2 W(x,a) a=R(x) Client 1 W Ack when “committed” Client 2 Out of band msg: “Check out my wall post x” R

Implementing Linearizability 11 Server 1 Server 2 W(x,a) W Ack when “committed” If OP must appear everywhere after some time (the conceptual “timestamp” requirement)  “all” locations must locally commit op before server acknowledges op as committed Implication: Linearizability and “low” latency mutually exclusive -e.g., might involve wide-area writes

Implementing Linearizability 12 Server 1 Server 2 W(x,a) W Ack when “committed” Algorithm not quite as simple as just copying to other server before replying with ACK: Recall that all must agree on ordering Both see either a  b or b  a, but not mixed Both a and b appear everywhere as soon as committed W(x,b)

Consistency + Availability 13

Data replication with linearizability Master replica model – All ops (& ordering) happens at single master node – Master replicates data to secondary Multi-master model – Read/write anywhere – Replicas order and replicate content before returning 14

Single-master: Two-phase commit 15 Marriage ceremony Theater Contract law Do you? I do. Do you? I do. I now pronounce… Ready on the set? Ready! Action! Offer Signature Deal / lawsuit Prepare Commit

Two-phase commit (2PC) protocol 16 Leader Acceptors Replicas PREPARE READY COMMIT ACK Client WRITE ACK All prepared? All ack’d?

What about failures? If one or more acceptor (≤ F) fails: – Can still ensure linearizability if |R| + |W| > N + F – “read” and “write” quorums of acceptors overlap in at least 1 non-failed node If the leader fails? – Lose availability: system not longer “live” Pick a new leader? – Need to make sure everybody agrees on leader! – Need to make sure that “group” is known 17

Consensus / Agreement Problem Goal: N processes want to agree on a value Desired properties: – Correctness (safety): All N nodes agree on the same value The agreed value has been proposed by some node – Fault-tolerance: If ≤ F faults in a window, consensus reached eventually Liveness not guaranteed: If > F failures, no consensus Given goal of F, what is N? – “Crash” faults need 2F+1 processes – “Malicious” faults (called Byzantine) need 3F+1 processes 18

Paxos Algorithm Setup – Each node runs proposer (leader), acceptor, and learner Basic approach – One or more node decides to act like a leader – Leader proposes value, solicits acceptance from acceptors – Leader announces chosen value to learners

Why is agreement hard? (Don’t we learn that in kindergarten?) What if >1 nodes think they’re leaders simultaneously? What if there is a network partition? What if a leader crashes in the middle of solicitation? What if a leader crashes after deciding but before broadcasting commit? What if the new leader proposes different values than already committed value? 20

Strawman solutions Designate a single node X as acceptor – Each proposer sends its value to X – X decides on one of the values, announces to all learners – Problem! Failure of acceptor halts decision  need multiple acceptors Each proposer (leader) propose to all acceptors – Each acceptor accepts first proposal received, rejects rest – If leader receives ACKs from a majority, chooses its value There is at most 1 majority, hence single value chosen – Leader sends chosen value to all learners – Problems! With multiple simultaneous proposals, may be no majority What if winning leader dies before sending chosen value?

Paxos’ solution Each acceptor must be able to accept multiple proposals Order proposals by proposal # – If a proposal with value v is chosen, all higher proposals will also have value v Each node maintains: – t a, v a : highest proposal # accepted and its corresponding accepted value – t max : highest proposal # seen – t my: my proposal # in the current Paxos

Paxos (Three phases) 23 Phase 1 (Prepare) Node decides to become leader -Chooses t my > t max -Sends to all nodes Acceptor upon receiving If t < t max reply Else t max = t reply Phase 2 (Accept) If leader gets from majority -If v == null, leader picks v my. Else v my = v. -Send to all nodes If leader fails to get majority, delay, restart Upon If t < t max reply with Else t a = t; v a = v; t max = t reply with Phase 3 (Decide) If leader gets acc-ok from majority -Send to all nodes If leader fails to get accept-ok from majority -Delay and restart

Paxos operation: an example Prepare, N1:1 Node 0Node 1Node 2 t max =N1:0 t a = v a = null t max =N0:0 t a = v a = null t max = N1:1 t a = null v a = null ok, t a = v a =null Prepare, N1:1 ok, t a =v a =null t max = N1:1 t a = null v a = null t max =N2:0 t a = v a = null Accept, N1:1, val1 t max = N1:1 t a = N1:1 v a = val1 t max = N1:1 t a = N1:1 v a = val1 ok Decide, val1

Combining Paxos and 2PC Use Paxos for view-change – If anybody notices current master unavailable, or one or more replicas unavailable – Propose view change Paxos to establish new group: Value agreed upon = Use 2PC for actual data – Writes go to master for two-phase commit – Reads go to acceptors and/or master Note: no liveness if can’t communicate with majority of nodes from previous view 25

CAP Conjecture Systems can have two of: – C: Strong consistency – A: Availability – P: Tolerance to network partitions …But not all three Two-phase commit: CA Paxos: CP Eventual consistency: AP 26