IS 651: Distributed Systems Consensus

IS 651: Distributed Systems Consensus
Sisi Duan Assistant Professor Information Systems

Recall Consistency challenges Replication Concurrency Machine failures
Network failures Replication State machine replication Primary back up replication

Recall Primary Backup Replication Deal with backup failures
Deal with primary failures

Recall Viewstamp replication Key to correctness
View number  a deterministic node as the primary No “committed” requests will be lost Quorum – Majority nodes if we consider benign failures

Recall Committed – if majority nodes said yes
Backups monitor the primary

Exercise #1

Exercise #2

Today The crown problem of distributed systems
A.k.a. consensus We have covered a lot about why we need consensus Data replication – consistency Distributed storage …

Consensus Despite having different views of the world, all nodes in a distributed system must act in concert, e.g.: All replicas that store the same object O must apply all updates to O in the same order (consistency) All nodes involved in a transaction must either commit or abort their portion of the transaction (atomicity) All that, despite FAILURES Nodes can restart, die, be slow Networks can be slow but reliable For simplicity, we assume the network is reliable where messages are eventually received

The Consensus/Agreement Problem
Some nodes propose something to other nodes All nodes must decide whether to accept or reject those values Examples of something A value Whether or not to commit a transaction to a database Whether to execute a client request Who has the lock in a distributed lock service among multiple clients that request it almost simultaneously

Timing Assumptions: Reminder
Synchronous system Known bound on processing delays and transmission delays Asynchronous system No assumption on the delays Partially synchronous system There is such as bound but the bound value is unknown to the nodes

Consensus Correctness in General
Safety (consistency) All nodes agree on the same value The agreed value X has been proposed by some node Liveness (timing, performance) All nodes eventually agree on the value If less than some fraction of nodes crash, the rest should still reach an agreement Replicated systems just behave like a centralized one!

Many forms of consensus/agreement
Today Transactions (e.g., distributed databases) Application perform multi-operation, multi-object data access Transfer money from one account to another Insert Alice to Bob’s friendlist and insert Bob to Alice’s friendlist What if Failures occur in the middle of writing objects Concurrent operations race with each other?

ACID transactions A (Atomicity) C (Consistency) I (Isolation)
All-or-nothing w.r.t. failures C (Consistency) Transactions maintain any internal storage state invariants I (Isolation) Concurrently executing transactions do not interfere D (Durability) Effect of transactions survive failures

ACID Example X = read(A) Y = read (B) Write (A, X-100)
Write (B, y+100) T1: Transfer $100 from A to B A T1 fully completes or leaves nothing D once T1 commits, T1’s writes are not lost I no races, as if T1 happens either before or after T2 C preserves invariants, e.g., account balance > 0

Solution: WAL logging Write-Ahead-Logging (WAL)
All state modification must be written to log before they are applied Simplest WAL: REDO logging Only stores REDO information in log entries Transactions buffer writes during execution This requirement is easy to satisfy now, but not in 80s or 90s when memory capacity is very low

The Concurrency Control Challenge
T1: Transfer $100 from A to B T2: Transfer $100 from A to C A T1 fully completes or leaves nothing D once T1 commits, T1’s writes are not lost I no races, as if T1 happens either before or after T2 C preserves invariants, e.g., account balance > 0 Remember the differences between Linearizability and sequential consistency?

Problem of interleaving

Ideal isolation semantics: serializability
Execution of a set of transactions is equivalent to some serial order Two executions are equivalent if they have the same effect on database and produce same output

Conflict serializability
An execution schedule is the ordering of read/write/commit/abort operations X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit X = read(A) Y = Read(B) Print (x+y) commit A (serial) schedule: R(A),R(B),W(A),W(B),C,R(A),R(B),C

Conflict serializability
Two schedules are equivalent if they Contain the same operations Order conflicting operations the same way Two ops are conflicting if they access the same data and one is a write A schedule is serializable if it’s equivalent to some serial schedule Strict serializability/Order-preserving serializability If T finishes before T’ starts, T must be ordered before T’ in equivalent serial schedule

Serializability Example
X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit X = read(A) Y = Read(B) Print (x+y) commit Serializable? R(A),R(B), R(A),R(B),C, W(A),W(B),C Equivalent R(A),R(B),C, R(A),R(B), W(A),W(B),C

Serializability Example
X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit X = read(A) Y = Read(B) Print (x+y) commit Serializable? R(A),R(B), W(A), R(A),R(B),C, W(B),C

Realize a serializable schedule
Locking-based approach (remember mutual exclusion?) Solution 1: Grab global lock before transaction starts Release global lock after transaction commits Solution 2: Grab short-term fine-grained locks on an item before access Lock(A) Read(A) Unlock(A) Lock(B) Write(B) Unlock(B)… Discussion: which one is better? What are the potential issues?

Solution 2 Problem X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit X = read(A) Y = Read(B) Print (x+y) commit Locks on writes should be held till the end of transaction Otherwise we may read an uncommitted value Serializable? R(A),R(B), W(A), R(A),R(B),C, W(B),C

A better solution Solution 3 Fine-grained locks
Long-term locks for writes Grab lock before write, release lock after tx commits/aborts Short-term locks for reads

Solution 3 problem X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit X = read(A) Y = Read(B) Print (x+y) commit Long-term locks for writes, short-term locks for reads? R(A),R(B), W(A), R(A),R(B),C, W(B),C Read locks must be held till commit time Non-repeatable reads R(A), R(A),R(B), W(A), W(B),C, R(B),C

Realize a serializable schedule
2 phase locking (2PL) A growing phase in which the transaction is acquiring locks A shrinking phase in which locks are releases In practice The growing phase is the entire transaction The shrinking phase is at the commit time Optimization Use read/write locks instead of exclusive locks

2PL: An Example Rlock(A) Rlock(A) X = read(A) X = read(A) Rlock(B)
Y = Read(B) Wlock(A) Buffer A = x-100 Wlock(B) Buffer B = y+100 log(A=0, B=200) Write (A, 0) Unlock(A) Write (B, 200) Unlock(B) Rlock(A) X = read(A) Rlock(B) Y = Read(B) Print (x+y) Unlock(A) Unlock(B)

How to relate this to distributed databases/storage?
Storage is sharded across multiple machines Different machines store different subset of data

2PC Client sends a request to the coordinator X = read(A) Y = Read(B)
Write (A, x+100) Write (B, y-100) commit 2PC Client sends a request to the coordinator

2PC Client sends a request to the coordinator
X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit 2PC Client sends a request to the coordinator Coordinator locks the items and the machines apply the operations Discussion: What could go wrong?

X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit 2PC Client sends a request to the coordinator Coordinator locks the items and the machines apply the operations Discussion: What could go wrong? A does not have enough money B crashes Coordinator crashes …

What we need TC (coordinator), A, and B need to agree on whether they should execute the operations Safety If one commits, no one aborts If one aborts, no one commits Liveness If no failures and A and B can commit, action commits If failures, reach a conclusion ASAP

2PC Client sends a request to the coordinator X = read(A) Y = Read(B)
Write (A, x+100) Write (B, y-100) commit 2PC Client sends a request to the coordinator

X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit 2PC Client sends a request to the coordinator Coordinator sends a PREPARE message

X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit 2PC Client sends a request to the coordinator Coordinator sends a PREPARE message A, B replies YES or NO If A does not have enough balance, reply no

X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit 2PC Client sends a request to the coordinator Coordinator sends a PREPARE message A, B replies YES or NO Coordinator sends a COMMIT or ABORT message COMMIT if both say yes ABORT if either says no

X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit 2PC Client sends a request to the coordinator Coordinator sends a PREPARE message A, B replies YES or NO Coordinator sends a COMMIT or ABORT message COMMIT if both say yes ABORT if either says no Coordinator replies to the client A,B commit on the receipt of commit message

Why 2PC is correct Neither can commit unless both agree to commit
Performance? Timeout: If the expected message is not received… Reboot: Node crashed, is rebooting, must clean up

2PC Coordinator waits for YES or NO
X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit 2PC Coordinator waits for YES or NO If it hasn’t heard back from A or B in time, can abort It might be a network problem… But no safety issues A and B wait for COMMIT or ABORT from coordinator If it sends a NO, can abort directly - Discussion: If it sends a YES, can we safely commit?

2PC A and B wait for COMMIT or ABORT from coordinator
X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit 2PC A and B wait for COMMIT or ABORT from coordinator If it sends a NO, can abort directly Discussion: If it sends a YES, can we safely commit? What if coordinator fails after it sends only one COMMIT message? A and B may wait forever Discussion: How to solve the problem?

2PC Suppose A and B can talk B->A: status? A can reply to B.
X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit 2PC Suppose A and B can talk B->A: status? A can reply to B. Four cases No reply from A: no decision.. A received commit or abort from coordinator: agree with the coordinator’s decision A hasn’t voted yet or voted no: both abort A voted yes: both must wait for coordinator

3PC Three Phase Commit (3PC) Split COMMIT/ABORT phase into two phases
Never block on node failures as 2PC did Split COMMIT/ABORT phase into two phases First communicate the outcome to everyone Let them commit only after everyone knows the outcome

3PC If one of them has received PRE-COMMIT, they can all commit
We know for sure that both voted for YES If none of them has received PRE-COMMIT, they can all abort Safe for node crashes (both coordinator and participants)

3PC Safety and livenss? Liveness? (availability) yes
Doesn’t block, it always makes progress by timing out Safety? (correctness) nope Can you think of scenarios in which 3PC would result in inconsistent states between the nodes? Two examples A hasn’t crashed, it’s just offline Coordinator hasn’t crashes, it’s just offline

3PC with Network Partitions
Coordinator crashes after it sends PRE- COMMIT to A A is partitioned later (or crashes and recover later) None of B,C,D have got PRE-COMMIT, they will abort A comes back and decides to commit…

Solution? Next week!

Reading List Optional Michael J. Franklin. Concurrency Control and Recovery. Sigmod 1992.

IS 651: Distributed Systems Consensus

Similar presentations

Presentation on theme: "IS 651: Distributed Systems Consensus"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IS 651: Distributed Systems Consensus

Similar presentations

Presentation on theme: "IS 651: Distributed Systems Consensus"— Presentation transcript:

Similar presentations

About project

Feedback