Detour: Distributed Systems Techniques

Detour: Distributed Systems Techniques
Paxos overview (based on Lampson’s talk) Google: Paxos made live (only briefly) Zookeeper: -- wait-free coordination system by Yahoo! CSci8211: Distributed Systems: Paxos & zookeeper

Paxos: Basic Ideas

Paxos: Agent States & Invariants
which follows from

Paxos: Leaders

Paxos Algorithm

Paxos Algorithm in Plain English
Phase 1 (prepare): A proposer selects a proposal number n and sends a prepare request with number n to majority of acceptors. If an acceptor receives a prepare request with number n greater than that of any prepare request it saw, it responses YES to that request with a promise not to accept any more proposals numbered less than n and include the highest-numbered proposal (if any) that it has accepted.

Paxos Algorithm in Plain English …
Phase 2 (accept): If the proposer receives a response YES to its prepare requests from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a values v which is the value of the highest-numbered proposal among the responses. If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n.

Paxos’s Properties (Invariants)
P1: Any proposal number is unique. P2: Any two set of acceptors have at least one acceptor in common. P3: the value sent out in phase 2 is the value of the highest-numbered proposal of all the responses in phase 1.

The Paxos Atomic Broadcast Algorithm
Leader based: each process has an estimate of who is the current leader To order an operation, a process sends it to its current leader The leader sequences the operation and launches a Consensus algorithm (Synod) to fix the agreement

Failure-Free Message Flow
Sn . C (“accept”) (“prepare”) (“ack”) Phase 1 Phase 2 request response

Message Flow: Take 2 w/ Optimization
Sn . C (“accept”) (“prepare”) (“ack”) Phase 1 Phase 2 request response

Highlights of Paxos Made Live
Implement Paxos in a large, practical distributed system have to consider many practical failure scenarios as well as efficiency issues, and “prove” implementation correct! e.g., disk failures Key Features/Mechanisms: Multi-Paxos: run multiple instances of Paxos to achieve consensus on a series of values, e.g., in a replicated log Master & Master Leases (Global) epoch numbers (to handle master crashes) Group membership: handle dynamic changes in # of servers Snapshot to enable faster recovery (& catch up) Handling disk corruption: a replica w/ corrupted disk re-builds its log by participating as a non-voting member until catch up & good software engineer: runtime checking & testing, etc.

Paxos Made Live: Architecture

Paxos Made Live: Client API

Highlights of ZooKeeper
Zookeeper: wait-free coordination service for processes of distributed applications wait-free: asynchronous (no blocking) and no locking with guaranteed FIFO client ordering and linearizable writes provide a simple & high-performance kernel for building more complex primitives at the client e.g., rendezvous, read/write locks, etc. this is in contrast to Google’s Chubby (distributed lock) service, or Amazon’s Simple Queue Service, … For target workloads: 2:1 to 100:1 read/write ratio, can handle 10^4 – 10^5 transactions per second Key Ideas & Mechanisms: A distributed file system like hierarchical namespace to store data objects (“shared states”): a tree of znodes but with simpler APIs for clients to coordinate processes

ZooKeeper Service Overview
server: process providing ZooKeeper service client: user of ZooKeeper service clients establish a session when they connect to ZooKeeper and obtain a handle thru which to issue requests znode: each associated w/ a version #, & can be of two types regular: create/delete explicitly ephemeral: delete explicitly or automatically when the session creates it terminates znode may have a sequential flag: created w/ a monotonically increasing counter attached to the name watch (on znode): one-time trigger associated with a session to notify a change in znode (or its child subtree) Zookeeper’s hierarchical namespace (data tree)

ZooKeeper Client API Each client runs a ZooKeeper library:
expose ZooKeeper service interface thru client APIs manage network connection (“session”) between client & server ZooKeeper APIs: Each API has both a synchronous and asynchronous versions

ZooKeeper Primitive Examples
Configuration Management: E.g., two clients A & B shares a configuration, and can directly communicate w/ each A makes a change to the configuration & notify B (but the two servers’ configuration replicas may be out of sync!) Rendezvous Group Membership Simple Lock (w & w/o Herd Effect) Read/Write Locks Double Barrier Yahoo and other services using ZooKeeper: Fetch Service (“Yahoo crawler”) Katta: a distributed indexer Yahoo! Message Broker (YMB)

ZooKeeper Implementation
convert writes into idempotent transactions ensure linearizable writes ensure client ordering via a pipelined architecture to allow multiple pending requests each write is handled by a leader, which broadcast the change to others via Zab, an atomic broadcast protocol server handling a client request uses a simple majority quorum to decide on a proposal to deliver the state change to the client

ZooKeeper and Zab Zab: atomic broadcast protocol used by Zookeeper to ensure transaction integrity, primary-order (PO) causality total order, and agreement (among replicated processes) Leader (primary instance)-based: only leader can abcast Atomic 2-phase broadcast: abcast + abdeliver => transaction committed, otherwise considered “aborted”

More on Zab Zab uses a 3-phase protocol w/ quorum (similar to Raft):
Zab atomic broadcast ensures primary-order causality: “causability” defined only w.r.t. primary instance Zab also ensures strict causality (or total ordering) if a process delivers two transactions, one must precede the other in the PO causality order Zab assumes a separate leader election/selection process (with a leader selection oracle) processes: leader (starting w/ a new epoch #) and followers Zab uses a 3-phase protocol w/ quorum (similar to Raft): Phase 1 (Discovery): agree on new epoch # and discover history Phase 2 (Synchronization): synchronize the history of all processes using 2PC-like protocol, commit based on quorum Phase 3 (broadcast): commit a new transaction via a 2PC-like protocol, commit based on quorum

PO Causality & Strict Causality
(a) In PO causality order, but not “causal order” (b) In PO causality order, but not “strict causality” order

Detour: Distributed Systems Techniques

Similar presentations

Presentation on theme: "Detour: Distributed Systems Techniques"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Detour: Distributed Systems Techniques

Similar presentations

Presentation on theme: "Detour: Distributed Systems Techniques"— Presentation transcript:

Similar presentations

About project

Feedback