1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar.

1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar

2 Today’s Material Shared memory Paxos from Sec. 5 of: Byzantine Disk Paxos: Optimal Resilience with Byzantine Shared Memory, Abraham, Chockler, Keidar, & Malkhi: PODC 2004. Disk Paxos, Gafni & Lamport, DISC 2000 Frangipani: A Scalable Distributed File System, Thekkath, Mann, & Lee, SOSP 1997

3 Reminder: Asynchronous R/W Shared Memory Model Shared memory registers –Simple read/write (R/W) objects Accessed by processes with ids 1,2,… All communication through shared memory! Algorithms must be wait-free –Must tolerate any number of process (client) failures –Possible thanks to reliable shared memory

4 Consensus in Shared Memory A shared object supporting a method decide(v i ) i returning a value d i Satisfying: –Agreement: for all i and j d i =d j –Validity: d i =v j for some j –Termination: decide returns

5 Solving Consensus in/with Shared Memory Assume asynchronous shared memory system with atomic R/W registers Can we solve consensus? –Consensus is not solvable if even one process can fail. Shared-memory version of [FLP]: write stands for send, read for receive. –Yes, if no process can fail –Yes, with eventual synchrony or 

6 Shared Memory (SM) Paxos Consensus –In asynchronous shared memory –Using wait-free regular R/W registers –And  (why?) Wait-free –Any number of processes may fail (t < n) Unlike message-passing model (why?) –Only the leader takes steps

7 Regular Registers SM Paxos can use registers that provide weaker semantics than atomicity SWMR regular register: a read returns – Either a value written by an overlapping write or –The register’s value before the first write that overlaps the read

8 write(0) Regular versus Atomic time read(1) read(0) write(1) time write(1) already happened Regular can return 0 not linearizable

9 Variables Reminder: Paxos variables are: –BallotNum, AcceptVal, AcceptNum SM version uses shared SWMR regular registers: –x i =  bal, val, num, decision  i for each process i –Initially   0,0 , ,  0,0 ,   –Writeable by i, readable by all Each process keeps local variables b,v,n –Initially   0,0 , ,  0,0  

10 Reminder: Paxos Phase I if leader (by  ) then BallotNum  choose new unique ballot send to all Upon receive (“prepare”, bal) from i if bal  BallotNum then BallotNum  bal send (ack, bal, AcceptNum, AcceptVal) to i Upon receive (ack, BallotNum, num, val) from n-t if all vals =  then myVal  initial value else myVal  received val with highest num n-t must have not moved on

11 SM Paxos: Phase I if leader (by  ) then b  choose new unique ballot write  b, v, n,   to x i read all x j ’s if some x j.bal > b then start over if all read x j.val ’ s =  then v  my initial value else v  read val with highest num Write is like sending to all Read instead of waiting for acks No ack: someone moved on! Only b changed in this phase

12 Phase I Summary Classical Paxos: –Leader chooses new ballot, sends to all –Others ack if they did not move on to a later ballot –If leader cannot get a majority, try again –Otherwise, move to Phase 2 SM Paxos: –Leader chooses new ballot, writes its variable –Leader reads to check if anyone moved on to a later ballot –If anyone did move on, try again –Otherwise, move to Phase 2

13 Reminder: Paxos Phase II send (“accept”, BallotNum, myVal) to all Upon receive (“accept”, b, v) with b  BallotNum AcceptNum  b; AcceptVal  v send (“accept”, b, v) to all (first time only) Upon receive (“accept”, b, v) from n-t decide v send (“decide”, v) to all Accept messages change AcceptNum and AcceptVal Only if did not move on yet.

14 SM Paxos: Phase II Leader Cont’d n  b write  b,v,n,   to x i read all x j ’s if some x i.bal > b then start over write  b,v,n,v  to x i return v Read to see if all would have accepted this proposal When don’t they? Like sending “accept” to all v,n changed in this phase Decide

15 Why Read Twice? readwrite(b)writeread write(b’>b) write(b’) did not complete write(b’>b)read read does not see b’

16 Adding The Non-Leader Code while (true) if leader (by  ) then [ leader code from previous slides ] else read x ld,were ld is leader if x ld.decision ≠  then return x ld.decision start over means go here

17 Liveness The shared memory is reliable The non-leaders don’t write –They don’t even need to be “around” The leader only fails if another leader competes with it –Contention –By , eventually only one leader will compete –In shared memory systems,  is called a contention manager

18 Validity Leader always proposes its own value or one previously proposed by an earlier leader –Regular registers suffice

19 Agreement readwrite(b)write(v)readwrite decision no write(b’) for b’>b completed write(b’>b)read read does not see any b’>b write read sees b,v writes v

20 Agreement Proof Idea Look at lowest ballot, b, in which some process decides, v By uniqueness of b, no other value is decided with b Prove by induction that every decision with b>b’ is v Homework: complete the proof –See argument in previous slide –See Byzantine Disk Paxos paper

21 Termination When one correct leader exists –It eventually chooses a higher b than all those written before –No other process writes a higher ballot –So it does not start over, and hence decides Any number of processes can fail How can it be possible? Didn’t we show a majority of correct processes is needed?

22 Optimization As in the message passing case…. The first write does not write consensus values A leader running multiple consensus instances can perform the first write once and for all and then perform only the second write for each consensus instance

23 Leases We need eventually accurate leader (  ) –But what does this mean in shared memory? We would like to have mutual exclusion –Not fault-tolerant! Lease: fault-tolerant, time-based mutual exclusion –Live but not safe in eventual synchrony model

24 Using Leases A client that has something to write tries to obtain the lease –Lease holder = leader –May fail… Example implementation: –Upon failure, backoff period Leases have limited duration, expire When is mutual exclusion guaranteed?

25 Lock versus Lease  Lock is blocking –Using locks is not wait-free –If lock holder fails, we’re in trouble  Lease is non-blocking –Lease expires regardless whether holder fails  Lock is always safe –Never two lock-holders  Lease is not –Two lease-holders possible due to asynchrony –OK for indulgent algorithms, like Paxos

26 Disk Paxos [Gafni,Lamport 00]

27 Data-Centric Replication A fixed collection of persistent data items accessed by transient clients Data items have limited functionality –E.g., R/W registers, or –An object of a certain type Data items can fail Cannot communicate with one another

28 System Model: Fault-Prone Memory n fault-prone shared-memory objects –Called base objects –Can be n servers or disks storing base objects –t out of n can fail m processes (clients) –Any number can fail (wait-free)

29 What Is It Good For? Storage Area Networks (SAN) –“Brick” storage –Disk functionality is limited (R/W) –Disks cannot communicate with each other –Disks and disk servers can fail Large-scale client/server systems –Simple servers that do not communicate with each other scale better, manage load better –Servers can fail

30 Disk Paxos Consensus using n  2t+1 fault-prone disks –Disks can incur crash failures Solution combines: –m-process shared memory Paxos and –ABD-like emulation of shared registers from fault-prone ones

31 Disk Paxos Setting R/W Replicated Data Store Client processes

32 Disk Paxos Data Structures m processes n disks 1 2 3 4 5  b,v,n,d  123 Process i can write block[i][j], for each disk j, can read all blocks x2x2  b,v,n,d 

33 Read Emulation In order to read x i –Issue read block[i][j], for each disk j –Wait for majority of disks to respond –Choose block with largest b,n Is this enough? How did ABD’s read emulation work?

34 does not find a written copy, returns 0 write(0) One Read Round Enough for Regular time read(1) read(0) write(1) time returning 0 is OK for regular finds a copy that was written

35 Write Emulation In order to write x i –Issue write block[i][j], for each disk j –Wait for majority of disks to respond Is this enough? Homework: put everything together –Write complete Disk Paxos pseudo-code based on SM Paxos and R/W emulations

36 Quorum Systems Generalization of Majority

37 Why Majority? In indulgent algorithms (e.g., Paxos) we assumed a majority of the processes are correct But what we really need is: If Q 1, Q 2 are sets of processes s.t. there liveness is guaranteed whenever all processes in P-Q 1 or P-Q 2 crash, then Q 1 and Q 2 intersect.

38 1 st Generalization: Weighted Voting [Gifford 79] Each process has a weight –Like share-holders in a corporation In order to make progress, need “votes” from a set of processes that have a majority of the weights (shares) Special cases: –Each process has weight 1 – majority –One process has all the weights – singleton

39 Definition of Quorum System A quorum system over a universe U of n processes is a collection of subsets of U (called quorums) such that every two quorums intersect Examples: –Singleton: QS = {{p i }} –Majority: QS = {Q  U: |Q| > n/2}

40 The Grid Quorum System A quorum consists of one row plus one cell from each row above it p1p2p3p4p5 p6p7p8p9p10 p11p12p13p14p15 p16p17p18p19p20 p21p22p23p24p25

41 Advantages of Quorum Systems Availability –Allow faulty/slow servers to be avoided (up to a certain threshold) Load balancing –Each server participates only in a fraction of quorums and therefore is accessed only a fraction of overall accesses Fundamental tradeoff: load vs. availability

42 Coteries and Domination A coterie is a quorum system in which no quorum is a subset of another quorum –Obtained from a quorum system by removing supersets and keeping only minimal quorums A coterie QS dominates a coterie QS’ if every quorum Q’  QS’ is a superset of some quorum in Q  QS A non-dominated coterie is not dominated

43 Quorum Sizes Majority: O(n) Grid: O(Sqrt(n)) Primary Copy: O(1) Weighted Majority: varies

44 The Load of a Quorum System The probability of accessing the busiest server in the best case, i.e., using a strategy that minimizes the load, and when no failures occur An access strategy for QS is a probability distribution for accessing the quorums in QS The load of a server under a strategy is the probability that this server is in the accessed quorum

45 Availability of a Quorum System The resilience f of QS is the number of failures QS is guaranteed to survive –After f failures there is always a live quorum Failure probability –Assume that each server fails independently with probability p –F p (QS) is the probability that all quorums in QS are hit, i.e., no quorum survives

46 Examples Majority –Best availability (smallest failure probability) for p<½ –Worst availability for p > ½ –Load is close to ½ Singleton –F p = p (optimal when p > ½) –Load is 1 Grid –Load O(1/Sqrt(n)) –Resilience of Sqrt(n)-1 –Failure probability goes to 1 as n grows

47 Quorum Replication Each operation accesses a quorum of replicas Generalization: Byzantine Quorums –Larger intersection

48 Frangipani File System Thekkath, Mann, and Lee, SOSP 1997

49 Frangipani Scalable file system built at SRC-DEC Published in SOSP’97 Uses failure detection, Paxos, leases,… Two layers: –Petal: virtual disk from many “storage bricks” –Frangipani file system and lock service

50 Motivation Large-scale distributed file systems are hard to administer Hard to add/remove machines (servers) Hard to add/remove disks (storage space) Hard to manage set of current components Hard to manage locks

51 Petal: Distributed Virtual Disks C. A. Thekkath and E. K. Lee Systems Research Center Digital Equipment Corporation ASPLOS’96

52 Client’s View

53 Petal Overview Petal provides virtual disks –Large (2 64 bytes), sparse virtual space –Disk storage allocated on demand –Accessible to all file servers over a network Virtual disks implemented by –Cooperating CPUs executing Petal software –Ordinary disks attached to the CPUs –A scalable interconnection network

54 Petal Prototype

55 Global State Management Uses Paxos –Global state is replicated across all servers Metadata (disk allocation) only! –Consistent in the face of server and network failures –A majority is needed to update the global state –Any server can be added/removed in the presence of failed servers

56 Key Petal Features Storage is incrementally expandable Data is optionally mirrored over multiple servers Metadata is replicated on all servers Transparent addition and deletion of servers Supports read-only snapshots of virtual disks Client API looks like block-level disk device Throughput –Scales linearly with additional servers –Degrades gracefully with failures

57 Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation SOSP’97

58 Frangipani Features Behaves like a local file system –Multiple machines cooperatively manage a Petal disk –Users on any machine see a consistent view of data Exhibits good performance, scaling, and load balancing Easy to administer

59 Ease of Administration Frangipani machines are modular –Can be added and deleted transparently Common free space pool –Users don’t have to be moved Automatically recovers from crashes Consistent backup without halting the system

60 Frangipani Structure Distributed file system built atop a shared virtual disk (Petal) Frangipani servers do not communicate with each other directly –Only through Petal Simplifies managemant –Addition/removal of servers

61 Frangipani Layering

62 Standard Organization

63 Components of Frangipani File system core –Implements the file system (FS) interface –Uses FS mechanisms (buffer cache etc.) –Exploits Petal’s large virtual space Locks with leases –Granted for finite time, must be refreshed Write-ahead redo log –Performance optimization + failure recovery

64 Locks Multiple reader/single writer Granularity: lock per entire file or directory A lock is really a lease – it expires –After 30 seconds in their implementation Assumption?

65 Using Locks Frangipani servers are clients of lock service Dirty data is written to disk (Petal) before the lock is given to another machine Locks are cached by servers that acquire them –Soft state: no need to explicitly release locks –Uses lease timeouts for lock recovery

66 Distributed Lock Management A set of lock servers collaboratively manage locks –Run Paxos among them –Consensus on global state: set of locks each server is responsible for, list of current lock servers, lock allocation to clients –Need majority to make progress Using leases requires assuming loosely synchronized clocks –Expired leases should not be accepted Why Paxos then? –To overcome network partitions

67 Logging Frangipani uses a write ahead redo log for metadata –Log records are kept on Petal (why?) Data is written to Petal –On sync, fsync, or every 30 seconds –On lock revocation or when the log wraps Each server has a separate log –Reduces contention –Independent recovery

68 Recovery Recovery initiated due to failure detection –By the lock service –Failure detection implemented using heartbeats Any server can recover operations for a failed server –Log is available via Petal

69 Conclusions Fault-tolerance in the real world Overcome crashes and network partitions using consensus-based replication –Paxos Un-contended good performance –Using locks Implement locks as leases for robustness Logging for recovery

1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar.

Similar presentations

Presentation on theme: "1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar.

Similar presentations

Presentation on theme: "1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar."— Presentation transcript:

Similar presentations

About project

Feedback