1 Distributed k -ary System Algorithms for Distributed Hash Tables Ali Ghodsi PhD Defense, 7th December 2006,

1 Distributed k -ary System Algorithms for Distributed Hash Tables Ali Ghodsi aligh@kth.se http://www.sics.se/~ali/thesis/ PhD Defense, 7th December 2006, KTH/Royal Institute of Technology

2 Distributed k -ary System Algorithms for Distributed Hash Tables Ali Ghodsi aligh@kth.se http://www.sics.se/~ali/thesis/ PhD Defense, 7th December 2006, KTH/Royal Institute of Technology

3 Presentation Overview Gentle introduction to DHTs Contributions The future

4 Whats a Distributed Hash Table (DHT)? An ordinary hash table Every node provides a lookup operation Provide the value associated with a key Nodes keep routing pointers If item not found, route to another node KeyValue AlexanderBerlin AliStockholm MarinaGothenburg PeterLouvain la neuve SeifStockholm StefanStockholm, which is distributed

5 So what? Characteristic properties Scalability Number of nodes can be huge Number of items can be huge Self-manage in presence joins/leaves/failures Routing information Data items Time to find data is logarithmic Size of routing tables is logarithmic Example: log 2 (1000000)20 EFFICIENT! Store number of items proportional to number of nodes Typically: With D items and n nodes Store D/n items per node Move D/n items when nodes join/leave/fail EFFICIENT! Self-management routing info: Ensure routing information is up-to-date Self-management of items: Ensure that data is always replicated and available

6 Presentation Overview … Whats been the general motivation for DHTs? …

7 Traditional Motivation (1/2) Peer-to-Peer filesharing very popular Napster Completely centralized Central server knows who has what Judicial problems Gnutella Completely decentralized Ask everyone you know to find data Very inefficient central index decentralized index

8 Traditional Motivation (2/2) Grand vision of DHTs Provide efficient file sharing Quote from Chord: In particular, [Chord] can help avoid single points of failure or control that systems like Napster possess, and the lack of scalability that systems like Gnutella display because of their widespread use of broadcasts. [Stoica et al. 2001] Hidden assumptions Millions of unreliable nodes User can switch off computer any time (leave=failure) Extreme dynamism (nodes joining/leaving/failing) Heterogeneity of computers and latencies Unstrusted nodes

9 Our philosophy DHT is a useful data structure Assumptions might not be true Moderate amount of dynamism Leave not same thing as failure Dedicated servers Nodes can be trusted Less heterogeneity Our goal is to achieve more given stronger assumptions

10 Presentation Overview … How to construct a DHT? …

11 How to construct a DHT (Chord)? Use a logical name space, called the identifier space, consisting of identifiers {0,1,2,…, N-1} Identifier space is a logical ring modulo N Every node picks a random identifier Example: Space N=16 {0,…,15} Five nodes a, b, c, d a picks 6 b picks 5 c picks 0 d picks 5 e picks 2 2 11 6 5 0 1 3 4 7 8 9 10 15 14 13 12

Definition of Successor The successor of an identifier is the first node met going in clockwise direction starting at the identifier Example succ(12)=14 succ(15)=2 succ(6)=6 2 11 6 5 0 1 3 4 7 8 9 10 15 14 13 12

13 Where to store data (Chord) ? Use globally known hash function, H Each item gets identifier H(key) Store each item at its successor Node n is responsible for item k Example H(Marina)=12 H(Peter)=2 H(Seif)=9 H(Stefan)=14 2 11 6 5 0 1 3 4 7 8 9 10 15 14 13 12 KeyValue AlexanderBerlin MarinaGothenburg Peter Louvain la neuve SeifStockholm StefanStockholm Store number of items proportional to number of nodes Typically: With D items and n nodes Store D/n items per node Move D/n items when nodes join/leave/fail EFFICIENT!

14 Where to point (Chord) ? Each node points to its successor The successor of a node n is succ(n+1) Known as a nodes succ pointer Each node points to its predecessor First node met in anti-clockwise direction starting at n-1 Known as a nodes pred pointer Example 0 s successor is succ(1)=2 2 s successor is succ(3)=5 5 s successor is succ(6)=6 6 s successor is succ(7)=11 11 s successor is succ(12)=0 2 11 6 5 0 1 3 4 7 8 9 10 15 14 13 12

15 DHT Lookup To lookup a key k Calculate H(k) Follow succ pointers until item k is found Example Lookup Seif at node 2 H(Seif)=9 Traverse nodes: 2, 5, 6, 11 (BINGO) Return Stockholm to initiator 2 11 6 5 0 1 3 4 7 8 9 10 15 14 13 12 KeyValue AlexanderBerlin MarinaGothenburg PeterLouvain la neuve SeifStockholm StefanStockholm

16 Speeding up lookups If only pointer to succ(n+1) is used Worst case lookup time is N, for N nodes Improving lookup time Point to succ(n+1) Point to succ(n+2) Point to succ(n+4) Point to succ(n+8) … Point to succ(n+2 M ) Distance always halved to the destination 2 11 6 5 0 1 3 4 7 8 9 10 15 14 13 12 Time to find data is logarithmic Size of routing tables is logarithmic Example: log 2 (1000000)20 EFFICIENT!

17 Dealing with failures Each node keeps a successor-list Pointer to f closest successors succ(n+1) succ(succ(n+1)+1) succ(succ(succ(n+1)+1)+1)... If successor fails Replace with closest alive successor If predecessor fails Set pred to nil 2 11 6 5 0 1 3 4 7 8 9 10 15 14 13 12

18 Handling Dynamism Periodic stabilization used to make pointers eventually correct Try pointing succ to closest alive successor Try pointing pred to closest alive predecessor

20 Outline … Lookup consistency …

21 Problems with periodic stabilization Joins and leaves can result in inconsistent lookup results At node 12, lookup(14)=14 At node 10, lookup(14)=15 10 12 14 15

22 Problems with periodic stabilization Leaves can result in routing failures 10 13 16

23 Problems with periodic stabilization Too many leaves destroy the system #leaves+#failures/round < |successor-list| 10 11 12 14 15

24 Outline … Atomic Ring Maintenance …

25 Atomic Ring Maintenance Differentiate leaves from failures Leave is a synchronized departure Failure is a crash-stop Initially assume no failures Build a ring initially

26 Atomic Ring Maintenance Separate parts of the problem Concurrency control Serialize neighboring joins/leaves Lookup consistency

27 Naïve Approach Each node i hosts a lock called L i For p to join or leave: First acquire L p.pred Second acquire L p Third acquire L p.succ Thereafter update relevant pointers Can lead to deadlocks

28 Our Approach to Concurrency Control Each node i hosts a lock called L i For p to join or leave: First acquire L p Thereafter acquire L p.succ Thereafter update relevant pointers Each lock has a lock queue Nodes waiting to acquire the lock

29 Safety Non-interference theorem: When node p acquires both locks: Node p s successor cannot leave Node p s predecessor cannot leave Other joins cannot affect relevant pointers

30 Dining Philosophers Problem similar to the Dining philosophers problem Five philosophers around a table One fork between each philosopher (5) Philosophers eat and think To eat: grab left fork then grab right fork

31 Deadlocks Can result in a deadlock If all nodes acquire their first lock Every node waiting indefinitely for second lock Solution from Dining philosophers Introduce asymmetry One node acquires locks in reverse order Node with highest identifier reverses If n<n.succ, then n has highest identity

32 1414, 12 Pitfalls Join adds node/philosopher Solution: some requests in the lock queue forwarded to new node 10 12 14 15 12

33 Pitfalls Leave removes a node/philosopher Problem: if leaving node gives lock queue to its successor, nodes can get worse position in queue: starvation Use forwarding to avoid starvation Lock queue empty after local leave request

34 Correctness Liveness Theorem: Algorithm is starvation free Also free from deadlocks and livelocks Every joining/leaving node will eventually succeed getting both locks

35 Performance drawbacks If many neighboring nodes leaving All grab local lock Sequential progress Solution Randomized locking Release locks and retry Liveness with high probability 10 12 14 15

36 Lookup consistency: leaves So far dealt with concurrent joins/leaves Look at concurrent join/leaves/lookups Lookup consistency (informally): At any time, only one node responsible for any key Joins/leaves should not affect functionality of lookups

37 Lookup consistency Goal is to make joins and leaves appear as if they happened instantaneously Every leave has a leave point A point in global time, where the whole system behaves as if the node instantaneously left Implemented with a LeaveForward flag The leaving node forwards messages to successor if LeaveForward is true

38 Leave Algorithm pred:=p succ:=r LeaveForward=true LeaveForward=false Node pNode q (leaving)Node r leave point

39 Lookup consistency: joins Every join has a join point A point in global time, where the whole system behaves as if the node instantaneously joined Implemented with a JoinForward flag The successor of a joining node forwards messages to new node if JoinForward is true

40 Join Algorithm Join Point JoinForward=true oldpred=pred pred=q JoinForwarding=false succ:=q pred:=p succ:=r Node pNode q (joining)Node r

41 Outline … What about failures? …

42 Dealing with Failures We prove it is impossible to provide lookup consistency on the Internet Assumptions Availability (always eventually answer) Lookup consistency Partition tolerance Failure detectors can behave as if the networked partitioned

43 Dealing with Failures We provide fault-tolerant atomic ring Locks leased Guarantees locks are always released Periodic stabilization ensures Eventually correct ring Eventual lookup consistency

44 Contributions Lookup consistency in presence of joins/leaves System not affected by joins/leaves Inserts do not disappear No routing failures when nodes leave Number of leaves not bounded

45 Related Work Li, Misra, Plaxton (04, 06) have a similar solution Advantages Assertional reasoning Almost machine verifiable proofs Disadvantages Starvation possible Not used for lookup consistency Failure-free environment assumed

46 Related Work Lynch, Malkhi, Ratajczak (02), position paper with pseudo code in appendix Advantages First to propose atomic lookup consistency Disadvantages No proofs Message might be sent to a node that left Does not work for both joins and leaves together Failures not dealt with

47 Outline … Additional Pointers on the Ring …

48 Routing Generalization of Chord to provide arbitrary arity Provide log k ( n ) hops per lookup k being a configurable parameter n being the number of nodes Instead of only log 2 ( n )

49 Achieving log k (n) lookup Interval 1Interval 2 Interval 3Interval 0 0 32 48 4 8 12 16 I3I3 I2I2 I1I1 I0I0 Node 0 48…6332…4716…310…15Level 1 Each node log k (N) levels, N=k L Each level contains k intervals, Example, k=4, N=64 (4 3 ), node 0

50 I3I3 I2I2 I1I1 I0I0 Node 0 48…6332…4716…310…15Level 1 Interval 2 Interval 1 Interval 3 Interval 0 Achieving log k (n) lookup 0 32 48 4 8 12 16 Each node log k (N) levels, N=k L Each level contains k intervals, Example, k=4, N=64 (4 3 ), node 0 I3I3 I2I2 I1I1 I0I0 Node 0 12…158…114…70…3Level 2 48…6332…4716…310…15Level 1

51 I3I3 I2I2 I1I1 I0I0 Node 0 12…158…114…70…3Level 2 48…6332…4716…310…15Level 1 Achieving log k (n) lookup 0 32 48 4 8 12 16 I3I3 I2I2 I1I1 I0I0 Node 0 3210Level 3 12…158…114…70…3Level 2 48…6332…4716…310…15Level 1 Each node log k (N) levels, N=k L Each level contains k intervals, Example, k=4, N=64 (4 3 ), node 0

52 Arity important Maximum number of hops can be configured Example, a 2-hop system

53 Each node has (k-1)log k (N) pointers Node p s pointers point at Placing pointers 0 32 48 4 8 12 16 Node 0 s pointers f(1)=1 f(2)=2 f(3)=3 f(4)=4 f(5)=8 f(6)=12 f(7)=16 f(8)=32 f(9)=48

54 Greedy Routing lookup(i) algorithm Use pointer closest to i, without overshooting i If no such pointer exists, succ is responsible for i i

55 Routing with Atomic Ring Maintenance Invariant of lookup Last hop is always predecessor of responsible node Last step in lookup If JoinForward is true, forward to pred If LeaveForward is true, forward to succ

56 Avoiding Routing Failures If nodes leave, routing failures can occur Accounting algorithm Simple Algorithm No routing failures of ordinary messages Fault-free Algorithm No routing failures Many cases and interleavings Concurrent joins and leaves, pointers in both directions

57 General Routing Three lookup styles Recursive Iterative Transitive

58 Reliable Routing Reliable lookup for each style If initiator doesnt crash, responsible node reached No redundant delivery of messages General strategy Repeat operation until success Filter duplicates using unique identifiers Iterative lookup Reliability easy to achieve Recursive lookup Several algorithms possible Transitive lookup Efficient reliability hard to achieve

59 Outline … One-to-many Communication …

60 Group Communication on an Overlay Use existing routing pointers Group communication DHT only provides key lookup Complex queries by searching the overlay Limited horizon broadcast Iterative deepening More efficient than Gnutella-like systems No unintended graph partitioning Cheaper topology maintenance [castro04]

61 Group Communication on an Overlay DHT builds a graph Why not use general graph algorithms? Can use the specific structure of DHTs More efficient Avoids redundant messages

62 Broadcast Algorithms Correctness conditions: Termination Algorithm should eventually terminate Coverage All nodes should receive the broadcast message Non-redundancy Each node receives the message at most once Initially assume no failures

63 Naïve Broadcast Naive Broadcast Algorithm send message to succ until: initiator reached or overshooted 2 11 6 5 0 1 3 4 7 8 9 10 15 14 13 12 initiator

64 Naïve Broadcast Naive Broadcast Algorithm send message to succ until: initiator reached or overshooted Improvement Initiator delegates half the space to neighbor Idea applied recursively log(n) time and n messages 2 11 6 5 0 1 3 4 7 8 9 10 15 14 13 12 initiator

65 Simple Broadcast in the Overlay Dissertation assumes general DHT model event n.SimpleBcast(m, limit)% initially limit = n for i:=M downto 1 do if u(i) (n,limit) then sendto u(i) : SimpleBcast(m, limit) limit := u(i)

66 Advanced Broadcast Old algorithm on k -ary trees

67 Getting responses Getting a reply Nodes send directly back to initiator Not scalable Simple Broadcast with Feedback Collect responses back to initiator Broadcast induces a tree, feedback in reverse direction Similar to simple broadcast algorithm Keeps track of parent ( par ) Keeps track of children ( Ack ) Accumulate feedback from children, send to parent Atomic ring maintenance Acquire local lock to ensure nodes do not leave

68 Outline … Advanced One-to-many Communication …

69 Motivation for Bulk Operation Building MyriadStore in 2005 Distributed backup using the DKS DHT Restoring a 4mb file Each block (4kb) indexed in DHT Requires 1000 items in DHT Expensive One node making 1000 lookups Marshaling/unmarshaling 1000 requests

70 Bulk Operation Define a bulk set: I A set of identifiers bulk_operation(m, I) Send message m to every node i I Similar correctness to broadcast Coverage: all nodes with identifier in I Termination Non-redundancy

71 Bulk Owner Operation with Feedback Define a bulk set: I A set of identifiers bulk_own(m, I) Send m to every node responsible for an identifier i I Example Bulk set I={4} Node 4 might not exist Some node is responsible for identifier 4

72 Bulk Operation with Feedback Define a bulk set: I A set of identifiers bulk_feed(m, I) Send message m to every node i I Accumulate responses back to initiator bulk_own_feed(m, I) Send message m to every node responsible for i I Accumulate responses back to initiator

73 Bulk Properties (1/2) No redundant messages Maximum log(n) messages per node

74 Bulk Properties (2/2) Two extreme cases Case 1 Bulk set is all identifiers Identical to simple broadcast Message complexity is n Time complexity is log(n) Case 2 Bulk set is a singleton with one identifier Identical to ordinary lookup Message complexity is log(n) Time complexity is in log(n)

75 Pseudo Reliable Broadcast Pseudo-reliable broadcast to deal with crash failures Coverage property If initiator is correct, every node gets the message Similar to broadcast with feedback Use failure detectors on children If child with responsibility to cover I fails Use bulk to retry covering interval I Filter redundant messages using unique identifiers Eventually perfect failure detector for termination Inaccuracy results in redundant messages

76 Applications of bulk operation Bulk operation Topology maintenance: update nodes in bulk set Pseudo-reliable broadcast: re-covering intervals Bulk owner Multiple inserts into a DHT Bulk owner with feedback Multiple lookups in a DHT Range queries

77 Outline … Replication …

78 Successor-list replication Replicate a nodes item on its f successors DKS, Chord, Pastry, Koorde etcetera. Was abandoned in favor of symmetric replication because …

79 Motivation: successor-lists If a node joins or leaves f replicas need to be updated Color represents data item Replication degree 3 Every color replicated three times

80 Motivation: successor-lists If a node joins or leaves f replicas need to be updated Color represents data item Node leaves Yellow, green, red, blue need to be re-distributed

81 Multiple hashing Rehashing Store each item at succ( H(k) ) succ( H(H(k)) ) succ( H(H(H(k))) ) … Multiple hash functions Store each item at succ( H 1 (k) ) succ( H 2 (k) ) succ( H 3 (k) ) … Advocated by CAN and Tapestry

82 Motivation: multiple hashing Example Item H(Seif)=7 succ(7)=9 Node 9 crashes Node 12 should get item from replica Need hash inverse H -1 (7)=Seif (impossible) Items dispersed all over nodes (inefficient) 5 9 12 7 Seif, Stockholm

83 Symmetric Replication Basic Idea Replicate identifiers, not nodes Associate each identifier i with f other identifiers: Identifier space partitioned into m equivalence classes Cardinality of each class is f, m=N/f Each node replicates the equivalence class of all identifiers it is responsible for

84 Symmetric replication Replication degree f=4, Space={0,…,15} Congruence classes modulo 4: {0, 4, 8, 12} {1, 5, 9, 13} {2, 6, 10, 14} {3, 7, 11, 15} 0 1 2 15 14 133 12 11 4 5 6 9 8 7 10 Data: 15, 0 Data: 1, 2, 3 Data: 4, 5 Data: 14, 13, 12, 11 Data: 6, 7, 8, 9, 10

85 Ordinary Chord Replication degree f=4, Space={0,…,15} Congruence classes modulo 4 {0, 4, 8, 12} {1, 5, 9, 13} {2, 6, 10, 14} {3, 7, 11, 15} 0 1 2 15 14 133 12 11 4 5 6 9 8 7 10 Data: 15, 0 Data: 1, 2, 3 Data: 4, 5 Data: 14, 13, 12, 11 Data: 10, 9, 8, 7 Data: 11, 12 Data: 13, 14, 15 Data: 0, 1 Data: 2, 3, 4, 5, 6 Data: 6, 5, 4, 3 Data: 2, 1, 0, 15 Data: 7, 8 Data: 3, 4 Data: 9, 10, 11 Data: 5, 6, 7 Data: 12, 13 Data: 8, 9 Data: 14, 15, 0, 1, 2 Data: 10, 11, 12, 13, 14 Data: 6, 7, 8, 9, 10

86 Data: 15, 0 Data: 1, 2, 3 Data: 4, 5 Data: 14, 13, 12, 11 Data: 6, 7, 8, 9, 10 Data: 11, 12 Cheap join/leave Replication degree f=4, Space={0,…,15} Congruence classes modulo 4 {0, 4, 8, 12} {1, 5, 9, 13} {2, 6, 10, 14} {3, 7, 11, 15} 0 1 2 15 14 133 12 11 4 5 6 9 8 7 10 Data: 10, 9, 8, 7 Data: 11, 12 Data: 13, 14, 15 Data: 0, 1 Data: 2, 3, 4, 5, 6 Data: 6, 5, 4, 3 Data: 2, 1, 0, 15 Data: 7, 8 Data: 3, 4 Data: 9, 10, 11 Data: 5, 6, 7 Data: 12, 13 Data: 8, 9 Data: 14, 15, 0, 1, 2 Data: 10, 11, 12, 13, 14 Data: 7, 8 Data: 3, 4 Data: 0, 15 Data: 11, 12, 7, 8, 3, 4, 0, 15

87 Contributions Message complexity for join/leave O(1) Bit complexity remains unchanged Handling failures more complex Bulk operation to fetch data On average log(n) complexity Can do parallel lookups Decreasing latencies Increasing robustness Distributed voting Erasure codes

88 Presentation Overview … Summary …

89 Summary (1/3) Atomic ring maintenance Lookup consistency for j/l No routing failures as nodes j/l No bound on number of leaves Eventual consistency with failures Additional routing pointers k -ary lookup Reliable lookup No routing failures with additional pointers

90 Summary (2/3) Efficient Broadcast log(n) time and n message complexity Used in overlay multicast Bulk operations Efficient parallel lookups Efficient range queries

91 Summary (3/3) Symmetric Replication Simple, O(1) message complexity for j/l O(log f) for failures Enables parallel lookups Decreasing latencies Increasing robustness Distributed voting

93 Future Work (1/2) Periodic stabilization Prove it is self-stabilizing

94 Future Work (2/2) Replication Consistency Atomic consistency impossible in asynchronous systems Assume partial synchrony Weaker consistency models? Using virtual synchrony

95 Speculative long-term agenda Overlay today provides Dynamic membership Identities (max/min avail) Only know subset of nodes Shared memory registers Revisit distributed computing Assuming an overlay as basic primitive Leader election Consensus Shared memory consistency (started) Transactions Wave algorithms (started) Implement middleware providing these…

96 Acknowledgments Seif Haridi Luc Onana Alima Cosmin Arad Per Brand Sameh El-Ansary Roland Yap

97 THANK YOU

99 Handling joins When n joins Find n s successor with lookup(n) Set succ to n s successor Stabilization fixes the rest Periodically at n : 1.set v:=succ.pred 2.if vnil and v is in (n,succ] 3. set succ:=v 4.send a notify(n) to succ When receiving notify(p) at n : 1.if pred=nil or p is in (pred,n] 2. set pred:=p 11 15 13

100 Handling leaves When n leaves Just dissappear (like failure) When pred detected failed Set pred to nil When succ detected failed Set succ to closest alive in successor list Periodically at n : 1.set v:=succ.pred 2.if vnil and v is in (n,succ] 3. set succ:=v 4.send a notify(n) to succ When receiving notify(p) at n : 1.if pred=nil or p is in (pred,n] 2. set pred:=p 11 15 13

1 Distributed k -ary System Algorithms for Distributed Hash Tables Ali Ghodsi PhD Defense, 7th December 2006,

Similar presentations

Presentation on theme: "1 Distributed k -ary System Algorithms for Distributed Hash Tables Ali Ghodsi PhD Defense, 7th December 2006,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Distributed k -ary System Algorithms for Distributed Hash Tables Ali Ghodsi PhD Defense, 7th December 2006,

Similar presentations

Presentation on theme: "1 Distributed k -ary System Algorithms for Distributed Hash Tables Ali Ghodsi PhD Defense, 7th December 2006,"— Presentation transcript:

Similar presentations

About project

Feedback