1 Distributed Data Structures for a Peer-to-peer system Advisor: James Aspnes Committee: Joan Feigenbaum Arvind Krishnamurthy Antony Rowstron [MSR, Cambridge,

1 Distributed Data Structures for a Peer-to-peer system Advisor: James Aspnes Committee: Joan Feigenbaum Arvind Krishnamurthy Antony Rowstron [MSR, Cambridge, UK] Gauri Shah

2 P2P system Very large number of peers (nodes). Store resources identified by keys. Peers subject to crash failures. Question: how to locate resources efficiently? Resources Peers Key

3 A brief history Shawn Fanning starts Napster. June 1999 Dec. 1999 RIAA sues Napster for copyright infringement. July 2001 Napster is shut down! Napster KaZaA Gnutella Morpheus MojoNation …… Napster clones CAN Chord Pastry Tapestry Skip graphs …… Academic Research SETI@home folding@home …… Distributed computing

4 Answer: Central server? Napster ? x Central server bottleneck Wasted power at clients No fault tolerance x ? x Using server farms?

5 Answer: Flooding? Gnutella Too much traffic Available resources ‘out-of-reach’

6 Answer: Super-peers? KaZaA/Morpheus Inherently unscalable Super-peers

7 What would we like? Data availability Decentralization Scalability Load balancing Fault-tolerance Network maintenance Dynamic node addition/deletion Repair mechanism Efficient searching Incorporating proximity Incorporating locality

8 Distributed Hash Tables Node Physical Link HASH Resource v2 v1 v4 v3 Virtual Link VIRTUAL OVERLAY NETWORK PHYSICAL NETWORK v3 v1 v4 v1 v3 v4 Virtual Route Actual RouteNode Ids and keys

9 3 5 7 8 2 (0,0) (1,0) (0,1)(1,1) d=2 CAN [RFHKS ’01] 427 768 135 365 123 Pastry [RD ’01] Tapestry [ZKJ ’01] Existing DHT systems 0 2 3 7 6 5 0 Chord [SMKKB ’01] 4 1 3333 6666 0 3 6 0 O(log n) time per search O(log n) space per node 3 27 36 0 368

10 What does this give us? Data availability Decentralization Scalability Load balancing Fault-tolerance Network maintenance Dynamic node addition/deletion Repair mechanism Efficient searching Incorporating proximity Incorporating locality

11 Analytical model [Aspnes-Diamadi-Shah, PODC 2002] Questions: Performance with failures? Optimal link distribution for greedy routing? Construction and dynamic maintenance?

12 Our approach (Based on [Kleinberg 1999]) Simple metric space: 1D line. Hash(key) = Metric space location. 2 short-hop links: immediate neighbors. k long-hops links: inverse-distance distribution. Pr[edge(u,v)] = 1/d(u,v) / Greedy Routing: forward message to neighbor closest to target in metric space. 1/d(u,v’)

13 Performance with failures Without failures: Routing time: O((log 2 n)/k). With failures: Each node/link fails with prob. p. Routing time: O((log 2 n)/[(1-p).k]). p (1-p) Time Each node has k [1..log n] long-hop links.

14 Search with random failures n = 131072 nodes log n = 17 links Fraction of failed searches Probability of node failure Fraction of failed searches [Non-faulty source and target]

15 Lower bounds? Is it possible to design a link distribution that beats the O(log 2 n) bound for routing given by 1/d distribution? Lower bound on routing time as a function of number of links per node.

16 Lower bounds Random graph G. Node x has k links on average, each chosen independently. x links to (x-1) and (x+1). Let target = 0. Expected time to reach 0 from any point chosen uniformly from 1..n: (log 2 n) worse than O(log n) for a tree: cost of assuming symmetry between nodes. * * Probability of choosing links symmetric about 0 and unimodal. Routing time: (log 2 n/k log log n)

17 Heuristic for construction New node chooses neighbors using inverse-distance distribution. Links to live nodes closest to chosen ones. Selects older nodes to point to it. absent node adjusted link initial link new link older node y x Same strategy for repairing broken links. new node ideal link

18 n=16384 nodes log n=14 links Derived Ideal

19 So far... Data availability Decentralization Scalability Load balancing Fault-tolerance Network maintenance Dynamic node addition/deletion Repair mechanism Efficient searching Incorporating proximity Incorporating locality

20 Disadvantage of DHTs No support for locality. User requests www.cnn.com Likely to request www.cnn.comwww.cnn.com/weather System should use information from first search to improve performance of second search. No support for complex queries. DHTs cannot do this as hashing destroys locality.

21 Skip list [Pugh ’90] Data structure based on a linked list. AGJMRW HEAD TAIL 101100001 Each element linked at higher level with probability 1/2. Level 0 AJM Level 1 J Level 2

22 Searching in a skip list AGJMRW HEADTAIL AJ J Search for key ‘R’ M Level 0 Level 1 Level 2 -- ++ success failure Time for search: O(log m) on average. Number of pointers per element: O(1) on average. [m = number of elements in skip list]

23 Skip lists for P2P? Cannot reduce load on top-level elements. Cannot survive partitioning by failures. Disadvantages Advantages Takes O(log m) expected search time. Retains locality. Supports dynamic additions/deletions. Problem: Lack of redundancy.

24 A skip graph [Aspnes-Shah, SODA 2003] A 000 J 001 M 011 G 100 W 101 R 110 Level 1 G R W AJM 000001011 101 110 100 Level 2 A G JMRW 000 001 011100110101 Level 0 Membership vectors Link at level i to elements with matching prefix of length i. Average O(log m) pointers per element [m = number of resources].

25 Search: expected O (log m) Same performance as skip lists and DHTs. AJM GWR Level 1 G R W AJM Level 2 AGJRW Level 0 Restricting to the lists containing the starting element of the search, we get a skip list. M

26 Resources vs. nodes Skip graphs: Elements are resources. DHTs: Elements are nodes. C B D A E Does not affect search performance or load balancing. But increases pointers at each node. DHT Skip graph Physical Network Physical Network A E C B D Level 0

27 com.applecom.sun com.ibm com.microsoft com.ibm/m1 com.ibm/m2 com.ibm/m3 m3 com.ibm/m4 m2 m1 m4 r.htm a.htm …… f.htm g.htm …… Level 0 SkipNet [HJSTW’03] Distributed Hash Table

29 Insertion – 1 A 000 M 011 G 100 W 101 R 110 Level 1 G R W A M 000011 101 110 100 Level 2 A GM R W 000011100110101 Level 0 J 001 Starting at buddy, find nearest key at level 0: range query looking for key closest to new key. Takes O(log m) on average. buddy new element

30 Insertion - 2 A 000 M 011 G 100 W 101 R 110 Level 1 G R W A M 000011 101 110 100 Level 2 A GM R W 000011100110101 Level 0 J 001 J J Adds O(1) time per level. Total time for insertion: O(log m). Same as most DHTS. Search for matching prefix of increasing length.

32 Locality and range queries Find any key F. Find largest key < F. Find least key > F. Find all keys in interval [D..O]. Initial element insertion at level 0. D F A I D F A I L O S

33 Further applications of locality news:05/14 e.g. find latest news before today. find largest key < news: 05/14. news:03/18news:04/03news:03/01news:01/31 Level 0 1. Version Control

34 e.g. find any copy of some Britney Spears song: search for britney*. britney03britney04britney02 Level 0 2. Data Replication Level 1 Level 2 Provides hot-spot management and survivability.

35 What’s left? Data availability Decentralization Scalability Load balancing Fault-tolerance Network maintenance Dynamic node addition/deletion Repair mechanism Efficient searching Incorporating proximity Incorporating locality

36 Fault tolerance How do failures affect skip graph performance? Random failures: Randomly chosen elements fail. Experimental results. [Experiments may not necessarily give worst failure pattern.] Adversarial failures: Adversary carefully chooses elements that fail. Theoretical results.

37 Random failures 131072 elements

38 Searches with random failures 131072 elements 10000 messages Fraction of failed searches [Non-faulty source and target]

39 Adversarial failures Theorem: A skip graph with m elements has expansion ratio = (1/log m) whp. A dA dA = elements adjacent to A but not in A. Expansion ratio = min |dA|/|A|, 1 <= |A| <= m/2. f failures can isolate only O(flog m) elements. # of failures isolated elements >= |dA| >= |A| 1 log m

40 Need for repair mechanism AJM G WR Level 1 G R W AJM Level 2 AGJ M RW Level 0 Node failures can leave skip graph in inconsistent state.

41 33 Basic repair action If an element detects a missing neighbor, it tries to patch the link using other levels. 124 5 6 1 56 1 5 Also relink at other lower levels. Eventually each connected component of the disrupted skip graph reorganizes itself into an ideal skip graph.

42 Ideal skip graph Let xR i (xL i ) be the right (left) neighbor of x at level i. xL i < x < xR i. xL i R i = xR i L i = x. Invariant If xL i, xR i exist: Successor constraints x Level i Level i-1 xR i xR i-1 x..00....01....00.. xR i = xR i-1, for some k’. xL i = xL i-1, for some k. k k’ 1 2

43 Constraint violation Neighbor at level i not present at level (i-1). Level i-1 Level i..00....01....00....01.. Level i+1 merge..00....01....00....01.. Level i-1 Level i

44 Additional properties 1.Low network congestion. 2.No need to know key space.

45 Network congestion Interested in average traffic through any element u i.e. the number of searches from source s to destination t that use u. Theorem: Let dist (u, t) = d. Then the probability that a search from s to t passes through u is < 2/(d+1). where V = {elements v: u <= v <= t} and |V| = d+1. Elements near popular target get loaded but effect drops off rapidly.

46 76400 76450 76500 76550 76600 0.0 Location Fraction of messages Predicted vs. real load 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Predicted load Actual load Destination = 76500

47 Knowledge of key space DHTs require knowledge of key space size initially. Skip graphs do not! E Z 10 E Z J insert Level 0 Level 1 Old elements extend membership vector as required with new arrivals. E Z 10 E Z J 0 J 000 1 ZJ Level 0 Level 1 Level 2 New bit

48 Similarities with DHTs Data availability Decentralization Scalability Load balancing Fault-tolerance [Random failures] Network maintenance Dynamic node addition/deletion Repair mechanism Efficient searching Incorporating proximity Incorporating locality

49 PropertyDHTsSkip Graphs Tolerance of adversarial faults Not yetYes LocalityNoYes Key space sizeReqd.Not reqd. ProximityPartiallyNo Differences

50 Open Problems Design more efficient repair mechanism. Incorporate proximity. Study effect of byzantine/selfish behavior. Provide locality and state-minimization Some promising approaches: Soln: Composition of data structures [AS’03, ZSZ’03] Tool: Locality-sensitive hashing [LS’96, IMRV’97]

51 Questions, Comments, Criticisms

1 Distributed Data Structures for a Peer-to-peer system Advisor: James Aspnes Committee: Joan Feigenbaum Arvind Krishnamurthy Antony Rowstron [MSR, Cambridge,

Similar presentations

Presentation on theme: "1 Distributed Data Structures for a Peer-to-peer system Advisor: James Aspnes Committee: Joan Feigenbaum Arvind Krishnamurthy Antony Rowstron [MSR, Cambridge,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Distributed Data Structures for a Peer-to-peer system Advisor: James Aspnes Committee: Joan Feigenbaum Arvind Krishnamurthy Antony Rowstron [MSR, Cambridge,

Similar presentations

Presentation on theme: "1 Distributed Data Structures for a Peer-to-peer system Advisor: James Aspnes Committee: Joan Feigenbaum Arvind Krishnamurthy Antony Rowstron [MSR, Cambridge,"— Presentation transcript:

Similar presentations

About project

Feedback