Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sylvia Ratnasamy (UC Berkley Dissertation 2002) Paul Francis Mark Handley Richard Karp Scott Shenker A Scalable, Content Addressable Network Slides by.

Similar presentations


Presentation on theme: "Sylvia Ratnasamy (UC Berkley Dissertation 2002) Paul Francis Mark Handley Richard Karp Scott Shenker A Scalable, Content Addressable Network Slides by."— Presentation transcript:

1 Sylvia Ratnasamy (UC Berkley Dissertation 2002) Paul Francis Mark Handley Richard Karp Scott Shenker A Scalable, Content Addressable Network Slides by Authors above with additions/modificatons by Michael Isaacs (UCSB) Tony Hannan (Georgia Tech)

2 Outline Introduction Design Design Improvements Evalution Related Work Conclusion

3 Hash tables – essential building block in software systems Internet-scale distributed hash tables – equally valuable to large-scale distributed systems peer-to-peer systems – Napster, Gnutella, Groove, FreeNet, MojoNation… large-scale storage management systems – Publius, OceanStore, PAST, Farsite, CFS... mirroring on the Web Internet-scale hash tables

4 Content-Addressable Network (CAN) CAN: Internet-scale hash table Interface – insert (key, value) – value = retrieve (key) Properties – scalable – operationally simple – good performance Related systems: Chord, Plaxton/Tapestry...

5 Problem Scope provide the hashtable interface scalability robustness performance

6 Not in scope security anonymity higher level primitives – keyword searching – mutable content

7 Outline Introduction Design Design Improvements Evalution Related Work Conclusion

8 K V CAN: basic idea K V

9 CAN: basic idea insert (K 1,V 1 ) K V

10 CAN: basic idea insert (K 1,V 1 ) K V

11 CAN: basic idea (K 1,V 1 ) K V

12 CAN: basic idea retrieve (K 1 ) K V

13 CAN: solution Virtual Cartesian coordinate space Entire space is partitioned amongst all the nodes – every node “owns” a zone in the overall space Abstraction – can store data at “points” in the space – can route from one “point” to another Point = node that owns the enclosing zone

14 CAN: simple example 1

15 12

16 1 2 3

17 1 2 3 4

18

19 I

20 node I::insert(K,V) I

21 (1) a = h x (K) CAN: simple example x = a node I::insert(K,V) I

22 (1) a = h x (K) b = h y (K) CAN: simple example x = a y = b node I::insert(K,V) I

23 (1) a = h x (K) b = h y (K) CAN: simple example (2) route(K,V) -> (a,b) node I::insert(K,V) I

24 CAN: simple example (2) route(K,V) -> (a,b) (3) (a,b) stores (K,V) (K,V) node I::insert(K,V) I (1) a = h x (K) b = h y (K)

25 CAN: simple example (2) route “retrieve(K)” to (a,b) (K,V) (1) a = h x (K) b = h y (K) node J ::retrieve(K) J

26 Data stored in the CAN is addressed by name (i.e. key), not location (i.e. IP address) CAN

27 CAN: routing table

28 CAN: routing (a,b) (x,y)

29 A node only maintains state for its immediate neighboring nodes CAN: routing

30 CAN: node insertion Bootstrap node 1) Discover some node “I” already in CAN new node

31 CAN: node insertion I new node 1) Discover some node “I” already in CAN

32 CAN: node insertion 2) pick random point in space I (p,q) new node

33 CAN: node insertion (p,q) 3) I routes to (p,q), discovers node J I J new node

34 CAN: node insertion new J 4) split J’s zone in half… new owns one half

35 Inserting a new node affects only a single other node and its immediate neighbors CAN: node insertion

36 CAN: node failures Need to repair the space – recover database state updates use replication, rebuild database from replicas – repair routing takeover algorithm

37 CAN: takeover algorithm Nodes periodically send update message to neigbors An absent update message indicates a failure – Each node send a recover message to the takeover node – The takeover node takes over the failed node's zone

38 CAN: takeover node The takeover node is the node whose VID is numerically closest to the failed nodes VID. The VID is a binary string indicating the path from the root to the zone in the binary partition tree. 0 11 100 101 0 1 10 100 101 11

39 CAN: VID predecessor / successor To handle multiple neighbors failing simultaneously, each node maintains its predecessor and successor in the VID ordering (idea borrowed from Chord). This guarantees that the takeover node can always be found.

40 Only the failed node’s immediate neighbors are required for recovery CAN: node failures

41 Basic Design recap Basic CAN – completely distributed – self-organizing – nodes only maintain state for their immediate neighbors

42 Outline Introduction Design Design Improvements Evalution Related Work Conclusion

43 Design Improvement Objectives Reduce latency of CAN routing – Reduce CAN path length – Reduce per-hop latency (IP length/latency per CAN hop) Improve CAN robustness – data availability – routing

44 Design Improvement Tradeoffs Routing performance and system robustnes » vs. Increased per-node state and system complexity

45 Design Improvement Techniques 1. Increase number of dimensions in coordinate space 2. Multiple coordinate spaces (realities) 3. Better CAN routing metrics (measure & consider time) 4. Overloading coordinate zones (peer nodes) 5. Multiple hash functions 6. Geographic-sensitive construction of CAN overlay network (distributed binning) 7. More uniform partitioning (when new nodes enter) 8. Caching & Replication techniques for “hot-spots” 9. Background load balancing algorithm

46 1. Increase dimensions

47 2. Multiple realities Multiple (r) coordinate spaces Each node maintains a different zone in each reality Contents of hash table replicated on each reality Routing chooses neighbor closest in any reality (can make large jumps towards target) Advantages: – Greater data availability – Fewer hops

48 Multiple realities

49 Multiple dimensions vs realities

50 3. Time-sensitive routing Each node measures round-trip time (RTT) to each of its neighbors Routing message is forwarded to the neighbor with the maximum ratio of progress to RTT Advantage: – reduces per-hop latency (25-40% depending on # dimensions)

51 4. Zone overloading Node maintains list of “peers” in same zone Zone not split until MAXPEERS are there Only least RTT neighbor maintained Replication or splitting of data across peers Advantages – Reduced per hop latency – Reduced path length (because sharing zones has same effect as reducing total number of nodes) – Improved fault tolerance (a zone is vacant only when all nodes fail)

52 Zone overload Avg of 2 8 to 2 18 nodes

53 5. Multiple hash (key, value) mapped to “k” multiple points (and nodes) Parallel query execution on “k” paths Or choose closest one in space Advantages: – improve data availability – reduce path length

54 Multiple hash functions

55 6. Geographically distributed binning Distributed binning of CAN nodes based on their relative distances from m IP landmarks (such as root DNS servers) m! partitions of coordinate space New node joins partition associated with its landmark ordering Topological close nodes will be assigned to same partition Nodes in same partition divide the space up randomly Disadvantage: coordinate space is no longer uniformily populated – background load balancing could alleviate this problem

56 Distributed binning

57 7. More uniform partitioning When adding a node to a zone, choose neighbor with largest volume to split Advantage – 85% of nodes have the average load as opposed to 40% without, and largest zone volume dropped from 8x to 2x the average

58 8. Caching & Replication A node can cache values of keys it has recently requested. An node can push out “hot” keys to neighbors.

59 9. Load balancing Merge and split zones to balance loads

60 Outline Introduction Design Design Improvements Evalution Related Work Conclusion

61 Evaluation Metrics 1. Path length 2. Neighbor state 3. Latency 4. Volume 5. Routing fault tolerance 6. Data availability

62 CAN: scalability For a uniformly partitioned space with n nodes and d dimensions – per node, number of neighbors is 2d – average routing path is (dn 1/d ) hops Simulations show that the above results hold in practice (Transit-Stub topology: GT-ITM gen) Can scale the network without increasing per- node state

63 CAN: latency latency stretch = (CAN routing delay) (IP routing delay) To reduce latency, apply design improvements: – increase dimensions (d) – multiple nodes per zone (peer nodes) (p) – multiple “realities” ie coordinate spaces (r) – multiple hash functions (k) – use RTT weighted routing – distributed Binning – uniform Partitioning – caching & replication (for “hot spots”)

64 Overall improvement

65 CAN: Robustness Completely distributed – no single point of failure Not exploring database recovery Resilience of routing – can route around trouble

66 Routing resilience destination source

67 Routing resilience

68 destination

69 Routing resilience

70 Node X::route(D) If (X cannot make progress to D) – check if any neighbor of X can make progress – if yes, forward message to one such nbr Routing resilience

71

72 Outline Introduction Design Design Improvements Evalution Related Work Conclusion

73 Related Algorithms Distance Vector (DV) and Link State (LS) in IP routing – require widespread dissemination of topology – not well suited to dynamic topology like file- sharing Hierarchical routing algorithms – too centralized, single points of stress Other DHTs – Plaxton, Tapestry, Pastry, Chord

74 Related Systems DNS OceanStore (uses Tapestry) Publius: Web-publishing system with high anonymity P2P File sharing systems – Napster, Gnutella, KaZaA, Freenet

75 Outline Introduction Design Design Improvements Evalution Related Work Conclusion

76 Consclusion: Purpose CAN – an Internet-scale hash table – potential building block in Internet applications

77 Conclusion: Scalability O(d) per-node state O(d * n^(1/d)) path length For d = (log n)/2, path lengh and per-node state is O(log n) like other DHTs

78 Conclusion: Latency With the main design improvements latency is less than twice the IP path latency

79 Conclusion: Robustness Decentralized, can route around trouble With certain design improvements (replication), high data availability

80 Weaknesses / Future Work Security – Denial of service attacks hard to combat because a malicious node can act as a malicious server as well as a malicious client Mutable content Search


Download ppt "Sylvia Ratnasamy (UC Berkley Dissertation 2002) Paul Francis Mark Handley Richard Karp Scott Shenker A Scalable, Content Addressable Network Slides by."

Similar presentations


Ads by Google