Sylvia Ratnasamy (UC Berkley Dissertation 2002) Paul Francis Mark Handley Richard Karp Scott Shenker A Scalable, Content Addressable Network Slides by.

Sylvia Ratnasamy (UC Berkley Dissertation 2002) Paul Francis Mark Handley Richard Karp Scott Shenker A Scalable, Content Addressable Network Slides by Authors above with additions/modificatons by Michael Isaacs (UCSB) Tony Hannan (Georgia Tech)

Outline Introduction Design Design Improvements Evalution Related Work Conclusion

Hash tables – essential building block in software systems Internet-scale distributed hash tables – equally valuable to large-scale distributed systems peer-to-peer systems – Napster, Gnutella, Groove, FreeNet, MojoNation… large-scale storage management systems – Publius, OceanStore, PAST, Farsite, CFS... mirroring on the Web Internet-scale hash tables

Content-Addressable Network (CAN) CAN: Internet-scale hash table Interface – insert (key, value) – value = retrieve (key) Properties – scalable – operationally simple – good performance Related systems: Chord, Plaxton/Tapestry...

Problem Scope provide the hashtable interface scalability robustness performance

Not in scope security anonymity higher level primitives – keyword searching – mutable content

K V CAN: basic idea K V

CAN: basic idea insert (K 1,V 1 ) K V

CAN: basic idea (K 1,V 1 ) K V

CAN: basic idea retrieve (K 1 ) K V

CAN: solution Virtual Cartesian coordinate space Entire space is partitioned amongst all the nodes – every node “owns” a zone in the overall space Abstraction – can store data at “points” in the space – can route from one “point” to another Point = node that owns the enclosing zone

CAN: simple example 1

1 2 3 4

node I::insert(K,V) I

(1) a = h x (K) CAN: simple example x = a node I::insert(K,V) I

(1) a = h x (K) b = h y (K) CAN: simple example x = a y = b node I::insert(K,V) I

(1) a = h x (K) b = h y (K) CAN: simple example (2) route(K,V) -> (a,b) node I::insert(K,V) I

CAN: simple example (2) route(K,V) -> (a,b) (3) (a,b) stores (K,V) (K,V) node I::insert(K,V) I (1) a = h x (K) b = h y (K)

CAN: simple example (2) route “retrieve(K)” to (a,b) (K,V) (1) a = h x (K) b = h y (K) node J ::retrieve(K) J

Data stored in the CAN is addressed by name (i.e. key), not location (i.e. IP address) CAN

CAN: routing table

CAN: routing (a,b) (x,y)

A node only maintains state for its immediate neighboring nodes CAN: routing

CAN: node insertion Bootstrap node 1) Discover some node “I” already in CAN new node

CAN: node insertion I new node 1) Discover some node “I” already in CAN

CAN: node insertion 2) pick random point in space I (p,q) new node

CAN: node insertion (p,q) 3) I routes to (p,q), discovers node J I J new node

CAN: node insertion new J 4) split J’s zone in half… new owns one half

Inserting a new node affects only a single other node and its immediate neighbors CAN: node insertion

CAN: node failures Need to repair the space – recover database state updates use replication, rebuild database from replicas – repair routing takeover algorithm

CAN: takeover algorithm Nodes periodically send update message to neigbors An absent update message indicates a failure – Each node send a recover message to the takeover node – The takeover node takes over the failed node's zone

CAN: takeover node The takeover node is the node whose VID is numerically closest to the failed nodes VID. The VID is a binary string indicating the path from the root to the zone in the binary partition tree. 0 11 100 101 0 1 10 100 101 11

CAN: VID predecessor / successor To handle multiple neighbors failing simultaneously, each node maintains its predecessor and successor in the VID ordering (idea borrowed from Chord). This guarantees that the takeover node can always be found.

Only the failed node’s immediate neighbors are required for recovery CAN: node failures

Basic Design recap Basic CAN – completely distributed – self-organizing – nodes only maintain state for their immediate neighbors

Design Improvement Objectives Reduce latency of CAN routing – Reduce CAN path length – Reduce per-hop latency (IP length/latency per CAN hop) Improve CAN robustness – data availability – routing

Design Improvement Tradeoffs Routing performance and system robustnes » vs. Increased per-node state and system complexity

Design Improvement Techniques 1. Increase number of dimensions in coordinate space 2. Multiple coordinate spaces (realities) 3. Better CAN routing metrics (measure & consider time) 4. Overloading coordinate zones (peer nodes) 5. Multiple hash functions 6. Geographic-sensitive construction of CAN overlay network (distributed binning) 7. More uniform partitioning (when new nodes enter) 8. Caching & Replication techniques for “hot-spots” 9. Background load balancing algorithm

1. Increase dimensions

2. Multiple realities Multiple (r) coordinate spaces Each node maintains a different zone in each reality Contents of hash table replicated on each reality Routing chooses neighbor closest in any reality (can make large jumps towards target) Advantages: – Greater data availability – Fewer hops

Multiple realities

Multiple dimensions vs realities

3. Time-sensitive routing Each node measures round-trip time (RTT) to each of its neighbors Routing message is forwarded to the neighbor with the maximum ratio of progress to RTT Advantage: – reduces per-hop latency (25-40% depending on # dimensions)

4. Zone overloading Node maintains list of “peers” in same zone Zone not split until MAXPEERS are there Only least RTT neighbor maintained Replication or splitting of data across peers Advantages – Reduced per hop latency – Reduced path length (because sharing zones has same effect as reducing total number of nodes) – Improved fault tolerance (a zone is vacant only when all nodes fail)

Zone overload Avg of 2 8 to 2 18 nodes

5. Multiple hash (key, value) mapped to “k” multiple points (and nodes) Parallel query execution on “k” paths Or choose closest one in space Advantages: – improve data availability – reduce path length

Multiple hash functions

6. Geographically distributed binning Distributed binning of CAN nodes based on their relative distances from m IP landmarks (such as root DNS servers) m! partitions of coordinate space New node joins partition associated with its landmark ordering Topological close nodes will be assigned to same partition Nodes in same partition divide the space up randomly Disadvantage: coordinate space is no longer uniformily populated – background load balancing could alleviate this problem

Distributed binning

7. More uniform partitioning When adding a node to a zone, choose neighbor with largest volume to split Advantage – 85% of nodes have the average load as opposed to 40% without, and largest zone volume dropped from 8x to 2x the average

8. Caching & Replication A node can cache values of keys it has recently requested. An node can push out “hot” keys to neighbors.

9. Load balancing Merge and split zones to balance loads

Evaluation Metrics 1. Path length 2. Neighbor state 3. Latency 4. Volume 5. Routing fault tolerance 6. Data availability

CAN: scalability For a uniformly partitioned space with n nodes and d dimensions – per node, number of neighbors is 2d – average routing path is (dn 1/d ) hops Simulations show that the above results hold in practice (Transit-Stub topology: GT-ITM gen) Can scale the network without increasing per- node state

CAN: latency latency stretch = (CAN routing delay) (IP routing delay) To reduce latency, apply design improvements: – increase dimensions (d) – multiple nodes per zone (peer nodes) (p) – multiple “realities” ie coordinate spaces (r) – multiple hash functions (k) – use RTT weighted routing – distributed Binning – uniform Partitioning – caching & replication (for “hot spots”)

Overall improvement

CAN: Robustness Completely distributed – no single point of failure Not exploring database recovery Resilience of routing – can route around trouble

Routing resilience destination source

Routing resilience

destination

Routing resilience

Node X::route(D) If (X cannot make progress to D) – check if any neighbor of X can make progress – if yes, forward message to one such nbr Routing resilience

Related Algorithms Distance Vector (DV) and Link State (LS) in IP routing – require widespread dissemination of topology – not well suited to dynamic topology like file- sharing Hierarchical routing algorithms – too centralized, single points of stress Other DHTs – Plaxton, Tapestry, Pastry, Chord

Related Systems DNS OceanStore (uses Tapestry) Publius: Web-publishing system with high anonymity P2P File sharing systems – Napster, Gnutella, KaZaA, Freenet

Consclusion: Purpose CAN – an Internet-scale hash table – potential building block in Internet applications

Conclusion: Scalability O(d) per-node state O(d * n^(1/d)) path length For d = (log n)/2, path lengh and per-node state is O(log n) like other DHTs

Conclusion: Latency With the main design improvements latency is less than twice the IP path latency

Conclusion: Robustness Decentralized, can route around trouble With certain design improvements (replication), high data availability

Weaknesses / Future Work Security – Denial of service attacks hard to combat because a malicious node can act as a malicious server as well as a malicious client Mutable content Search

Sylvia Ratnasamy (UC Berkley Dissertation 2002) Paul Francis Mark Handley Richard Karp Scott Shenker A Scalable, Content Addressable Network Slides by.

Similar presentations

Presentation on theme: "Sylvia Ratnasamy (UC Berkley Dissertation 2002) Paul Francis Mark Handley Richard Karp Scott Shenker A Scalable, Content Addressable Network Slides by."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sylvia Ratnasamy (UC Berkley Dissertation 2002) Paul Francis Mark Handley Richard Karp Scott Shenker A Scalable, Content Addressable Network Slides by.

Similar presentations

Presentation on theme: "Sylvia Ratnasamy (UC Berkley Dissertation 2002) Paul Francis Mark Handley Richard Karp Scott Shenker A Scalable, Content Addressable Network Slides by."— Presentation transcript:

Similar presentations

About project

Feedback