Building Peer-to-Peer Systems with Chord, a Distributed Lookup Service Robert Morris F. Dabek, E. Brunskill, M. F. Kaashoek, D. Karger, I. Stoica, H. Balakrishnan MIT / LCS Motivation: Every p2p system needs a lookup plan, to find documents.
Goal: Better Peer-to-Peer Storage Lookup is the key problem Lookup is not easy: GNUtella scales badly Freenet is imprecise Chord lookup provides: Good naming semantics and efficiency Elegant base for layered features But lookup isn’t easy.
Insert(name, document) Lookup N1 N2 ? Consumer N3 Fetch(name) Author N4 1000s of nodes. Some/all may have crashed. Naming problem. Turns out good naming solution solves all kinds of other problems too. No natural hierarchy. Insert(name, document) N6 N5
Chord Architecture Interface: Chord consists of lookup(DocumentID) NodeID, IP-Address Chord consists of Consistent Hashing Small routing tables: log(n) Fast join/leave protocol
Consistent Hashing Example: Node 90 is the “successor” of document 80. (0) D120 N105 D20 Circular 7-bit ID space N32 Not all hash slots exist! Actually store document at “successor” node. Example: 7-bit ID space. document with ID 80 stored at node with ID 90. Maybe two slides. 1. Circle. 2. Successor. Animation? N90 D80 Example: Node 90 is the “successor” of document 80.
Chord Uses log(N) “Fingers” (0) ¼ ½ Circular 7-bit ID space 1/8 1/16 Small tables, but multi-hop lookup. Table entries: IP address and Chord ID. Navigate in ID space, route queries closer to successor. Log(n) tables, log(n) hops. Route to a document between ¼ and ½ … 1/32 1/64 1/128 N80 N80 knows of only seven other nodes.
Chord Finger Table Node n’s i-th entry: first node n + 2i-1 N32’s (0) N32’s Finger Table N113 33..33 N40 34..35 N40 36..39 N40 40..47 N40 48..63 N52 64..95 N70 96..31 N102 N102 N32 N85 Finger table actually contains ID and IP address. Explain ranges in finger table. N40 N80 N79 N52 N70 N60 Node n’s i-th entry: first node n + 2i-1
Chord Lookup Node 32, lookup(82): 32 70 80 85. (0) N70’s Finger Table Chord Lookup 71..71 N79 72..73 N79 74..77 N79 78..85 N80 86..101 N102 102..5 N102 6..69 N32 (0) N113 N32’s Finger Table N102 33..33 N40 34..35 N40 36..39 N40 40..47 N40 48..63 N52 64..95 N70 96..31 N102 N80’s Finger Table N32 N85 81..81 N85 82..83 N85 84..87 N85 88..95 N102 96..111 N102 112..15 N113 16..79 N32 N40 N80 Forward to highest known node before target. Done if document between node and finger[1] (its successor). Example: successor of Document 82. 32 -> 70 -> 79 -> 80 -> 85. Keep mentioning successor. N52 N79 N70 N60 Node 32, lookup(82): 32 70 80 85.
New Node Join Procedure (0) N20’s Finger Table N113 N20 21..21 22..23 24..27 28..35 36..51 52..83 84..19 N102 N32 Assume N20 knows of *some* existing node. N40 N80 N52 N70 N60
New Node Join Procedure (2) (0) N20’s Finger Table N113 N20 21..21 N32 22..23 N32 24..27 N32 28..35 N32 36..51 N40 52..83 N52 84..19 N102 N102 N32 Similarly notify nodes that must include N20 in their table. N110[1] = N20, not N32. ONLY need to talk to one other node to xfer values. N40 N80 N52 N70 N60 Node 20 asks any node for successor to 21, 22, …, 52, 84.
New Node Join Procedure (3) (0) N20’s Finger Table N113 N20 21..21 N32 22..23 N32 24..27 N32 28..35 N32 36..51 N40 52..83 N52 84..19 N102 N102 D114..20 N32 Similarly notify nodes that must include N20 in their table. N110[1] = N20, not N32. ONLY need to talk to one other node to xfer values. N40 N80 N52 N70 N60 Node 20 moves documents from node 32.
Chord Properties Log(n) lookup messages and table space. Well-defined location for each ID. No search required. Natural load balance. No name structure imposed. Minimal join/leave disruption. Does not store documents…
Building Systems with Chord Data auth. Client App (e.g. Browser) get(key) put(k, v) Storage Update auth. Fault tolerance Load balance Key/Value Key/Value Key/Value lookup(id) K/v software in every chord server; k/v talks to chord and to other k/v servers. Need features, Layered on top of Chord lookup(), often think in terms of naming. Chord Chord Chord Client Server Server
Naming and Authentication Name could be hash of file content Easy for client to verify But update requires new file name Name could be a public key Document contains digital signature Allows verified updates w/ same name Names probably come from links. Inspiration from SFSRO and Freenet
Naming and Fault Tolerance Replica1 Replica2 Replica2 Replica3 Need to notice when a replica dies, find a new one. Hash(name + I) easy to load balance, while successor hits on first successor Replica3 Replica1 IDi = hash(name + i) IDi = successori(hash(name))
Naming and Load Balance File Blocks Inode B1 IDB1 = hash(B1) IDB1 IDB2 IDB3 IDInode = hash(name) B2 IDB2 = hash(B2) B3 IDB2 = hash(B2)
Naming and Caching Client 1 D30 @ N32 Client 2 Good performance for small files? Cache – but need to put copies where clients will naturally encounter them. Client 2
Open Issues Network proximity Malicious data insertion Malicious Chord table information Anonymity Keyword search and indexing
Related Work CAN (Satnasamy, Francis, Handley, Karp, Shenker) Pastry (Rowstron and Druschel) Tapestry (Zhao, Kubiatowicz, Joseph)
Chord Status Working Chord implementation SFSRO file system layered on top Prototype deployed at 12 sites around world Understand design tradeoffs
Chord Summary Chord provides distributed lookup Efficient, low-impact join and leave Flat key space allows flexible extensions Good foundation for peer-to-peer systems http://www.pdos.lcs.mit.edu/chord Just the right lookup for peer-to-peer storage systems. NATs? Mogul. What if most nodes are flakey? Details of noticing and reacting to failures? How to eval with huge # of nodes?