Reliable Distributed Systems Distributed Hash Tables.

Slides:



Advertisements
Similar presentations
SkipNet: A Scalable Overlay Network with Practical Locality Properties Nick Harvey, Mike Jones, Stefan Saroiu, Marvin Theimer, Alec Wolman Microsoft Research.
Advertisements

Where to leave the data ? – Parallel systems – Scalable Distributed Data Structures – Dynamic Hash Table (P2P)
Ken Birman Cornell University. CS5410 Fall
CAN 1.Distributed Hash Tables a)DHT recap b)Uses c)Example – CAN.
One Hop Lookups for Peer-to-Peer Overlays Anjali Gupta, Barbara Liskov, Rodrigo Rodrigues Laboratory for Computer Science, MIT.
CS5412: TIER 2 OVERLAYS Lecture VI Ken Birman
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Peer to Peer and Distributed Hash Tables
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK
Scalable Content-Addressable Network Lintao Liu
Peer-to-Peer (P2P) Distributed Storage 1Dennis Kafura – CS5204 – Operating Systems.
CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.
Technische Universität Yimei Liao Chemnitz Kurt Tutschku Vertretung - Professur Rechner- netze und verteilte Systeme Chord - A Distributed Hash Table Yimei.
Technische Universität Chemnitz Kurt Tutschku Vertretung - Professur Rechner- netze und verteilte Systeme Chord - A Distributed Hash Table Yimei Liao.
The Chord P2P Network Some slides have been borowed from the original presentation by the authors.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Speaker: Cathrin Weiß 11/23/2004 Proseminar Peer-to-Peer Information Systems.
Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, M
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
1 1 Chord: A scalable Peer-to-peer Lookup Service for Internet Applications Dariotaki Roula
Chord A Scalable Peer-to-peer Lookup Service for Internet Applications
Robert Morris, M. Frans Kaashoek, David Karger, Hari Balakrishnan, Ion Stoica, David Liben-Nowell, Frank Dabek Chord: A scalable peer-to-peer look-up.
Robert Morris, M. Frans Kaashoek, David Karger, Hari Balakrishnan, Ion Stoica, David Liben-Nowell, Frank Dabek Chord: A scalable peer-to-peer look-up protocol.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion StoicaRobert Morris David Liben-NowellDavid R. Karger M. Frans KaashoekFrank.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
Peer to Peer File Sharing Huseyin Ozgur TAN. What is Peer-to-Peer?  Every node is designed to(but may not by user choice) provide some service that helps.
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications Stoica et al. Presented by Tam Chantem March 30, 2007.
CS5412: ADAPTIVE OVERLAYS Ken Birman 1 CS5412 Spring 2012 (Cloud Computing: Birman) Lecture V.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek and Hari alakrishnan.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
Decentralized Location Services CS273 Guest Lecture April 24, 2001 Ben Y. Zhao.
Wide-area cooperative storage with CFS
P2P Course, Structured systems 1 Skip Net (9/11/05)
 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.
Content Overlays (Nick Feamster). 2 Content Overlays Distributed content storage and retrieval Two primary approaches: –Structured overlay –Unstructured.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
The Impact of DHT Routing Geometry on Resilience and Proximity K. Gummadi, R. Gummadi..,S.Gribble, S. Ratnasamy, S. Shenker, I. Stoica.
1 Reading Report 5 Yin Chen 2 Mar 2004 Reference: Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications, Ion Stoica, Robert Morris, david.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
1 Distributed Hash Tables (DHTs) Lars Jørgen Lillehovde Jo Grimstad Bang Distributed Hash Tables (DHTs)
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented.
SIGCOMM 2001 Lecture slides by Dr. Yingwu Zhu Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
Peer to Peer Networks Distributed Hash Tables Chord, Kelips, Dynamo Galen Marchetti, Cornell University.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
Chord Advanced issues. Analysis Theorem. Search takes O (log N) time (Note that in general, 2 m may be much larger than N) Proof. After log N forwarding.
1 Distributed Hash Table CS780-3 Lecture Notes In courtesy of Heng Yin.
Chord Advanced issues. Analysis Search takes O(log(N)) time –Proof 1 (intuition): At each step, distance between query and peer hosting the object reduces.
Peer to Peer Network Design Discovery and Routing algorithms
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
Two Peer-to-Peer Networking Approaches Ken Calvert Net Seminar, 23 October 2001 Note: Many slides “borrowed” from S. Ratnasamy’s Qualifying Exam talk.
CS 425 / ECE 428 Distributed Systems Fall 2015 Indranil Gupta (Indy) Peer-to-peer Systems All slides © IG.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
The Chord P2P Network Some slides taken from the original presentation by the authors.
Peer-to-Peer Information Systems Week 12: Naming
Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, M
The Chord P2P Network Some slides have been borrowed from the original presentation by the authors.
A Scalable Peer-to-peer Lookup Service for Internet Applications
(slides by Nick Feamster)
Accessing nearby copies of replicated objects
DHT Routing Geometries and Chord
MIT LCS Proceedings of the 2001 ACM SIGCOMM Conference
P2P: Distributed Hash Tables
A Scalable Peer-to-peer Lookup Service for Internet Applications
Peer-to-Peer Information Systems Week 12: Naming
Presentation transcript:

Reliable Distributed Systems Distributed Hash Tables

Reminder: Distributed Hash Table (DHT) A service, distributed over multiple machines, with hash table semantics Insert(key, value), Value(s) = Lookup(key) Designed to work in a peer-to-peer (P2P) environment No central control Nodes under different administrative control But of course can operate in an “infrastructure” sense

P2P “environment” Nodes come and go at will (possibly quite frequently---a few minutes) Nodes have heterogeneous capacities Bandwidth, processing, and storage Nodes may behave badly Promise to do something (store a file) and not do it (free-loaders) Attack the system

Several flavors, each with variants Tapestry (Berkeley) Based on Plaxton trees---similar to hypercube routing The first* DHT Complex and hard to maintain (hard to understand too!) CAN (ACIRI), Chord (MIT), and Pastry (Rice/MSR Cambridge) Second wave of DHTs (contemporary with and independent of each other) * Landmark Routing, 1988, used a form of DHT called Assured Destination Binding (ADB)

Basics of all DHTs Goal is to build some “structured” overlay network with the following characteristics: Node IDs can be mapped to the hash key space Given a hash key as a “destination address”, you can route through the network to a given node Always route to the same node no matter where you start from

Simple example (doesn’t scale) Circular number space 0 to 127 Routing rule is to move clockwise until current node ID  key, and last hop node ID < key Example: key = 42 Obviously you will route to node 58 from no matter where you start Node 58 “owns” keys in [34,58]

81 Building any DHT Newcomer always starts with at least one known member

Building any DHT Newcomer always starts with at least one known member Newcomer searches for “self” in the network hash key = newcomer’s node ID Search results in a node in the vicinity where newcomer needs to be

Building any DHT Newcomer always starts with at least one known member Newcomer searches for “self” in the network hash key = newcomer’s node ID Search results in a node in the vicinity where newcomer needs to be Links are added/removed to satisfy properties of network

Building any DHT Newcomer always starts with at least one known member Newcomer searches for “self” in the network hash key = newcomer’s node ID Search results in a node in the vicinity where newcomer needs to be Links are added/removed to satisfy properties of network Objects that now hash to new node are transferred to new node

Insertion/lookup for any DHT Hash name of object to produce key Well-known way to do this Use key as destination address to route through network Routes to the target node Insert object, or retrieve object, at the target node foo.htm  93

Properties of all DHTs Memory requirements grow (something like) logarithmically with N (exception: Kelips) Routing path length grows (something like) logarithmically with N (several exceptions) Cost of adding or removing a node grows (something like) logarithmically with N Has caching, replication, etc…

DHT Issues Resilience to failures Load Balance Heterogeneity Number of objects at each node Routing hot spots Lookup hot spots Locality (performance issue) Churn (performance and correctness issue) Security

We’re going to look at four DHTs At varying levels of detail… Chord MIT (Stoica et al) Kelips Cornell (Gupta, Linga, Birman) Pastry Rice/Microsoft Cambridge (Druschel, Rowstron) Behive Cornell (Ramasubramanian, Sirer)

Things we’re going to look at What is the structure? How does routing work in the structure? How does it deal with node departures? How does it scale? How does it deal with locality? What are the security issues?

Chord uses a circular ID space N32 N10 N100 N80 N60 Circular ID Space Successor: node with next highest ID K33, K40, K52 K11, K30 K5, K10 K65, K70 K100 Key ID Node ID Chord slides care of Robert Morris, MIT

Basic Lookup N32 N10 N5 N20 N110 N99 N80 N60 N40 “Where is key 50?” “Key 50 is At N60” Lookups find the ID’s successor Correct if predecessor is correct

Successor Lists Ensure Robust Lookup Each node remembers r successors Lookup can skip over dead nodes to find blocks Periodic check of successor and predecessor links N32 N10 N5 N20 N110 N99 N80 N60 N40 10, 20, 32 20, 32, 40 32, 40, 60 40, 60, 80 60, 80, 99 80, 99, , 110, 5 110, 5, 10 5, 10, 20

Chord “Finger Table” Accelerates Lookups N80 ½ ¼ 1/8 1/16 1/32 1/64 1/128 To build finger tables, new node searches for the key values for each finger To do it efficiently, new nodes obtain successor’s finger table, and use as a hint to optimize the search

Chord lookups take O(log N) hops N32 N10 N5 N20 N110 N99 N80 N60 Lookup(K19) K19

Drill down on Chord reliability Interested in maintaining a correct routing table (successors, predecessors, and fingers) Primary invariant: correctness of successor pointers Fingers, while important for performance, do not have to be exactly correct for routing to work Algorithm is to “get closer” to the target Successor nodes always do this

Maintaining successor pointers Periodically run “stabilize” algorithm Finds successor’s predecessor Repair if this isn’t self This algorithm is also run at join Eventually routing will repair itself Fix_finger also periodically run For randomly selected finger

Initial: 25 wants to join correct ring (between 20 and 30) finds successor, and tells successor (30) of itself 20 runs “stabilize”: 20 asks 30 for 30’s predecessor 30 returns tells 25 of itself

This time, 28 joins before 20 runs “stabilize” finds successor, and tells successor (30) of itself runs “stabilize”: 20 asks 30 for 30’s predecessor 30 returns tells 28 of itself

runs “stabilize” runs “stabilize”

Chord problems? With intense “churn” ring may be very disrupted Worse case: partition can provoke formation of two distinct rings (think “Cornell ring” and “MIT ring”) And they could have finger pointers to each other, but not link pointers This might never heal itself… But scenario would be hard to induce. Unable to resist Sybil attacks

Pastry also uses a circular number space Difference is in how the “fingers” are created Pastry uses prefix match overlap rather than binary splitting More flexibility in neighbor selection d46a1c Route(d46a1c) d462ba d4213f d13da3 65a1fc d467c4 d471f1

Pastry routing table (for node 65a1fc) Pastry nodes also have a “leaf set” of immediate neighbors up and down the ring Similar to Chord’s list of successors

Pastry join X = new node, A = bootstrap, Z = nearest node A finds Z for X In process, A, Z, and all nodes in path send state tables to X X settles on own table Possibly after contacting other nodes X tells everyone who needs to know about itself Pastry paper doesn’t give enough information to understand how concurrent joins work 18 th IFIP/ACM, Nov 2001

Pastry leave Noticed by leaf set neighbors when leaving node doesn’t respond Neighbors ask highest and lowest nodes in leaf set for new leaf set Noticed by routing neighbors when message forward fails Immediately can route to another neighbor Fix entry by asking another neighbor in the same “row” for its neighbor If this fails, ask somebody a level up

For instance, this neighbor fails

Ask other neighbors Try asking some neighbor in the same row for its 655x entry If it doesn’t have one, try asking some neighbor in the row below, etc.

CAN, Chord, Pastry differences CAN, Chord, and Pastry have deep similarities Some (important???) differences exist We didn’t look closely at it, but CAN nodes tend to know of multiple nodes that allow equal progress Can therefore use additional criteria (RTT) to pick next hop Pastry allows greater choice of neighbor Can thus use additional criteria (RTT) to pick neighbor In contrast, Chord has more determinism Some recent work exploits this to make Chord tolerant of Byzantine attacks (but cost is quite high)

Kelips takes a different approach Network partitioned into  N “affinity groups” Hash of node ID determines which affinity group a node is in Each node knows: One or more nodes in each group All objects and nodes in own group But this knowledge is soft-state, spread through peer-to-peer “gossip” (epidemic multicast)!

Kelips Take a a collection of “nodes”

Kelips Affinity Groups: peer membership thru consistent hash 1N  N members per affinity group Map nodes to affinity groups

Kelips Affinity Groups: peer membership thru consistent hash 1N  Affinity group pointers N members per affinity group idhbeatrtt ms ms Affinity group view 110 knows about other members – 230, 30…

Kelips Affinity Groups: peer membership thru consistent hash 1N  Contact pointers N members per affinity group idhbeatrtt ms ms Affinity group view groupcontactNode …… 2202 Contacts 202 is a “contact” for 110 in group 2 In practice, keep k contacts, for some small constant k

Kelips Affinity Groups: peer membership thru consistent hash 1N  Gossip protocol replicates data cheaply N members per affinity group idhbeatrtt ms ms Affinity group view groupcontactNode …… 2202 Contacts resourceinfo …… cnn.com110 Resource Tuples “cnn.com” maps to group 2. So 110 tells group 2 to “route” inquiries about cnn.com to it.

Kelips Affinity Groups: peer membership thru consistent hash 1N  N members per affinity group To look up “cnn.com”, just ask some contact in group 2. It returns “110” (or forwards your request). IP2P, ACM TOIS (submitted)

Kelips gossip Operates at constant “background” rate Independent of frequency of changes in the system Average overhead may be higher than other DHTs, but not bursty If churn too high, system performs poorly (failed lookups), but does not collapse…

Beehive A DHT intended for supporting righ- performance infrastructure services with proactive caching Focus of “real system” is on DNS but has other applications Proactive caching: a form of replication. DNS already caches… Beehive pushes updates to improve hit rates

Domain Name Service Translates textual names to internet addresses “ -> www.cnn.com Relies on a static hierarchy of servers Prone to DoS attacks Fragile, expensive to maintain Slow and sluggish In 2002, a DDoS attack almost disabled the root name servers

0122 A Self-Organizing Solution… oid = IP= Great idea, but O(log N) is too slow on the Internet

Beehive Intuition Optimization problem: Minimize total number of replicas s.t., average lookup performance  C By replicating a (key,value) tuple we can shorten the worst-case search by one hop With greater degree of replication the search goes even faster

Beehive Summary general replication framework suitable for structured DHTs decentralization, self-organization, resilience properties high performance: O(1) average lookup time scalable: minimize number of replicas and reduce storage, bandwidth, and network load adaptive: promptly respond to changes in popularity – flash crowds

Analytical Problem Minimize (storage/bandwidth) x 0 + x 1 /b + x 2 /b 2 + … + x K-1 /b K-1 such that (average lookup time is C hops) (x 0 1-  + x 1 1-  + x 2 1-  + … + x K-1 1-  )  K – C and x 0  x 1  x 2  …  x K-1  1 b: base K: log b (N) x j : fraction of objects replicated at level j or lower

Optimal Solution d j (K’ – C) 1 + d + … + d K’  [ ] x* j = 0  j  K’ – 1 d = b (1-  ) /  x* j = 1 K’  j  K K’ is determined by setting (typically 2 or 3) x* K’-1  1  d K’-1 (K’ – C) / (1 + d + … + d K’-1 )  1 optimal per node storage (1 – 1/b) / (1 + d + … + d K’-1 )  /(1-  ) + 1/b K’

analytical model optimization problem minimize: total number of replicas, s.t., average lookup performance  C configurable target lookup performance continuous range, sub one-hop minimizing number of replicas decreases storage and bandwidth overhead

CoDoNS Sirer’s group built an alternative to DNS Safety net and replacement for legacy DNS Self-organizing, distributed Deployed across the globe Achieves better average response time than DNS See NSDI ‘04, SIGCOMM ‘04 for details

Control traffic load generated by churn Kelips None O(Changes x Nodes)? O(changes) ChordPastry

DHT applications Company could use DHT to create a big index of “our worldwide stuff” Beehive: Rebuilds DNS as a DHT application No structure required in name! Fewer administration errors No DoS target But benefits aren’t automatic. Contrast with DNS over Chord Median lookup time increased from 43ms (DNS) to 350ms

To finish up Various applications have been designed over DHTs File system, DNS-like service, pub/sub system DHTs are elegant and promising tools Concerns about churn and security