Gnutella Message Header Length of message after header. Next message header located this many bytes from this header. Messages <= 4 kB. Data Length19-22 Incremented at each node. Note that (TTL + Hops Taken)= TTL initial Hops Taken18 Time To Live. Each node decrements this value when receiving the message. It gets dropped when the value is zero. TTL Remaining 17 What type of message it is. Ping, pong (ping response), query, query response, push, bye Function ID16 A randomly chosen globally unique (in theory) 16 byte identifier generated by the client for each message it sends. Message ID0-15 FunctionNameBytes
Design Goals Allow the Gnutella-like p2p to handle higher amount of queries Make it scalable Utilize heterogeneity of machines
Search Protocol GIA search is based on random walks –Like floods, but less messages But because random walks are blind, there are scaling issues So GIA uses biased walks –Toward high degree or high capacity?
High degree AND high capacity Using a dynamic topology adaptation algorithm Ensures: –high capacity nodes have a high degree –low capacity nodes are close to high capacity nodes Level of satisfaction S Measures how close the sum of capacities of a node's neighbors normalized by degrees is to that node's own capacity The lower the satisfaction level, the shorter the adaptation interval
To add a node n... Accept if num_nabr is still <= max_nbrs Select the node with the highest degree out of the subset of neighbors with a capacity less than that of n Only drop that node if it has less neighbors than n
One-hop replication Each node records an index of neighbor nodes' content Ensures that high capacity nodes can respond to a greater number of queries
Flow control To avoid overloading a node Only can direct queries to a neighbor if the neighbor is ready Node uses tokens to signify it can handle queries Node gives out tokens at the rate it can process queries If queries are being queued, decrease allocation rate Weights the allocation of tokens for capacity If a node isn't using tokens, they are allocated to other neighbors Can be piggy backed
Search Protocol (again) Biased random walks aren't random Send queries to highest capacity neighbor with tokens Time To Live decremented at each node Book-keeping limits same path traversal MAX_RESPONSES decremented for each found answer Append address of owning node to the forwarded query
Query Resilience Can't let a query die with a node Keep-alive messages –query responses –dummy query responses Originator can resend query if no keep- alive messages arrive for a while When the topology adapts, the previous connections are maintained for a while
Simulations GIA compared to: –Flood –Random Walks over Random Topologies –Supernode mechanisms queries only flooded between supernodes
Assumptions All nodes produce queries at same rate Capacity = number of messages processed per unit time queues have infinite length specific keyword searches min_alloc = min(C/num_nbrs) = 4 For Flood and Super –average diameter is 7 –TTL is 10 Look at relative performance, not absolute
Metrics Success rate = fraction of queries issued that reach the file hop-count delay = time taken from query's start to finish Collapse Point (CP) node query rate at point beyond which success rate drops below 90% Average hop-count before collapse
Performance Comparison Search terminates after finding a single answer 5000 and 10,000 nodes for each system.01,.05,.1,.5, 1 replication rates
Performance results RWRT better than Flood at high replication, equal at low replication GIA has higher hop counts than Flood and Super GIA hop counts lower as replication goes up Flood and Super aren't scalable... duh.
Multiple Search Responses Same tests, MAX_RESPONSES 1, 10, 20 Flood and Super unchanged GIA and RWRT decline as M_R increases
Node Failure model Force nodes to fail at a uniformly random time between 0 and MAXLIFETIME MAXLIFTIME = 10s, 100s, 1000s, forever Even at 10s, GIA is 2-4 orders of magnitude better than RWRT, Super, and Flood when they aren't fialing.
Types of P2P searching Centralized (Napster) –Based on user provided file lists Decentralized –Queries are distributed to peers –Unstructured (Gnutella) –Structured (Chord)
Distributed Hash Tables Pros: –Scalable –Quick lookup O(log n) steps O(n) steps for Gnutella –Can find needles Cons (why not DHTs): –P2P Clients are transient Require O(log n) repair operations after each failure –DHTs only support exact match searches –P2Ps look for hay (Not really a con)
Analysis and Comparison of P2P Search Models Dimitrios Tsoumakos Nick Roussopoulos
Blind Gnutella (flood) Modified-BFS Iterative Deepening Random Walks
Informed Gnutella2 (Super-peer) Intelligent-BFS APS Local Indices Routing Indices Distributed Resource Location Protocol Gnutella with Shortcuts GIA
Gnutella2 Uses super-peers (hubs) They act as local servers for their peers Hubs are connected Queries the hubs sequentially
Intelligent-BFS Query similarity metric to find similar queries Forwards to neighbors most likely to answer that query Focuses on object discovery rather than message reduction Increased number of hits Does not handle node departures well Assumes a node specializes in one file type
APS Uses indices to weight random walks Each index value represents a query for a specific object directed toward a specific node Index value is raised or lowered based on outcome of query Optimistic and pessimistic update approaches Originating node sends query to all neighbors, those neighbors send query to one neighbor
Local Indices Each node indexes objects stored on nodes within a radius r and can answer queries for them A BFS like search is performed Queries hop a distance of 2r+1 nodes Accuracy and hits are very high Decreases actual processing time Floods the network with messages Churn is very costly b/c flooding is used to update the repository for all joins/leaves
Routing Indices Files are assumed to fall into themes Each node stores the number of files of each theme reachable from each outgoing path Three functions used to determine best outgoing path Queries forwarded to best outgoing path Flooding is used for creation and update, so serious issues with dynamic networks Bloom filters...
Distributed Resource Location Protocol Initially, random flooding is used to find objects When an object is discovered, the query backtracks, storing the location of the found object on those nodes If a node knows where a queried object is located, it can directly contact that node Depending on specificity of queries, only one replica of a certain object is ever found In a dynamic network, much flooding
Gnutella with Shortcuts Uses standard flooding initially If a peer provides an answer, it is indexed on the requesting nodes Nodes forward queries to the ranked shortcuts first, then flood if necessary Shortcuts ranked by success rate Very high success rate Works well when users make related queries
Results All algorithms that implement flooding in some fashion have high success rates Systems that use shortcuts aren't hurt as badly by departures as expected, because the more flood searches utilized, the more accurate the shortcuts GIA is middle of the pack No collapse point test