Presentation on theme: "Topics in Database Systems: Data Management in Peer-to-Peer Systems"— Presentation transcript:
1Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p
2Topics in Database Systems: Data Management in Peer-to-Peer Systems D. Tsoumakos and N. Roussopoulos, “A Comparison of Peer-to-Peer Search Methods”, WebDB03
3OverviewCentralizedConstantly-updated directory hosted at central locations (do not scale well, updates, single points of failure)Decentralized but structuredThe overlay topology is highly controlled and files (or metadata/index) are not placed at random nodes but at specified locationsDecentralized and Unstructuredpeers connect in an ad-hoc fashionthe location of document/metadata is not controlled by the systemNo guarantee for the success of a searchNo bounds on search timeNo maintenance costAny kind of query (not just single key or range queries)
8Search in Unstructured P2P Must find a way to stop the search: Time-to-Leave (TTL)Exponential Number of MessagesCycles (?)Note: cycles can be detected but not avoided
9Search in Unstructured P2P BFS vs DFSBFS better response time, larger number of nodes (message overhead per node and overall)Note: search in BFS continues (if TTL is not reached), even if the object has been located on a different pathRecursive vs IterativeDuring search, whether the node issuing the query directly contacts others, or recursively.Does the result follows the same path?
10Iterative vs. Recursive Routing Iterative: Originator requests IP address of each hopMessage transport is actually done via direct IPRecursive: Message transferred hop-by-hopK VK VK VK VK VK VK VK VK VK VK Vretrieve (K1)
11Search in Unstructured P2P Two general types of search in unstructured p2p:Blind: try to propagate the query to a sufficient number of nodes (example Gnutella)Informed: utilize information about document locations (example Routing Indexes)Informed search increases the cost of join for an improved search cost
12Blind Search MethodsGnutella:Use flooding (BFS) to contact all accessible nodes within the TTL valueHuge overhead to a large number of peers +Overall network trafficHard to find unpopular itemsUp to 60% bandwidth consumption of the total Internet traffic
13Free-riding on Gnutella [Adar00] 24 hour sampling period:70% of Gnutella users share no files50% of all responses are returned by top 1% of sharing hostsA social problem not a technical oneProblems:Degradation of system performance: collapse?Increase of system vulnerability“Centralized” (“backbone”) Gnutella copyright issues?Verified hypotheses:H1: A significant portion of Gnutella peers are free ridersH2: Free riders are distributed evenly across domainsH3: Often hosts share files nobody is interested in (are not downloaded)
14Free-riding Statistics - 1 [Adar00] H1: Most Gnutella users are free ridersOf 33,335 hosts:22,084 (66%) of the peers share no files24,347 (73%) share ten or less filesTop 1 percent (333) hosts share 37% (1,142,645) of total files sharedTop 5 percent (1,667) hosts share 70% (1,142,645) of total files sharedTop 10 percent (3,334) hosts share 87% (2,692,082) of total files shared
15Free-riding Statistics - 2 [Adar00] H3: Many servents share files nobody downloadsOf 11,585 sharing hosts:Top 1% of sites provide nearly 47% of all answersTop 25% of sites provide 98% of all answers7,349 (63%) never provide a query response
16Free Riders File sharing studies Lots of people download Few people serve filesIs this bad?If there’s no incentive to serve, why do people do so?What if there are strong disincentives to being a major server?
17Simple Solution: Thresholds Many programs allow a threshold to be setDon’t upload a file to a peer unless it shares > k filesProblems:What’s k?How to ensure the shared files are interesting?
18Categories of Queries [Sripanidkulchai01] Categorized top 20 queries
19Popularity of Queries [Sripanidkulchai01] Very popular documents are approximately equally popularLess popular documents follow a Zipf-like distribution (i.e., the probability of seeing a query for the ith most popular query is proportional to 1/(ialpha))Access frequency of web documents also follows Zipf-like distributions caching might also work for Gnutella
20Caching in Gnutella [Sripanidkulchai01] Average bandwidth consumption in tests: 3.5MbpsBest case: trace 2 (73% hit rate = 3.7 times traffic reduction)
21Topology of Gnutella [Jovanovic01] Power-law properties verified (“find everything close by”)Backbone + outskirtsPower-Law Random Graph (PLRG):The node degrees follow a power law distribution:if one ranks all nodes from the most connected to the least connected, thenthe i’th most connected node has ω/ia neighbors,where w is a constant.
23Why does it work? It’s a small World! [Hong01] Milgram: 42 out of 160 letters from Oregon to Boston (~ 6 hops)Watts: between order and randomnessshort-distance clustering + long-distance shortcutsIn 1967, Stanley Milgram conducted a classic experiment where he instructed randomly chosen people in Nebraska to pass letters to a selected target person in Boston, using only intermediaries who were known to one another on a first-name basis. He found that it only required a median of six steps for the letters to reach their destination, giving rise to “six degrees of separation” and the “small-world effect.”Duncan Watts and Steven Strogatz extended this work in 1998 with an influential paper in Nature that described small-world networks as an intermediate state between regular graphs and random graphs. Small-world graphs maintain the high local clustering of regular graphs (as measured by the clustering coefficient, the proportion of a nodes linked to a given node which are also linked to each other) but also have the short pathlengths of random graphs. They can be regarded as locally clustered graphs with shortcuts scattered in.Freenet networks can be shown to be small-world graphs (next slide).Regular graph:n nodes, k nearest neighbors path length ~ n/2k4096/16 = 256Rewired graph (1% of nodes):path length ~ random graphclustering ~ regular graphRandom graph:path length ~ log (n)/log(k)~ 4
24Links in the small World [Hong01] “Scale-free” link distributionScale-free: independent of the total number of nodesCharacteristic for small-world networksThe proportion of nodes having a given number of links n is: P(n) = 1 /n kMost nodes have only a few connectionsSome have a lot of links: important for binding disparate regions togetherA key characteristic of small-world graphs is the “scale-free” link distribution, which has no term related to the size of the network, and thus applies at all scales from small to large. This distribution can be seen in Freenet. The nodes at the top left, with few connections, are the local clusters while the nodes at the bottom right, with lots of connections, provide the shortcuts that tie the network together. The outlier at far right is the group of nodes whose datastores are completely filled, with 250 entries – with larger datastores, this column shifts further to the right.
25Freenet: Links in the small World [Hong01] P(n) ~ 1/n 1.5A key characteristic of small-world graphs is the “scale-free” link distribution, which has no term related to the size of the network, and thus applies at all scales from small to large. This distribution can be seen in Freenet. The nodes at the top left, with few connections, are the local clusters while the nodes at the bottom right, with lots of connections, provide the shortcuts that tie the network together. The outlier at far right is the group of nodes whose datastores are completely filled, with 250 entries – with larger datastores, this column shifts further to the right.
26Gnutella: “New” Measurements  Stefan Saroiu, P. Krishna Gummadi, Steven D. Gribble:A Measurement Study of Peer-to-Peer File Sharing Systems,Proceedings of Multimedia Computing and Networking (MMCN)2002, San Jose, CA, USA, January 2002. M. Ripeanu, I. Foster, and A. Iamnitchi.Mapping the gnutella network: Properties of large-scale peer-to-peer systems and implications for system design.IEEE Internet Computing Journal, 6(1), 2002 Evangelos P. Markatos,Tracing a large-scale Peer to Peer System: an hour in the life of Gnutella,2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2002. Y. HawatheAWATHE, S. Ratnasamy, L. Breslau, and S. Shenker.Making Gnutella-like P2P Systems Scalable. In Proc. ACM SIGCOMM (Aug. 2003). Qin Lv, Pei Cao, Edith Cohen, Kai Li, Scott Shenker:Search and replication in unstructured peer-to-peer networks. ICS 2002: 84-95
27Gnutella: Bandwidth Barriers Clip2 measured Gnutella over 1 month:typical query is 560 bits long (including TCP/IP headers)25% of the traffic are queries, 50% pings, 25% otheron average each peer seems to have 3 other peers actively connectedClip2 found a scalability barrier with substantial performance degradation if queries/sec > 10:10 queries/sec* 560 bits/query* (to account for the other 3 quarters of message traffic)* simultaneous connections67,200 bps10 queries/sec maximum in the presence of many dialup userswon’t improve (more bandwidth - larger files)
28Gnutella: Summary Completely decentralized Hit rates are high High fault toleranceAdopts well and dynamically to changing peer populationsProtocol causes high network traffic (e.g., 3.5Mbps). For example:4 connections C / peer, TTL = 71 ping packet can cause packetsNo estimates on the duration of queries can be givenNo probability for successful queries can be givenTopology is unknown algorithms cannot exploit itFree riding is a problemReputation of peers is not addressedSimple, robust, and scalable (at the moment)
29Lessons and Limitations Client-Server performs wellBut not always feasibleIdeal performance is often not the key issue!Things that flood-based systems do wellOrganic scalingDecentralization of visibility and liabilityFinding popular stuff (e.g., caching)Fancy local queriesThings that flood-based systems do poorlyFinding unpopular stuff [Loo, et al VLDB 04]Fancy distributed queriesVulnerabilities: data poisoning, tracking, etc.Guarantees about anything (answer quality, privacy, etc.)
32Information Preservation Information Quality Trust Security & PrivacyIssues:AnonymityReputationAccountabilityInformation PreservationInformation QualityTrustDenial of service attacks
33? title: origin of species author: charles darwin date: 1859 Authenticitytitle: origin of speciesauthor: charles darwin?date: 1859body: In an island far,far away ......
34More than Just File Integrity title: origin of speciesauthor: charles darwin?date: 185900body: In an island far,far away ...checksum
35More than Fetching One File T=originY=?A=darwinB=?T=originY=1859A=darwinB=abcdT=originY=1800A=darwinY=1859
36SolutionsAuthenticity Function A(doc): T or Fat expert sites, at all sites?can use signature expert sig(doc) userVoting Basedauthentic is what majority saysTime Basede.g., oldest version (available) is authentic
37Trust computations in dynamic system Overloading good nodes IssuesTrust computations in dynamic systemOverloading good nodesBad nodes can provide good content sometimesBad nodes can build up reputationBad nodes can form collectives...
39Blind Search MethodsModified-BFS:Choose only a ratio of the neighbors (some random subset)Iterative Deepening:Start BFS with a small TTL and repeat the BFS at increasing depths if the first BFS failsWorks well when there is some stop condition and a “small” flood will satisfy the queryElse even bigger loads than standard flooding(more later …)
40Two methods to terminate each walker: Blind Search MethodsRandom Walks:The node that poses the query sends out k query messages to an equal number of randomly chosen neighborsEach step follows each own path at each step randomly choosing one neighbor to forward itEach path – a walkerTwo methods to terminate each walker:TTL-based orchecking method (the walkers periodically check with the query source if the stop condition has been met)It reduces the number of messages to k x TTL in the worst caseSome kind of local load-balancing
41Blind Search MethodsRandom Walks:In addition, the protocol bias its walks towards high-degree nodes (choose the highest degree neighbor)
42Blind Search MethodsUsing Super-nodes:Super (or ultra) peers are connected to each otherEach super-peer is also connected with a number of leaf nodesRouting among the super-peersThe super-peers then contact their leaf nodes
43Blind Search MethodsUsing Super-nodes:Gnutella2When a super-peer (or hub) receives a query from a leaf, it forwards it to its relevant leaves and to neighboring super-peersThe hubs process the query locally and forward it to their relevant leavesNeighboring super-peers regularly exchange local repository tables to filter out traffic between them
44Interconnection between the superpeers Blind Search MethodsUltrapeers can be installed (KaZaA) or self-promoted (Gnutella)Interconnection between the superpeers
45Informed Search Methods Local IndexEach node indexes all files stored at all nodes within a certain radius r and can answer queries on behalf of themSearch process at steps of r, hop distance between two consecutive searches 2r+1Increased cost for join/leaveFlood inside each r with TTL = r, when join/leave the network
46Informed Search Methods Intelligent BFSquery...?Nodes store simple statistics on its neighbors:(query, NeigborID) tuples for recently answered requests from or through their neighborsso they can rank themFor each query, a node finds similar ones and selects a directionHow?
47Informed Search Methods Intelligent or Directed BFSquery...?Heuristics for Selecting Direction>RES: Returned most results for previous queries<TIME: Shortest satisfaction time<HOPS: Min hops for results>MSG: Forwarded the largest number of messages (all types), suggests that the neighbor is stable<QLEN: Shortest queue<LAT: Shortest latency>DEG: Highest degree
48Informed Search Methods Intelligent or Directed BFSNo negative feedbackDepends on the assumption that nodes specialize in certain documents
49Informed Search Methods APSAgain, each node keeps a local index with one entry for each object it has requested per neighbor –this reflects the relative probability of the node to be chosen to forward the queryk independent walkers and probabilistic forwardingEach node forwards the query to one of its neighbor based on the local index (for each object, choose a neighbor using the stored probability)If a walker, succeeds the probability is increased, else is decreased –Take the reverse path to the requestor and update the probability, after a walker miss (optimistic update) or after a hit (pessimistic update)