Topics in Database Systems: Data Management in Peer-to-Peer Systems

Slides:



Advertisements
Similar presentations
Performance in Decentralized Filesharing Networks Theodore Hong Freenet Project.
Advertisements

A Measurement Study of Peer-to-Peer File Sharing Systems Presented by Cristina Abad.
Peer-to-Peer and Social Networks An overview of Gnutella.
P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Peer to Peer and Distributed Hash Tables
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
1 An Overview of Gnutella. 2 History The Gnutella network is a fully distributed alternative to the centralized Napster. Initial popularity of the network.
Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao, Christine Lv., Edith Cohen, Kai Li and Scott Shenker ICS 2002.
Farnoush Banaei-Kashani and Cyrus Shahabi Criticality-based Analysis and Design of Unstructured P2P Networks as “ Complex Systems ” Mohammad Al-Rifai.
LightFlood: An Optimal Flooding Scheme for File Search in Unstructured P2P Systems Song Jiang, Lei Guo, and Xiaodong Zhang College of William and Mary.
An Overview of Peer-to-Peer Networking CPSC 441 (with thanks to Sami Rollins, UCSB)
P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems March 29, 2005.
Cis e-commerce -- lecture #6: Content Distribution Networks and P2P (based on notes from Dr Peter McBurney © )
FRIENDS: File Retrieval In a dEcentralized Network Distribution System Steven Huang, Kevin Li Computer Science and Engineering University of California,
Spotlighting Decentralized P2P File Sharing Archie Kuo and Ethan Le Department of Computer Science San Jose State University.
Building Low-Diameter P2P Networks Eli Upfal Department of Computer Science Brown University Joint work with Gopal Pandurangan and Prabhakar Raghavan.
Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.
Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,
Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao Cisco Systems, Inc. (Joint work with Christine Lv, Edith Cohen, Kai Li and Scott Shenker)
presented by Hasan SÖZER1 Scalable P2P Search Daniel A. Menascé George Mason University.
Topics in Database Systems: Data Management in Peer-to-Peer Systems
1 Characterizing Files in the Modern Gnutella Network: A Measurement Study Shanyu Zhao, Daniel Stutzbach, Reza Rejaie University of Oregon SPIE Multimedia.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.
Efficient Search in Peer to Peer Networks By: Beverly Yang Hector Garcia-Molina Presented By: Anshumaan Rajshiva Date: May 20,2002.
Searching in Unstructured Networks Joining Theory with P-P2P.
Improving Data Access in P2P Systems Karl Aberer and Magdalena Punceva Swiss Federal Institute of Technology Manfred Hauswirth and Roman Schmidt Technical.
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
P2P File Sharing Systems
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
Freenet. Anonymity  Napster, Gnutella, Kazaa do not provide anonymity  Users know who they are downloading from  Others know who sent a query  Freenet.
Peer-to-Peer Computing CS587x Lecture Department of Computer Science Iowa State University.
1 Napster & Gnutella An Overview. 2 About Napster Distributed application allowing users to search and exchange MP3 files. Written by Shawn Fanning in.
Introduction Widespread unstructured P2P network
P2P Architecture Case Study: Gnutella Network
1 Reading Report 4 Yin Chen 26 Feb 2004 Reference: Peer-to-Peer Architecture Case Study: Gnutella Network, Matei Ruoeanu, In Int. Conf. on Peer-to-Peer.
Peer-to-Peer Overlay Networks. Outline Overview of P2P overlay networks Applications of overlay networks Classification of overlay networks – Structured.
Searching In Peer-To-Peer Networks Chunlin Yang. What’s P2P - Unofficial Definition All of the computers in the network are equal Each computer functions.
HERO: Online Real-time Vehicle Tracking in Shanghai Xuejia Lu 11/17/2008.
Peer to Peer Research survey TingYang Chang. Intro. Of P2P Computers of the system was known as peers which sharing data files with each other. Build.
2: Application Layer1 Chapter 2 outline r 2.1 Principles of app layer protocols r 2.2 Web and HTTP r 2.3 FTP r 2.4 Electronic Mail r 2.5 DNS r 2.6 Socket.
Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Routing Indices For P-to-P Systems ICDCS Introduction Search in a P2P system –Mechanisms without an index –Mechanisms with specialized index nodes.
Quantitative Evaluation of Unstructured Peer-to-Peer Architectures Fabrício Benevenuto José Ismael Jr. Jussara M. Almeida Department of Computer Science.
1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.
Efficient P2P Search by Exploiting Localities in Peer Community and Individual Peers A DISC’04 paper Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang.
P2PComputing/Scalab 1 Gnutella and Freenet Ramaswamy N.Vadivelu Scalab.
By Jonathan Drake.  The Gnutella protocol is simply not scalable  This is due to the flooding approach it currently utilizes  As the nodes increase.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
LightFlood: An Efficient Flooding Scheme for File Search in Unstructured P2P Systems Song Jiang, Lei Guo, and Xiaodong Zhang College of William and Mary.
P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.
Computer Networking P2P. Why P2P? Scaling: system scales with number of clients, by definition Eliminate centralization: Eliminate single point.
ADVANCED COMPUTER NETWORKS Peer-Peer (P2P) Networks 1.
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
CS Spring 2014 CS 414 – Multimedia Systems Design Lecture 37 – Introduction to P2P (Part 1) Klara Nahrstedt.
INTERNET TECHNOLOGIES Week 10 Peer to Peer Paradigm 1.
CS 347Notes081 CS 347: Parallel and Distributed Data Management Notes 08: P2P Systems.
P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.
P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.
09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
Distributed Web Systems Peer-to-Peer Systems Lecturer Department University.
BitTorrent Vs Gnutella.
Peer-to-Peer and Social Networks
Early Measurements of a Cluster-based Architecture for P2P Systems
Unstructured Routing : Gnutella and Freenet
Peer-to-Peer Information Systems Week 6: Performance
An Overview of Peer-to-Peer
Presentation transcript:

Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p

Topics in Database Systems: Data Management in Peer-to-Peer Systems D. Tsoumakos and N. Roussopoulos, “A Comparison of Peer-to-Peer Search Methods”, WebDB03

Overview Centralized Constantly-updated directory hosted at central locations (do not scale well, updates, single points of failure) Decentralized but structured The overlay topology is highly controlled and files (or metadata/index) are not placed at random nodes but at specified locations Decentralized and Unstructured peers connect in an ad-hoc fashion the location of document/metadata is not controlled by the system No guarantee for the success of a search No bounds on search time No maintenance cost Any kind of query (not just single key or range queries)

Flooding on Overlays xyz.mp3 ? xyz.mp3

Flooding on Overlays xyz.mp3 xyz.mp3 ? Flooding

Flooding on Overlays xyz.mp3 xyz.mp3 ? Flooding

Flooding on Overlays xyz.mp3

Search in Unstructured P2P Must find a way to stop the search: Time-to-Leave (TTL) Exponential Number of Messages Cycles (?) Note: cycles can be detected but not avoided

Search in Unstructured P2P BFS vs DFS BFS better response time, larger number of nodes (message overhead per node and overall) Note: search in BFS continues (if TTL is not reached), even if the object has been located on a different path Recursive vs Iterative During search, whether the node issuing the query directly contacts others, or recursively. Does the result follows the same path?

Iterative vs. Recursive Routing Iterative: Originator requests IP address of each hop Message transport is actually done via direct IP Recursive: Message transferred hop-by-hop K V K V K V K V K V K V K V K V K V K V K V retrieve (K1)

Search in Unstructured P2P Two general types of search in unstructured p2p: Blind: try to propagate the query to a sufficient number of nodes (example Gnutella) Informed: utilize information about document locations (example Routing Indexes) Informed search increases the cost of join for an improved search cost

Blind Search Methods Gnutella: Use flooding (BFS) to contact all accessible nodes within the TTL value Huge overhead to a large number of peers + Overall network traffic Hard to find unpopular items Up to 60% bandwidth consumption of the total Internet traffic

Free-riding on Gnutella [Adar00] 24 hour sampling period: 70% of Gnutella users share no files 50% of all responses are returned by top 1% of sharing hosts A social problem not a technical one Problems: Degradation of system performance: collapse? Increase of system vulnerability “Centralized” (“backbone”) Gnutella  copyright issues? Verified hypotheses: H1: A significant portion of Gnutella peers are free riders H2: Free riders are distributed evenly across domains H3: Often hosts share files nobody is interested in (are not downloaded)

Free-riding Statistics - 1 [Adar00] H1: Most Gnutella users are free riders Of 33,335 hosts: 22,084 (66%) of the peers share no files 24,347 (73%) share ten or less files Top 1 percent (333) hosts share 37% (1,142,645) of total files shared Top 5 percent (1,667) hosts share 70% (1,142,645) of total files shared Top 10 percent (3,334) hosts share 87% (2,692,082) of total files shared

Free-riding Statistics - 2 [Adar00] H3: Many servents share files nobody downloads Of 11,585 sharing hosts: Top 1% of sites provide nearly 47% of all answers Top 25% of sites provide 98% of all answers 7,349 (63%) never provide a query response

Free Riders File sharing studies Lots of people download Few people serve files Is this bad? If there’s no incentive to serve, why do people do so? What if there are strong disincentives to being a major server?

Simple Solution: Thresholds Many programs allow a threshold to be set Don’t upload a file to a peer unless it shares > k files Problems: What’s k? How to ensure the shared files are interesting?

Categories of Queries [Sripanidkulchai01] Categorized top 20 queries

Popularity of Queries [Sripanidkulchai01] Very popular documents are approximately equally popular Less popular documents follow a Zipf-like distribution (i.e., the probability of seeing a query for the ith most popular query is proportional to 1/(ialpha)) Access frequency of web documents also follows Zipf-like distributions  caching might also work for Gnutella

Caching in Gnutella [Sripanidkulchai01] Average bandwidth consumption in tests: 3.5Mbps Best case: trace 2 (73% hit rate = 3.7 times traffic reduction)

Topology of Gnutella [Jovanovic01] Power-law properties verified (“find everything close by”) Backbone + outskirts Power-Law Random Graph (PLRG): The node degrees follow a power law distribution: if one ranks all nodes from the most connected to the least connected, then the i’th most connected node has ω/ia neighbors, where w is a constant.

Gnutella Backbone [Jovanovic01]

Why does it work? It’s a small World! [Hong01] Milgram: 42 out of 160 letters from Oregon to Boston (~ 6 hops) Watts: between order and randomness short-distance clustering + long-distance shortcuts In 1967, Stanley Milgram conducted a classic experiment where he instructed randomly chosen people in Nebraska to pass letters to a selected target person in Boston, using only intermediaries who were known to one another on a first-name basis. He found that it only required a median of six steps for the letters to reach their destination, giving rise to “six degrees of separation” and the “small-world effect.” Duncan Watts and Steven Strogatz extended this work in 1998 with an influential paper in Nature that described small-world networks as an intermediate state between regular graphs and random graphs. Small-world graphs maintain the high local clustering of regular graphs (as measured by the clustering coefficient, the proportion of a nodes linked to a given node which are also linked to each other) but also have the short pathlengths of random graphs. They can be regarded as locally clustered graphs with shortcuts scattered in. Freenet networks can be shown to be small-world graphs (next slide). Regular graph: n nodes, k nearest neighbors  path length ~ n/2k 4096/16 = 256 Rewired graph (1% of nodes): path length ~ random graph clustering ~ regular graph Random graph: path length ~ log (n)/log(k) ~ 4

Links in the small World [Hong01] “Scale-free” link distribution Scale-free: independent of the total number of nodes Characteristic for small-world networks The proportion of nodes having a given number of links n is: P(n) = 1 /n k Most nodes have only a few connections Some have a lot of links: important for binding disparate regions together A key characteristic of small-world graphs is the “scale-free” link distribution, which has no term related to the size of the network, and thus applies at all scales from small to large. This distribution can be seen in Freenet. The nodes at the top left, with few connections, are the local clusters while the nodes at the bottom right, with lots of connections, provide the shortcuts that tie the network together. The outlier at far right is the group of nodes whose datastores are completely filled, with 250 entries – with larger datastores, this column shifts further to the right.

Freenet: Links in the small World [Hong01] P(n) ~ 1/n 1.5 A key characteristic of small-world graphs is the “scale-free” link distribution, which has no term related to the size of the network, and thus applies at all scales from small to large. This distribution can be seen in Freenet. The nodes at the top left, with few connections, are the local clusters while the nodes at the bottom right, with lots of connections, provide the shortcuts that tie the network together. The outlier at far right is the group of nodes whose datastores are completely filled, with 250 entries – with larger datastores, this column shifts further to the right.

Gnutella: “New” Measurements [1] Stefan Saroiu, P. Krishna Gummadi, Steven D. Gribble: A Measurement Study of Peer-to-Peer File Sharing Systems, Proceedings of Multimedia Computing and Networking (MMCN) 2002, San Jose, CA, USA, January 2002.   [2] M. Ripeanu, I. Foster, and A. Iamnitchi. Mapping the gnutella network: Properties of large-scale peer-to-peer systems and implications for system design. IEEE Internet Computing Journal, 6(1), 2002 [3] Evangelos P. Markatos, Tracing a large-scale Peer to Peer System: an hour in the life of Gnutella, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2002. [4] Y. HawatheAWATHE, S. Ratnasamy, L. Breslau, and S. Shenker. Making Gnutella-like P2P Systems Scalable. In Proc. ACM SIGCOMM (Aug. 2003). [5] Qin Lv, Pei Cao, Edith Cohen, Kai Li, Scott Shenker: Search and replication in unstructured peer-to-peer networks. ICS 2002: 84-95 

Gnutella: Bandwidth Barriers Clip2 measured Gnutella over 1 month: typical query is 560 bits long (including TCP/IP headers) 25% of the traffic are queries, 50% pings, 25% other on average each peer seems to have 3 other peers actively connected Clip2 found a scalability barrier with substantial performance degradation if queries/sec > 10: 10 queries/sec * 560 bits/query * 4 (to account for the other 3 quarters of message traffic) * 3 simultaneous connections 67,200 bps 10 queries/sec maximum in the presence of many dialup users won’t improve (more bandwidth - larger files)

Gnutella: Summary Completely decentralized Hit rates are high High fault tolerance Adopts well and dynamically to changing peer populations Protocol causes high network traffic (e.g., 3.5Mbps). For example: 4 connections C / peer, TTL = 7 1 ping packet can cause packets No estimates on the duration of queries can be given No probability for successful queries can be given Topology is unknown  algorithms cannot exploit it Free riding is a problem Reputation of peers is not addressed Simple, robust, and scalable (at the moment)

Lessons and Limitations Client-Server performs well But not always feasible Ideal performance is often not the key issue! Things that flood-based systems do well Organic scaling Decentralization of visibility and liability Finding popular stuff (e.g., caching) Fancy local queries Things that flood-based systems do poorly Finding unpopular stuff [Loo, et al VLDB 04] Fancy distributed queries Vulnerabilities: data poisoning, tracking, etc. Guarantees about anything (answer quality, privacy, etc.)

Comparison     

Comparison          

Information Preservation Information Quality Trust Security & Privacy Issues: Anonymity Reputation Accountability Information Preservation Information Quality Trust Denial of service attacks

? title: origin of species author: charles darwin date: 1859 Authenticity title: origin of species author: charles darwin ? date: 1859 body: In an island far, far away ... ...

More than Just File Integrity title: origin of species author: charles darwin ? date: 1859 00 body: In an island far, far away ... checksum

More than Fetching One File T=origin Y=? A=darwin B=? T=origin Y=1859 A=darwin B=abcd T=origin Y=1800 A=darwin Y=1859

Solutions Authenticity Function A(doc): T or F at expert sites, at all sites? can use signature expert sig(doc) user Voting Based authentic is what majority says Time Based e.g., oldest version (available) is authentic

Trust computations in dynamic system Overloading good nodes Issues Trust computations in dynamic system Overloading good nodes Bad nodes can provide good content sometimes Bad nodes can build up reputation Bad nodes can form collectives ...

Back to searching

Blind Search Methods Modified-BFS: Choose only a ratio of the neighbors (some random subset) Iterative Deepening: Start BFS with a small TTL and repeat the BFS at increasing depths if the first BFS fails Works well when there is some stop condition and a “small” flood will satisfy the query Else even bigger loads than standard flooding (more later …)

Two methods to terminate each walker: Blind Search Methods Random Walks: The node that poses the query sends out k query messages to an equal number of randomly chosen neighbors Each step follows each own path at each step randomly choosing one neighbor to forward it Each path – a walker Two methods to terminate each walker: TTL-based or checking method (the walkers periodically check with the query source if the stop condition has been met) It reduces the number of messages to k x TTL in the worst case Some kind of local load-balancing

Blind Search Methods Random Walks: In addition, the protocol bias its walks towards high-degree nodes (choose the highest degree neighbor)

Blind Search Methods Using Super-nodes: Super (or ultra) peers are connected to each other Each super-peer is also connected with a number of leaf nodes Routing among the super-peers The super-peers then contact their leaf nodes

Blind Search Methods Using Super-nodes: Gnutella2 When a super-peer (or hub) receives a query from a leaf, it forwards it to its relevant leaves and to neighboring super-peers The hubs process the query locally and forward it to their relevant leaves Neighboring super-peers regularly exchange local repository tables to filter out traffic between them

Interconnection between the superpeers Blind Search Methods Ultrapeers can be installed (KaZaA) or self-promoted (Gnutella) Interconnection between the superpeers

Informed Search Methods Local Index Each node indexes all files stored at all nodes within a certain radius r and can answer queries on behalf of them Search process at steps of r, hop distance between two consecutive searches 2r+1 Increased cost for join/leave Flood inside each r with TTL = r, when join/leave the network

Informed Search Methods Intelligent BFS query ... ? Nodes store simple statistics on its neighbors: (query, NeigborID) tuples for recently answered requests from or through their neighbors so they can rank them For each query, a node finds similar ones and selects a direction How?

Informed Search Methods Intelligent or Directed BFS query ... ? Heuristics for Selecting Direction >RES: Returned most results for previous queries <TIME: Shortest satisfaction time <HOPS: Min hops for results >MSG: Forwarded the largest number of messages (all types), suggests that the neighbor is stable <QLEN: Shortest queue <LAT: Shortest latency >DEG: Highest degree

Informed Search Methods Intelligent or Directed BFS No negative feedback Depends on the assumption that nodes specialize in certain documents

Informed Search Methods APS Again, each node keeps a local index with one entry for each object it has requested per neighbor – this reflects the relative probability of the node to be chosen to forward the query k independent walkers and probabilistic forwarding Each node forwards the query to one of its neighbor based on the local index (for each object, choose a neighbor using the stored probability) If a walker, succeeds the probability is increased, else is decreased – Take the reverse path to the requestor and update the probability, after a walker miss (optimistic update) or after a hit (pessimistic update)