PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities F. M. Cuenca-Acuna, C. Peery, R. P. Martin, and T. D.

Slides:

Advertisements

Similar presentations

One Hop Lookups for Peer-to-Peer Overlays Anjali Gupta, Barbara Liskov, Rodrigo Rodrigues Laboratory for Computer Science, MIT.

Advertisements

Chapter 5: Introduction to Information Retrieval

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.

Peer to Peer and Distributed Hash Tables

Peer-to-Peer Systems Chapter 25. What is Peer-to-Peer (P2P)? Napster? Gnutella? Most people think of P2P as music sharing.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.

Technion –Israel Institute of Technology Computer Networks Laboratory A Comparison of Peer-to-Peer systems by Gomon Dmitri and Kritsmer Ilya under Roi.

Denial-of-Service Resilience in Peer-to-Peer Systems D. Dumitriu, E. Knightly, A. Kuzmanovic, I. Stoica and W. Zwaenepoel Presenter: Yan Gao.

A Distributed Indexing Strategy for Efficient XML Retrieval Efficiency Issues in Information Retrieval Workshop 30th European Conference on Information.

P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems March 29, 2005.

Autonomous Replication for High Availability in Unstructured P2P Systems Francisco Matias Cuenca-Acuna, Richard P. Martin, Thu D. Nguyen Department of.

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.

A probabilistic approach to building large scale federated systems Francisco Matias Cuenca-Acuna

Routing of Structured Queries in Large-Scale Distributed Systems Workshop on Large-Scale Distributed Systems for Information Retrieval ACM.

Francisco Matias Cuenca-Acuna Christopher Peery Thu D. Nguyen Usando algoritmos probabilísticos para construir sistemas.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities Francisco Matias Cuenca-Acuna, Christopher Peery, Richard P.

Object Naming & Content based Object Search 2/3/2003.

Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.

Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.

1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen

Parallel and Distributed IR

Improving Data Access in P2P Systems Karl Aberer and Magdalena Punceva Swiss Federal Institute of Technology Manfred Hauswirth and Roman Schmidt Technical.

Peer-to-Peer Networks Slides largely adopted from Ion Stoica’s lecture at UCB.

Peer-to-peer file-sharing over mobile ad hoc networks Gang Ding and Bharat Bhargava Department of Computer Sciences Purdue University Pervasive Computing.

Chapter 5: Information Retrieval and Web Search

INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

IR Techniques For P2P Networks1 Information Retrieval Techniques For Peer-To-Peer Networks Demetrios Zeinalipour-Yazti, Vana Kalogeraki and Dimitrios Gunopulos.

COCONET: Co-Operative Cache driven Overlay NETwork for p2p VoD streaming Abhishek Bhattacharya, Zhenyu Yang & Deng Pan.

Developing Analytical Framework to Measure Robustness of Peer-to-Peer Networks Niloy Ganguly.

MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.

Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.

Security Michael Foukarakis – 13/12/2004 A Survey of Peer-to-Peer Security Issues Dan S. Wallach Rice University,

Search in Peer-to-Peer File-Sharing Systems: Like Metasearch Engines, But Not Really Wai Gen Yee, Dongmei Jia, Linh Thai Nguyen {yee, jiadong,

Autonomous Replication for High Availability in Unstructured P2P Systems Francisco Matias Cuenca-Acuna, Richard P. Martin, Thu D. Nguyen

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Autonomous Replication for High Availability in Unstructured P2P Systems (Paper by Francisco Matias Cuenca-Acuna, Richard P. Martin, Thu D. Nguyen) Hristo.

A Scalable Content-Addressable Network (CAN) Seminar “Peer-to-peer Information Systems” Speaker Vladimir Eske Advisor Dr. Ralf Schenkel November 2003.

CSE3201/CSE4500 Term Weighting.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

A Peer-to-Peer Approach to Resource Discovery in Grid Environments (in HPDC’02, by U of Chicago) Gisik Kwon Nov. 18, 2002.

AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network Toan Luu, Gleb Skobeltsyn, Fabius Klemm, Maroje Puh, Ivana Podnar Zarko, Martin.

Freelib: A Self-sustainable Digital Library for Education Community Ashraf Amrou, Kurt Maly, Mohammad Zubair Computer Science Dept., Old Dominion University.

An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Temporal-DHT and its Application in P2P-VoD Systems Abhishek Bhattacharya, Zhenyu Yang & Shiyun Zhang.

1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.

Information Retrieval in Peer to Peer Systems Modern Information Retrieval Sharif University of Technology Fall 2005.

1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks

LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.

P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.

P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.

School of Electrical Engineering &Telecommunications UNSW Cost-effective Broadcast for Fully Decentralized Peer-to-peer Networks Marius Portmann & Aruna.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.

Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.

Information Retrieval in Peer to Peer Systems Modern Information Retrieval Sharif University of Technology Fall 2005.

CS 268: Lecture 22 (Peer-to-Peer Networks)

CHAPTER 3 Architectures for Distributed Systems

EE 122: Peer-to-Peer (P2P) Networks

Paraskevi Raftopoulou, Euripides G.M. Petrakis

Presentation transcript:

PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities F. M. Cuenca-Acuna, C. Peery, R. P. Martin, and T. D. Nguyen Speaker: Sergey Lugin Seminar “Peer-to-peer Information Systems” January 2003

1.Introduction 2. Architecture local and global index gossiping algorithm content search and ranking 3. Performance content search and ranking gossiping algorithm 4. Group extension 5. Related works and summary Outline

Introduction PlanetP is a content addressable publish/ subscribe service for unstructured peer-to-peer (P2P) communities: no central management resilience to rapid membership changes content search and ranking scaling up to several thousand peers PlanetP features: a gossiping layer used to globally replicate an extremely compact content index a completely distributed content search and ranking algorithm that helps users find the most relevant information

1.Introduction 2. Architecture local and global index gossiping algorithm content search and ranking 3. Performance content search and ranking gossiping algorithm 4. Group extension 5. Related works and summary Outline

Local index Contents of peer files (video, music) are described in XML snippets that contain pointers to corresponded files Each peer runs a Simple Web Server to support peer’s retrieval of these files (to be considered further) Each peer summarizes the set of unique terms in its local index in a BLOOM FILTER Local Files XML Snippets Simple Web Server BLOOM FILTER: Local index unique terms pointers

BLOOM FILTER: BLOOM FILTER The Bloom filter is a bit-array that allows to quickly test a membership in a large term set using hash functions. compact form for the representation of a large term set quickness only Boolean answer ( Yes / No ) “false positive” problem BLOOM FILTER: Terms: paper … cat hash functions query term: cat hash functions yes

Global index NicknameStatusIPBloom Filter BobON-LINE BF[…….] GeraOFF-LINE BF[…….] ………… FredON-LINE BF[…….] JuriON-LINE BF[…….] TanON-LINE BF[…….] Global directory (index) Global directory describes all peers and all available information in the community Global directory is replicated everywhere by using gossiping Gossiping is used also to keep peers synchronized joining or leaving of peers new data

Gossiping algorithm PlanetP uses gossiping to replicate the global index across peers of a P2P community robustness to the dynamic joining and leaving of peers independence from any particular subset of peers being on-line PlanetP’s gossiping algorithm: rumoring algorithm anti-entropy algorithm partial anti-entropy algorithm Demer’s algorithm

Rumoring algorithm A peer has a change A peer has a change: rumor Purpose: The algorithm provides spreading of new information across a P2P community rumor if the rumor is new information for the peer P y, then it starts to push this rumor just like P x the peer P x stops pushing the rumor after it has contacted n consecutive peers that already heard the rumor PxPx rumor every T g seconds, a peer P x pushes this change (a rumor) to a peer P y chosen randomly from the global index PyPy

Anti-entropy algorithm Purpose: The algorithm allows to avoid the possibility of rumors dying out before reaching everyone All peers: global index summary the peer P y returns the summary of its version of the global index then P x can ask P y for any new information that it does not have PxPx every T r seconds, a peer P x attempts to pull information from a peer P y chosen randomly from the global index pull PyPy

Partial anti-entropy algorithm Purpose: The algorithm allows to reduce the time of new information spreading Extension of each push operation: The process requires only one extra message exchange in the case that P y knows something that P x does not identifiers of recent rumors the peer P y piggybacks the identifiers of a small number of the most recent rumors then P x can pull any recent rumor that did not reach it PxPx a peer P x pushes a rumor to a peer P y rumor PyPy

Membership Joining of new peers: gossiping NEW Leaving of present peers: a peer discovers that another peer is OFF-LINE when an attempt to communicate with it fails: the peer status is marked as OFF-LINE information in the global index is not dropped if a peer has been marked as OFF-LINE continuously for a time T Dead, it is assumed that the peer has left the community permanently: all information about the peer is dropped OFF-LINE Rejoining of peers: gossiping (the peer status is marked as ON-LINE) ON-LINE

Content search and ranking Q is a query, D is a document, t is a term, w Q,T – the weight of the term t for the query Q, w D,t – the weight of the term t for the document D each document and query is abstractly represented as a vector each dimension is associated with a distinct term the value of each component of a vector is the weight representing the importance of that term to the corresponding document or query PlanetP implements a Content Ranking Algorithm that uses the Vector Space Ranking Model Vector Space Ranking Model Relevance of the document and query the cosine of the angle between the query vector and document vector

Content Ranking Algorithm - Background TFxIDF is a popular method for term weight assignments TF is a Term Frequency how often the term appears in the document IDF is a Inverse Document Frequency the inverse of how often this term appears in the entire collection This technique allows to balance: the fact that terms frequently used in a document are likely important to describe its meaning terms that appear in many documents in a collection are not useful for differentiating between these documents

Approximation TFxIDF How this technique can be used for a P2P community ? Problems of the computation TFxIDF for a P2P community:  documents are distributed across a P2P community  the peer bandwidth is restricted to use PlanetP’s proposed solution Approximation TFxIDF two sub-problems: 1. Ranking peers according to their likelihood of having relevant documents 2. Deciding on the number of peers to contact and ranking identified documents

PlanetP’s Content Search Steps: 1.Ranking peers 2. Querying the most relevant peers Query Name Bloom Filter BobBF[…….] GeraBF[…….] …… FredBF[…….] JuriBF[…….] TanBF[…….] Ranking of peers Fred Bob Tan namerank Ranking results Doc - A Doc- E Doc - D 0,93 0,79 0,77 Querying Fred Bob Global directory

Ranking peers N – is the number of all peers N t – is the number of peers having the term t 1. Ranking peers new measure similar IDF Inverse Peer Frequency (IPF) a term that is present in the bloom filter of every peer is not useful for differentiating between the peers for a particular query IPF for a term t Ranking peers for a query Q:

Quering peers 2. Querying the most relevant peers Problem: As communities grow, it becomes infeasible to contact large subsets of peers for each query Solution: for a query Q, the user specifies a limit K on the number of potential documents that should be presented PlanetP sorts ranked peer lists and contacts peers (the most relevant peers) Each contacted peer returns a set of document URLs together with their relevance

Quering peers NamePeer rank Bob0.84 Gera0.77 …… Fred0.34 Sorted peer list p stop PlanetP stops contacting peers when the documents identified by p consecutive peers fail to contribute to the top K ranked documents ranked documents Ranking results Doc - A Doc- E Doc - D namerank 0,93 0,79 0,77 K 2. Querying the most relevant peers C 0, C 1, C 2 – are constant values Simulation results showed that p should be a function of the community size N and K as follows: stopping heuristic:

1.Introduction 2. Architecture local and global index gossiping algorithm content search and ranking 3. Performance content search and ranking gossiping algorithm 4. Group extension 5. Related works and summary Outline

Performance Performance Study:  Content Search  Gossiping Efficacy Time Bandwidth Usage Performance Study was based on the developed simulator The simulator was validated against measurements taken from prototype (up to several hundred peers)

Content search and ranking algorithm Metrics: Recall (R), Precision (P) Input data: The collection AP89 was extracted from the TREC collection (Associated Press)  Uniform (The worst case for a distributed search)  Weibull (7% of the users in the Gnutella community share more files than all the rest together) Different document-to-peer distribution:  No. Docs =  No. Unique Terms =

T.W is a search engine using TFxIDF (centralized implementation) P.W is the PlanteP’s search engine (Weibull distribution of documents) P.U is the PlanteP’s search engine (Uniform distribution of documents) T.W P.W P.U T.W P.U P.W Precision Recall No. documents requested Content search and ranking algorithm

No. peers contacted Recall No. documents requested Number of peers Content search and ranking algorithm community of 400 peers stopping heuristic T.W is a search engine using TFxIDF (centralized implementation) P.W is the PlanteP’s search engine (Weibull distribution of documents) P.U is the PlanteP’s search engine (Uniform distribution of documents)

Observations 1. PlanetP tracks the performance of the centralized implementation closely Performance is independent of how the shared documents are distributed. PlanetP’s recall and precision is within 11% of TFxIDF’s implementation. 2. PlanetP scales well for communities of up to 1000 peers, maintaining a relatively constant recall and precision. 3. PlanetP’s stopping heuristic allows to maintain the close recall and precision independently of how the documents are distributed.

Gossiping algorithm Measured factor: propagation time LAN-AE: Peers use only push anti-entropy: each peer periodically push a summary of its data structure. The target requests all new information from this summary. LAN: Peers use PlanetP’s gossiping algorithm. ParameterValue Base gossiping interval30sec Message header size3 bytes 1000 terms BF3000 bytes BF summary6 bytes Peer summary48 bytes PlanetP’s parameters Time (sec) No. peers LAN-AE LAN (PlanetP) Network: 45 Mbps Time required to propagate a single Bloom filter everywhere.

Gossiping algorithm Different gossiping intervals: 10 sec (DSL-10) 30 sec (DSL-30) 60 sec (DSL-60) DSL- 60 DSL- 30 DSL- 10 Measured factor: propagation time, average bandwidth Time (sec) No. peers Average Bandwidth (Bytes/s) DSL- 60 DSL- 30 DSL- 10 Network: 512 Kbps PlanetP’s gossiping algorithm

Observations 3. Total number of bytes sent is very modest, implying that gossiping is very scalable 1. The algorithm significantly outperforms ones that use only push anti-entropy for both propagation time and network volume 2. Propagation time is a logarithmic function of community size 4. We can easily trade off propagation time against gossiping bandwidth by increasing or decreasing the gossiping interval

1.Introduction 2. Architecture local and global index gossiping algorithm content search and ranking 3. Performance content search and ranking gossiping algorithm 4. Group extension 5. Related works and summary Outline

Group extension – Gossiping Group A Group C Group B attenuated BF Community is divided into a number of groups Peers within the same group operate as described above Peers from different groups will gossip an attenuated Bloom filter that is a summary of the global index for their groups

Group extension – Content search Group A Group C Search: P A1 P Ci P C3 P C2 a peer P A1 (group A) try to find documents that are relevant to a query Q the Bloom filter of the group C contains relevant terms to a query Q query the peer P A1 queries to a random peer P Ci from the group C 0.77P C2 0.84P C3 Peer rankName the peer P Ci returns a ranked list of peers in the group C ranked list

1.Introduction 2. Architecture local and global index gossiping algorithm content search and ranking 3. Performance content search and ranking gossiping algorithm 4. Group extension 5. Related works and summary Outline

Related works  Tapestry, Pastry, Chord and CAN use distributed hash tables (DHT): key – value provide search mechanisms based on the key Problem: The high cost of publishing thousands of keys per file  Cori and Closs address the problems of database selection and ranking fusion on distributed collections use servers to keep a reduced index of the contents Problems: The need of centralized resources The possibility of a single point failure

Summary The first work that supports content ranking. Content addressable publish/ subscribe service for unstructured P2P communities Gossiping algorithm provides the propagation of shared information everywhere provides the robustness to the dynamic peer behavior Content search and ranking algorithm provides the search capabilities comparable with centralized resources operates independently of how documents are distributed throughout the community

Questions ? PlanetP