Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities Francisco Matias Cuenca-Acuna, Christopher Peery, Richard P.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Peer-to-Peer Systems Chapter 25. What is Peer-to-Peer (P2P)? Napster? Gnutella? Most people think of P2P as music sharing.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Clayton Sullivan PEER-TO-PEER NETWORKS. INTRODUCTION What is a Peer-To-Peer Network A Peer Application Overlay Network Network Architecture and System.
PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities F. M. Cuenca-Acuna, C. Peery, R. P. Martin, and T. D.
Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao, Christine Lv., Edith Cohen, Kai Li and Scott Shenker ICS 2002.
An Overview of Peer-to-Peer Networking CPSC 441 (with thanks to Sami Rollins, UCSB)
Peer-to-Peer Networks as a Distribution and Publishing Model Jorn De Boever (june 14, 2007)
Evaluating Search Engine
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Autonomous Replication for High Availability in Unstructured P2P Systems Francisco Matias Cuenca-Acuna, Richard P. Martin, Thu D. Nguyen Department of.
Peer-to-Peer Content Sharing. P2P File Sharing Benefits Why use a P2P model for a file sharing application?
A probabilistic approach to building large scale federated systems Francisco Matias Cuenca-Acuna
Rutgers PANIC Laboratory The State University of New Jersey Self-Managing Federated Services Francisco Matias Cuenca-Acuna and Thu D. Nguyen Department.
Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,
CS 552 Peer 2 Peer Networking R. Martin Credit slides from B. Richardson, I. Stoica, M. Cuenca.
Francisco Matias Cuenca-Acuna Christopher Peery Thu D. Nguyen Usando algoritmos probabilísticos para construir sistemas.
CS 552 Peer 2 Peer Networking R. Martin Credit slides from B. Richardson, I. Stoica, M. Cuenca.
Object Naming & Content based Object Search 2/3/2003.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Navigating and Sharing in a Decentralized World Francisco Matias Cuenca-Acuna
1 Seminar: Information Management in the Web Gnutella, Freenet and more: an overview of file sharing architectures Thomas Zahn.
Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen
Introduction to Peer-to-Peer Networks. What is a P2P network Uses the vast resource of the machines at the edge of the Internet to build a network that.
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
IR Techniques For P2P Networks1 Information Retrieval Techniques For Peer-To-Peer Networks Demetrios Zeinalipour-Yazti, Vana Kalogeraki and Dimitrios Gunopulos.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Exploring VoD in P2P Swarming Systems By Siddhartha Annapureddy, Saikat Guha, Christos Gkantsidis, Dinan Gunawardena, Pablo Rodriguez Presented by Svetlana.
Cmpe 494 Peer-to-Peer Computing Anıl Gürsel Didem Unat.
Introduction to Peer-to-Peer Networks. What is a P2P network A P2P network is a large distributed system. It uses the vast resource of PCs distributed.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Peer to Peer Research survey TingYang Chang. Intro. Of P2P Computers of the system was known as peers which sharing data files with each other. Build.
1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.
Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.
Search in Peer-to-Peer File-Sharing Systems: Like Metasearch Engines, But Not Really Wai Gen Yee, Dongmei Jia, Linh Thai Nguyen {yee, jiadong,
Autonomous Replication for High Availability in Unstructured P2P Systems Francisco Matias Cuenca-Acuna, Richard P. Martin, Thu D. Nguyen
Autonomous Replication for High Availability in Unstructured P2P Systems (Paper by Francisco Matias Cuenca-Acuna, Richard P. Martin, Thu D. Nguyen) Hristo.
Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Content-Based Retrieval in Hierarchical Peer-to-Peer.
AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network Toan Luu, Gleb Skobeltsyn, Fabius Klemm, Maroje Puh, Ivana Podnar Zarko, Martin.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
03/19/02Scalab Seminar Series1 Routing in Peer-to-Peer Systems Ramaswamy N.Vadivelu Scalab, ASU.
1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.
Efficient P2P Search by Exploiting Localities in Peer Community and Individual Peers A DISC’04 paper Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang.
Algorithmic Detection of Semantic Similarity WWW 2005.
Efficient Peer-to-Peer Keyword Searching 1 Efficient Peer-to-Peer Keyword Searching Patrick Reynolds and Amin Vahdat presented by Volker Kudelko.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Peer to Peer Network Design Discovery and Routing algorithms
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Data Indexing in Peer- to-Peer DHT Networks Garces-Erice, P.A.Felber, E.W.Biersack, G.Urvoy-Keller, K.W.Ross ICDCS 2004.
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Composing Web Services and P2P Infrastructure. PRESENTATION FLOW Related Works Paper Idea Our Project Infrastructure.
Peer-to-Peer Information Systems Week 12: Naming
Large-scale file systems and Map-Reduce
CHAPTER 3 Architectures for Distributed Systems
EE 122: Peer-to-Peer (P2P) Networks
The Globus Toolkit™: Information Services
Paraskevi Raftopoulou, Euripides G.M. Petrakis
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.
An Overview of Peer-to-Peer
Peer-to-Peer Information Systems Week 12: Naming
Presentation transcript:

Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities Francisco Matias Cuenca-Acuna, Christopher Peery, Richard P. Martin, Thu D. Nguyen

Introduction P2P technology is emerging as a powerful computing model Building self-healing and self-configuring systems Aggregating heterogeneous resources over WANs 1 st generation of P2P applications based on ad-hoc solutions File sharing (Kazaa, Gnutella, etc), Spare cycles usage More recently, many projects are focusing on building infrastructure for large scale key-based object location Chord, Tapestry and others Used to build global file systems (Farsite, Oceanstore) What about content-based location?

Goals & Challenges Provide content addressing and ranking in P2P Similar to what Google and other search engines do Ranking is critical to navigate terabytes of data Challenges Resources are divided among large set of heterogeneous peers No central management and administration Uncontrolled peer behavior Gathering accurate global information is too expensive

The PlanetP Infrastructure Compact global index of shared information Supports resource discovery and location Extremely compact to minimize global storage requirement Kept loosely synchronized and globally replicated Epidemic based communication layer Provides efficient and reliable communication despite unpredictable peer behaviors Supports peer discovery (membership), group communication, and update propagation Distributed information ranking algorithm Locate highly relevant information in large shared document collections Based on TFxIDF, a state-of-the-art ranking technique Adapted to work with only partial information

Using PlanetP Services provided by PlanetP Content addressing and ranking Resource discovery for adaptive applications Group membership management Close collaboration Publish/Subscribe information propagation Decoupled communication and timely propagation Group communication Simplify development of distributed apps. Example application: Grid information services Resource discovery & location (Global information index) Resilience to GIIS and GRIS failure (Automatic index replication) Track resource availability (Pub/Sub information propagation)

[K 1,..,K n ] Local Files Bloom filter Inverted Index Global Directory Gossiping [K 1,..,K n ] Local Files Bloom filter Inverted Index Global Directory NicknameStatusIPKeys AliceOnline…[K 1,..,K n ] BobOffline…[K 1,..,K n ] CharlesOnline…[K 1,..,K n ] Global Information Index Each node maintains an index of its content Summarize the set of terms in its index using a Bloom filter The global index is the set of all summaries Term to peer mappings List of online peers Summaries are propagated and kept synchronized using gossiping NicknameStatusIPKeys AliceOnline…[K 1,..,K n ] BobOffline…[K 1,..,K n ] CharlesOnline…[K 1,..,K n ]

Nodes push and pull randomly from each others Unstructured communication  resilient to failures Predictable convergence time Novel combination of previously known techniques Rumoring, anti-entropy, and partial anti-entropy Introduce partial anti-entropy to reduce variance in propagation time for dynamic communities Batch updates into communication rounds for efficiency Dynamic slow-down in absence of updates to save bandwidth Epidemic Comm. in P2P   ___

Content Search in PlanetP Query Diane Global Directory [K 1,..,K n ]Gary [K 1,..,K n ]Fred [K 1,..,K n ]Edward [K 1,..,K n ]Diane [K 1,..,K n ] Keys Charles Bob Alice Nickname Bob Fred Local lookup Fred Bob Diane Rank nodes Diane Contact candidates Fred File 3 File 1 File 2 Rank results STOP

Results Ranking The Vector Space model Documents and queries are represented as k-dimensional vectors Each dimension represents the relevance or weight of the word for the document The angle between a query and a document indicates its similarity Does not requires links between documents Weight assignment (TFxIDF) Use Term Frequency (TF) to weight terms for documents Use Inverse Document Frequency (IDF) to weight terms for query Intuition TF indicates how relevant a document is to a particular concept IDF gives more weight to terms that are good discriminators between documents

Using TFxIDF in P2P Unfortunately IDF is not suited for P2P Requires term to document mappings Requires a frequency count for every term in the shared collection Instead, use a two-phase approximation algorithm Replace IDF with IPF ( Inverse Peer Frequency) IPF(t) = f(No. Peers/Peers with documents containing term t) Individuals can compute a consistent global ranking of peers and documents without knowing the global frequency count of terms Node ranking function

Pruning Searches Centralized search engines have index for entire collection Can rank entire set of documents for each query In a P2P community, we do not want to contact peers that have only marginally relevant documents Use adaptive heuristic to limit forwarding of query in 2 nd -phase to only a subset of most highly ranked peers

Evaluation Answer the following questions What is the efficacy of our distributed ranking algorithm? What is the storage cost for the globally replicated index? How well does gossiping work in P2P communities? Evaluation methodology Use a running prototype to validate and collect micro benchmarks (tested with up to 200 nodes) Use simulation to predict performance on big communities We model peer behavior based on previous work and our own measurements from a local P2P community of 4000 users Will show sampling of results from paper

Ranking Evaluation I We use the AP89 collection from TREC documents, words, 97 queries, 266MB Each collection comes with a set of queries and relevance judgments We measure recall (R) and precision (P)

Ranking Evaluation II Results intersection is 70% at low recall and gets to 100% as recall increases To get 10 documents, PlanetP contacted 20 peers out of 160 candidates

Global Index Space Efficiency TREC collection (pure text) Simulate a community of 5000 nodes Distribute documents uniformly 944,651 documents taking up 3GB 36MB of RAM are needed to store the global index This is 1% of the total collection size MP3 collection (audio + tags) Using previous result but based on Gnutella measurements 3,000,000 MP3 files taking up 14TB 36MB of RAM are needed to store the global index This is % of the total collection size

Data Propagation Arrival and departure experiment (LAN)Propagation speed experiment (DSL)

Conclusions Explored the design of infrastructural support for a rich set of P2P applications Membership, content addressing and ranking Scale well to thousands of peers Extremely tolerant to unpredictable dynamic peer behaviors Gossiping with partial anti-entropy is reliable Information always propagate everywhere Propagation time has small variance Distributed approximation of TFxIDF Within 11% of centralized implementation Never collect all needed information in one place Global index on average is only 1% of data collection Synchronization of global index only requires 50 B/sec

Current and Future Work WayFinder A FS that provides content addressing in a unified shared namespace across peers Supports disconnected operation A distributed registry (UDDI) To complement the Globus Toolkit R3 We are building support for wildcard searches Randomized replication algorithm for P2P Provides predictable data availability See Rutgers DCS-TR-509 (to appear in SRDS 2003)

Related Work Tapestry, Pastry, Chord and CAN Implement a distributed hash table for P2P environments Oriented towards large scale object location They already store all the information needed to implement TFxIPF Cori and Gloss Address the problem of indexing and searching distributed collections of documents They build a centralized index that has total knowledge of word usage so they don’t contact unnecessary nodes

PlanetP Questions?