MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine Middleware 2005 Grenoble, France Sebastian Michel Max-Planck-Institut für Informatik.

Slides:



Advertisements
Similar presentations
CAN 1.Distributed Hash Tables a)DHT recap b)Uses c)Example – CAN.
Advertisements

P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Peer to Peer and Distributed Hash Tables
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Scalable Content-Addressable Network Lintao Liu
Kademlia: A Peer-to-peer Information System Based on the XOR Metric Petar Mayamounkov David Mazières A few slides are taken from the authors’ original.
Technische Universität Yimei Liao Chemnitz Kurt Tutschku Vertretung - Professur Rechner- netze und verteilte Systeme Chord - A Distributed Hash Table Yimei.
Technische Universität Chemnitz Kurt Tutschku Vertretung - Professur Rechner- netze und verteilte Systeme Chord - A Distributed Hash Table Yimei Liao.
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
On the Usage of Global Document Occurrences (GDO) in P2P Information Systems or… Avoiding overlapping results in P2P searching Odysseas Papapetrou 1,2.
Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao, Christine Lv., Edith Cohen, Kai Li and Scott Shenker ICS 2002.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.
1 Distributed Hash Tables My group or university Peer-to-Peer Systems and Applications Distributed Hash Tables Peer-to-Peer Systems and Applications Chapter.
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
Introduction to Peer-to-Peer (P2P) Systems Gabi Kliot - Computer Science Department, Technion Concurrent and Distributed Computing Course 28/06/2006 The.
Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.
Distributed Lookup Systems
LSDS-IR’08, October 30, Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek and Hari alakrishnan.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Peer To Peer Distributed Systems Pete Keleher. Why Distributed Systems? l Aggregate resources! –memory –disk –CPU cycles l Proximity to physical stuff.
Wide-area cooperative storage with CFS
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Effizientes Routing in P2P Netzwerken Chord: A Scalable Peer-to- peer Lookup Protocol for Internet Applications Dennis Schade.
Other Structured P2P Systems CAN, BATON Lecture 4 1.
Content Overlays (Nick Feamster). 2 Content Overlays Distributed content storage and retrieval Two primary approaches: –Structured overlay –Unstructured.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
1 Distributed Hash Tables (DHTs) Lars Jørgen Lillehovde Jo Grimstad Bang Distributed Hash Tables (DHTs)
A Scalable Content-Addressable Network (CAN) Seminar “Peer-to-peer Information Systems” Speaker Vladimir Eske Advisor Dr. Ralf Schenkel November 2003.
Super-peer Network. Motivation: Search in P2P Centralised (Napster) Flooding (Gnutella)  Essentially a breadth-first search using TTLs Distributed Hash.
Presentation 1 By: Hitesh Chheda 2/2/2010. Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT Laboratory for Computer Science.
An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
SIGCOMM 2001 Lecture slides by Dr. Yingwu Zhu Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
1 Distributed Hash Table CS780-3 Lecture Notes In courtesy of Heng Yin.
Chord Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google,
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 2: Distributed Hash.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Distributed Hash Tables Steve Ko Computer Sciences and Engineering University at Buffalo.
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
Peer-to-Peer Systems: An Overview Hongyu Li. Outline  Introduction  Characteristics of P2P  Algorithms  P2P Applications  Conclusion.
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
CS Spring 2014 CS 414 – Multimedia Systems Design Lecture 37 – Introduction to P2P (Part 1) Klara Nahrstedt.
Two Peer-to-Peer Networking Approaches Ken Calvert Net Seminar, 23 October 2001 Note: Many slides “borrowed” from S. Ratnasamy’s Qualifying Exam talk.
INTERNET TECHNOLOGIES Week 10 Peer to Peer Paradigm 1.
P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.
P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 37 – Introduction to P2P (Part 1) Klara Nahrstedt.
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.
Malugo – a scalable peer-to-peer storage system..
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
CSE 486/586 Distributed Systems Distributed Hash Tables
Distributed Hash Tables (DHT) Jukka K. Nurminen *Adapted from slides provided by Stefan Götz and Klaus Wehrle (University of Tübingen)
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
CS Spring 2010 CS 414 – Multimedia Systems Design Lecture 24 – Introduction to Peer-to-Peer (P2P) Systems Klara Nahrstedt (presented by Long Vu)
(slides by Nick Feamster)
CHAPTER 3 Architectures for Distributed Systems
Plethora: Infrastructure and System Design
EE 122: Peer-to-Peer (P2P) Networks
Bookmark-driven Query Routing in Peer-to-Peer Web Search
Consistent Hashing and Distributed Hash Table
Presentation transcript:

MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine Middleware 2005 Grenoble, France Sebastian Michel Max-Planck-Institut für Informatik Saarbrücken, Germany Peter Triantafillou University of Patras Rio, Greece Gerhard Weikum Max-Planck-Institut für Informatik Saarbrücken, Germany

MINERVA Infinity: A Scalable Efficient P2P Search Engine 2 Vision P2P approach best suitable –large number of peers –exploit mostly idle resources –intellectual input of user community Today: Web Search is dominated by centralized engines (“to google”) - censorship? - single point of attack/abuse - coverage of the web? Ultimate goal: “Distributed Google” to break information monopolies

MINERVA Infinity: A Scalable Efficient P2P Search Engine 3 Challenges large scale networks –100,000 to 10,000,000 users large collections > 10^10 documents –1,000,000 terms high dynamics

MINERVA Infinity: A Scalable Efficient P2P Search Engine 4 Questions Network Organization –structured? –hierarchical? –unstructured? Data Placement –move data around? –data remains at the owner? Scalability? Query Routing/Execution –Routing indexes? –Message flooding?

MINERVA Infinity: A Scalable Efficient P2P Search Engine 5 Overview Motivation (Vision/Challenges/Questions) Introduction to IR and P2P Systems P2P- IR Minerva Infinity Network Organization Data Placement Query Processing Data Replication Experiments Conclusion

MINERVA Infinity: A Scalable Efficient P2P Search Engine 6 Information Retrieval Basics Document Terms 5 x 7 x 4 x # of terms (term frequency)

MINERVA Infinity: A Scalable Efficient P2P Search Engine 7 Information Retrieval Basics (2) index lists with (DocId: tf*idf) sorted by Score B+ tree on terms Query Execution: Usually using some kind of threshold algorithm*: - sequential scans over the index lists (round-robin) - (random accesses to fetch missing scores) - aggregate scores - stop when the threshold is reached Top-k Query Processing: find k documents with the highest total score e.g. Fagin’s algorithm TA or a variant without random accesses d17: 0.3 d44: d52: 0.1 d53: 0.8 d55: 0.6 d12: 0.5 d14: d28: 0.1 d51: 0.6 d52: 0.3 d28: d17: 0.1 d44: 0.2 d11: 0.6

MINERVA Infinity: A Scalable Efficient P2P Search Engine 8 P2P Systems Peer: –“one that is of equal standing with another” (source: Merriam-Webster Online Dictionary ) Benefits: –no single point of failure –resource/data sharing Problems/Challenges: –authority/trust/incentives –high dynamics –… Applications: – File Sharing – IP Telephony – Web Search – Digital Libraries

MINERVA Infinity: A Scalable Efficient P2P Search Engine 9 Structured P2P Systems based on Distributed Hash Tables (DHTs) “structured” P2P networks provide one simple method: lookup:key->peer CAN [SIGCOMM 2001] CHORD [SIGCOMM 2001] Pastry [Middleware 2001] P-Grid [CoopIS 2001] robustness to load skew, failures, dynamics

MINERVA Infinity: A Scalable Efficient P2P Search Engine 10 p1 p8 p14 p21 p32 p38 p42 p48 p51 p56 Chord Peers and keys are mapped to the same cyclic ID space using a hash function Key k (e.g., hash(file name)) is assigned to the node with key p (e.g., hash(IP address)) such that k  p and there is no node p‘ with k  p‘ and p‘<p k10 k24 k30k38 k54

MINERVA Infinity: A Scalable Efficient P2P Search Engine 11 Chord (2) Using finger tables to speed up lookup process Store pointers to few distant peers Lookup in O(log n) steps Chord Ring p1p1 p8 p 56 p 51 p 42 p 38 p 32 p 21 p 14 k 54 Lookup(54)

MINERVA Infinity: A Scalable Efficient P2P Search Engine 12 Overview Motivation (Vision/Challenges/Questions) Introduction to IR and P2P Systems P2P- IR Minerva Infinity Network Organization Data Placement Query Processing Data Replication Experiments Conclusion

MINERVA Infinity: A Scalable Efficient P2P Search Engine 13 P2P - IR Share documents (e.g. Web pages) in an efficient and scalable way Ranked retrieval –simple DHT is insufficient

MINERVA Infinity: A Scalable Efficient P2P Search Engine 14 Possible Approaches Each peer is responsible for storing the COMPLETE index list for a subset of terms. p1 p8 p14 p21 p32 p38 p42 p48 p51 p56 Query Routing: DHT lookups Query Execution: Distributed Top-k [TPUT ’04, KLEE ‘05] capacity overload of peers with highly frequent / popular terms (data load AND query load)

MINERVA Infinity: A Scalable Efficient P2P Search Engine 15 Possible Approaches (2) Each peer has its own local index (e.g., created by web crawls) Query Routing: 1. DHT lookups 2. Retrieve Metadata 3. Find most promising peers Query Execution: - Send the complete Query and merge the incoming results capacity overload of peers with - highly frequent terms - high-quality collections a: P1 P6 P4 b: P5 P3 P1 P6... Distributed Directory Term  List of Peers P1 P5 P6P4 P2 P3

MINERVA Infinity: A Scalable Efficient P2P Search Engine 16 Overview Motivation (Vision/Challenges/Questions) Introduction to IR and P2P Systems P2P- IR Minerva Infinity Network Organization Data Placement Query Processing Data Replication Experiments Conclusion

MINERVA Infinity: A Scalable Efficient P2P Search Engine 17 Minerva Infinity Idea: –assign (term, docId, score) triplets to the peers order preserving load balancing –hash(score)+ hash(term) as offset –guarantee 100% recall high low high low high low

MINERVA Infinity: A Scalable Efficient P2P Search Engine 18 Hash Function Requirements: –Load balancing (to avoid overloading peers) –Order preserving (to make the QP work) One without the other is trivial... –Load balancing: apply a pseudo random hash function –Order preserving: Both together is challenging … S-Smin * N Smax - Smin

MINERVA Infinity: A Scalable Efficient P2P Search Engine 19 Hash Function (2) Assume an exponential score distribution Place the first half of the data to the first peer The next quarter to the next peer and so on … … 1 0

MINERVA Infinity: A Scalable Efficient P2P Search Engine 20 Term Index Networks (TINs) Reduce # of hops during QP by reducing the number of peers that maintain the index list for a particular term  Only a small subset of peers is used to store an index list. Global Network B A C

MINERVA Infinity: A Scalable Efficient P2P Search Engine 21 How to Create/Find a TIN Use u Beacon-Peers to bootstrap the TIN for term T Global Network p = 1/u For i=0 to i<n‘ do id = hash(t, i*p) if (i>0) use hash(t,( i-1)*p ) as a gateway to the TIN else node with id creates the TIN End for T Beacon nodes act as gateways to the TIN

MINERVA Infinity: A Scalable Efficient P2P Search Engine 22 Publish Data / Join a TIN Peer with id = hash(t, score) not in the TIN for term t Randomly select a beacon node (Beacon nodes act as gateways to the TIN) Call the join method Store the item (docId, t, score)

MINERVA Infinity: A Scalable Efficient P2P Search Engine 23 Query Processing Data PeersCoordinator 11 2-keyword Query Alternative: Collect data and send in one batch.

MINERVA Infinity: A Scalable Efficient P2P Search Engine 24 QP with Moving Coordinator Data PeersCoordinator keyword Query

MINERVA Infinity: A Scalable Efficient P2P Search Engine 25 Data Replication Vertical: Replicate data inside a TIN via a ‘reverse’ communication. Horizontal: Replicate complete TINs B BB B C C A A A A B B C C B B C C A 1 16 A 31

MINERVA Infinity: A Scalable Efficient P2P Search Engine 26 Experiments Test bed: 10,000 peers Benchmarks: GOV : TREC.GOV collection + 50 TREC-2003 Web queries, e.g. juvenile delinquency XGOV : TREC.GOV collection + 50 manually expanded queries, e.g. juvenile delinquency youth minor crime law jurisdiction offense prevention SCALABILITY : One query executed multiple times ……….

MINERVA Infinity: A Scalable Efficient P2P Search Engine 27 Experiments: Metrics Metrics Network traffic (in KB) Query response time (in s) - network cost (150ms RTT, 800Kb/s data transfer rate) - local I/O cost (8ms rotation latency + 8MB/s transfer delay) - processing cost Number of Hops

MINERVA Infinity: A Scalable Efficient P2P Search Engine 28 Scalability Experiment Measure time for a different query loads. –identical queries –inserted into a queue

MINERVA Infinity: A Scalable Efficient P2P Search Engine 29 Experiments: Results

MINERVA Infinity: A Scalable Efficient P2P Search Engine 30 Conclusion Novel architecture for P2P web search. High level of distribution both in data and processing. Novel algorithms to create the networks, place data, and execute queries. Support of two different data replication strategies.

MINERVA Infinity: A Scalable Efficient P2P Search Engine 31 Future Work Support of different score distributions Adapt TIN sizes to the actual load Different top-k query processing algorithms

MINERVA Infinity: A Scalable Efficient P2P Search Engine 32 Thank you for your attention