Presentation is loading. Please wait.

Presentation is loading. Please wait.

MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine Middleware 2005 Grenoble, France Sebastian Michel Max-Planck-Institut für Informatik.

Similar presentations


Presentation on theme: "MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine Middleware 2005 Grenoble, France Sebastian Michel Max-Planck-Institut für Informatik."— Presentation transcript:

1 MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine Middleware 2005 Grenoble, France Sebastian Michel Max-Planck-Institut für Informatik Saarbrücken, Germany smichel@mpi-inf.mpg.de Peter Triantafillou University of Patras Rio, Greece peter@ceid.upatras.gr Gerhard Weikum Max-Planck-Institut für Informatik Saarbrücken, Germany weikum@mpi-inf.mpg.de

2 MINERVA Infinity: A Scalable Efficient P2P Search Engine 2 Vision P2P approach best suitable –large number of peers –exploit mostly idle resources –intellectual input of user community Today: Web Search is dominated by centralized engines (“to google”) - censorship? - single point of attack/abuse - coverage of the web? Ultimate goal: “Distributed Google” to break information monopolies

3 MINERVA Infinity: A Scalable Efficient P2P Search Engine 3 Challenges large scale networks –100,000 to 10,000,000 users large collections > 10^10 documents –1,000,000 terms high dynamics

4 MINERVA Infinity: A Scalable Efficient P2P Search Engine 4 Questions Network Organization –structured? –hierarchical? –unstructured? Data Placement –move data around? –data remains at the owner? Scalability? Query Routing/Execution –Routing indexes? –Message flooding?

5 MINERVA Infinity: A Scalable Efficient P2P Search Engine 5 Overview Motivation (Vision/Challenges/Questions) Introduction to IR and P2P Systems P2P- IR Minerva Infinity Network Organization Data Placement Query Processing Data Replication Experiments Conclusion

6 MINERVA Infinity: A Scalable Efficient P2P Search Engine 6 Information Retrieval Basics Document Terms 5 x 7 x 4 x # of terms (term frequency)

7 MINERVA Infinity: A Scalable Efficient P2P Search Engine 7 Information Retrieval Basics (2) index lists with (DocId: tf*idf) sorted by Score B+ tree on terms Query Execution: Usually using some kind of threshold algorithm*: - sequential scans over the index lists (round-robin) - (random accesses to fetch missing scores) - aggregate scores - stop when the threshold is reached Top-k Query Processing: find k documents with the highest total score e.g. Fagin’s algorithm TA or a variant without random accesses d17: 0.3 d44: 0.4... d52: 0.1 d53: 0.8 d55: 0.6 d12: 0.5 d14: 0.4... d28: 0.1 d51: 0.6 d52: 0.3 d28: 0.7... d17: 0.1 d44: 0.2 d11: 0.6

8 MINERVA Infinity: A Scalable Efficient P2P Search Engine 8 P2P Systems Peer: –“one that is of equal standing with another” (source: Merriam-Webster Online Dictionary ) Benefits: –no single point of failure –resource/data sharing Problems/Challenges: –authority/trust/incentives –high dynamics –… Applications: – File Sharing – IP Telephony – Web Search – Digital Libraries

9 MINERVA Infinity: A Scalable Efficient P2P Search Engine 9 Structured P2P Systems based on Distributed Hash Tables (DHTs) “structured” P2P networks provide one simple method: lookup:key->peer CAN [SIGCOMM 2001] CHORD [SIGCOMM 2001] Pastry [Middleware 2001] P-Grid [CoopIS 2001] robustness to load skew, failures, dynamics

10 MINERVA Infinity: A Scalable Efficient P2P Search Engine 10 p1 p8 p14 p21 p32 p38 p42 p48 p51 p56 Chord Peers and keys are mapped to the same cyclic ID space using a hash function Key k (e.g., hash(file name)) is assigned to the node with key p (e.g., hash(IP address)) such that k  p and there is no node p‘ with k  p‘ and p‘<p k10 k24 k30k38 k54

11 MINERVA Infinity: A Scalable Efficient P2P Search Engine 11 Chord (2) Using finger tables to speed up lookup process Store pointers to few distant peers Lookup in O(log n) steps Chord Ring p1p1 p8 p 56 p 51 p 42 p 38 p 32 p 21 p 14 k 54 Lookup(54)

12 MINERVA Infinity: A Scalable Efficient P2P Search Engine 12 Overview Motivation (Vision/Challenges/Questions) Introduction to IR and P2P Systems P2P- IR Minerva Infinity Network Organization Data Placement Query Processing Data Replication Experiments Conclusion

13 MINERVA Infinity: A Scalable Efficient P2P Search Engine 13 P2P - IR Share documents (e.g. Web pages) in an efficient and scalable way Ranked retrieval –simple DHT is insufficient

14 MINERVA Infinity: A Scalable Efficient P2P Search Engine 14 Possible Approaches Each peer is responsible for storing the COMPLETE index list for a subset of terms. p1 p8 p14 p21 p32 p38 p42 p48 p51 p56 Query Routing: DHT lookups Query Execution: Distributed Top-k [TPUT ’04, KLEE ‘05] capacity overload of peers with highly frequent / popular terms (data load AND query load)

15 MINERVA Infinity: A Scalable Efficient P2P Search Engine 15 Possible Approaches (2) Each peer has its own local index (e.g., created by web crawls) Query Routing: 1. DHT lookups 2. Retrieve Metadata 3. Find most promising peers Query Execution: - Send the complete Query and merge the incoming results capacity overload of peers with - highly frequent terms - high-quality collections a: P1 P6 P4 b: P5 P3 P1 P6... Distributed Directory Term  List of Peers P1 P5 P6P4 P2 P3

16 MINERVA Infinity: A Scalable Efficient P2P Search Engine 16 Overview Motivation (Vision/Challenges/Questions) Introduction to IR and P2P Systems P2P- IR Minerva Infinity Network Organization Data Placement Query Processing Data Replication Experiments Conclusion

17 MINERVA Infinity: A Scalable Efficient P2P Search Engine 17 Minerva Infinity Idea: –assign (term, docId, score) triplets to the peers order preserving load balancing –hash(score)+ hash(term) as offset –guarantee 100% recall high low high low high low

18 MINERVA Infinity: A Scalable Efficient P2P Search Engine 18 Hash Function Requirements: –Load balancing (to avoid overloading peers) –Order preserving (to make the QP work) One without the other is trivial... –Load balancing: apply a pseudo random hash function –Order preserving: Both together is challenging … S-Smin ----------------- * N Smax - Smin

19 MINERVA Infinity: A Scalable Efficient P2P Search Engine 19 Hash Function (2) Assume an exponential score distribution Place the first half of the data to the first peer The next quarter to the next peer and so on … … 1 0

20 MINERVA Infinity: A Scalable Efficient P2P Search Engine 20 Term Index Networks (TINs) Reduce # of hops during QP by reducing the number of peers that maintain the index list for a particular term  Only a small subset of peers is used to store an index list. Global Network B A C 2 7 12 15 16 20 24 37 41 45 62 12 24 62 2 41 24 7 16 45

21 MINERVA Infinity: A Scalable Efficient P2P Search Engine 21 How to Create/Find a TIN Use u Beacon-Peers to bootstrap the TIN for term T Global Network p = 1/u For i=0 to i<n‘ do id = hash(t, i*p) if (i>0) use hash(t,( i-1)*p ) as a gateway to the TIN else node with id creates the TIN End for T Beacon nodes act as gateways to the TIN

22 MINERVA Infinity: A Scalable Efficient P2P Search Engine 22 Publish Data / Join a TIN Peer with id = hash(t, score) not in the TIN for term t Randomly select a beacon node (Beacon nodes act as gateways to the TIN) Call the join method Store the item (docId, t, score)

23 MINERVA Infinity: A Scalable Efficient P2P Search Engine 23 Query Processing Data PeersCoordinator 11 2-keyword Query Alternative: Collect data and send in one batch.

24 MINERVA Infinity: A Scalable Efficient P2P Search Engine 24 QP with Moving Coordinator Data PeersCoordinator 111 3-keyword Query

25 MINERVA Infinity: A Scalable Efficient P2P Search Engine 25 Data Replication Vertical: Replicate data inside a TIN via a ‘reverse’ communication. Horizontal: Replicate complete TINs 1 2 3 123 123 123 B 34 1 50 BB 20 8 64 B C 12 62 28 C 12 62 A 49 5 19 A A 417 16 A 417 16 B 24 2 45 B 24 2 45 C 7 55 22 C B 46 11 57 B C 11 64 24 C A 1 16 A 31

26 MINERVA Infinity: A Scalable Efficient P2P Search Engine 26 Experiments Test bed: 10,000 peers Benchmarks: GOV : TREC.GOV collection + 50 TREC-2003 Web queries, e.g. juvenile delinquency XGOV : TREC.GOV collection + 50 manually expanded queries, e.g. juvenile delinquency youth minor crime law jurisdiction offense prevention SCALABILITY : One query executed multiple times ……….

27 MINERVA Infinity: A Scalable Efficient P2P Search Engine 27 Experiments: Metrics Metrics Network traffic (in KB) Query response time (in s) - network cost (150ms RTT, 800Kb/s data transfer rate) - local I/O cost (8ms rotation latency + 8MB/s transfer delay) - processing cost Number of Hops

28 MINERVA Infinity: A Scalable Efficient P2P Search Engine 28 Scalability Experiment Measure time for a different query loads. –identical queries –inserted into a queue

29 MINERVA Infinity: A Scalable Efficient P2P Search Engine 29 Experiments: Results

30 MINERVA Infinity: A Scalable Efficient P2P Search Engine 30 Conclusion Novel architecture for P2P web search. High level of distribution both in data and processing. Novel algorithms to create the networks, place data, and execute queries. Support of two different data replication strategies.

31 MINERVA Infinity: A Scalable Efficient P2P Search Engine 31 Future Work Support of different score distributions Adapt TIN sizes to the actual load Different top-k query processing algorithms

32 MINERVA Infinity: A Scalable Efficient P2P Search Engine 32 Thank you for your attention


Download ppt "MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine Middleware 2005 Grenoble, France Sebastian Michel Max-Planck-Institut für Informatik."

Similar presentations


Ads by Google