Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana.

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana Podnar Žarko *, Martin Rajman, Toan Luu, Fabius Klemm, Karl Aberer School of Computer and Communication Sciences EPFL, Lausanne, Switzerland *FER, University of Zagreb, Croatia Contact: karl.aberer@epfl.ch

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 20072 Contents Motivation Indexing and Retrieval model (HDKs) Scalability analysis Experimental results Conclusion

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 20073 Motivation Clustered retrieval engines are reaching scalability limits –Fast growing public Web –Immense volume of privately owned content that will never be indexed by search engines like Google or Yahoo –Dynamically changing content P2P retrieval as a scalable alternative –Involve large number of peer machines (millions) –Exploit scalable P2P search techniques –Support community-oriented search

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 20074 P2P full text retrieval Goals –retrieval performance comparable to state-of-the-art engines –scalable in terms of generated traffic (indexing and retrieval) Two basic approaches –Document partitioning  unstructured overlay network for search (e.g. Gnutella) –Term partitioning  structured overlay network for search (e.g. Chord, P-Grid) Problem: communication cost for search [Li et al, IPTPS 2003] –Document partitioning: broadcast search –Term partitioning: long posting lists transmitted over network, in particular when processing multi-term queries

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 20075 Approach Some facts about web retrieval –queries are in general short (on average 2 to 3 terms) –users pose queries containing frequent terms –users are interested in a few high-precision answers (fast) Full-text information retrieval engine built over a structured P2P network specifically considering these observations ALVIS PEERS –EU FP6 research project (2004-2006)

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 20077 P2PIR Architecture Ranking HDK Indexing/Querying P2P Web service IF IR PEER LI GKI LI Local single-term index GKI Global key index (k, postinglist(k)) Structured P2P network with N peers –logarithmic lookup cost for keys Large document collection D Each peera) indexes part of the global collection D (P i ) and b) maintains part of the global index Ranking HDK Indexing/Querying P2P Web service IF IR PEER LI GKI Ranking HDK Indexing/Querying P2P Web service IF IR PEER LI GKI

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 20078 Single-term P2P indexing Q = {t1,t2} t1,t2:{d1,d4, d7} t1:{d1, d2, d4, d5, d7, d8} t2:{d1, d3, d6, d7} Global single- term index t1:{d1, d2} t2:{d1, d3} t1:{d4, d5} t2:{d6} t1:{d7, d8} t2:{d7} Peer1 Peer2 Peer3 Local index Querying peer Retrieval traffic is not scalable!  grows with (Heap’s law) ? D - collection size in no. of terms  experimentally linear, frequent terms used frequently in queries key = single-term

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 20079 HDK-based P2P indexing Q ={t1,t2, t3} k13, k2:{d5, d7, d1, d3, d6, d8} t1:{d1, d2} t2:{d1, d3} t1:{d4, d5} t2:{d6} t3:{d5, d6} t1:{d7, d8} t2:{d8} t3:{d7} Peer1 Peer2 Peer3 Retrieval traffic is bounded by DF max and query size! Querying peer t1:{d4, d1, d8, d5} t2:{d1, d3, d6, d8} k13:{d5, d7} DF max = 4 posting list truncated to top-DF max postings (t1, t3) key = set of terms

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 200710 Single-term vs. HDK-based P2P indexing comparable retrieval quality (extended vocabulary) voc. size could grow exponentially!

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 200711 Keys and key filtering Non-Discriminative Keys (NDKs) e.g. t 1 is an NDK iff: –t 1 appears in more than DF max collection documents posting lists truncated to top-DF max documents Highly-Discriminative Keys (HDKs) e.g. (t 1, t 2 ) is an HDK iff: –t 1 & t 2 appear in less than DF max collection documents (discriminative w.r.t document collection) –t 1 and t 2 are non-discriminative (redundancy filter) –t 1 and t 2 are within a window of size w (proximity filter) –the no. of terms comprising a key is limited by s max (size filter) posting lists by definition contain only  DF max documents Key filtering enables scalable indexing!

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 200713 Scalability analysis (indexing) What is the upper bound on the index size for a very large document collection? D – collection size in no. of terms s – no. of terms comprising a key w – window size IS s – index size associated with keys of size s P f, (s-1) – probability of NDK occurrences where NDK size is (s-1) key sizeindex size (location index) 1 2 s constant ?

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 200714 Scalability analysis (indexing) Zipf model z(r)z(r) r FfFf FrFr very frequent terms frequent terms rare terms F r  DF max NDKsHDKs

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 200715 Scalability analysis (indexing) C increases for an increasing collection size, a remains const. z(r)z(r) r FfFf FrFr D increases Theorem: Probability P f,(s-1) of NDK occurrence remains constant!

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 200716 Scalability analysis (retrieval) Retrieval traffic is bounded by DF max and the number of keys a query is mapped to (constant) Scalability theoretically guaranteed, but what are the constants? Experiments!

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 200718 Experiment System fully implemented in Java (available on request) Document collection –20.000, 40.000,..., 140.000 documents from Wikipedia (www.wikipedia.org)www.wikipedia.org Query log –Wikipedia query log for 2 months (08/2004 and 09/2004) –3,000 randomly chosen queries from 2,000,000 unique queries with more than 20 hits No. of peers: 4, 8,..., 28 –PCs running RedHat Linux with 1GB memory –100 Mbit Ethernet –Each peer indexes 5.000 documents DF max = 400 or 500, s max = 3, w = 20

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 200719 Indexing costs Average index size per peer Average indexing traffic per peer HDK vs single-term (ST) indexing  experimentally: HDK / ST = 13.9 (for 140.000 documents)  theoretically: HDK / ST = 40.7 (overestimated upper bound!)

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 200720 Retrieval costs Retrieval traffic per query (Wikipedia query log)  remains constant with a growing collection size for the HDK approach (linear for single-term)

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 200721 Estimated total generated traffic Assumptions  monthly indexing  no. of queries per month: 1,5 * 10 6 (true no. of queries from the wikipedia log, conservative estimate)  for 1 billion documents, HDK generates 42 times less overall traffic

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 200722 Retrieval performance Overlap on top 20 documents  comparable performance of the HDK-based approach to the centralized single-term engine with BM25

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 200723 Conclusion Novel indexing model based on indexing terms and term sets; Theoretical scalability model proves the proposed solution scales to large networks in terms of generated traffic both for indexing and retrieval; Running P2P prototype that exhibits retrieval performance fully comparable to a centralized term-based retrieval system; Associated resource requirements (storage, bandwidth consumption) grow in a scalable way as shown by experiments.

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 200724 Ongoing work Further reduce the number of indexing keys using query-driven indexing to produce and store only profitable keys for query answering

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 200725 Acknowledgement The work presented in this paper was carried out in the framework of the EPFL Center for Global Computing and supported by the Swiss National Funding Agency OFES as part of the European FP 6 STREP project ALVIS (002068)

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana.

Similar presentations

Presentation on theme: "Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana.

Similar presentations

Presentation on theme: "Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana."— Presentation transcript:

Similar presentations

About project

Feedback