Presentation is loading. Please wait.

Presentation is loading. Please wait.

P2P Content Search: Give the Web Back to the People Christian Zimmer, Matthias Bender, Sebastian Michel, Gerhard Weikum Max-Planck-Institut for Informatics,

Similar presentations


Presentation on theme: "P2P Content Search: Give the Web Back to the People Christian Zimmer, Matthias Bender, Sebastian Michel, Gerhard Weikum Max-Planck-Institut for Informatics,"— Presentation transcript:

1 P2P Content Search: Give the Web Back to the People Christian Zimmer, Matthias Bender, Sebastian Michel, Gerhard Weikum Max-Planck-Institut for Informatics, Saarbrücken, Germany Peter Triantafillou University of Patras, Greece IPTPS The 5th International Workshop on Peer-to-Peer System Santa Barbara, California, USAFebruary 27-28, 2006 Outline of the Talk 1.Feasibility of P2P Web Search 2.Problem Statement 3.Learning from Queries 4.Exploiting Correlation 5.Experiments

2 Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 2 P2P and Web Search: Marriage in Heaven But: Authors assume distribution of full term-document index  non-scalable! Better: light-weight approach with distributed term-peer directory Variety of projects following this line: PlanetP (Rutgers), Pepper (CMU), Galanx (Wisconsin), Odissea (Brooklyn), Minerva (MPII), and others P2P Web Search has potential advantages:  Highly distributed data  Better processing power Li, Loo, Hellerstein, Kaashoek, Karger, Morris questioned Feasibility of Peer- to-Peer Web Indexing and Search (IPTPS 2003)

3 Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 3 Architectural Model Each peer has full-fledged local search engine (with crawler / importer, indexer, query processor) Each peer has autonomously compiled (e.g. crawled) its own content according to the user‘s thematic interests  peer-specific collections When a query is issued by a peer, it is first executed locally and then possibly routed to carefully selected other peers Peers are connected by overlay network (e.g. DHT, random graph) and IP Peers can post summaries / synopses / metadata / QoS info to (distr.) network-wide directory with efficient per-key lookup

4 Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 4 Minerva System Architecture  Based on top of a scalable, churn-resilient DHT  Conceptually global but physically distributed meta-data directory P3P3 P6P6 P2P2 P7P7 P8P8 P5P5 P1P1 P4P4 query peer local index term a: P 1,P 4,P 8 term b: P 3,P 5,P 8 term f: P 2,P 4,P 6 term c: P 2,P 4,P 6 peer lists term d: P 1,P 3 term e: P 1,P 2,P 5 peer ranking and statistics peer ranking and statistics peer ranking and statistics a b c Query Routing driven by statistics on peer quality

5 Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 5 Problem Statement What can happen?  Great results: top peers for q are selected!  Bad results: selected peers good for individual terms, mediocre for complete query. Example Query q: „native american music“  Ask global directory for three single-term PeerLists  Combine into single PeerList for complete query  Ask top peers for best documents  Combine all documents into single result documents PjPj PiPi PkPk PqPq native: P 27, P 4, P 8, P 112, P 36,... american: P 1, P 4, P 18, P 108, P 25,... music: P 13, P 4, P 88, P 36,... Doc 1 american music Doc 2 native american Post native Post music Post american

6 Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 6 Problem: Term Correlations Architectural compromise:  Best peers for q={t 1, …, t |q| } may not be in  t  q PeerList(t) top-k and possibly not even in  t  q PeerList(t) top-k  Also possible:  t  q PeerList(t) top-k is empty!  Name and phrase recognition helps but insufficient  Lack of correlation-awareness is standard in IR, but more severe in P2P because of peer-granularity directory Queries with correlated or specifically „associated“ termsets:  „Michael Jordan“, „Lake Superior“, „Bell Labs“, „hurricane Katrina“, „Native American Music“, „PhD admission“, „black magic“, „ice hockey Honolulu“, „Natalya Kournikova“ The solution:  Special handling of correlated termsets as termset posts in the directory, but... ... efficiency & scalability are critical! Consider correlated termsets for query routing!

7 Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 7 Critical Issues and what remains to be done? 1.How to decide that a termset is correlated? 2.How to store termset posts in the directory? 3.How to exploit termset posts for queries?

8 Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 8 Possible Approaches Possible sources of correlated termsets  Names and phrases from dictionaries or thesauries  incomplete!  Frequent itemset mining on data  computationally expensive! Extraction of all possible term pairs out of the documents  Brute-force precomputation of termset posts  But: quadratic explosion and what about triples, quadruples,... Impossible to predict all correlated termsets of interest!

9 Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 9 Our Approach... Exploit query logs to learn correlated termsets... driven by „Give the Web back to the people“ Advantages of query logs:  Reflect real behavior of millions of user  Only termsets of interest need to be learned as correlated  As we will see: Integration in existing architecture for free Looking at query logs... ... to validate that logs are useful to recognize correlated termsets  Excite Search Engine Log (1999) with about 2 million real web queries Queries are a gold mine!

10 Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 10 Learning Correlated Termsets from Queries P3P3 P2P2 P7P7 P1P1 P6P6 P8P8 P5P5 P4P4 american: P 1,P 4,P 8 native: P 3,P 5,P 8 music: P 2,P 4,P 6 american native american music native native american music music native american music american music native american music american native music native  Peerlist request: piggybacking complete query  Directory peers remember query as termsets Learning included in Query Routing

11 Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 11 Collecting and Storing Termset Posts P3P3 P2P2 P8P8 american: P 1,P 4,P 8 native: P 3,P 5,P 8 american music native american music american native P7P7 P1P1 P6P6 P5P5 P4P4 music: P 2,P 4,P 6 music native  Directory Peers manage termset posts  Posting procedure extended with termset posting Post american Post native american music native american music american native Post american native american native: P 8 No extra Communication Protocol needed!

12 Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 12 Exploiting Termset Postings P3P3 P2P2 P7P7 P1P1 P8P8 P6P6 P5P5 P4P4  Integrated in standard query execution  Fallback-option always possible american: P 1,P 4,P 8 native: P 3,P 5,P 8,P 2 music: P 2,P 4,P 6,P 8 american music native: P 8 native music: P 8,P 4 american native american music native native american music music native american music PeerList american music native PeerList music native PeerList native PeerList for complete query No additional Communication Round!

13 Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 13 No Termset for Complete Query P1P1 P3P3 P2P2 P7P7 P6P6 P5P5 P8P8 P4P4  Especially for large queries  Covering problem! a a b e c b b c e a b d a b c d e c e b c e a a b c d e b a b c d e c a b c d e d a b c d e e a b c d e a b d a b c b c e c e d e e a b c a b d b c e c e d e e a b c d e Integrated into Query Routing!

14 Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 14 What about Networking Costs? Big Concern: too many messages, high bandwidth consumption, too? All messages piggybacked, no extra costs!  Learning correlated termsets integrated in the query routing process  Asking for termsets integrated in the posting process  Exploiting correlated termsets in the query processing for free and includes the fallback option, too Our approach is still scalable because It‘s all free!!

15 Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 15 Experimental Evaluation  Experiments: 750 peers with.Gov partitions (~1.2 million web documents)  Running 50 expanded queries from TREC-2003 Web Track (example: „robots research artificial“ or „shipwrecks accident“) Major Gain in Benefit / Cost

16 Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 16 Conclusion and Future Work  Reconcile scalability with good search-result quality  No extra networking costs and... ... greatly improved benefit/cost for query routing and processing  Consider and benefit from user and community behavior  Optimization of termset covers for queries with many terms  Real-life testbed with real users! Thank You for Your Attention!


Download ppt "P2P Content Search: Give the Web Back to the People Christian Zimmer, Matthias Bender, Sebastian Michel, Gerhard Weikum Max-Planck-Institut for Informatics,"

Similar presentations


Ads by Google