Presentation is loading. Please wait.

Presentation is loading. Please wait.

Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.

Similar presentations


Presentation on theme: "Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max."— Presentation transcript:

1 Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max Planck Research School for Computer Science

2 2 Talk Outline Motivation Proposed Search Engine architecture Query routing and database selection Similarity-based measures Example: GlOSS Document-frequency-based measures Example: CORI Evaluation of methods Proposals Conclusion

3 3 Problems of present Web Search Engines Size of indexable Web: Web is huge, it’s difficult to cover all Timely re-crawls are required Technical limits Deep Web Monopoly of Google: Controls 80% of web search requests Paid sites get updated more frequently and get higher rating Sites may be censored by engine

4 4 Make use of Peer-to-Peer technology Peer 3Peer 1Peer 2 Peer 4Peer 2Peer 1 Peer 4Peer 3 Peer 2Peer 3Peer 4 cancerelephantcomputer Ranking of peer usefulness (richness) for keyword Global directory must be shared among peers! Exploit previously unused CPU/memory/disk power Provide up-to-date results for small portions of Web Conquer Deep Web by personalized and specialized web crawlers Chord Ring 0 4 26 5 1 3 7 Global Directory

5 5 Query routing Goal: find peers with relevant documents Known before as Database Selection Problem Not all techniques are applicable to P2P query

6 6 Database Selection Problem 1 st inference: Is this document relevant? It’s a subjective user judgment, we model it We use only representations of user needs and documents (keywords, inverted indices) 2 nd inference: Database is potential to satisfy query, if it has many documents (size-based naive approach) has many documents, containing all query words high number of them with given similarity high summarized similarity of them

7 7 Measuring usefulness Number of documents with all query words is unknown no full document representations available, only database summaries (representatives) 3 rd Inference (usefulness) is built on top of previous two Steps of database selection i. Rely on sensible 1 st and 2 nd inferences ii. Choose database representatives for 3 rd inference iii. Calculate usefulness measures iv. Choose most useful databases

8 8 Similarity-based measures Definition: Usefulness is a sum of document similarities, exceeding threshold l Simplest: summarized weight of query terms across collection no assumptions about word cooccurrence l = 0

9 9 GlOSS High correlation assumption: Sort all n query terms T i in descendant order of their DF’ s DF n → T n, T n-1, …, T 1, DF n-1 – DF n → T n-1, T n-2, …, T 1, …, DF 1 – DF 2 → T 1 Use averaged term weights to calculate document similarity l > 0 l is query dependent l is collection dependent Usually because of local IDF’s difference Proposal: use global term importance Usually l is set to 0 in experiments

10 10 Problems of similarity-based measures Is this inference good? A few high-scored documents and a lot of low scored documents are regarded as equal Proposal: summarize first K similarities Highly scored documents could be bad indicator of usefulness Most of relevant documents have moderate scores Highly scored documents could be non-relevant

11 11 Document frequency based measures Don’t use term frequencies (actual similarities) Exploit document frequencies only Exploit global measure of term importance Average IDF ICF (inversed collection frequency) = Main assumption: many documents with rare terms have more meaning for user most likely contain other query terms

12 12 CORI: Using TFIDF normalization DF : document frequency of query term DF MAX : maximum document frequency among all terms in collection CF : number of collections, containing query term |C| : number of collections in the system

13 13 CORI Issues Pure document frequencies make CORI better The less statistics, the simpler Smaller variance Better estimates ranking, not actual database summaries No use of document richness To be normalized or not to be? Small databases are not necessary better Collection may specialize well in several topics

14 14 Using usefulness measures 601520 Peer3 400660 Peer2 6012 20 Peer1 DF max avg_tfDF 60105 Peer3 400410 Peer2 6085 Peer1 DF max avg_tfDF Peer2 Peer1 Peer3 Inform. Peer2 Peer1 Peer3 Retrieval CORI Peer3 0.5681 Peer1 0.5681 Peer2 0.5634 GlOSS Peer2845 Peer3784 Peer1627 Information: CF = 120Retrieval: CF = 40 |C| = 1000 Peer1 Peer3 Peer2 Inform. Peer2 Peer1 Peer3 Inform.

15 15 Analysis of experiments CORI is the best, but Only when choosing more than 50 from 236 databases Only 10% better when choosing more than 90 databases Test collections are strange Chronologically or even randomly separated documents No topic specificity No actual Web data used No overlapping among collections Experiments are unrealistic, it’s unclear Which method is better Is there any satisfactory method

16 16 Possible solutions Most of measures could be unified in framework We can play with it and try Various normalization schemes Different notions of term importance (ICF, local IDF) Use statistics of top documents Change the power of factors DF·ICF 4 is not worse than CORI Change the form of expression GlOSS CORI

17 17 Conclusion What done: Measures are analytically evaluated Sensible subset of measures is chosen Measures are implemented What could be done next: Carry out new sensible experiments Choose appropriate usefulness measure Experiment with database representatives Build own measure Try to exploit collections metadata Bookmarks, authoritative documents, collection descriptions

18 18 Thank you for attention!


Download ppt "Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max."

Similar presentations


Ads by Google