Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
IR Models: Overview, Boolean, and Vector
Evaluating Search Engine
Information Retrieval in Practice
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Ch 4: Information Retrieval and Text Mining
INFO 624 Week 3 Retrieval System Evaluation
 Manmatha MetaSearch R. Manmatha, Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
The Vector Space Model …and applications in Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen
Information Retrieval
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Adversarial Information Retrieval The Manipulation of Web Content.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Word Weighting based on User’s Browsing History Yutaka Matsuo National Institute of Advanced Industrial Science and Technology (JPN) Presenter: Junichiro.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Querying Structured Text in an XML Database By Xuemei Luo.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Information Retrieval Effectiveness of Folksonomies on the World Wide Web P. Jason Morrison.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Vector Space Models.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
1 CS 430: Information Discovery Lecture 5 Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Information Retrieval in Practice
An Efficient Algorithm for Incremental Update of Concept space
Collection Fusion in Carrot2
Information Retrieval and Web Search
Martin Rajman, Martin Vesely
Search Engines & Subject Directories
Information Retrieval
Information Retrieval and Web Search
Bookmark-driven Query Routing in Peer-to-Peer Web Search
Information Retrieval and Web Design
Presentation transcript:

Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max Planck Research School for Computer Science

2 Talk Outline Motivation Proposed Search Engine architecture Query routing and database selection Similarity-based measures Example: GlOSS Document-frequency-based measures Example: CORI Evaluation of methods Proposals Conclusion

3 Problems of present Web Search Engines Size of indexable Web: Web is huge, it’s difficult to cover all Timely re-crawls are required Technical limits Deep Web Monopoly of Google: Controls 80% of web search requests Paid sites get updated more frequently and get higher rating Sites may be censored by engine

4 Make use of Peer-to-Peer technology Peer 3Peer 1Peer 2 Peer 4Peer 2Peer 1 Peer 4Peer 3 Peer 2Peer 3Peer 4 cancerelephantcomputer Ranking of peer usefulness (richness) for keyword Global directory must be shared among peers! Exploit previously unused CPU/memory/disk power Provide up-to-date results for small portions of Web Conquer Deep Web by personalized and specialized web crawlers Chord Ring Global Directory

5 Query routing Goal: find peers with relevant documents Known before as Database Selection Problem Not all techniques are applicable to P2P query

6 Database Selection Problem 1 st inference: Is this document relevant? It’s a subjective user judgment, we model it We use only representations of user needs and documents (keywords, inverted indices) 2 nd inference: Database is potential to satisfy query, if it has many documents (size-based naive approach) has many documents, containing all query words high number of them with given similarity high summarized similarity of them

7 Measuring usefulness Number of documents with all query words is unknown no full document representations available, only database summaries (representatives) 3 rd Inference (usefulness) is built on top of previous two Steps of database selection i. Rely on sensible 1 st and 2 nd inferences ii. Choose database representatives for 3 rd inference iii. Calculate usefulness measures iv. Choose most useful databases

8 Similarity-based measures Definition: Usefulness is a sum of document similarities, exceeding threshold l Simplest: summarized weight of query terms across collection no assumptions about word cooccurrence l = 0

9 GlOSS High correlation assumption: Sort all n query terms T i in descendant order of their DF’ s DF n → T n, T n-1, …, T 1, DF n-1 – DF n → T n-1, T n-2, …, T 1, …, DF 1 – DF 2 → T 1 Use averaged term weights to calculate document similarity l > 0 l is query dependent l is collection dependent Usually because of local IDF’s difference Proposal: use global term importance Usually l is set to 0 in experiments

10 Problems of similarity-based measures Is this inference good? A few high-scored documents and a lot of low scored documents are regarded as equal Proposal: summarize first K similarities Highly scored documents could be bad indicator of usefulness Most of relevant documents have moderate scores Highly scored documents could be non-relevant

11 Document frequency based measures Don’t use term frequencies (actual similarities) Exploit document frequencies only Exploit global measure of term importance Average IDF ICF (inversed collection frequency) = Main assumption: many documents with rare terms have more meaning for user most likely contain other query terms

12 CORI: Using TFIDF normalization DF : document frequency of query term DF MAX : maximum document frequency among all terms in collection CF : number of collections, containing query term |C| : number of collections in the system

13 CORI Issues Pure document frequencies make CORI better The less statistics, the simpler Smaller variance Better estimates ranking, not actual database summaries No use of document richness To be normalized or not to be? Small databases are not necessary better Collection may specialize well in several topics

14 Using usefulness measures Peer Peer Peer1 DF max avg_tfDF Peer Peer Peer1 DF max avg_tfDF Peer2 Peer1 Peer3 Inform. Peer2 Peer1 Peer3 Retrieval CORI Peer Peer Peer GlOSS Peer2845 Peer3784 Peer1627 Information: CF = 120Retrieval: CF = 40 |C| = 1000 Peer1 Peer3 Peer2 Inform. Peer2 Peer1 Peer3 Inform.

15 Analysis of experiments CORI is the best, but Only when choosing more than 50 from 236 databases Only 10% better when choosing more than 90 databases Test collections are strange Chronologically or even randomly separated documents No topic specificity No actual Web data used No overlapping among collections Experiments are unrealistic, it’s unclear Which method is better Is there any satisfactory method

16 Possible solutions Most of measures could be unified in framework We can play with it and try Various normalization schemes Different notions of term importance (ICF, local IDF) Use statistics of top documents Change the power of factors DF·ICF 4 is not worse than CORI Change the form of expression GlOSS CORI

17 Conclusion What done: Measures are analytically evaluated Sensible subset of measures is chosen Measures are implemented What could be done next: Carry out new sensible experiments Choose appropriate usefulness measure Experiment with database representatives Build own measure Try to exploit collections metadata Bookmarks, authoritative documents, collection descriptions

18 Thank you for attention!