1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Introduction to Information Retrieval

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Modelling Relevance and User Behaviour in Sponsored Search using Click-Data Adarsh Prasad, IIT Delhi Advisors: Dinesh Govindaraj SVN Vishwanathan* Group:

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Analysis and Modeling of Social Networks Foudalis Ilias.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.

Qualifying Exam: Contour Grouping Vida Movahedi Supervisor: James Elder Supervisory Committee: Minas Spetsakis, Jeff Edmonds York University Summer 2009.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Link Analysis David Kauchak cs160 Fall 2009 adapted from:

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.

10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.

More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

Precision and Recall.

Evaluating Search Engine

Learning to Rank: New Techniques and Applications Martin Szummer Microsoft Research Cambridge, UK.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:

Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.

Markov Models. Markov Chain A sequence of states: X 1, X 2, X 3, … Usually over time The transition from X t-1 to X t depends only on X t-1 (Markov Property).

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.

The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.

Query Log Analysis Naama Kraus Slides are based on the papers: Andrei Broder, A taxonomy of web search Ricardo Baeza-Yates, Graphs from Search Engine Queries.

Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.

Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.

SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.

Computer vision: models, learning and inference

School of Electronics Engineering and Computer Science Peking University Beijing, P.R. China Ziqi Wang, Yuwei Tan, Ming Zhang.

Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.

Fan Guo 1, Chao Liu 2 and Yi-Min Wang 2 1 Carnegie Mellon University 2 Microsoft Research Feb 11, 2009.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,

Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

CHAPTER 15 SECTION 1 – 2 Markov Models. Outline Probabilistic Inference Bayes Rule Markov Chains.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Jiafeng Guo(ICT) Xueqi Cheng(ICT) Hua-Wei Shen(ICT) Gu Xu (MSRA) Speaker: Rui-Rui Li Supervisor: Prof. Ben Kao.

Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Hongbo Deng, Michael R. Lyu and Irwin King

Post-Ranking query suggestion by diversifying search Chao Wang.

Topic by Topic Performance of Information Retrieval Systems Walter Liggett National Institute of Standards and Technology TREC-7 (1999)

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

Why Decision Engine Bing Demos Search Interaction model Data-driven Research Problems Q & A.

Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang.

Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.

Xiang Li,1 Lili Mou,1 Rui Yan,2 Ming Zhang1

Evaluation Anisio Lacerda.

HITS Hypertext-Induced Topic Selection

Evaluation of IR Systems

PageRank and Markov Chains

Relevance and Reinforcement in Interactive Browsing

Web Information retrieval (Web IR)

Learning to Rank with Ties

Presentation transcript:

1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007

2 Introduction (1/2) A search engine can track which of its search results were clicked for which query A search engine can track which of its search results were clicked for which query Click records of query-document pairs can be viewed as a weak indication of relevance Click records of query-document pairs can be viewed as a weak indication of relevance –The user decided to at least view the document, based on its description in the search results We can use the clicks of past users to improve the current search results We can use the clicks of past users to improve the current search results –The clicked set of documents is likely to differ from the current user ’ s relevance set

3 Introduction (2/2) From the perspective of a user conducting a search: From the perspective of a user conducting a search: –Documents that are clicked but not relevant constitute noise –Documents that are relevant but not clicked constitute sparsity in the click data Power law distribution: most queries in the click log have a small number of clicked documents Power law distribution: most queries in the click log have a small number of clicked documents This paper focuses on the sparsity problem by giving a Markov random walk model, although the model also has noise reduction properties This paper focuses on the sparsity problem by giving a Markov random walk model, although the model also has noise reduction properties

4 Algorithm on the Click Graph The current model uses click data alone, without considering document content or query content The current model uses click data alone, without considering document content or query content The click graph: The click graph: –Bipartite –Two types of nodes: queries and documents –An edge connects a query and a document if a click for that query-document pair is observed –The edge may be weighted according to the total number of clicks from all users

5 Click Graph Example

6 Application Areas for Algorithms on Click Graph Query-to-document ‘search’ Query-to-document ‘search’ –Given a query, find relevant documents, as in ad hoc search Query-to-query ‘suggestion’ Query-to-query ‘suggestion’ –Given a query, find other queries that the user might like to run Document-to-query ‘annotation’ Document-to-query ‘annotation’ –Given a document, attach related queries to it Document-to-document ‘relevance feedback’ Document-to-document ‘relevance feedback’ –Given an example document that is relevant to the user, find additional relevant documents

7 Random Walk Model A basic query formulation model A basic query formulation model 1.Imagine a document (information need) 2.Think of a query associated with the document 3.Issue the query or imagine another document related to the query 4.Iterative thought process (noise process) – A Markov random walk which describes a probability distribution over queries The retrieval model is obtained by inverting the query formulation model The retrieval model is obtained by inverting the query formulation model –Starts from an observed query, and attempts to undo the noise, inferring the underlying information need –Backward walks

8 Random Walk Computation C jk : click counts associating node j to k C jk : click counts associating node j to k Define transition probabilities P t+1|t (k|j) from j to k Define transition probabilities P t+1|t (k|j) from j to k s is the self-transition probability, which corresponds to the user favoring the current query or document Transition matrix [A] jk = P t+1|t (k|j)  P t|0 (k|j)=[A t ] jk Transition matrix [A] jk = P t+1|t (k|j)  P t|0 (k|j)=[A t ] jk –A measure of the volume of paths between j and k

9 Random Walk Model for Retrieval Backward random walk for retrieval: Backward random walk for retrieval: Given that we ended a t -step walk at node j, we find the probability of starting at node k, P 0|t (k|j) Bayes rule: P 0|t (k|j) = P t|0 (j|k)P 0 (k) ╱ P t (j), assuming P 0 (k)=1/N and P t (j) = Σ i [A t ] ij  P 0|t (k|j) = [A t Z -1 ] kj where Z is diagonal and Z jj = Σ i [A t ] ij Forward random walk: Forward random walk: P t|0 (k|j) = [v j ． A t ] k P t|0 (k|j) = [v j ． A t ] k

10 Forward vs. Backward Walks PageRank: a query-independent forward random walk on the link graph, which proceeds to its stationary distribution PageRank: a query-independent forward random walk on the link graph, which proceeds to its stationary distribution In statistics, the backward walk model is referred to as diagnostic, and in contrast, the forward walk model is predictive In statistics, the backward walk model is referred to as diagnostic, and in contrast, the forward walk model is predictive When t → ∞: When t → ∞: –The forward random walk approaches the stationary distribution Gives high probability to nodes with large number of clicks Gives high probability to nodes with large number of clicks –The backward random walk approaches the prior starting distribution, which we have taken to be uniform

11 Clustering Effect Given an end node that is part of a cluster, we have similar probabilities of having started the walk from any node in the cluster Given an end node that is part of a cluster, we have similar probabilities of having started the walk from any node in the cluster

12 Walk Parameters Figure: Probability distribution of non-self transitions under different combinations of t and s

13 Experiment Data A 14-day click log of web image search engines A 14-day click log of web image search engines –Judged images with distance 1 from the query had precision of 75% –Pruning: remove URLs only connected to one query and remove queries that only connected to one URL –After pruning: 505,000 URLs, 202,000 queries and 1.1 million edges –Uniformly sampling 45 queries for evaluation –TREC-style pooling relevance judgments of depth relevance judgments identify 818 relevant images 2278 relevance judgments identify 818 relevant images

14 Experiment Result-1 Table 1. The furthest node from any of our test queries is at distance 41 ( ‘ backward ’ ). ‘ dist ’ and ‘ 1-0-forward ’ are the baselines.

15 Experiment Result-2 Figure: The number of images retrieved at different distances from the query for each method. The 101-step walk with zero- self-transition possibly goes too far, returning too few distance-1 images.

16 Experiment Result-3 Figure: The precision at different distances from the query for each method.

17 Experiment Result-4 Figure: Precision-recall curves of forward and backward walks, with zero self-transition probability (1000 URLs retrieved)

18 Experiment Result-5 Figure: Parameter sensitivity for a backwards walk. Each contour shows a 0.01 variation in Grid intersections indicate the parameter combinations tried. The large plateau has the highest ( )

19 Conclusion We have applied a Markov random walk model to the click graph, giving us a high-quality ranking of documents for a given query, including those as-yet unclicked for that query We have applied a Markov random walk model to the click graph, giving us a high-quality ranking of documents for a given query, including those as-yet unclicked for that query A backward walk was more effective than a forward walk, which supports the notion underlying our backward walk A backward walk was more effective than a forward walk, which supports the notion underlying our backward walk We got the best results from a walk of 11 steps, or 101 steps with high self-transition probability We got the best results from a walk of 11 steps, or 101 steps with high self-transition probability We have studied ad hoc retrieval in this paper and the model could be effective and easily applied in the applications listed We have studied ad hoc retrieval in this paper and the model could be effective and easily applied in the applications listed Given our model, another possible step would be to incorporate document content and query content, by incorporating a language model, aiming to find document that are not yet part of the click graph Given our model, another possible step would be to incorporate document content and query content, by incorporating a language model, aiming to find document that are not yet part of the click graph