Random Walks on Graphs Lecture 18 Monojit Choudhury

Slides:



Advertisements
Similar presentations
Markov Models.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Information Networks Link Analysis Ranking Lecture 8.
Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.
Graphs, Node importance, Link Analysis Ranking, Random walks
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Link Analysis.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
ICS 278: Data Mining Lecture 15: Mining Web Link Structure
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Link Analysis, PageRank and Search Engines on the Web
Link Structure and Web Mining Shuying Wang
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine CS 277: Data Mining Mining Web Link Structure.
Information Retrieval Link-based Ranking. Ranking is crucial… “.. From our experimental data, we could observe that the top 20% of the pages with the.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Link Analysis.
Stochastic Approach for Link Structure Analysis (SALSA) Presented by Adam Simkins.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
1 Random Walks on Graphs: An Overview Purnamrita Sarkar, CMU Shortened and modified by Longin Jan Latecki.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Lectures 6 & 7 Centrality Measures Lectures 6 & 7 Centrality Measures February 2, 2009 Monojit Choudhury
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
Chapter 6: Link Analysis
PageRank Algorithm -- Bringing Order to the Web (Hu Bin)
1 CS 430: Information Discovery Lecture 5 Ranking.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
The PageRank Citation Ranking: Bringing Order to the Web
Quality of a search engine
Random Walks on Graphs.
15-499:Algorithms and Applications
Search Engines and Link Analysis on the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Information retrieval and PageRank
Link Structure Analysis
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Graph and Link Mining.
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Random Walks on Graphs Lecture 18 Monojit Choudhury monojitc@microsoft.com

What is a Random Walk Given a graph and a starting point (node), we select a neighbor of it at random, and move to this neighbor; Then we select a neighbor of this node and move to it, and so on; The (random) sequence of nodes selected this way is a random walk on the graph

Slide from Purnamitra Sarkar, Random Walks on Graphs: An Overview An example Adjacency matrix A Transition matrix P 1 1/2 A B C 1 Slide from Purnamitra Sarkar, Random Walks on Graphs: An Overview

Slide from Purnamitra Sarkar, Random Walks on Graphs: An Overview An example t=0, A A B C 1 1/2 Slide from Purnamitra Sarkar, Random Walks on Graphs: An Overview

Slide from Purnamitra Sarkar, Random Walks on Graphs: An Overview An example 1 1/2 t=1, AB t=0, A A B C 1 1/2 Slide from Purnamitra Sarkar, Random Walks on Graphs: An Overview

Slide from Purnamitra Sarkar, Random Walks on Graphs: An Overview An example 1 1/2 t=1, AB t=0, A A B C 1 1/2 t=2, ABC 1 1/2 Slide from Purnamitra Sarkar, Random Walks on Graphs: An Overview

Slide from Purnamitra Sarkar, Random Walks on Graphs: An Overview An example 1 1/2 t=1, AB t=0, A A B C 1 1/2 t=2, ABC 1 1/2 t=3, ABCA ABCB 1 1/2 Slide from Purnamitra Sarkar, Random Walks on Graphs: An Overview

Why are random walks interesting? When the underlying data has a natural graph structure, several physical processes can be conceived as a random walk Data Process WWW Random surfer Internet Routing P2P Search Social network Information percolation

More examples Classic ones Not so obvious ones Brownian motion Electrical circuits (resistances) Lattices and Ising models Not so obvious ones Shuffling and permutations Music Language

Random Walks & Markov Chains A random walk on a directed graph is nothing but a Markov chain! Initial node: chosen from a distribution P0 Transition matrix: M= D-1A When does M exist? When is M symmetric? Random Walk: Pt+1 = MTPt Pt = (MT)tP0

Properties of Markov Chains Symmetric: P(u  v) = P(v  u) Any random walk (v0,…,vt), when reversed, has the same probability if v0= vt Time Reversibility: The reversed walk is also a random walk with initial distribution as Pt Stationary or Steady-state: P* is stationary if P* = MTP*

More on stationary distribution For every graph G, the following is stationary distribution: P*(v) = d(v)/2m For which type of graph, the uniform distribution is stationary? Stationary distribution is unique, when … t  , Pt  P*; but not when …

Revisiting time-reversibility P*[i]M[i][j] = P*[j]M[j][i] However, P*[i]M[i][j] = 1/(2m) We move along every edge, along every given direction with the same frequency What is the expected number of steps before revisiting an edge? What is the expected number of steps before revisiting a node?

Important parameters of random walk Access time or hitting time Hij is the expected number of steps before node j is visited, starting from node i Commute time: i  j  i: Hij + Hji Cover time: Starting from a node/distribution the expected number of steps to reach every node

Problems Compute access time for any pair of nodes for Kn Can you express the cover time of a path by access time? For which kind of graphs, cover time is infinity? What can you infer about a graph which a large number of nodes but very low cover time?

Applications of Random Walks on Graphs Lecture 19 Applications of Random Walks on Graphs Monojit Choudhury monojitc@microsoft.com

Ranking Webpages The problem statement: Similar problems: Given a query word, Given a large number of webpages consisting of the query word Based on the hyperlink structure, find out which of the webpages are most relevant to the query Similar problems: Citation networks, Recommender systems

Mixing rate How fast the random walk converges to its limiting distribution Very important for analysis/usability of algorithms Mixing rates for some graphs can be very small: O(log n)

Mixing Rate and Spectral Gap It can be shown that Smaller the value of 2 larger is the spectral gap, faster is the mixing rate

Recap: Pagerank Simulate a random surfer by the power iteration method Problems Not unique if the graph is disconnected 0 pagerank if there are no incoming links or if there are sinks Computationally intensive? Stability & Cost of recomputation (web is dynamic) Does not take into account the specific query Easy to fool

PageRank The surfer jumps to an arbitrary page with non-zero probability (escape probability) M’ = (1-w)M + wE This solves: Sink problem Disconnectedness Converges fast if w is chosen appropriately Stability and need for recomputation But still ignores the query word

HITS Hypertext Induced Topic Selection By Jon Kleinberg, 1998 For each vertex v Є V in a subgraph of interest: a(v) - the authority of v h(v) - the hubness of v A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites

HITS: Constructing the Query graph

Authorities and Hubs 2 5 3 1 1 6 4 7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

Can you prove that it will converge? The Markov Chain Recursive dependency: a(v)  Σ h(w) h(v)  Σ a(w) w Є pa[v] w Є ch[v] Can you prove that it will converge?

HITS: Example Authority and hubness weights Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights

Limitations of HITS Sink problem: Solved Disconnectedness: an issue Convergence: Not a problem Stability: Quite robust You can still fool HITS easily! Tightly Knit Community (TKC) Effect

Applications Random Walks on Graphs - II Lecture 20 Applications Random Walks on Graphs - II Monojit Choudhury monojitc@microsoft.com

Acknowledgements Some slides of these lectures are from: Random Walks on Graphs: An Overview Purnamitra Sarkar “Link Analysis Slides” from the book Modeling the Internet and the Web Pierre Baldi, Paolo Frasconi, Padhraic Smyth

References Basics of Random Walk: L. Lovasz (1993) Random Walks on Graphs: A Survey PageRank: http://en.wikipedia.org/wiki/PageRank K. Bryan and T. Leise, The $25,000,000 Eigenvector: The Linear Algebra Behind Google (www.rose-hulman.edu/~bryan) HITS J. M. Kleinberg (1999) Authorative Sources in a Hyperlinked Environment. Journal of the ACM 46 (5): 604–632.

HITS on Citation Network A = WTW is the co-citation matrix What is A[i][j]? H = WWT is the bibliographic coupling matrix What is H[i][j]? H. Small, Co-citation in the scientific literature: a new measure of the relationship between two documents, Journal of the American Society for Information Science 24 (1973) 265–269. M.M. Kessler, Bibliographic coupling between scientific papers, American Documentation 14 (1963) 10–25.

SALSA: The Stochastic Approach for Link-Structure Analysis Probabilistic extension of the HITS algorithm Random walk is carried out by following hyperlinks both in the forward and in the backward direction Two separate random walks Hub walk Authority walk R. Lempel and S. Moran (2000) The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks 33 387-401

The basic idea Hub walk Authority Walk Follow a Web link from a page uh to a page wa (a forward link) and then Immediately traverse a backlink going from wa to vh, where (u,w) Є E and (v,w) Є E Authority Walk Follow a Web link from a page w(a) to a page u(h) (a backward link) and then Immediately traverse a forward link going back from vh to wa where (u,w) Є E and (v,w) Є E

Analyzing SALSA

Analyzing SALSA Hub Matrix: = Authority Matrix: =

SALSA ranks are degrees!

Is it good? It can be shown theoretically that SALSA does a better job than HITS in the presence of TKC effect However, it also has its own limitations Link Analysis: Which links (directed edges) in a network should be given more weight during the random walk? An active area of research

Limits of Link Analysis (in IR) META tags/ invisible text Search engines relying on meta tags in documents are often misled (intentionally) by web developers Pay-for-place Search engine bias : organizations pay search engines and page rank Advertisements: organizations pay high ranking pages for advertising space With a primary effect of increased visibility to end users and a secondary effect of increased respectability due to relevance to high ranking page

Limits of Link Analysis (in IR) Stability Adding even a small number of nodes/edges to the graph has a significant impact Topic drift – similar to TKC A top authority may be a hub of pages on a different topic resulting in increased rank of the authority page Content evolution Adding/removing links/content can affect the intuitive authority rank of a page requiring recalculation of page ranks

Applications Random Walks on Graphs - III Lecture 21 Applications Random Walks on Graphs - III Monojit Choudhury monojitc@microsoft.com

Clustering Using Random Walk

Chinese Whispers Based on the game of “Chinese Whispers” C. Biemann (2006) Chinese whispers - an efficient graph clustering algorithm and its application to natural language processing problems. In Proc of HLT-NAACL’06 workshop on TextGraphs, pages 73–80 Based on the game of “Chinese Whispers”

The Chinese Whispers Algorithm color sky weight 0.9 0.8 light -0.5 0.7 blue 0.9 blood heavy 0.5 red

The Chinese Whispers Algorithm color sky weight 0.9 0.8 light -0.5 0.7 blue 0.9 blood heavy 0.5 red

The Chinese Whispers Algorithm color sky weight 0.9 0.8 light -0.5 0.7 blue 0.9 blood heavy 0.5 red

Properties No parameters! Number of clusters? Does it converge for all graphs? How fast does it converge? What is the basis of clustering?

Affinity Propagation B.J. Frey and D. Dueck (2007) Clustering by Passing Messages Between Data Points. Science 315, 972 Choosing exemplars through real-valued message passing: Responsibilities Availabilities

Input n points (nodes) Similarity between them: s(i,k) How suitable an exemplar k is for i. s(k,k) = how likely it is for k to be an exemplar

Messages: Responsibility Denoted by r(i,k) Sent from i to k The accumulated evidence for how well-suited point k is to serve as the exemplar for point i, taking into account other potential exemplars

Messages: Availability Denoted by a(i,k) Sent from k to i The accumulated evidence for how appropriate it would be for point i to choose point k as its exemplar, taking into account the support from other points that point k should be an exemplar.

The Update Rules Initialization: a(i,k) = 0

Choosing Exemplars After any iteration, choose that k as an exemplar for i for which a(i,k) + r(i,k) is maximum. i is an exemplar itself if a(i,i) + r(i,i) is maximum.

An example

An example

An Example