DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

Slides:



Advertisements
Similar presentations
Overview of this week Debugging tips for ML algorithms
Advertisements

Markov Models.
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Information Networks Link Analysis Ranking Lecture 8.
Graphs, Node importance, Link Analysis Ranking, Random walks
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Absorbing Random walks Coverage
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis, PageRank and Search Engines on the Web
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Markov Models. Markov Chain A sequence of states: X 1, X 2, X 3, … Usually over time The transition from X t-1 to X t depends only on X t-1 (Markov Property).
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
HCC class lecture 22 comments John Canny 4/13/05.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
DATA MINING LECTURE 13 Absorbing Random walks Coverage.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Overview of Web Ranking Algorithms: HITS and PageRank
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
DATA MINING LECTURE 11 Classification Naïve Bayes Graphs And Centrality.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
1 CS 430: Information Discovery Lecture 5 Ranking.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
The PageRank Citation Ranking: Bringing Order to the Web
Quality of a search engine
15-499:Algorithms and Applications
Search Engines and Link Analysis on the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
DTMC Applications Ranking Web Pages & Slotted ALOHA
CSE 454 Advanced Internet Systems University of Washington
Centrality in Social Networks
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Graph and Link Mining.
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
COMP5331 Web databases Prepared by Raymond Wong
Presentation transcript:

DATA MINING LECTURE 12 Link Analysis Ranking Random walks

GRAPHS AND LINK ANALYSIS RANKING

Link Analysis Ranking Use the graph structure in order to determine the relative importance of the nodes Applications: Ranking on graphs (Web, Twitter, FB, etc) Intuition: An edge from node p to node q denotes endorsement Node p endorses/recommends/confirms the authority/centrality/importance of node q Use the graph of recommendations to assign an authority value to every node

Rank by Popularity Rank pages according to the number of incoming edges (in-degree, degree centrality) 1.Red Page 2.Yellow Page 3.Blue Page 4.Purple Page 5.Green Page

Popularity It is not important only how many link to you, but how important are the people that link to you. Good authorities are pointed by good authorities Recursive definition of importance

PageRank Good authorities should be pointed by good authorities The value of a node is the value of the nodes that point to it. How do we implement that? Assume that we have a unit of authority to distribute to all nodes. Each node distributes the authority value they have to their neighbors The authority value of each node is the sum of the authority fractions it collects from its neighbors. Solving the system of equations we get the authority values for the nodes w = ½, w = ¼, w = ¼ ww w w + w + w = 1 w = w + w w = ½ w

A more complex example w 1 = 1/3 w 4 + 1/2 w 5 w 2 = 1/2 w 1 + w 3 + 1/3 w 4 w 3 = 1/2 w 1 + 1/3 w 4 w 4 = 1/2 w 5 w 5 = w 2

Random Walks on Graphs What we described is equivalent to a random walk on the graph Random walk: Start from a node uniformly at random Pick one of the outgoing edges uniformly at random Move to the destination of the edge Repeat.

Example Step 0

Example Step 0

Example Step 1

Example Step 1

Example Step 2

Example Step 2

Example Step 3

Example Step 3

Example Step 4…

Memorylessness

Transition probability matrix

An example

Node Probability vector

An example

Stationary distribution

Computing the stationary distribution The Power Method Initialize to some distribution q 0 Iteratively compute q t = q t-1 P After many iterations q t ≈ π regardless of the initial vector q 0 Power method because it computes q t = q 0 P t Rate of convergence determined by the second eigenvalue λ 2 t

The stationary distribution

The PageRank random walk Vanilla random walk make the adjacency matrix stochastic and run a random walk

The PageRank random walk What about sink nodes? what happens when the random walk moves to a node without any outgoing inks?

The PageRank random walk Replace these row vectors with a vector v typically, the uniform vector P’ = P + dv T

The PageRank random walk What about loops? Spider traps

The PageRank random walk Add a random jump to vector v with prob 1- α typically, to a uniform vector Restarts after 1/(1-α) steps in expectation Guarantees irreducibility, convergence P’’ = αP’ + (1-α)uv T, where u is the vector of all 1s Random walk with restarts

PageRank algorithm [BP98] The Random Surfer model pick a page at random with probability 1- α jump to a random page with probability α follow a random outgoing link Rank according to the stationary distribution 1.Red Page 2.Purple Page 3.Yellow Page 4.Blue Page 5.Green Page

Stationary distribution with random jump

Effects of random jump

Random walks on undirected graphs For undirected graphs, the stationary distribution is proportional to the degrees of the nodes Thus in this case a random walk is the same as degree popularity This is not longer true if we do random jumps Now the short paths play a greater role, and the previous distribution does not hold.

Pagerank implementation

A (Matlab-friendly) PageRank algorithm Performing vanilla power method is now too expensive – the matrix is not sparse q 0 = v t = 1 repeat t = t +1 until δ < ε Efficient computation of y = (P’’) T x P = normalized adjacency matrix P’’ = αP’ + (1-α)uv T, where u is the vector of all 1s P’ = P + dv T, where d i is 1 if i is sink and 0 o.w.

Pagerank history Huge advantage for Google in the early days It gave a way to get an idea for the value of a page, which was useful in many different ways Put an order to the web. After a while it became clear that the anchor text was probably more important for ranking Also, link spam became a new (dark) art Flood of research Numerical analysis got rejuvenated Huge number of variations Efficiency became a great issue. Huge number of applications in different fields Random walk is often referred to as PageRank.

THE HITS ALGORITHM

The HITS algorithm Another algorithm proposed around the same time as Pagerank for using the hyperlinks to rank pages Kleinberg: then an intern at IBM Almaden IBM never made anything out of it

Query dependent input Root Set Root set obtained from a text-only search engine

Query dependent input Root Set IN OUT

Query dependent input Root Set IN OUT

Query dependent input Root Set IN OUT Base Set

Hubs and Authorities [K98] Authority is not necessarily transferred directly between authorities Pages have double identity hub identity authority identity Good hubs point to good authorities Good authorities are pointed by good hubs hubs authorities

Hubs and Authorities Two kind of weights: Hub weight Authority weight The hub weight is the sum of the authority weights of the authorities pointed to by the hub The authority weight is the sum of the hub weights that point to this authority.

HITS Algorithm Initialize all weights to 1. Repeat until convergence O operation : hubs collect the weight of the authorities I operation: authorities collect the weight of the hubs Normalize weights under some norm

HITS and eigenvectors

Example hubs authorities Initialize

Example hubs authorities Step 1: O operation

Example hubs authorities Step 1: I operation

Example hubs authorities 1 5/6 2/6 1/6 1/3 2/3 1 1/3 Step 1: Normalization (Max norm)

Example hubs authorities 1 5/6 2/6 1/6 1 11/6 16/6 7/6 1/6 Step 2: O step

Example hubs authorities 33/6 27/6 23/6 7/6 1/6 1 11/6 16/6 7/6 1/6 Step 2: I step

Example hubs authorities 1 27/33 23/33 7/33 1/33 6/16 11/16 1 7/16 1/16 Step 2: Normalization

Example hubs authorities Convergence

OTHER ALGORITHMS

The SALSA algorithm [LM00] Perform a random walk alternating between hubs and authorities What does this random walk converge to? The graph is essentially undirected, so it will be proportional to the degree. hubs authorities

Social network analysis Evaluate the centrality of individuals in social networks degree centrality the (weighted) degree of a node distance centrality the average (weighted) distance of a node to the rest in the graph betweenness centrality the average number of (weighted) shortest paths that use node v

Counting paths – Katz 53 The importance of a node is measured by the weighted sum of paths that lead to this node A m [i,j] = number of paths of length m from i to j Compute converges when b < λ 1 (A) Rank nodes according to the column sums of the matrix P

Bibliometrics Impact factor (E. Garfield 72) counts the number of citations received for papers of the journal in the previous two years Pinsky-Narin 76 perform a random walk on the set of journals P ij = the fraction of citations from journal i that are directed to journal j