## Presentation on theme: "Link Analysis: PageRank"— Presentation transcript:

Ranking Nodes on the Graph
Web pages are not equally “important” vs. Since there is large diversity in the connectivity of the web graph we can rank the pages by the link structure vs. Slides by Jure Leskovec: Mining Massive Datasets

We will cover the following Link Analysis approaches to computing importances of nodes in a graph: Page Rank Hubs and Authorities (HITS) Topic-Specific (Personalized) Page Rank Web Spam Detection Algorithms Slides by Jure Leskovec: Mining Massive Datasets

Slides by Jure Leskovec: Mining Massive Datasets

Simple Recursive Formulation
Each link’s vote is proportional to the importance of its source page If page p with importance x has n out-links, each link gets x/n votes Page p’s own importance is the sum of the votes on its in-links p Slides by Jure Leskovec: Mining Massive Datasets

PageRank: The “Flow” Model
A “vote” from an important page is worth more A page is important if it is pointed to by other important pages Define a “rank” rj for node j The web in 1839 y m a a/2 y/2 Flow equations: ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2 Slides by Jure Leskovec: Mining Massive Datasets

Solving the Flow Equations
ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2 3 equations, 3 unknowns, no constants No unique solution Additional constraint forces uniqueness ry + ra + rm = 1 ry = 2/5, ra = 2/5, rm = 1/5 Gaussian elimination method works for small examples, but we need a better method for large web-size graphs Slides by Jure Leskovec: Mining Massive Datasets

PageRank: Matrix Formulation
Stochastic adjacency matrix M Let page j has dj out-links If j → i, then Mij = 1/dj else Mij = 0 M is a column stochastic matrix Columns sum to 1 Rank vector r: vector with an entry per page ri is the importance score of page i i ri = 1 The flow equations can be written r = M r Slides by Jure Leskovec: Mining Massive Datasets

Slides by Jure Leskovec: Mining Massive Datasets
Example Suppose page j links to 3 pages, including i i j M r = i 1/3 r Slides by Jure Leskovec: Mining Massive Datasets

Eigenvector Formulation
The flow equations can be written r = M ∙ r So the rank vector is an eigenvector of the stochastic web matrix In fact, its first or principal eigenvector, with corresponding eigenvalue 1 Slides by Jure Leskovec: Mining Massive Datasets

Example: Flow Equations & M
y a m 1 y a m r = Mr y ½ ½ y a = ½ a m ½ m ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2 Slides by Jure Leskovec: Mining Massive Datasets

Power Iteration Method
Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Power iteration: a simple iterative scheme Suppose there are N web pages Initialize: r(0) = [1/N,….,1/N]T Iterate: r(t+1) = M ∙ r(t) Stop when |r(t+1) – r(t)|1 <  |x|1 = 1≤i≤N|xi| is the L1 norm Can use any other vector norm e.g., Euclidean di …. out-degree of node i Slides by Jure Leskovec: Mining Massive Datasets

Slides by Jure Leskovec: Mining Massive Datasets
PageRank: How to solve? y a m 1 y a m Power Iteration: Set 𝑟 𝑗 =1/N 𝑟 𝑗 = 𝑖→𝑗 𝑟 𝑖 𝑑 𝑖 And iterate ri=j Mij∙rj Example: ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15 ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2 Iteration 0, 1, 2, … Slides by Jure Leskovec: Mining Massive Datasets

Random Walk Interpretation
Imagine a random web surfer: At any time t, surfer is on some page u At time t+1, the surfer follows an out-link from u uniformly at random Ends up on some page v linked from u Process repeats indefinitely Let: p(t) … vector whose ith coordinate is the prob. that the surfer is at page i at time t p(t) is a probability distribution over pages j Slides by Jure Leskovec: Mining Massive Datasets

The Stationary Distribution
Where is the surfer at time t+1? Follows a link uniformly at random p(t+1) = M · p(t) Suppose the random walk reaches a state p(t+1) = M · p(t) = p(t) then p(t) is stationary distribution of a random walk Our rank vector r satisfies r = M · r So, it is a stationary distribution for the random walk j Slides by Jure Leskovec: Mining Massive Datasets

PageRank: Three Questions
Does this converge? Does it converge to what we want? Are results reasonable? or equivalently Slides by Jure Leskovec: Mining Massive Datasets

Slides by Jure Leskovec: Mining Massive Datasets
Does This Converge? Example: ra rb a b = Iteration 0, 1, 2, … Slides by Jure Leskovec: Mining Massive Datasets

Does it Converge to What We Want?
Example: ra rb a b = Iteration 0, 1, 2, … Slides by Jure Leskovec: Mining Massive Datasets

Problems with the “Flow” Model
Some pages are “dead ends” (have no out-links) Such pages cause importance to “leak out” Spider traps (all out links are within the group) Eventually spider traps absorb all importance Slides by Jure Leskovec: Mining Massive Datasets

Slides by Jure Leskovec: Mining Massive Datasets
Problem: Spider Traps y a m 1 y a m Power Iteration: Set 𝑟 𝑗 =1 𝑟 𝑗 = 𝑖→𝑗 𝑟 𝑖 𝑑 𝑖 And iterate Example: ry 1/3 2/6 3/12 5/24 0 ra = 1/3 1/6 2/12 3/24 … 0 rm 1/3 3/6 7/12 16/24 1 ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 + rm Iteration 0, 1, 2, … Slides by Jure Leskovec: Mining Massive Datasets

Solution: Random Teleports
The Google solution for spider traps: At each time step, the random surfer has two options: With probability , follow a link at random With probability 1-, jump to some page uniformly at random Common values for  are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps y a m y a m Slides by Jure Leskovec: Mining Massive Datasets

Slides by Jure Leskovec: Mining Massive Datasets
Problem: Dead Ends y a m y a m Power Iteration: Set 𝑟 𝑗 =1 𝑟 𝑗 = 𝑖→𝑗 𝑟 𝑖 𝑑 𝑖 And iterate Example: ry 1/3 2/6 3/12 5/24 0 ra = 1/3 1/6 2/12 3/24 … 0 rm 1/3 1/6 1/12 2/24 0 ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 Iteration 0, 1, 2, … Slides by Jure Leskovec: Mining Massive Datasets

Slides by Jure Leskovec: Mining Massive Datasets
Solution: Dead Ends Teleports: Follow random teleport links with probability 1.0 from dead-ends Adjust matrix accordingly y a m y a m y a m y a m Slides by Jure Leskovec: Mining Massive Datasets

Why Teleports Solve the Problem?
Markov Chains Set of states X Transition matrix P where Pij = P(Xt=i | Xt-1=j) π specifying the probability of being at each state x  X Goal is to find π such that π = P π Slides by Jure Leskovec: Mining Massive Datasets

Why is This Analogy Useful?
Theory of Markov chains Fact: For any start vector, the power method applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic. Slides by Jure Leskovec: Mining Massive Datasets

Slides by Jure Leskovec: Mining Massive Datasets
Make M Stochastic Stochastic: Every column sums to 1 A possible solution: Add green links ai…=1 if node i has out deg 0, =0 else 1…vector of all 1s y a m 1/3 y a m ry = ry /2 + ra /2 + rm /3 ra = ry /2+ rm /3 rm = ra /2 + rm /3 Slides by Jure Leskovec: Mining Massive Datasets

Slides by Jure Leskovec: Mining Massive Datasets
Make M Aperiodic A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k. A possible solution: Add green links y a m Slides by Jure Leskovec: Mining Massive Datasets

Slides by Jure Leskovec: Mining Massive Datasets
Make M Irreducible From any state, there is a non-zero probability of going from any one state to any another A possible solution: Add green links y a m Slides by Jure Leskovec: Mining Massive Datasets

Solution: Random Jumps
Google’s solution that does it all: Makes M stochastic, aperiodic, irreducible At each step, random surfer has two options: With probability 1-, follow a link at random With probability , jump to some random page PageRank equation [Brin-Page, 98] 𝑟 𝑗 = 𝑖→𝑗 𝛽 𝑟 𝑖 𝑑 𝑖 +(1−𝛽) 1 𝑛 From now on: We assume M has no dead ends That is, we follow random teleport links with probability 1.0 from dead-ends di … out-degree of node i Slides by Jure Leskovec: Mining Massive Datasets

Slides by Jure Leskovec: Mining Massive Datasets
The Google Matrix PageRank equation [Brin-Page, 98] 𝑟 𝑗 = 𝑖→𝑗 𝛽 𝑟 𝑖 𝑑 𝑖 +(1−𝛽) 1 𝑛 The Google Matrix A: 𝐴=𝛽 𝑆+(1−𝛽) 1 𝑛 𝟏 ⋅𝟏 𝑇 G is stochastic, aperiodic and irreducible, so 𝑟 (𝑡+1) =𝐴⋅ 𝑟 (𝑡) What is  ? In practice  =0.85 (make 5 steps and jump) Slides by Jure Leskovec: Mining Massive Datasets

Slides by Jure Leskovec: Mining Massive Datasets
Random Teleports ( = 0.8) S 1/n·1·1T y a m 0.8·½+0.2·⅓ 0.2·⅓ ·⅓ 0.2· ⅓ 1/2 1/2 0 1/ 0 1/2 1 1/3 1/3 1/3 0.8 + 0.2 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 A y a = m 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . . Slides by Jure Leskovec: Mining Massive Datasets

Slides by Jure Leskovec: Mining Massive Datasets
Computing Page Rank Key step is matrix-vector multiplication rnew = A ∙ rold Easy if we have enough main memory to hold A, rold, rnew Say N = 1 billion pages We need 4 bytes for each entry (say) 2 billion entries for vectors, approx 8GB Matrix A has N2 entries 1018 is a large number! A = ∙M + (1-) [1/N]NxN ½ ½ 0 ½ 0 ½ 1 1/3 1/3 1/3 A = 0.8 +0.2 7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 = Slides by Jure Leskovec: Mining Massive Datasets

Slides by Jure Leskovec: Mining Massive Datasets
Matrix Formulation Suppose there are N pages Consider a page j, with set of out-links dj We have Mij = 1/|dj| when j→i and Mij = 0 otherwise The random teleport is equivalent to Adding a teleport link from j to every other page with probability (1-)/N Reducing the probability of following each out-link from 1/|dj| to /|dj| Equivalent: Tax each page a fraction (1-) of its score and redistribute evenly Slides by Jure Leskovec: Mining Massive Datasets

Rearranging the Equation
𝑟 =𝐴⋅𝑟, where 𝐴 𝑖𝑗 =𝛽 𝑀 𝑖𝑗 + 1−𝛽 𝑁 𝑟 𝑖 = 𝑗=1 𝑁 𝐴 𝑖𝑗 ⋅ 𝑟 𝑗 𝑟 𝑖 = 𝑗=1 𝑁 𝛽 𝑀 𝑖𝑗 + 1−𝛽 𝑁 ⋅ 𝑟 𝑗 = 𝑗=1 𝑁 𝛽 𝑀 𝑖𝑗 ⋅ 𝑟 𝑗 + 1−𝛽 𝑁 𝑗=1 𝑁 𝑟 𝑗 = 𝑗=1 𝑁 𝛽 𝑀 𝑖𝑗 ⋅ 𝑟 𝑗 + 1−𝛽 𝑁 since ∑ 𝑟 𝑗 =1 So we get: 𝑟=𝛽 𝑀⋅𝑟+ 1−𝛽 𝑁 𝑁 [x]N … a vector of length N with all entries x Slides by Jure Leskovec: Mining Massive Datasets

Sparse Matrix Formulation
We just rearranged the PageRank equation 𝒓=𝜷𝑴⋅𝒓+ 𝟏−𝜷 𝑵 𝑵 where [(1-)/N]N is a vector with all N entries (1-)/N M is a sparse matrix! 10 links per node, approx 10N entries So in each iteration, we need to: Compute rnew =  M ∙ rold Add a constant value (1-)/N to each entry in rnew Slides by Jure Leskovec: Mining Massive Datasets

Sparse Matrix Encoding
Encode sparse matrix using only nonzero entries Space proportional roughly to number of links Say 10N, or 4*10*1 billion = 40GB Still won’t fit in memory, but will fit on disk source node degree destination nodes 3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 13, 23 Slides by Jure Leskovec: Mining Massive Datasets

Basic Algorithm: Update Step
Assume enough RAM to fit rnew into memory Store rold and matrix M on disk Then 1 step of power-iteration is: Initialize all entries of rnew to (1-)/N For each page p (of out-degree n): Read into memory: p, n, dest1,…,destn, rold(p) for j = 1…n: rnew(destj) +=  rold(p) / n rnew rold src degree destination 1 1 3 1, 5, 6 1 4 17, 64, 113, 117 2 13, 23 2 2 3 3 4 4 5 5 Slides by Jure Leskovec: Mining Massive Datasets 6 6

Slides by Jure Leskovec: Mining Massive Datasets
Analysis Assume enough RAM to fit rnew into memory Store rold and matrix M on disk In each iteration, we have to: Read rold and M Write rnew back to disk IO cost = 2|r| + |M| Question: What if we could not even fit rnew in memory? Slides by Jure Leskovec: Mining Massive Datasets

Block-based Update Algorithm
rnew rold src degree destination 4 0, 1, 3, 5 1 2 0, 5 3, 4 1 1 2 3 2 4 3 5 4 5 Slides by Jure Leskovec: Mining Massive Datasets

Analysis of Block Update
Similar to nested-loop join in databases Break rnew into k blocks that fit in memory Scan M and rold once for each block k scans of M and rold k(|M| + |r|) + |r| = k|M| + (k+1)|r| Can we do better? Hint: M is much bigger than r (approx 10-20x), so we must avoid reading it k times per iteration Slides by Jure Leskovec: Mining Massive Datasets

Block-Stripe Update Algorithm
src degree destination rnew 4 0, 1 1 3 2 rold 1 1 2 4 3 2 3 2 4 3 5 4 5 1 3 2 4 5 Slides by Jure Leskovec: Mining Massive Datasets

Block-Stripe Analysis
Break M into stripes Each stripe contains only destination nodes in the corresponding block of rnew Some additional overhead per stripe But it is usually worth it Cost per iteration |M|(1+) + (k+1)|r| Slides by Jure Leskovec: Mining Massive Datasets

Some Problems with Page Rank
Measures generic popularity of a page Biased against topic-specific authorities Solution: Topic-Specific PageRank (next) Uses a single measure of importance Other models e.g., hubs-and-authorities Solution: Hubs-and-Authorities (next) Susceptible to Link spam Artificial link topographies created in order to boost page rank Solution: TrustRank (next) Slides by Jure Leskovec: Mining Massive Datasets