Presentation is loading. Please wait.

Presentation is loading. Please wait.

Link Analysis: PageRank. Ranking Nodes on the Graph Web pages are not equally “important” www.joe-schmoe.comwww.joe-schmoe.com vs. www.stanford.eduwww.stanford.edu.

Similar presentations


Presentation on theme: "Link Analysis: PageRank. Ranking Nodes on the Graph Web pages are not equally “important” www.joe-schmoe.comwww.joe-schmoe.com vs. www.stanford.eduwww.stanford.edu."— Presentation transcript:

1 Link Analysis: PageRank

2 Ranking Nodes on the Graph Web pages are not equally “important” vs. Since there is large diversity in the connectivity of the web graph we can rank the pages by the link structure Slides by Jure Leskovec: Mining Massive Datasets2 vs.

3 Link Analysis Algorithms We will cover the following Link Analysis approaches to computing importances of nodes in a graph: – Page Rank – Hubs and Authorities (HITS) – Topic-Specific (Personalized) Page Rank – Web Spam Detection Algorithms Slides by Jure Leskovec: Mining Massive Datasets3

4 Links as Votes Idea: Links as votes – Page is more important if it has more links In-coming links? Out-going links? Think of in-links as votes: – has 23,400 inlinks – has 1 inlink Are all in-links are equal? – Links from important pages count more – Recursive question! Slides by Jure Leskovec: Mining Massive Datasets4

5 Simple Recursive Formulation Each link’s vote is proportional to the importance of its source page If page p with importance x has n out-links, each link gets x/n votes Page p’s own importance is the sum of the votes on its in-links Slides by Jure Leskovec: Mining Massive Datasets5 p

6 PageRank: The “Flow” Model A “vote” from an important page is worth more A page is important if it is pointed to by other important pages Define a “rank” r j for node j Slides by Jure Leskovec: Mining Massive Datasets6 y y m m a a a/2 y/2 a/2 m y/2 The web in 1839 Flow equations: r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2

7 Solving the Flow Equations 3 equations, 3 unknowns, no constants – No unique solution Additional constraint forces uniqueness – r y + r a + r m = 1 – r y = 2/5, r a = 2/5, r m = 1/5 Gaussian elimination method works for small examples, but we need a better method for large web-size graphs Slides by Jure Leskovec: Mining Massive Datasets7 r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 Flow equations:

8 PageRank: Matrix Formulation Stochastic adjacency matrix M – Let page j has d j out-links – If j → i, then M ij = 1/d j else M ij = 0 M is a column stochastic matrix – Columns sum to 1 Rank vector r : vector with an entry per page – r i is the importance score of page i –  i r i = 1 The flow equations can be written r = M r Slides by Jure Leskovec: Mining Massive Datasets8

9 Example Suppose page j links to 3 pages, including i Slides by Jure Leskovec: Mining Massive Datasets9 i j M r r = i 1/3

10 Eigenvector Formulation The flow equations can be written r = M ∙ r So the rank vector is an eigenvector of the stochastic web matrix – In fact, its first or principal eigenvector, with corresponding eigenvalue 1 Slides by Jure Leskovec: Mining Massive Datasets10

11 Example: Flow Equations & M Slides by Jure Leskovec: Mining Massive Datasets11 r = Mr y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m y y a a m m yam y½½0 a½01 m0½0 r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2

12 Power Iteration Method Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Power iteration: a simple iterative scheme – Suppose there are N web pages – Initialize: r (0) = [1/N,….,1/N] T – Iterate: r (t+1) = M ∙ r (t) – Stop when |r (t+1) – r (t) | 1 <  |x| 1 =  1 ≤ i ≤ N |x i | is the L 1 norm Can use any other vector norm e.g., Euclidean Slides by Jure Leskovec: Mining Massive Datasets12 d i …. out-degree of node i

13 PageRank: How to solve? Slides by Jure Leskovec: Mining Massive Datasets13 y y a a m m yam y½½0 a½01 m0½0 Iteration 0, 1, 2, … r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2

14 Random Walk Interpretation  Imagine a random web surfer: – At any time t, surfer is on some page u – At time t+1, the surfer follows an out-link from u uniformly at random – Ends up on some page v linked from u – Process repeats indefinitely  Let:  p(t) … vector whose i th coordinate is the prob. that the surfer is at page i at time t – p(t) is a probability distribution over pages Slides by Jure Leskovec: Mining Massive Datasets14 j i1i1 i2i2 i3i3

15 The Stationary Distribution Where is the surfer at time t+1 ? – Follows a link uniformly at random p(t+1) = M · p(t) Suppose the random walk reaches a state p(t+1) = M · p(t) = p(t) then p(t) is stationary distribution of a random walk  Our rank vector r satisfies r = M · r – So, it is a stationary distribution for the random walk Slides by Jure Leskovec: Mining Massive Datasets15 j i1i1 i2i2 i3i3

16 PageRank: Three Questions Does this converge? Does it converge to what we want? Are results reasonable? Slides by Jure Leskovec: Mining Massive Datasets16 or equivalently

17 Does This Converge? Example: r a 1010 r b 0101 Slides by Jure Leskovec: Mining Massive Datasets17 = b b a a Iteration 0, 1, 2, …

18 Does it Converge to What We Want? Example: r a 1000 r b 0100 Slides by Jure Leskovec: Mining Massive Datasets18 = b b a a Iteration 0, 1, 2, …

19 Problems with the “Flow” Model 2 problems: Some pages are “dead ends” (have no out-links) – Such pages cause importance to “leak out” Spider traps (all out links are within the group) – Eventually spider traps absorb all importance Slides by Jure Leskovec: Mining Massive Datasets19

20 Problem: Spider Traps Slides by Jure Leskovec: Mining Massive Datasets20 Iteration 0, 1, 2, … y y a a m m yam y½½0 a½00 m0½1 r y = r y /2 + r a /2 r a = r y /2 r m = r a /2 + r m

21 Solution: Random Teleports The Google solution for spider traps: At each time step, the random surfer has two options: – With probability , follow a link at random – With probability 1- , jump to some page uniformly at random – Common values for  are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps Slides by Jure Leskovec: Mining Massive Datasets21 y y a a m m y y a a m m

22 Problem: Dead Ends Slides by Jure Leskovec: Mining Massive Datasets22 Iteration 0, 1, 2, … y y a a m m yam y½½0 a½00 m0½0 r y = r y /2 + r a /2 r a = r y /2 r m = r a /2

23 Solution: Dead Ends Teleports: Follow random teleport links with probability 1.0 from dead-ends – Adjust matrix accordingly Slides by Jure Leskovec: Mining Massive Datasets23 y y a a m m yam y½½⅓ a½0⅓ m0½⅓ yam y½½0 a½00 m0½0 y y a a m m

24 Why Teleports Solve the Problem? Markov Chains Set of states X Transition matrix P where P ij = P(X t =i | X t-1 =j) π specifying the probability of being at each state x  X Goal is to find π such that π = P π Slides by Jure Leskovec: Mining Massive Datasets24

25 Why is This Analogy Useful? Theory of Markov chains Fact: For any start vector, the power method applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic. Slides by Jure Leskovec: Mining Massive Datasets25

26 Make M Stochastic Stochastic: Every column sums to 1 A possible solution: Add green links Slides by Jure Leskovec: Mining Massive Datasets26 y y a a m m yam y½½1/3 a½0 m0½ r y = r y /2 + r a /2 + r m /3 r a = r y /2+ r m /3 r m = r a /2 + r m /3 a i …=1 if node i has out deg 0, =0 else 1…vector of all 1s

27 Make M Aperiodic A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k. A possible solution: Add green links Slides by Jure Leskovec: Mining Massive Datasets27 y y a a m m

28 Make M Irreducible From any state, there is a non-zero probability of going from any one state to any another A possible solution: Add green links Slides by Jure Leskovec: Mining Massive Datasets28 y y a a m m

29 Solution: Random Jumps Slides by Jure Leskovec: Mining Massive Datasets29 d i … out-degree of node i From now on: We assume M has no dead ends That is, we follow random teleport links with probability 1.0 from dead-ends

30 The Google Matrix Slides by Jure Leskovec: Mining Massive Datasets30

31 Random Teleports (  ) Slides by Jure Leskovec: Mining Massive Datasets31 y a = m 1/ /33 5/33 21/33... y y a a m m 0.8·½+0.2·⅓ 0.2·⅓ ·⅓ 0.2·⅓ 0.8·½+0.2·⅓ 1/2 1/2 0 1/ /2 1 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/ S 1/n·1·1 T A

32 Computing Page Rank Key step is matrix-vector multiplication – r new = A ∙ r old Easy if we have enough main memory to hold A, r old, r new Say N = 1 billion pages – We need 4 bytes for each entry (say) – 2 billion entries for vectors, approx 8GB – Matrix A has N 2 entries is a large number! Slides by Jure Leskovec: Mining Massive Datasets32 ½ ½ 0 ½ ½ 1 1/3 1/3 1/3 7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/ A =  ∙M + (1-  ) [1/N] NxN = A =

33 Matrix Formulation Suppose there are N pages Consider a page j, with set of out-links d j We have M ij = 1/|d j | when j → i and M ij = 0 otherwise The random teleport is equivalent to – Adding a teleport link from j to every other page with probability (1-  )/N – Reducing the probability of following each out-link from 1/|d j | to  /|d j | – Equivalent: Tax each page a fraction (1-  ) of its score and redistribute evenly Slides by Jure Leskovec: Mining Massive Datasets33

34 Rearranging the Equation Slides by Jure Leskovec: Mining Massive Datasets34 [x] N … a vector of length N with all entries x

35 Sparse Matrix Formulation Slides by Jure Leskovec: Mining Massive Datasets35

36 Sparse Matrix Encoding Encode sparse matrix using only nonzero entries – Space proportional roughly to number of links – Say 10N, or 4*10*1 billion = 40GB – Still won’t fit in memory, but will fit on disk Slides by Jure Leskovec: Mining Massive Datasets , 5, , 64, 113, 117, , 23 source node degree destination nodes

37 Basic Algorithm: Update Step Assume enough RAM to fit r new into memory – Store r old and matrix M on disk Then 1 step of power-iteration is: Slides by Jure Leskovec: Mining Massive Datasets , 5, , 64, 113, , 23 srcdegreedestination r new r old Initialize all entries of r new to (1-  )/N For each page p (of out-degree n): Read into memory: p, n, dest 1,…,dest n, r old (p) for j = 1…n: r new (dest j ) +=  r old (p) / n

38 Analysis Assume enough RAM to fit r new into memory – Store r old and matrix M on disk In each iteration, we have to: – Read r old and M – Write r new back to disk – IO cost = 2|r| + |M| Question: – What if we could not even fit r new in memory? Slides by Jure Leskovec: Mining Massive Datasets38

39 Block-based Update Algorithm Slides by Jure Leskovec: Mining Massive Datasets , 1, 3, , , 4 srcdegreedestination r new r old

40 Analysis of Block Update Similar to nested-loop join in databases – Break r new into k blocks that fit in memory – Scan M and r old once for each block k scans of M and r old – k(|M| + |r|) + |r| = k|M| + (k+1)|r| Can we do better? – Hint: M is much bigger than r (approx 10-20x), so we must avoid reading it k times per iteration Slides by Jure Leskovec: Mining Massive Datasets40

41 Block-Stripe Update Algorithm Slides by Jure Leskovec: Mining Massive Datasets , srcdegreedestination r new r old

42 Block-Stripe Analysis Break M into stripes – Each stripe contains only destination nodes in the corresponding block of r new Some additional overhead per stripe – But it is usually worth it Cost per iteration – |M|(1+  ) + (k+1)|r| Slides by Jure Leskovec: Mining Massive Datasets42

43 Some Problems with Page Rank Measures generic popularity of a page – Biased against topic-specific authorities – Solution: Topic-Specific PageRank (next) Uses a single measure of importance – Other models e.g., hubs-and-authorities – Solution: Hubs-and-Authorities (next) Susceptible to Link spam – Artificial link topographies created in order to boost page rank – Solution: TrustRank (next) Slides by Jure Leskovec: Mining Massive Datasets43


Download ppt "Link Analysis: PageRank. Ranking Nodes on the Graph Web pages are not equally “important” www.joe-schmoe.comwww.joe-schmoe.com vs. www.stanford.eduwww.stanford.edu."

Similar presentations


Ads by Google