Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.

Similar presentations


Presentation on theme: "Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis."— Presentation transcript:

1 Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis

2 Introduction to Information Retrieval Outline ❶ Anchor Text ❷ PageRank ❸ HITS(Hyperlink-Induced Topic Search): Hubs & Authorities

3 Introduction to Information Retrieval Outline ❶ Anchor Text ❷ PageRank ❸ HITS: Hubs & Authorities

4 Introduction to Information Retrieval 4 The web as a directed graph  Assumption 1: A hyperlink is a quality signal.  The hyperlink d 1 → d 2 indicates that d 1 ‘s author deems d 2 high-quality and relevant.  Assumption 2: The anchor text describes the content of d 2.  We use anchor text somewhat loosely here for: the text surrounding the hyperlink.  Example: “You can find cheap cars ˂a href =http://…˃here ˂/a ˃. ”  Anchor text: “You can find cheap cars here”

5 Introduction to Information Retrieval 5 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ]  Searching on [text of d 2 ] + [anchor text → d 2 ] is oftenmore effective than searching on [text of d 2 ] only.  Example: Query “IBM”  Matches IBM’s copyright page  Matches many spam pages  Matches IBM wikipedia article  May not match IBM home page!  … if IBM home page is mostly graphics  Searching on [anchor text → d 2 ] is better for the query “IBM”.  In this representation, the page with most occurrences of IBM is www.ibm.com

6 Introduction to Information Retrieval 6 Anchor text containing IBM pointing to www.ibm.com

7 Introduction to Information Retrieval 7 Indexing anchor text  Thus: Anchor text is often a better description of a page’s content than the page itself.  Anchor text can be weighted more highly than document text.

8 Introduction to Information Retrieval Outline ❶ Anchor Text ❷ PageRank ❸ HITS: Hubs & Authorities

9 Introduction to Information Retrieval PageRank  Assigned to every node(page) in the web graph a numerical score between 0 and 1, known as its PageRank.  The idea behind PageRank is that pages visited more often are more important.

10 Introduction to Information Retrieval 10 Model behind PageRank: Random walk  Imagine a web surfer doing a random walk on the web  Start at a random page  At each step, go out of the current page along one of the links on that page, equiprobably.

11 Introduction to Information Retrieval 11 Dead ends  The web is full of dead ends.  Random walk can get stuck in dead ends.  Teleport operation

12 Introduction to Information Retrieval 12 Teleporting  At a dead end, jump to a random web page with prob. 1/N (N is the total number of nodes in the web graph).  At a non-dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N ).  We use the teleport operation in two ways 1.When at a node with no out-links. 2.At any node that has outgoing links, the surfer invokes the teleport operation with probability α and the standard random walk with probility 1-α.

13 Introduction to Information Retrieval 13 Markov Chains  A Markov chain is characterized by an NxN transition probability matrix P each of whose entries is in the interval [0,1].  The entries in each row of P add up to 1.  The entry P ij tells us the probability that the state at the next time step is j, conditioned on the current state being i.  Each entry P ij is known as a transition probability and depends only on the current state i.  A matrix with non-negative entries that satisfies (b) is known as a stochastic matrix (The largest eigenvalue is 1). (a) (b)

14 Introduction to Information Retrieval 14 How to get the probability matrix P  The adjacency matrix A of the web graph is defined as follow: if there is a hyperlink from page i to page j, then A ij =1, otherwise A ij =0.  We can derive the transition probability matrix P form the matrix A. 1.If a row of A has no 1’s, then replace each element by 1/N. 2.Divide each 1 in A by the number of 1’s in its row. 3.Multiply the resulting matrix by 1-α. 4.Add α/N to every entry of the resulting matrix, to obtain P.

15 Introduction to Information Retrieval Web graph - Example Adjacency Matrix A d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0010000 d1d1 0110000 d2d2 1011000 d3d3 0001100 d4d4 0000001 d5d5 0000011 d6d6 0001101

16 Introduction to Information Retrieval 16 d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0.00 1.000.00 d1d1 0.50 0.00 d2d2 0.330.000.33 0.00 d3d3 0.50 0.00 d4d4 1.00 d5d5 0.00 0.50 d6d6 0.00 0.33 0.000.33 d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0010000 d1d1 0110000 d2d2 1011000 d3d3 0001100 d4d4 0000001 d5d5 0000011 d6d6 0001101 Divide each 1 in A by the number of 1’s in its row Web graph - Example

17 Introduction to Information Retrieval 17 d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0.00 0.790.00 d1d1 0.40 0.00 d2d2 0.260.000.26 0.00 d3d3 0.40 0.00 d4d4 0.79 d5d5 0.00 0.40 d6d6 0.00 0.26 0.000.26 d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0.00 1.000.00 d1d1 0.50 0.00 d2d2 0.330.000.33 0.00 d3d3 0.50 0.00 d4d4 1.00 d5d5 0.00 0.50 d6d6 0.00 0.33 0.000.33 Multiply the resulting matrix by 1-α Web graph – Example(α=0.21)

18 Introduction to Information Retrieval 18 Add α/N to every entry of the resulting matrix Transition probability matrix P d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0.03 0.820.03 d1d1 0.43 0.03 d2d2 0.290.030.29 0.03 d3d3 0.300.03 0.43 0.03 d4d4 0.300.030.30 0.030.300.82 d5d5 0.300.03 0.43 d6d6 0.300.03 0.29 0.030.29 Web graph – Example(α=0.21) d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0.00 0.790.00 d1d1 0.40 0.00 d2d2 0.260.000.26 0.00 d3d3 0.40 0.00 d4d4 0.79 d5d5 0.00 0.40 d6d6 0.00 0.26 0.000.26

19 Introduction to Information Retrieval 19 Ergodic Markov Chains  A Markov chain is said to be ergodic if there exists a positive integer T 0 such that for all pairs of states i, j in the Markov chain, if it is started at time 0 in state i then for all t >T 0, the probability of being in state j at time t is greater than 0.  Teleport operation makes the web graph ergodic.  For any ergodic Markov chain, there is a unique steady-state probability vector that is the principal left eigenvector of P.  The steady-state probability for a state is the PageRank of the corresponding web page.

20 Introduction to Information Retrieval 20 Change in probability vector  If the probability vector is x = (x 1,..., x N ), at this step, what is it at the next step?  Recall that row i of the transition probability matrix P tells us where we go next from state i.  So from x, our next state is distributed as xP.

21 Introduction to Information Retrieval 21 Steady state in vector notation  The steady state in vector notation is simply a vector  = (  ,  , …,   ) of probabilities.  (We use  to distinguish it from the notation for the probability vector x.)   is the long-term visit rate (or PageRank) of page i.  So we can think of PageRank as a very long vector – one entry per page.

22 Introduction to Information Retrieval 22 How do we compute the steady state vector?  In other words: how do we compute PageRank?  Recall:  = (  1,  2, …,  N ) is the PageRank vector, the vector of steady-state probabilities...  … and if the distribution in this step is x, then the distribution in the next step is xP.  But  is the steady state!  So:  =  P  Solving this matrix equation gives us .   is the principal left eigenvector for P …  … that is,  is the left eigenvector with the largest eigenvalue.  All transition probability matrices have largest eigenvalue 1.

23 Introduction to Information Retrieval 23 One way of computing the PageRank   Start with any distribution x, e.g., uniform distribution  After one step, we’re at xP.  After two steps, we’re at xP 2.  After k steps, we’re at xP k.  Algorithm: multiply x by increasing powers of P until convergence.  This method is called the power iteration.  Recall: regardless of where we start, we eventually reach the steady state 

24 Introduction to Information Retrieval 24 Power iteration: Example  What is the PageRank / steady state in this example?

25 Introduction to Information Retrieval 25 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.7480.24960.7504= xP 4... t∞t∞ 0.250.750.250.75= xP ∞ PageRank vector =  = (  ,   ) = (0.25, 0.75)

26 Introduction to Information Retrieval is the principle left eigenvector of P d1d1 d2d2 d1d1 0.10.9 d2d2 0.30.7 1- 0.75 10.25 =x x 10 0-0.2 0.250.75 1 SkSk TkTk = S0S0 T0T0 x 1-0.75 10.25 x x 1k1k 0 0-0.2 k 0.250.75 1 S k =(S 0 +T 0 )x1 k x[0.25] - (-0.75S 0 +0.25T 0 )x(-0.2) k T k =(S 0 +T 0 )x1 k x[0.75] + [-0.75S 0 +0.25T 0 ]x(-0.2) k

27 Introduction to Information Retrieval Example web graph

28 Introduction to Information Retrieval 28 Transition matrix with teleporting(α=0.14) d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0.02 0.880.02 d1d1 0.45 0.02 d2d2 0.310.020.31 0.02 d3d3 0.45 0.02 d4d4 0.88 d5d5 0.02 0.45 d6d6 0.02 0.31 0.020.31

29 Introduction to Information Retrieval 29 Power method vectors xP k xxP 1 xP 2 xP 3 xP 4 xP 5 xP 6 xP 7 xP 8 xP 9 xP 10 xP 11 xP 12 xP 13 d0d0 0.140.060.090.07 0.06 0.05 d1d1 0.140.080.060.04 d2d2 0.140.250.180.170.150.140.130.12 0.11 d3d3 0.140.160.230.24 0.25 d4d4 0.140.120.160.19 0.200.21 d5d5 0.140.080.060.04 d6d6 0.140.250.230.250.270.280.29 0.30 0.31

30 Introduction to Information Retrieval 30 Example web graph PageRank d0d0 0.05 d1d1 0.04 d2d2 0.11 d3d3 0.25 d4d4 0.21 d5d5 0.04 d6d6 0.31

31 Introduction to Information Retrieval Outline ❶ Anchor Text ❷ PageRank ❸ HITS: Hubs & Authorities

32 Introduction to Information Retrieval 32 Hits – Hyperlink-Induced Topic Search  Premise: there are two different types of relevance on the web.  Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need.  Relevance type 2: Authorities. An authority page is a direct answer to the information need.  Most approaches to search (including PageRank ranking) don’t make the distinction between these two very different types of relevance.

33 Introduction to Information Retrieval 33 Hubs and authorities : Definition  A good hub page for a topic links to many authority pages for that topic.  A good authority page for a topic is linked to by many hub pages for that topic.  Circular definition – we will turn this into an iterative computation.

34 Introduction to Information Retrieval 34 How to compute hub and authority scores  Do a regular web search first  Call the search result the root set  Find all pages that are linked to or link to pages in the root set  Call first larger set the base set  Finally, compute hubs and authorities for the base set (which we’ll view as a small web graph)

35 Introduction to Information Retrieval Root set and base set (1) The root set root set

36 Introduction to Information Retrieval Root set and base set (1) Nodes that root set nodes link to root set

37 Introduction to Information Retrieval Root set and base set (1) Nodes that link to root set nodes root set

38 Introduction to Information Retrieval Root set and base set (1) The base set root set base set

39 Introduction to Information Retrieval 39 Root set and base set (2)  Root set typically has 200-1000 nodes.  Base set may have up to 5000 nodes.  Computation of base set, as shown on previous slide:  Follow out-links by parsing the pages in the root set  Find d’s in-links by searching for all pages containing a link to d

40 Introduction to Information Retrieval 40 Hub and authority scores  Compute for each page d in the base set a hub score h(d) and an authority score a(d)  Initialization: for all d: h(d) = 1, a(d) = 1  Iteratively update all h(d), a(d)  After convergence:  Output pages with highest h scores as top hubs  Output pages with highest a scores as top authorities  So we output two ranked lists

41 Introduction to Information Retrieval 41 Iterative update  For all d: h(d) =  For all d: a(d) =  Iterate these two steps until convergence

42 Introduction to Information Retrieval 42 Hub & Authorities: Comments  HITS can pull together good pages regardless of page content.  Once the base set is assembles, we only do link analysis, no text matching.  Pages in the base set often do not contain any of the query words.  Danger: topic drift – the pages found by following links may not be related to the original query.

43 Introduction to Information Retrieval 43 Proof of convergence  We define an N×N adjacency matrix A. (We called this the link matrix earlier).  For 1 ≤ i, j ≤ N, the matrix entry A ij tells us whether there is a link from page i to page j ( A ij = 1) or not (A ij = 0).  Example: d1d1 d2d2 d3d3 d1d1 010 d2d2 111 d3d3 100

44 Introduction to Information Retrieval 44 Write update rules as matrix operations  Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d.  Similarity for a, the vector of authority scores  Now we can write h(d) = as a matrix operation: h = Aa... ... and we can write a(d) = as a a = A T h  HITS algorithm in matrix notation:  Compute h = Aa  Compute a = A T h  Iterate until convergence

45 Introduction to Information Retrieval 45 HITS as eigenvector problem  HITS algorithm in matrix notation. Iterate:  Compute h = Aa  Compute a = A T h  By substitution we get: h = AA T h and a = A T Aa  Thus, h is an eigenvector of AA T and a is an eigenvector of A T A.  So the HITS algorithm is actually a special case of the power method and hub and authority scores are eigenvector values.  HITS and PageRank both formalize link analysis as eigenvector problems.

46 Introduction to Information Retrieval Example web graph Matrix A d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0010000 d1d1 0110000 d2d2 1012000 d3d3 0001100 d4d4 0000001 d5d5 0000011 d6d6 0002101

47 Introduction to Information Retrieval Hub vectors h 0, h i = A*a i, i ≥1 h0h0 h1h1 h2h2 h3h3 h4h4 h5h5 d0d0 0.140.060.04 0.03 d1d1 0.140.080.050.04 d2d2 0.140.280.320.33 d3d3 0.14 0.170.18 d4d4 0.140.060.04 d5d5 0.140.080.050.04 d6d6 0.140.300.330.340.35

48 Introduction to Information Retrieval Authority vector a = A T *h i-1, i ≥ 1 a1a1 a2a2 a3a3 a4a4 a5a5 a6a6 a7a7 d0d0 0.060.090.10 d1d1 0.060.030.01 d2d2 0.190.140.130.12 d3d3 0.310.430.46 0.47 d4d4 0.130.140.16 d5d5 0.060.030.020.01 d6d6 0.190.140.13

49 Introduction to Information Retrieval 49 Example web graph ah d0d0 0.100.03 d1d1 0.010.04 d2d2 0.120.33 d3d3 0.470.18 d4d4 0.160.04 d5d5 0.010.04 d6d6 0.130.35

50 Introduction to Information Retrieval Top-ranked pages  Pages with highest in-degree: d 2, d 3, d 6  Pages with highest out-degree: d 2, d 6  Pages with highest PageRank: d 6  Pages with highest in-degree: d 6 (close: d 2 )  Pages with highest authority score: d 3

51 Introduction to Information Retrieval 51 PageRank vs. HITS: Discussion  PageRank can be precomputed, HITS has to be computed at query time.  HITS is too expensive in most application scenarios.  PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to.  Claim: On the web, a good hub almost always is also a good authority.


Download ppt "Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis."

Similar presentations


Ads by Google