Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.

Similar presentations


Presentation on theme: "Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis."— Presentation transcript:

1 Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis

2 Introduction to Information Retrieval Outline ❶ Recap ❷ Anchor Text ❸ Citation Analysis ❹ PageRank ❺ HITS: Hubs & Authorities

3 Introduction to Information Retrieval ❶ Recap ❷ Anchor Text ❸ Citation Analysis ❹ PageRank ❺ HITS: Hubs & Authorities Outline

4 Introduction to Information Retrieval Applications in clustering in IR Application What is clustered? Benefit Search results clustering search resultsmore effective information presentation to user Scatter-Gather(subsets of) collection alternative user interface: “search without typing” Collection clusteringcollectioneffective information presentation for exploratory browsing Cluser-based retrievalcollectionHigher efficiency: faster search

5 Introduction to Information Retrieval K-means algorithm

6 Introduction to Information Retrieval 6 Initialization of K-means  Random seed selection is just one of many ways K-means can be initialized.  Random seed selection is not very robust: It's easy to get a suboptimal clustering.  Better ways of computing initial centroids:  Select seeds not randomly, but using some heuristic (e.g., filter out outliers or find a set of seeds that has “good coverage” of the space)  Use hierarchical clustering to find good seeds  Select i (e.g., i = 10) different random sets of seeds, do a K-means clustering for each, select the clustering with lowest RSS.

7 Introduction to Information Retrieval 7 External criterion: Purity   = {  1,  , …,  K } is the set of clusters and C = {c 1, c 2, …, c J } is the set of classes  For each cluster  k : find class c j with most members n kj in  k  Sum all n kj and divide by total number of points

8 Introduction to Information Retrieval 8 Finding the “knee” in the curve

9 Introduction to Information Retrieval Pick the number of clusters where curve “flattens”. Here: 4 or 9. 9 Finding the “knee” in the curve

10 Introduction to Information Retrieval 10 Take-away today

11 Introduction to Information Retrieval 11 Take-away today  Anchor text: What exactly are links on the web and why are they important for IR?

12 Introduction to Information Retrieval 12 Take-away today  Anchor text: What exactly are links on the web and why are they important for IR?  Citation analysis: the mathematical foundation of PageRank and link-based ranking

13 Introduction to Information Retrieval 13 Take-away today  Anchor text: What exactly are links on the web and why are they important for IR?  PageRank : the original algorithm that was used for link- based ranking on the web  Citation analysis: the mathematical foundation of PageRank and link-based ranking

14 Introduction to Information Retrieval 14 Take-away today  Anchor text: What exactly are links on the web and why are they important for IR?  Hubs & Authorities: an alternative link-based ranking algorithm  PageRank : the original algorithm that was used for link- based ranking on the web  Citation analysis: the mathematical foundation of PageRank and link-based ranking

15 Introduction to Information Retrieval ❶ Recap ❷ Anchor Text ❸ Citation Analysis ❹ PageRank ❺ HITS: Hubs & Authorities Outline

16 Introduction to Information Retrieval 16 The web as a directed graph

17 Introduction to Information Retrieval 17 The web as a directed graph

18 Introduction to Information Retrieval 18 The web as a directed graph  Assumption 1: A hyperlink is a quality signal.

19 Introduction to Information Retrieval 19 The web as a directed graph  Assumption 1: A hyperlink is a quality signal.  The hyperlink d 1 → d 2 indicates that d 1 ‘s author deems d 2 high-quality and relevant.

20 Introduction to Information Retrieval 20 The web as a directed graph  Assumption 1: A hyperlink is a quality signal.  The hyperlink d 1 → d 2 indicates that d 1 ‘s author deems d 2 high-quality and relevant.  Assumption 2: The anchor text describes the content of d 2.

21 Introduction to Information Retrieval 21 The web as a directed graph  Assumption 1: A hyperlink is a quality signal.  The hyperlink d 1 → d 2 indicates that d 1 ‘s author deems d 2 high-quality and relevant.  Assumption 2: The anchor text describes the content of d 2.  We use anchor text somewhat loosely here for: the text surrounding the hyperlink.

22 Introduction to Information Retrieval 22 The web as a directed graph  Assumption 1: A hyperlink is a quality signal.  The hyperlink d 1 → d 2 indicates that d 1 ‘s author deems d 2 high-quality and relevant.  Assumption 2: The anchor text describes the content of d 2.  We use anchor text somewhat loosely here for: the text surrounding the hyperlink.  Example: “You can find cheap cars ˂a href =http://…˃here ˂/a ˃. ”

23 Introduction to Information Retrieval 23 The web as a directed graph  Assumption 1: A hyperlink is a quality signal.  The hyperlink d 1 → d 2 indicates that d 1 ‘s author deems d 2 high-quality and relevant.  Assumption 2: The anchor text describes the content of d 2.  We use anchor text somewhat loosely here for: the text surrounding the hyperlink.  Example: “You can find cheap cars ˂a href =http://…˃here ˂/a ˃. ”  Anchor text: “You can find cheap here”

24 Introduction to Information Retrieval 24 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ]

25 Introduction to Information Retrieval 25 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ]  Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only.

26 Introduction to Information Retrieval 26 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ]  Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only.  Example: Query IBM

27 Introduction to Information Retrieval 27 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ]  Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only.  Example: Query IBM  Matches IBM’s copyright page

28 Introduction to Information Retrieval 28 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ]  Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only.  Example: Query IBM  Matches IBM’s copyright page  Matches many spam pages

29 Introduction to Information Retrieval 29 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ]  Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only.  Example: Query IBM  Matches IBM’s copyright page  Matches many spam pages  Matches IBM wikipedia article

30 Introduction to Information Retrieval 30 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ]  Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only.  Example: Query IBM  Matches IBM’s copyright page  Matches many spam pages  Matches IBM wikipedia article  May not match IBM home page!

31 Introduction to Information Retrieval 31 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ]  Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only.  Example: Query IBM  Matches IBM’s copyright page  Matches many spam pages  Matches IBM wikipedia article  May not match IBM home page!  … if IBM home page is mostly graphics

32 Introduction to Information Retrieval 32 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ]  Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only.  Example: Query IBM  Matches IBM’s copyright page  Matches many spam pages  Matches IBM wikipedia article  May not match IBM home page!  … if IBM home page is mostly graphics  Searching on [anchor text → d 2 ] is better for the query IBM.

33 Introduction to Information Retrieval 33 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ]  Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only.  Example: Query IBM  Matches IBM’s copyright page  Matches many spam pages  Matches IBM wikipedia article  May not match IBM home page!  … if IBM home page is mostly graphics  Searching on [anchor text → d 2 ] is better for the query IBM.  In this representation, the page with most occurences of IBM is www.ibm.com

34 Introduction to Information Retrieval 34 Anchor text containing IBM pointing to www.ibm.com

35 Introduction to Information Retrieval 35 Anchor text containing IBM pointing to www.ibm.com

36 Introduction to Information Retrieval 36 Indexing anchor text

37 Introduction to Information Retrieval 37 Indexing anchor text  Thus: Anchor text is often a better description of a page’s content than the page itself.

38 Introduction to Information Retrieval 38 Indexing anchor text  Thus: Anchor text is often a better description of a page’s content than the page itself.  Anchor text can be weighted more highly than document text. (based on Assumption 1&2)

39 Introduction to Information Retrieval 39 Exercise: Assumptions underlying PageRank

40 Introduction to Information Retrieval 40 Exercise: Assumptions underlying PageRank  Assumption 1: A link on the web is a quality signal – the author of the link thinks that the linked-to page is high- quality.

41 Introduction to Information Retrieval 41 Exercise: Assumptions underlying PageRank  Assumption 1: A link on the web is a quality signal – the author of the link thinks that the linked-to page is high- quality.  Assumption 2: The anchor text describes the content of the linked-to page.

42 Introduction to Information Retrieval 42 Exercise: Assumptions underlying PageRank  Assumption 1: A link on the web is a quality signal – the author of the link thinks that the linked-to page is high- quality.  Assumption 2: The anchor text describes the content of the linked-to page.  Is assumption 1 true in general?

43 Introduction to Information Retrieval 43 Exercise: Assumptions underlying PageRank  Assumption 1: A link on the web is a quality signal – the author of the link thinks that the linked-to page is high- quality.  Assumption 2: The anchor text describes the content of the linked-to page.  Is assumption 1 true in general?  Is assumption 2 true in general?

44 Introduction to Information Retrieval 44 Google bombs

45 Introduction to Information Retrieval 45 Google bombs  A Google bomb is a search with “bad” results due to maliciously manipulated anchor text.

46 Introduction to Information Retrieval 46 Google bombs  A Google bomb is a search with “bad” results due to maliciously manipulated anchor text.  Google introduced a new weighting function in January 2007 that fixed many Google bombs.

47 Introduction to Information Retrieval 47 Google bombs  A Google bomb is a search with “bad” results due to maliciously manipulated anchor text.  Google introduced a new weighting function in January 2007 that fixed many Google bombs.  Still some remnants: [dangerous cult] on Google, Bing, Yahoo

48 Introduction to Information Retrieval 48 Google bombs  A Google bomb is a search with “bad” results due to maliciously manipulated anchor text.  Google introduced a new weighting function in January 2007 that fixed many Google bombs.  Still some remnants: [dangerous cult] on Google, Bing, Yahoo  Coordinated link creation by those who dislike the Church of Scientology

49 Introduction to Information Retrieval 49 Google bombs  A Google bomb is a search with “bad” results due to maliciously manipulated anchor text.  Google introduced a new weighting function in January 2007 that fixed many Google bombs.  Still some remnants: [dangerous cult] on Google, Bing, Yahoo  Coordinated link creation by those who dislike the Church of Scientology  Defused Google bombs: [dumb motherf…], [who is a failure?], [evil empire]

50 Introduction to Information Retrieval ❶ Recap ❷ Anchor Text ❸ Citation Analysis ❹ PageRank ❺ HITS: Hubs & Authorities Outline

51 Introduction to Information Retrieval 51 Origins of PageRank: Citation analysis (1)

52 Introduction to Information Retrieval 52 Origins of PageRank: Citation analysis (1)  Citation analysis: analysis of citations in the scientific literature.

53 Introduction to Information Retrieval 53 Origins of PageRank: Citation analysis (1)  Citation analysis: analysis of citations in the scientific literature.  Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.”

54 Introduction to Information Retrieval 54 Origins of PageRank: Citation analysis (1)  Citation analysis: analysis of citations in the scientific literature.  Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.”  We can view “Miller (2001)” as a hyperlink linking two scientific articles.

55 Introduction to Information Retrieval 55 Origins of PageRank: Citation analysis (1)  Citation analysis: analysis of citations in the scientific literature.  Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.”  We can view “Miller (2001)” as a hyperlink linking two scientific articles.  One application of these “hyperlinks” in the scientific literature:

56 Introduction to Information Retrieval 56 Origins of PageRank: Citation analysis (1)  Citation analysis: analysis of citations in the scientific literature.  Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.”  We can view “Miller (2001)” as a hyperlink linking two scientific articles.  One application of these “hyperlinks” in the scientific literature:  Measure the similarity of two articles by the overlap of other articles citing them.

57 Introduction to Information Retrieval 57 Origins of PageRank: Citation analysis (1)  Citation analysis: analysis of citations in the scientific literature.  Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.”  We can view “Miller (2001)” as a hyperlink linking two scientific articles.  One application of these “hyperlinks” in the scientific literature:  Measure the similarity of two articles by the overlap of other articles citing them.  This is called cocitation similarity.

58 Introduction to Information Retrieval 58 Origins of PageRank: Citation analysis (1)  Citation analysis: analysis of citations in the scientific literature.  Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.”  We can view “Miller (2001)” as a hyperlink linking two scientific articles.  One application of these “hyperlinks” in the scientific literature:  Measure the similarity of two articles by the overlap of other articles citing them.  This is called cocitation similarity.  Cocitation similarity on the web: Google’s “find pages like this” or “Similar” feature.

59 Introduction to Information Retrieval 59 Origins of PageRank: Citation analysis (2)

60 Introduction to Information Retrieval 60 Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article.

61 Introduction to Information Retrieval 61 Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article.  Simplest measure: Each article gets one vote – not very accurate.

62 Introduction to Information Retrieval 62 Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article.  Simplest measure: Each article gets one vote – not very accurate.  On the web: citation frequency = inlink count

63 Introduction to Information Retrieval 63 Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article.  Simplest measure: Each article gets one vote – not very accurate.  On the web: citation frequency = inlink count  A high inlink count does not necessarily mean high quality...

64 Introduction to Information Retrieval 64 Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article.  Simplest measure: Each article gets one vote – not very accurate.  On the web: citation frequency = inlink count  A high inlink count does not necessarily mean high quality... ... mainly because of link spam.

65 Introduction to Information Retrieval 65 Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article.  Simplest measure: Each article gets one vote – not very accurate.  On the web: citation frequency = inlink count  A high inlink count does not necessarily mean high quality... ... mainly because of link spam.  Better measure: weighted citation frequency or citation rank

66 Introduction to Information Retrieval 66 Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article.  Simplest measure: Each article gets one vote – not very accurate.  On the web: citation frequency = inlink count  A high inlink count does not necessarily mean high quality... ... mainly because of link spam.  Better measure: weighted citation frequency or citation rank  An article’s vote is weighted according to its citation impact.

67 Introduction to Information Retrieval 67 Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article.  Simplest measure: Each article gets one vote – not very accurate.  On the web: citation frequency = inlink count  A high inlink count does not necessarily mean high quality... ... mainly because of link spam.  Better measure: weighted citation frequency or citation rank  An article’s vote is weighted according to its citation impact.  Circular? No: can be formalized in a well-defined way.

68 Introduction to Information Retrieval 68 Origins of PageRank: Citation analysis (3)

69 Introduction to Information Retrieval 69 Origins of PageRank: Citation analysis (3)  Better measure: weighted citation frequency or citation rank.

70 Introduction to Information Retrieval 70 Origins of PageRank: Citation analysis (3)  Better measure: weighted citation frequency or citation rank.  This is basically PageRank.

71 Introduction to Information Retrieval 71 Origins of PageRank: Citation analysis (3)  Better measure: weighted citation frequency or citation rank.  This is basically PageRank.  PageRank was invented in the context of citation analysis by Pinsker and Narin in the 1960s.

72 Introduction to Information Retrieval 72 Origins of PageRank: Citation analysis (3)  Better measure: weighted citation frequency or citation rank.  This is basically PageRank.  PageRank was invented in the context of citation analysis by Pinsker and Narin in the 1960s.  Citation analysis is a big deal: The budget and salary of this lecturer are / will be determined by the impact of his publications!

73 Introduction to Information Retrieval 73 Origins of PageRank: Summary

74 Introduction to Information Retrieval 74 Origins of PageRank: Summary  We can use the same formal representation for

75 Introduction to Information Retrieval 75 Origins of PageRank: Summary  We can use the same formal representation for  citations in the scientific literature

76 Introduction to Information Retrieval 76 Origins of PageRank: Summary  We can use the same formal representation for  citations in the scientific literature  hyperlinks on the web

77 Introduction to Information Retrieval 77 Origins of PageRank: Summary  We can use the same formal representation for  citations in the scientific literature  hyperlinks on the web  Appropriately weighted citation frequency is an excellent measure of quality...

78 Introduction to Information Retrieval 78 Origins of PageRank: Summary  We can use the same formal representation for  citations in the scientific literature  hyperlinks on the web  Appropriately weighted citation frequency is an excellent measure of quality... ... both for web pages and for scientific publications.

79 Introduction to Information Retrieval 79 Origins of PageRank: Summary  We can use the same formal representation for  citations in the scientific literature  hyperlinks on the web  Appropriately weighted citation frequency is an excellent measure of quality... ... both for web pages and for scientific publications.  Next: PageRank algorithm for computing weighted citation frequency on the web.

80 Introduction to Information Retrieval ❶ Recap ❷ Anchor Text ❸ Citation Analysis ❹ PageRank ❺ HITS: Hubs & Authorities Outline

81 Introduction to Information Retrieval 81 Model behind PageRank: Random walk

82 Introduction to Information Retrieval 82 Model behind PageRank: Random walk  Imagine a web surfer doing a random walk on the web

83 Introduction to Information Retrieval 83 Model behind PageRank: Random walk  Imagine a web surfer doing a random walk on the web  Start at a random page

84 Introduction to Information Retrieval 84 Model behind PageRank: Random walk  Imagine a web surfer doing a random walk on the web  Start at a random page  At each step, go out of the current page along one of the links on that page, equiprobably

85 Introduction to Information Retrieval 85 Model behind PageRank: Random walk  Imagine a web surfer doing a random walk on the web  Start at a random page  At each step, go out of the current page along one of the links on that page, equiprobably  In the steady state, each page has a long-term visit rate.

86 Introduction to Information Retrieval 86 Model behind PageRank: Random walk  Imagine a web surfer doing a random walk on the web  Start at a random page  At each step, go out of the current page along one of the links on that page, equiprobably  In the steady state, each page has a long-term visit rate.  This long-term visit rate is the page’s PageRank.

87 Introduction to Information Retrieval 87 Model behind PageRank: Random walk  Imagine a web surfer doing a random walk on the web  Start at a random page  At each step, go out of the current page along one of the links on that page, equiprobably  In the steady state, each page has a long-term visit rate.  This long-term visit rate is the page’s PageRank.  PageRank = long-term visit rate = steady state probability.

88 Introduction to Information Retrieval 88 Formalization of random walk: Markov chains

89 Introduction to Information Retrieval 89 Formalization of random walk: Markov chains  A Markov chain consists of N states, plus an N ×N transition probability matrix P.

90 Introduction to Information Retrieval 90 Formalization of random walk: Markov chains  A Markov chain consists of N states, plus an N ×N transition probability matrix P.  state = page

91 Introduction to Information Retrieval 91 Formalization of random walk: Markov chains  A Markov chain consists of N states, plus an N ×N transition probability matrix P.  state = page  At each step, we are on exactly one of the pages.

92 Introduction to Information Retrieval 92 Formalization of random walk: Markov chains  A Markov chain consists of N states, plus an N ×N transition probability matrix P.  state = page  At each step, we are on exactly one of the pages.  For 1 ≤ i, j ≥ N, the matrix entry P ij tells us the probability of j being the next page, given we are currently on page i.  Clearly, for all i,

93 Introduction to Information Retrieval Example web graph

94 Introduction to Information Retrieval 94 Link matrix for example

95 Introduction to Information Retrieval 95 Link matrix for example d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0010000 d1d1 0110000 d2d2 1011000 d3d3 0001100 d4d4 0000001 d5d5 0000011 d6d6 0001101

96 Introduction to Information Retrieval 96 Transition probability matrix P for example

97 Introduction to Information Retrieval 97 Transition probability matrix P for example d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0.00 1.000.00 d1d1 0.50 0.00 d2d2 0.330.000.33 0.00 d3d3 0.50 0.00 d4d4 1.00 d5d5 0.00 0.50 d6d6 0.00 0.33 0.000.33

98 Introduction to Information Retrieval 98 Long-term visit rate

99 Introduction to Information Retrieval 99 Long-term visit rate  Recall: PageRank = long-term visit rate.

100 Introduction to Information Retrieval 100 Long-term visit rate  Recall: PageRank = long-term visit rate.  Long-term visit rate of page d is the probability that a web surfer is at page d at a given point in time.

101 Introduction to Information Retrieval 101 Long-term visit rate  Recall: PageRank = long-term visit rate.  Long-term visit rate of page d is the probability that a web surfer is at page d at a given point in time.  Next: what properties must hold of the web graph for the long-term visit rate to be well defined?

102 Introduction to Information Retrieval 102 Long-term visit rate  Recall: PageRank = long-term visit rate.  Long-term visit rate of page d is the probability that a web surfer is at page d at a given point in time.  Next: what properties must hold of the web graph for the long-term visit rate to be well defined?  The web graph must correspond to an ergodic Markov chain.

103 Introduction to Information Retrieval 103 Long-term visit rate  Recall: PageRank = long-term visit rate.  Long-term visit rate of page d is the probability that a web surfer is at page d at a given point in time.  Next: what properties must hold of the web graph for the long-term visit rate to be well defined?  The web graph must correspond to an ergodic Markov chain.  First a special case: The web graph must not contain dead ends.

104 Introduction to Information Retrieval 104 Dead ends

105 Introduction to Information Retrieval 105 Dead ends

106 Introduction to Information Retrieval 106 Dead ends  The web is full of dead ends.

107 Introduction to Information Retrieval 107 Dead ends  The web is full of dead ends.  Random walk can get stuck in dead ends.

108 Introduction to Information Retrieval 108 Dead ends  The web is full of dead ends.  Random walk can get stuck in dead ends.  If there are dead ends, long-term visit rates are not well-defined (or non-sensical).

109 Introduction to Information Retrieval 109 Teleporting – to get us of dead ends

110 Introduction to Information Retrieval 110 Teleporting – to get us of dead ends  At a dead end, jump to a random web page with prob. 1/ N.

111 Introduction to Information Retrieval 111 Teleporting – to get us of dead ends  At a dead end, jump to a random web page with prob. 1/ N.  At a non-dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N ).

112 Introduction to Information Retrieval 112 Teleporting – to get us of dead ends  At a dead end, jump to a random web page with prob. 1/ N.  At a non-dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N ).  With remaining probability (90%), go out on a random hyperlink.

113 Introduction to Information Retrieval 113 Teleporting – to get us of dead ends  At a dead end, jump to a random web page with prob. 1/ N.  At a non-dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N ).  With remaining probability (90%), go out on a random hyperlink.  For example, if the page has 4 outgoing links: randomly choose one with probability (1-0.10)/4=0.225

114 Introduction to Information Retrieval 114 Teleporting – to get us of dead ends  At a dead end, jump to a random web page with prob. 1/ N.  At a non-dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N ).  With remaining probability (90%), go out on a random hyperlink.  For example, if the page has 4 outgoing links: randomly choose one with probability (1-0.10)/4=0.225  10% is a parameter, the teleportation rate.

115 Introduction to Information Retrieval 115 Teleporting – to get us of dead ends  At a dead end, jump to a random web page with prob. 1/ N.  At a non-dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N ).  With remaining probability (90%), go out on a random hyperlink.  For example, if the page has 4 outgoing links: randomly choose one with probability (1-0.10)/4=0.225  10% is a parameter, the teleportation rate.  Note: “jumping” from dead end is independent of teleportation rate.

116 Introduction to Information Retrieval 116 Result of teleporting

117 Introduction to Information Retrieval 117 Result of teleporting  With teleporting, we cannot get stuck in a dead end.

118 Introduction to Information Retrieval 118 Result of teleporting  With teleporting, we cannot get stuck in a dead end.  But even without dead ends, a graph may not have well- defined long-term visit rates.

119 Introduction to Information Retrieval 119 Result of teleporting  With teleporting, we cannot get stuck in a dead end.  But even without dead ends, a graph may not have well- defined long-term visit rates.  More generally, we require that the Markov chain be ergodic.

120 Introduction to Information Retrieval 120 Ergodic Markov chains  A Markov chain is ergodic if it is irreducible and aperiodic.

121 Introduction to Information Retrieval 121 Ergodic Markov chains  A Markov chain is ergodic if it is irreducible and aperiodic.  Irreducibility. Roughly: there is a path from any other page.

122 Introduction to Information Retrieval 122 Ergodic Markov chains  A Markov chain is ergodic if it is irreducible and aperiodic.  Irreducibility. Roughly: there is a path from any other page.  Aperiodicity. Roughly: The pages cannot be partitioned such that the random walker visits the partitions sequentially.

123 Introduction to Information Retrieval 123 Ergodic Markov chains  A Markov chain is ergodic if it is irreducible and aperiodic.  Irreducibility. Roughly: there is a path from any other page.  Aperiodicity. Roughly: The pages cannot be partitioned such that the random walker visits the partitions sequentially.  A non-ergodic Markov chain:

124 Introduction to Information Retrieval 124 Ergodic Markov chains  Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state.

125 Introduction to Information Retrieval 125 Ergodic Markov chains  Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state.  This is the steady-state probability distribution.

126 Introduction to Information Retrieval 126 Ergodic Markov chains  Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state.  This is the steady-state probability distribution.  Over a long time period, we visit each state in proportion to this rate.

127 Introduction to Information Retrieval 127 Ergodic Markov chains  Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state.  This is the steady-state probability distribution.  Over a long time period, we visit each state in proportion to this rate.  It doesn’t matter where we start.

128 Introduction to Information Retrieval 128 Ergodic Markov chains  Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state.  This is the steady-state probability distribution.  Over a long time period, we visit each state in proportion to this rate.  It doesn’t matter where we start.  Teleporting makes the web graph ergodic.

129 Introduction to Information Retrieval 129 Ergodic Markov chains  Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state.  This is the steady-state probability distribution.  Over a long time period, we visit each state in proportion to this rate.  It doesn’t matter where we start.  Teleporting makes the web graph ergodic.  Web-graph+teleporting has a steady-state probability distribution.

130 Introduction to Information Retrieval 130 Ergodic Markov chains  Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state.  This is the steady-state probability distribution.  Over a long time period, we visit each state in proportion to this rate.  It doesn’t matter where we start.  Teleporting makes the web graph ergodic.  Web-graph+teleporting has a steady-state probability distribution.  Each page in the web-graph+teleporting has a PageRank.

131 Introduction to Information Retrieval 131 Formalization of “visit”: Probability vector

132 Introduction to Information Retrieval 132 Formalization of “visit”: Probability vector  A probability (row) vector x = (x 1,..., x N ) tells us where the random walk is at any point.

133 Introduction to Information Retrieval 133 Formalization of “visit”: Probability vector  A probability (row) vector x = (x 1,..., x N ) tells us where the random walk is at any point.  Example: (000…1…000) 123…i…N-2N-1N

134 Introduction to Information Retrieval 134 Formalization of “visit”: Probability vector  A probability (row) vector x = (x 1,..., x N ) tells us where the random walk is at any point.  Example:  More generally: the random walk is on the page i with probability x i. (000…1…000) 123…i…N-2N-1N

135 Introduction to Information Retrieval 135 Formalization of “visit”: Probability vector  A probability (row) vector x = (x 1,..., x N ) tells us where the random walk is at any point.  Example:  More generally: the random walk is on the page i with probability x i.  Example: (000…1…000) 123…i…N-2N-1N (0.050.010.0…0.2…0.010.050.03) 123…i…N-2N-1N

136 Introduction to Information Retrieval 136 Formalization of “visit”: Probability vector  A probability (row) vector x = (x 1,..., x N ) tells us where the random walk is at any point.  Example:  More generally: the random walk is on the page i with probability x i.  Example:   x i = 1 (000…1…000) 123…i…N-2N-1N (0.050.010.0…0.2…0.010.050.03) 123…i…N-2N-1N

137 Introduction to Information Retrieval 137 Change in probability vector  If the probability vector is x = (x 1,..., x N ), at this step, what is it at the next step?

138 Introduction to Information Retrieval 138 Change in probability vector  If the probability vector is x = (x 1,..., x N ), at this step, what is it at the next step?  Recall that row i of the transition probability matrix P tells us where we go next from state i.

139 Introduction to Information Retrieval 139 Change in probability vector  If the probability vector is x = (x 1,..., x N ), at this step, what is it at the next step?  Recall that row i of the transition probability matrix P tells us where we go next from state i.  So from x, our next state is distributed as xP.

140 Introduction to Information Retrieval 140 Steady state in vector notation

141 Introduction to Information Retrieval 141 Steady state in vector notation  The steady state in vector notation is simply a vector  = (  ,  , …,   ) of probabilities.

142 Introduction to Information Retrieval 142 Steady state in vector notation  The steady state in vector notation is simply a vector  = (  ,  , …,   ) of probabilities.  (We use  to distinguish it from the notation for the probability vector x.)

143 Introduction to Information Retrieval 143 Steady state in vector notation  The steady state in vector notation is simply a vector  = (  ,  , …,   ) of probabilities.  (We use  to distinguish it from the notation for the probability vector x.)   is the long-term visit rate (or PageRank) of page i.

144 Introduction to Information Retrieval 144 Steady state in vector notation  The steady state in vector notation is simply a vector  = (  ,  , …,   ) of probabilities.  (We use  to distinguish it from the notation for the probability vector x.)   is the long-term visit rate (or PageRank) of page i.  So we can think of PageRank as a very long vector – one entry per page.

145 Introduction to Information Retrieval 145 Steady-state distribution: Example

146 Introduction to Information Retrieval 146 Steady-state distribution: Example  What is the PageRank / steady state in this example?

147 Introduction to Information Retrieval 147 Steady-state distribution: Example

148 Introduction to Information Retrieval 148 Steady-state distribution: Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.25 P 21 = 0.25 P 12 = 0.75 P 22 = 0.75 t0t1t0t1 0.250.75 PageRank vector =  = (  ,   ) = (0.25, 0.75) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

149 Introduction to Information Retrieval 149 Steady-state distribution: Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.25 P 21 = 0.25 P 12 = 0.75 P 22 = 0.75 t0t1t0t1 0.250.750.250.75 PageRank vector =  = (  ,   ) = (0.25, 0.75) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

150 Introduction to Information Retrieval 150 Steady-state distribution: Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.25 P 21 = 0.25 P 12 = 0.75 P 22 = 0.75 t0t1t0t1 0.25 0.75 0.250.75 PageRank vector =  = (  ,   ) = (0.25, 0.75) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

151 Introduction to Information Retrieval 151 Steady-state distribution: Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.25 P 21 = 0.25 P 12 = 0.75 P 22 = 0.75 t0t1t0t1 0.25 0.75 0.250.75 (convergence) PageRank vector =  = (  ,   ) = (0.25, 0.75) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

152 Introduction to Information Retrieval 152 How do we compute the steady state vector?

153 Introduction to Information Retrieval 153 How do we compute the steady state vector?  In other words: how do we compute PageRank?

154 Introduction to Information Retrieval 154 How do we compute the steady state vector?  In other words: how do we compute PageRank?  Recall:  = (  1,  2, …,  N ) is the PageRank vector, the vector of steady-state probabilities...

155 Introduction to Information Retrieval 155 How do we compute the steady state vector?  In other words: how do we compute PageRank?  Recall:  = (  1,  2, …,  N ) is the PageRank vector, the vector of steady-state probabilities...  … and if the distribution in this step is x, then the distribution in the next step is xP.

156 Introduction to Information Retrieval 156 How do we compute the steady state vector?  In other words: how do we compute PageRank?  Recall:  = (  1,  2, …,  N ) is the PageRank vector, the vector of steady-state probabilities...  … and if the distribution in this step is x, then the distribution in the next step is xP.  But  is the steady state!

157 Introduction to Information Retrieval 157 How do we compute the steady state vector?  In other words: how do we compute PageRank?  Recall:  = (  1,  2, …,  N ) is the PageRank vector, the vector of steady-state probabilities...  … and if the distribution in this step is x, then the distribution in the next step is xP.  But  is the steady state!  So:  =  P

158 Introduction to Information Retrieval 158 How do we compute the steady state vector?  In other words: how do we compute PageRank?  Recall:  = (  1,  2, …,  N ) is the PageRank vector, the vector of steady-state probabilities...  … and if the distribution in this step is x, then the distribution in the next step is xP.  But  is the steady state!  So:  =  P  Solving this matrix equation gives us .

159 Introduction to Information Retrieval 159 How do we compute the steady state vector?  In other words: how do we compute PageRank?  Recall:  = (  1,  2, …,  N ) is the PageRank vector, the vector of steady-state probabilities...  … and if the distribution in this step is x, then the distribution in the next step is xP.  But  is the steady state!  So:  =  P  Solving this matrix equation gives us .   is the principal left eigenvector for P …

160 Introduction to Information Retrieval 160 How do we compute the steady state vector?  In other words: how do we compute PageRank?  Recall:  = (  1,  2, …,  N ) is the PageRank vector, the vector of steady-state probabilities...  … and if the distribution in this step is x, then the distribution in the next step is xP.  But  is the steady state!  So:  =  P  Solving this matrix equation gives us .   is the principal left eigenvector for P …  … that is,  is the left eigenvector with the largest eigenvalue.

161 Introduction to Information Retrieval 161 How do we compute the steady state vector?  In other words: how do we compute PageRank?  Recall:  = (  1,  2, …,  N ) is the PageRank vector, the vector of steady-state probabilities...  … and if the distribution in this step is x, then the distribution in the next step is xP.  But  is the steady state!  So:  =  P  Solving this matrix equation gives us .   is the principal left eigenvector for P …  … that is,  is the left eigenvector with the largest eigenvalue.  All transition probability matrices have largest eigenvalue 1.

162 Introduction to Information Retrieval 162 One way of computing the PageRank 

163 Introduction to Information Retrieval 163 One way of computing the PageRank   Start with any distribution x, e.g., uniform distribution

164 Introduction to Information Retrieval 164 One way of computing the PageRank   Start with any distribution x, e.g., uniform distribution  After one step, we’re at xP.

165 Introduction to Information Retrieval 165 One way of computing the PageRank   Start with any distribution x, e.g., uniform distribution  After one step, we’re at xP.  After two steps, we’re at xP 2.

166 Introduction to Information Retrieval 166 One way of computing the PageRank   Start with any distribution x, e.g., uniform distribution  After one step, we’re at xP.  After two steps, we’re at xP 2.  After k steps, we’re at xP k.

167 Introduction to Information Retrieval 167 One way of computing the PageRank   Start with any distribution x, e.g., uniform distribution  After one step, we’re at xP.  After two steps, we’re at xP 2.  After k steps, we’re at xP k.  Algorithm: multiply x by increasing powers of P until convergence.

168 Introduction to Information Retrieval 168 One way of computing the PageRank   Start with any distribution x, e.g., uniform distribution  After one step, we’re at xP.  After two steps, we’re at xP 2.  After k steps, we’re at xP k.  Algorithm: multiply x by increasing powers of P until convergence.  This is called the power method.

169 Introduction to Information Retrieval 169 One way of computing the PageRank   Start with any distribution x, e.g., uniform distribution  After one step, we’re at xP.  After two steps, we’re at xP 2.  After k steps, we’re at xP k.  Algorithm: multiply x by increasing powers of P until convergence.  This is called the power method.  Recall: regardless of where we start, we eventually reach the steady state 

170 Introduction to Information Retrieval 170 One way of computing the PageRank   Start with any distribution x, e.g., uniform distribution  After one step, we’re at xP.  After two steps, we’re at xP 2.  After k steps, we’re at xP k.  Algorithm: multiply x by increasing powers of P until convergence.  This is called the power method.  Recall: regardless of where we start, we eventually reach the steady state   Thus: we will eventually (in asymptotia) reach the steady state.

171 Introduction to Information Retrieval 171 Power method: Example

172 Introduction to Information Retrieval 172 Power method: Example  What is the PageRank / steady state in this example?

173 Introduction to Information Retrieval 173 Computing PageRank: Power Example

174 Introduction to Information Retrieval 174 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 01= xP t1t1 = xP 2 t2t2 = xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

175 Introduction to Information Retrieval 175 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 = xP 2 t2t2 = xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

176 Introduction to Information Retrieval 176 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.7= xP 2 t2t2 = xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

177 Introduction to Information Retrieval 177 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 = xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

178 Introduction to Information Retrieval 178 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.76= xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

179 Introduction to Information Retrieval 179 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

180 Introduction to Information Retrieval 180 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.748= xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

181 Introduction to Information Retrieval 181 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.7480.24960.7504= xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

182 Introduction to Information Retrieval 182 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.7480.24960.7504= xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

183 Introduction to Information Retrieval 183 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.7480.24960.7504= xP 4... t∞t∞ 0.250.75= xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

184 Introduction to Information Retrieval 184 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.7480.24960.7504= xP 4... t∞t∞ 0.250.750.250.75= xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

185 Introduction to Information Retrieval 185 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.7480.24960.7504= xP 4... t∞t∞ 0.250.750.250.75= xP ∞ PageRank vector =  = (  ,   ) = (0.25, 0.75) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

186 Introduction to Information Retrieval 186 Power method: Example  What is the PageRank / steady state in this example?  The steady state distribution (= the PageRanks) in this example are 0.25 for d 1 and 0.75 for d 2.

187 Introduction to Information Retrieval 187 Exercise: Compute PageRank using power method

188 Introduction to Information Retrieval 188 Exercise: Compute PageRank using power method

189 Introduction to Information Retrieval 189 Solution

190 Introduction to Information Retrieval 190 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 01 t1t1 t2t2 t3t3 t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

191 Introduction to Information Retrieval 191 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 t2t2 t3t3 t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

192 Introduction to Information Retrieval 192 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.8 t2t2 t3t3 t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

193 Introduction to Information Retrieval 193 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 t3t3 t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

194 Introduction to Information Retrieval 194 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.7 t3t3 t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

195 Introduction to Information Retrieval 195 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

196 Introduction to Information Retrieval 196 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 0.350.65 t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

197 Introduction to Information Retrieval 197 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 0.350.650.3750.625 t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

198 Introduction to Information Retrieval 198 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 0.350.650.3750.625... t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

199 Introduction to Information Retrieval 199 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 0.350.650.3750.625... t∞t∞ 0.40.6 PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

200 Introduction to Information Retrieval 200 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 0.350.650.3750.625... t∞t∞ 0.40.60.40.6 PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

201 Introduction to Information Retrieval 201 PageRank summary

202 Introduction to Information Retrieval 202 PageRank summary  Preprocessing

203 Introduction to Information Retrieval 203 PageRank summary  Preprocessing  Given graph of links, build matrix P

204 Introduction to Information Retrieval 204 PageRank summary  Preprocessing  Given graph of links, build matrix P  Apply teleportation

205 Introduction to Information Retrieval 205 PageRank summary  Preprocessing  Given graph of links, build matrix P  Apply teleportation  From modified matrix, compute 

206 Introduction to Information Retrieval 206 PageRank summary  Preprocessing  Given graph of links, build matrix P  Apply teleportation  From modified matrix, compute    i is the PageRank of page i.

207 Introduction to Information Retrieval 207 PageRank summary  Preprocessing  Given graph of links, build matrix P  Apply teleportation  From modified matrix, compute    i is the PageRank of page i.  Query processing

208 Introduction to Information Retrieval 208 PageRank summary  Preprocessing  Given graph of links, build matrix P  Apply teleportation  From modified matrix, compute    i is the PageRank of page i.  Query processing  Retrieve pages satisfying the query

209 Introduction to Information Retrieval 209 PageRank summary  Preprocessing  Given graph of links, build matrix P  Apply teleportation  From modified matrix, compute    i is the PageRank of page i.  Query processing  Retrieve pages satisfying the query  Rank them by their PageRank

210 Introduction to Information Retrieval 210 PageRank summary  Preprocessing  Given graph of links, build matrix P  Apply teleportation  From modified matrix, compute    i is the PageRank of page i.  Query processing  Retrieve pages satisfying the query  Rank them by their PageRank  Return reranked list to the user

211 Introduction to Information Retrieval 211 PageRank issues

212 Introduction to Information Retrieval 212 PageRank issues  Real surfers are not random surfers.

213 Introduction to Information Retrieval 213 PageRank issues  Real surfers are not random surfers.  Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search!

214 Introduction to Information Retrieval 214 PageRank issues  Real surfers are not random surfers.  Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search!  → Markov model is not a good model of surfing.

215 Introduction to Information Retrieval 215 PageRank issues  Real surfers are not random surfers.  Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search!  → Markov model is not a good model of surfing.  But it’s good enough as a model for our purposes.

216 Introduction to Information Retrieval 216 PageRank issues  Real surfers are not random surfers.  Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search!  → Markov model is not a good model of surfing.  But it’s good enough as a model for our purposes.  Simple PageRank ranking (as described on previous slide) produces bad results for many pages.

217 Introduction to Information Retrieval 217 PageRank issues  Real surfers are not random surfers.  Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search!  → Markov model is not a good model of surfing.  But it’s good enough as a model for our purposes.  Simple PageRank ranking (as described on previous slide) produces bad results for many pages.  Consider the query [video service].

218 Introduction to Information Retrieval 218 PageRank issues  Real surfers are not random surfers.  Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search!  → Markov model is not a good model of surfing.  But it’s good enough as a model for our purposes.  Simple PageRank ranking (as described on previous slide) produces bad results for many pages.  Consider the query [video service].  The Yahoo home page (i) has a very high PageRank and (ii) contains both video and service.

219 Introduction to Information Retrieval 219 PageRank issues  Real surfers are not random surfers.  Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search!  → Markov model is not a good model of surfing.  But it’s good enough as a model for our purposes.  Simple PageRank ranking (as described on previous slide) produces bad results for many pages.  Consider the query [video service].  The Yahoo home page (i) has a very high PageRank and (ii) contains both video and service.  If we rank all Boolean hits according to PageRank, then the Yahoo home page would be top-ranked.

220 Introduction to Information Retrieval 220 PageRank issues  Real surfers are not random surfers.  Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search!  → Markov model is not a good model of surfing.  But it’s good enough as a model for our purposes.  Simple PageRank ranking (as described on previous slide) produces bad results for many pages.  Consider the query [video service].  The Yahoo home page (i) has a very high PageRank and (ii) contains both video and service.  If we rank all Boolean hits according to PageRank, then the Yahoo home page would be top-ranked.  Clearly not desireble.

221 Introduction to Information Retrieval 221 PageRank issues  In practice: rank according to weighted combination of raw text match, anchor text match, PageRank & other factors.

222 Introduction to Information Retrieval 222 PageRank issues  In practice: rank according to weighted combination of raw text match, anchor text match, PageRank & other factors.  → see lecture on Learning to Rank.

223 Introduction to Information Retrieval Example web graph

224 Introduction to Information Retrieval 224 Transition (probability) matrix

225 Introduction to Information Retrieval 225 Transition (probability) matrix d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0.00 1.000.00 d1d1 0.50 0.00 d2d2 0.330.000.33 0.00 d3d3 0.50 0.00 d4d4 1.00 d5d5 0.00 0.50 d6d6 0.00 0.33 0.000.33

226 Introduction to Information Retrieval 226 Transition matrix with teleporting

227 Introduction to Information Retrieval 227 Transition matrix with teleporting d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0.02 0.880.02 d1d1 0.45 0.02 d2d2 0.310.020.31 0.02 d3d3 0.45 0.02 d4d4 0.88 d5d5 0.02 0.45 d6d6 0.02 0.31 0.020.31

228 Introduction to Information Retrieval 228 Power method vectors xP k

229 Introduction to Information Retrieval 229 Power method vectors xP k xxP 1 xP 2 xP 3 xP 4 xP 5 xP 6 xP 7 xP 8 xP 9 xP 10 xP 11 xP 12 xP 13 d0d0 0.140.060.090.07 0.06 0.05 d1d1 0.140.080.060.04 d2d2 0.140.250.180.170.150.140.130.12 0.11 d3d3 0.140.160.230.24 0.25 d4d4 0.140.120.160.19 0.200.21 d5d5 0.140.080.060.04 d6d6 0.140.250.230.250.270.280.29 0.30 0.31

230 Introduction to Information Retrieval 230 Example web graph PageRank d0d0 0.05 d1d1 0.04 d2d2 0.11 d3d3 0.25 d4d4 0.21 d5d5 0.04 d6d6 0.31

231 Introduction to Information Retrieval 231 How important is PageRank?

232 Introduction to Information Retrieval 232 How important is PageRank?  Frequent claim: PageRank is the most important component of web ranking.

233 Introduction to Information Retrieval 233 How important is PageRank?  Frequent claim: PageRank is the most important component of web ranking.  The reality:

234 Introduction to Information Retrieval 234 How important is PageRank?  Frequent claim: PageRank is the most important component of web ranking.  The reality:  There are several components that are at least as important: e.g., anchor text, phrases, proximity, tiered indexes...

235 Introduction to Information Retrieval 235 How important is PageRank?  Frequent claim: PageRank is the most important component of web ranking.  The reality:  There are several components that are at least as important: e.g., anchor text, phrases, proximity, tiered indexes...  Rumor has it that PageRank in his original form (as presented here) now has a negligible impact on ranking!

236 Introduction to Information Retrieval 236 How important is PageRank?  Frequent claim: PageRank is the most important component of web ranking.  The reality:  There are several components that are at least as important: e.g., anchor text, phrases, proximity, tiered indexes...  Rumor has it that PageRank in his original form (as presented here) now has a negligible impact on ranking!  However, variants of a page’s PageRank are still an essential part of ranking.

237 Introduction to Information Retrieval 237 How important is PageRank?  Frequent claim: PageRank is the most important component of web ranking.  The reality:  There are several components that are at least as important: e.g., anchor text, phrases, proximity, tiered indexes...  Rumor has it that PageRank in his original form (as presented here) now has a negligible impact on ranking!  However, variants of a page’s PageRank are still an essential part of ranking.  Adressing link spam is difficult and crucial.

238 Introduction to Information Retrieval ❶ Recap ❷ Anchor Text ❸ Citation Analysis ❹ PageRank ❺ HITS: Hubs & Authorities Outline

239 Introduction to Information Retrieval 239 Hits – Hyperlink-Induced Topic Search

240 Introduction to Information Retrieval 240 Hits – Hyperlink-Induced Topic Search  Premise: there are two different types of relevance on the web.

241 Introduction to Information Retrieval 241 Hits – Hyperlink-Induced Topic Search  Premise: there are two different types of relevance on the web.  Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need.

242 Introduction to Information Retrieval 242 Hits – Hyperlink-Induced Topic Search  Premise: there are two different types of relevance on the web.  Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need.  E.g, for query [chicago bulls]: Bob’s list of recommended resources on the Chicago Bulls sports team

243 Introduction to Information Retrieval 243 Hits – Hyperlink-Induced Topic Search  Premise: there are two different types of relevance on the web.  Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need.  E.g, for query [chicago bulls]: Bob’s list of recommended resources on the Chicago Bulls sports team  Relevance type 2: Authorities. An authority page is a direct answer to the information need.

244 Introduction to Information Retrieval 244 Hits – Hyperlink-Induced Topic Search  Premise: there are two different types of relevance on the web.  Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need.  E.g, for query [chicago bulls]: Bob’s list of recommended resources on the Chicago Bulls sports team  Relevance type 2: Authorities. An authority page is a direct answer to the information need.  The home page of the Chicago Bulls sports team

245 Introduction to Information Retrieval 245 Hits – Hyperlink-Induced Topic Search  Premise: there are two different types of relevance on the web.  Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need.  E.g, for query [chicago bulls]: Bob’s list of recommended resources on the Chicago Bulls sports team  Relevance type 2: Authorities. An authority page is a direct answer to the information need.  The home page of the Chicago Bulls sports team  By definition: Links to authority pages occur repeatedly on hub pages.

246 Introduction to Information Retrieval 246 Hits – Hyperlink-Induced Topic Search  Premise: there are two different types of relevance on the web.  Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need.  E.g, for query [chicago bulls]: Bob’s list of recommended resources on the Chicago Bulls sports team  Relevance type 2: Authorities. An authority page is a direct answer to the information need.  The home page of the Chicago Bulls sports team  By definition: Links to authority pages occur repeatedly on hub pages.  Most approaches to search (including PageRank ranking) don’t make the distinction between these two very different types of relevance.

247 Introduction to Information Retrieval 247 Hubs and authorities : Definition

248 Introduction to Information Retrieval 248 Hubs and authorities : Definition  A good hub page for a topic links to many authority pages for that topic.

249 Introduction to Information Retrieval 249 Hubs and authorities : Definition  A good hub page for a topic links to many authority pages for that topic.  A good authority page for a topic is linked to by many hub pages for that topic.

250 Introduction to Information Retrieval 250 Hubs and authorities : Definition  A good hub page for a topic links to many authority pages for that topic.  A good authority page for a topic is linked to by many hub pages for that topic.  Circular definition – we will turn this into an iterative computation.

251 Introduction to Information Retrieval 251 Example for hubs and authorities

252 Introduction to Information Retrieval 252 Example for hubs and authorities

253 Introduction to Information Retrieval 253 How to compute hub and authority scores

254 Introduction to Information Retrieval 254 How to compute hub and authority scores  Do a regular web search first

255 Introduction to Information Retrieval 255 How to compute hub and authority scores  Do a regular web search first  Call the search result the root set

256 Introduction to Information Retrieval 256 How to compute hub and authority scores  Do a regular web search first  Call the search result the root set  Find all pages that are linked to or link to pages in the root set

257 Introduction to Information Retrieval 257 How to compute hub and authority scores  Do a regular web search first  Call the search result the root set  Find all pages that are linked to or link to pages in the root set  Call first larger set the base set

258 Introduction to Information Retrieval 258 How to compute hub and authority scores  Do a regular web search first  Call the search result the root set  Find all pages that are linked to or link to pages in the root set  Call first larger set the base set  Finally, compute hubs and authorities for the base set (which we’ll view as a small web graph)

259 Introduction to Information Retrieval Root set and base set (1) The root set root set

260 Introduction to Information Retrieval Root set and base set (1) Nodes that root set nodes link to root set

261 Introduction to Information Retrieval Root set and base set (1) Nodes that link to root set nodes root set

262 Introduction to Information Retrieval Root set and base set (1) The base set root set base set

263 Introduction to Information Retrieval 263 Root set and base set (2)

264 Introduction to Information Retrieval 264 Root set and base set (2)  Root set typically has 200-1000 nodes.

265 Introduction to Information Retrieval 265 Root set and base set (2)  Root set typically has 200-1000 nodes.  Base set may have up to 5000 nodes.

266 Introduction to Information Retrieval 266 Root set and base set (2)  Root set typically has 200-1000 nodes.  Base set may have up to 5000 nodes.  Computation of base set, as shown on previous slide:

267 Introduction to Information Retrieval 267 Root set and base set (2)  Root set typically has 200-1000 nodes.  Base set may have up to 5000 nodes.  Computation of base set, as shown on previous slide:  Follow outlinks by parsing the pages in the root set

268 Introduction to Information Retrieval 268 Root set and base set (2)  Root set typically has 200-1000 nodes.  Base set may have up to 5000 nodes.  Computation of base set, as shown on previous slide:  Follow outlinks by parsing the pages in the root set  Find d’s inlinks by searching for all pages containing a link to d

269 Introduction to Information Retrieval 269 Hub and authority scores

270 Introduction to Information Retrieval 270 Hub and authority scores  Compute for each page d in the base set a hub score h(d) and an authority score a(d)

271 Introduction to Information Retrieval 271 Hub and authority scores  Compute for each page d in the base set a hub score h(d) and an authority score a(d)  Initialization: for all d: h(d) = 1, a(d) = 1

272 Introduction to Information Retrieval 272 Hub and authority scores  Compute for each page d in the base set a hub score h(d) and an authority score a(d)  Initialization: for all d: h(d) = 1, a(d) = 1  Iteratively update all h(d), a(d)

273 Introduction to Information Retrieval 273 Hub and authority scores  Compute for each page d in the base set a hub score h(d) and an authority score a(d)  Initialization: for all d: h(d) = 1, a(d) = 1  Iteratively update all h(d), a(d)  After convergence:

274 Introduction to Information Retrieval 274 Hub and authority scores  Compute for each page d in the base set a hub score h(d) and an authority score a(d)  Initialization: for all d: h(d) = 1, a(d) = 1  Iteratively update all h(d), a(d)  After convergence:  Output pages with highest h scores as top hubs

275 Introduction to Information Retrieval 275 Hub and authority scores  Compute for each page d in the base set a hub score h(d) and an authority score a(d)  Initialization: for all d: h(d) = 1, a(d) = 1  Iteratively update all h(d), a(d)  After convergence:  Output pages with highest h scores as top hubs  Output pages with highest a scores as top authrities

276 Introduction to Information Retrieval 276 Hub and authority scores  Compute for each page d in the base set a hub score h(d) and an authority score a(d)  Initialization: for all d: h(d) = 1, a(d) = 1  Iteratively update all h(d), a(d)  After convergence:  Output pages with highest h scores as top hubs  Output pages with highest a scores as top authrities  So we output two ranked lists

277 Introduction to Information Retrieval 277 Iterative update

278 Introduction to Information Retrieval 278 Iterative update  For all d: h(d) =

279 Introduction to Information Retrieval 279 Iterative update  For all d: h(d) =  For all d: a(d) =

280 Introduction to Information Retrieval 280 Iterative update  For all d: h(d) =  For all d: a(d) =  Iterate these two steps until convergence

281 Introduction to Information Retrieval 281 Details

282 Introduction to Information Retrieval 282 Details  Scaling

283 Introduction to Information Retrieval 283 Details  Scaling  To prevent the a() and h() values from getting too big, can scale down after each iteration.

284 Introduction to Information Retrieval 284 Details  Scaling  To prevent the a() and h() values from getting too big, can scale down after each iteration.  Scaling factor doesn’t really matter.

285 Introduction to Information Retrieval 285 Details  Scaling  To prevent the a() and h() values from getting too big, can scale down after each iteration.  Scaling factor doesn’t really matter.  We care about the relative (as opposed to absolute) values of the scores.

286 Introduction to Information Retrieval 286 Details  Scaling  To prevent the a() and h() values from getting too big, can scale down after each iteration.  Scaling factor doesn’t really matter.  We care about the relative (as opposed to absolute) values of the scores.  In most cases, the algorithm converges after a few iterations.

287 Introduction to Information Retrieval 287 Authorities for query [Chicago Bulls]

288 Introduction to Information Retrieval 288 Authorities for query [Chicago Bulls] 0.85www.nba.com/bulls 0.25www.essex1.com/people/jmiller/bulls.htm “da Bulls” 0.20www.nando.net/SportServer/basketball/nba/chi.html “The Chicago Bulls” 0.15Users.aol.com/rynocub/bulls.htm “The Chicago Bulls Home Page ” 0.13www.geocities.com/Colosseum/6095 “Chicago Bulls” (Ben Shaul et al, WWW8)

289 Introduction to Information Retrieval 289 The authority page for [Chicago Bulls]

290 Introduction to Information Retrieval 290 The authority page for [Chicago Bulls]

291 Introduction to Information Retrieval 291 Hubs for query [Chicago Bulls]

292 Introduction to Information Retrieval 292 Hubs for query [Chicago Bulls] 1.62www.geocities.com/Colosseum/1778 “Unbelieveabulls!!!!!” 1.24www.webring.org/cgi-bin/webring?ring=chbulls “Chicago Bulls” 0.74www.geocities.com/Hollywood/Lot/3330/Bulls.html “Chicago Bulls” 0.52www.nobull.net/web_position/kw-search-15-M2.html “Excite Search Results: bulls ” 0.52www.halcyon.com/wordltd/bball/bulls.html “Chicago Bulls Links” (Ben Shaul et al, WWW8)

293 Introduction to Information Retrieval 293 A hub page for [Chicago Bulls]

294 Introduction to Information Retrieval 294 A hub page for [Chicago Bulls]

295 Introduction to Information Retrieval 295 Hub & Authorities: Comments

296 Introduction to Information Retrieval 296 Hub & Authorities: Comments  HITS can pull together good pages regardless of page content.

297 Introduction to Information Retrieval 297 Hub & Authorities: Comments  HITS can pull together good pages regardless of page content.  Once the base set is assembles, we only do link analysis, no text matching.

298 Introduction to Information Retrieval 298 Hub & Authorities: Comments  HITS can pull together good pages regardless of page content.  Once the base set is assembles, we only do link analysis, no text matching.  Pages in the base set often do not contain any of the query words.

299 Introduction to Information Retrieval 299 Hub & Authorities: Comments  HITS can pull together good pages regardless of page content.  Once the base set is assembles, we only do link analysis, no text matching.  Pages in the base set often do not contain any of the query words.  In theory, an English query can retrieve Japanese-language pages!

300 Introduction to Information Retrieval 300 Hub & Authorities: Comments  HITS can pull together good pages regardless of page content.  Once the base set is assembles, we only do link analysis, no text matching.  Pages in the base set often do not contain any of the query words.  In theory, an English query can retrieve Japanese-language pages!  If supported by the link structures between English and Japanese pages!

301 Introduction to Information Retrieval 301 Hub & Authorities: Comments  HITS can pull together good pages regardless of page content.  Once the base set is assembles, we only do link analysis, no text matching.  Pages in the base set often do not contain any of the query words.  In theory, an English query can retrieve Japanese-language pages!  If supported by the link structures between English and Japanese pages!  Danger: topic drift – the pages found by following links may not be related to the original query.

302 Introduction to Information Retrieval 302 Proof of convergence

303 Introduction to Information Retrieval 303 Proof of convergence  We define an N×N adjacency matrix A. (We called this the link matrix earlier).

304 Introduction to Information Retrieval 304 Proof of convergence  We define an N×N adjacency matrix A. (We called this the link matrix earlier).  For 1 ≤ i, j ≤ N, the matrix entry A ij tells us whether there is a link from page i to page j ( A ij = 1) or not (A ij = 0).

305 Introduction to Information Retrieval 305 Proof of convergence  We define an N×N adjacency matrix A. (We called this the link matrix earlier).  For 1 ≤ i, j ≤ N, the matrix entry A ij tells us whether there is a link from page i to page j ( A ij = 1) or not (A ij = 0).  Example:

306 Introduction to Information Retrieval 306 Proof of convergence  We define an N×N adjacency matrix A. (We called this the link matrix earlier).  For 1 ≤ i, j ≤ N, the matrix entry A ij tells us whether there is a link from page i to page j ( A ij = 1) or not (A ij = 0).  Example: d1d1 d2d2 d3d3 d1d1 010 d2d2 111 d3d3 100

307 Introduction to Information Retrieval 307 Write update rules as matrix operations

308 Introduction to Information Retrieval 308 Write update rules as matrix operations  Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d.

309 Introduction to Information Retrieval 309 Write update rules as matrix operations  Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d.  Similarity for a, the vector of authority scores

310 Introduction to Information Retrieval 310 Write update rules as matrix operations  Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d.  Similarity for a, the vector of authority scores  Now we can write h(d) = as a matrix operation: h = Aa...

311 Introduction to Information Retrieval 311 Write update rules as matrix operations  Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d.  Similarity for a, the vector of authority scores  Now we can write h(d) = as a matrix operation: h = Aa... ... and we can write a(d) = as a a = A T h

312 Introduction to Information Retrieval 312 Write update rules as matrix operations  Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d.  Similarity for a, the vector of authority scores  Now we can write h(d) = as a matrix operation: h = Aa... ... and we can write a(d) = as a a = A T h  HITS algorithm in matrix notation:

313 Introduction to Information Retrieval 313 Write update rules as matrix operations  Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d.  Similarity for a, the vector of authority scores  Now we can write h(d) = as a matrix operation: h = Aa... ... and we can write a(d) = as a a = A T h  HITS algorithm in matrix notation:  Compute h = Aa

314 Introduction to Information Retrieval 314 Write update rules as matrix operations  Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d.  Similarity for a, the vector of authority scores  Now we can write h(d) = as a matrix operation: h = Aa... ... and we can write a(d) = as a a = A T h  HITS algorithm in matrix notation:  Compute h = Aa  Compute a = A T h

315 Introduction to Information Retrieval 315 Write update rules as matrix operations  Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d.  Similarity for a, the vector of authority scores  Now we can write h(d) = as a matrix operation: h = Aa... ... and we can write a(d) = as a a = A T h  HITS algorithm in matrix notation:  Compute h = Aa  Compute a = A T h  Iterate until convergence

316 Introduction to Information Retrieval 316 HITS as eigenvector problem  HITS algorithm in matrix notation. Iterate:  Compute h = Aa  Compute a = A T h

317 Introduction to Information Retrieval 317 HITS as eigenvector problem  HITS algorithm in matrix notation. Iterate:  Compute h = Aa  Compute a = A T h  By substitution we get: h = AA T h and a = A T Aa

318 Introduction to Information Retrieval 318 HITS as eigenvector problem  HITS algorithm in matrix notation. Iterate:  Compute h = Aa  Compute a = A T h  By substitution we get: h = AA T h and a = A T Aa  Thus, h is an eigenvector of AA T and a is an eigenvector of A T A.

319 Introduction to Information Retrieval 319 HITS as eigenvector problem  HITS algorithm in matrix notation. Iterate:  Compute h = Aa  Compute a = A T h  By substitution we get: h = AA T h and a = A T Aa  Thus, h is an eigenvector of AA T and a is an eigenvector of A T A.  So the HITS algorithm is actually a special case of the power merthod and hub and authority scores are eigenvector values.

320 Introduction to Information Retrieval 320 HITS as eigenvector problem  HITS algorithm in matrix notation. Iterate:  Compute h = Aa  Compute a = A T h  By substitution we get: h = AA T h and a = A T Aa  Thus, h is an eigenvector of AA T and a is an eigenvector of A T A.  So the HITS algorithm is actually a special case of the power merthod and hub and authority scores are eigenvector values.  HITS and PageRank both formalize link analysis as eigenvector problems.

321 Introduction to Information Retrieval Example web graph

322 Introduction to Information Retrieval Raw matrix A for HITS d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0010000 d1d1 0110000 d2d2 1012000 d3d3 0001100 d4d4 0000001 d5d5 0000011 d6d6 0002101

323 Introduction to Information Retrieval Hub vectors h 0, h i = A*a i, i ≥1 h0h0 h1h1 h2h2 h3h3 h4h4 h5h5 d0d0 0.140.060.04 0.03 d1d1 0.140.080.050.04 d2d2 0.140.280.320.33 d3d3 0.14 0.170.18 d4d4 0.140.060.04 d5d5 0.140.080.050.04 d6d6 0.140.300.330.340.35

324 Introduction to Information Retrieval Authority vector a = A T *h i-1, i ≥ 1 a1a1 a2a2 a3a3 a4a4 a5a5 a6a6 a7a7 d0d0 0.060.090.10 d1d1 0.060.030.01 d2d2 0.190.140.130.12 d3d3 0.310.430.46 0.47 d4d4 0.130.140.16 d5d5 0.060.030.020.01 d6d6 0.190.140.13

325 Introduction to Information Retrieval 325 Example web graph ah d0d0 0.100.03 d1d1 0.010.04 d2d2 0.120.33 d3d3 0.470.18 d4d4 0.160.04 d5d5 0.010.04 d6d6 0.130.35

326 Introduction to Information Retrieval Top-ranked pages

327 Introduction to Information Retrieval Top-ranked pages  Pages with highest in-degree: d 2, d 3, d 6

328 Introduction to Information Retrieval Top-ranked pages  Pages with highest in-degree: d 2, d 3, d 6  Pages with highest out-degree: d 2, d 6

329 Introduction to Information Retrieval Top-ranked pages  Pages with highest in-degree: d 2, d 3, d 6  Pages with highest out-degree: d 2, d 6  Pages with highest PageRank: d 6

330 Introduction to Information Retrieval Top-ranked pages  Pages with highest in-degree: d 2, d 3, d 6  Pages with highest out-degree: d 2, d 6  Pages with highest PageRank: d 6  Pages with highest in-degree: d 6 (close: d 2 )

331 Introduction to Information Retrieval Top-ranked pages  Pages with highest in-degree: d 2, d 3, d 6  Pages with highest out-degree: d 2, d 6  Pages with highest PageRank: d 6  Pages with highest in-degree: d 6 (close: d 2 )  Pages with highest authority score: d 3

332 Introduction to Information Retrieval 332 PageRank vs. HITS: Discussion

333 Introduction to Information Retrieval 333 PageRank vs. HITS: Discussion  PageRank can be precomputed, HITS has to be computed at query time.

334 Introduction to Information Retrieval 334 PageRank vs. HITS: Discussion  PageRank can be precomputed, HITS has to be computed at query time.  HITS is too expensive in most application scenarios.

335 Introduction to Information Retrieval 335 PageRank vs. HITS: Discussion  PageRank can be precomputed, HITS has to be computed at query time.  HITS is too expensive in most application scenarios.  PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to.

336 Introduction to Information Retrieval 336 PageRank vs. HITS: Discussion  PageRank can be precomputed, HITS has to be computed at query time.  HITS is too expensive in most application scenarios.  PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to.  These two are orthogonal.

337 Introduction to Information Retrieval 337 PageRank vs. HITS: Discussion  PageRank can be precomputed, HITS has to be computed at query time.  HITS is too expensive in most application scenarios.  PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to.  These two are orthogonal.  We could also apply HITS to the entire web and PageRank to a small base set.

338 Introduction to Information Retrieval 338 PageRank vs. HITS: Discussion  PageRank can be precomputed, HITS has to be computed at query time.  HITS is too expensive in most application scenarios.  PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to.  These two are orthogonal.  We could also apply HITS to the entire web and PageRank to a small base set.  Claim: On the web, a good hub almost always is also a good authority.

339 Introduction to Information Retrieval 339 PageRank vs. HITS: Discussion  PageRank can be precomputed, HITS has to be computed at query time.  HITS is too expensive in most application scenarios.  PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to.  These two are orthogonal.  We could also apply HITS to the entire web and PageRank to a small base set.  Claim: On the web, a good hub almost always is also a good authority.  The actual difference between PageRank ranking and HITS ranking is therefore not as large as one might expect.

340 Introduction to Information Retrieval 340 Exercise

341 Introduction to Information Retrieval 341 Exercise  Why is a good hub almost always also a good authority?

342 Introduction to Information Retrieval 342 Take-away today  Anchor text: What exactly are links on the web and why are they important for IR?  Citation analysis: the mathematical foundation of PageRank and link-based ranking  PageRank: the original algorithm that was used for link-based ranking on the web  Hubs & Authorities: an alternative link-based ranking algorithm

343 Introduction to Information Retrieval 343 Resources  Chapter 21 of IIR  Resources at http://ifnlp.org/ir  American Mathematical Society article on PageRank (popular science style)  Jon Kleinberg’s home page (main person behind HITS)  A Google bomb and its defusing  Google’s official description of PageRank: PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results.


Download ppt "Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis."

Similar presentations


Ads by Google