Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Similar presentations


Presentation on theme: "Prof. Paolo Ferragina, Algoritmi per "Information Retrieval""— Presentation transcript:

1 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Ranking Link-based Ranking (2° generation) Reading 21

2 The Web as a Directed Graph
Sec. 21.1 The Web as a Directed Graph Page A hyperlink Page B Anchor Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal) Assumption 2: The text in the anchor of the hyperlink describes the target page (textual context)

3 Sec Indexing anchor text When indexing a document D, include anchor text from links pointing to D. Armonk, NY-based computer giant IBM announced today Big Blue today announced record profits for the quarter Joe’s computer hardware links Sun HP IBM Can score anchor text with weight depending on the authority of the anchor page’s website

4 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Random Walks Paolo Ferragina Dipartimento di Informatica Università di Pisa

5 Definitions Adjacency matrix A Transition matrix P 1 2 3 1 2 3
1/2 1 2 3 Any edge weigthing is possible: Any proposals?

6 What is a random walk t=0 1 1/2

7 What is a random walk 1 1/2 t=1 t=0 1 1/2

8 What is a random walk 1 1/2 t=0 1 1/2 t=1 t=2 1 1/2

9 What is a random walk 1 1/2 t=1 t=0 1 1/2 t=2 1 1/2 t=3 1 1/2

10 Probability Distributions
xt(i) = probability that surfer is at node i at time t xt+1(i) = ∑j (Probability of being at node j)*Pr(j->i) = ∑j xt(j)*P(j,i) = xt * P t=2 1 2 3 1/2 Transition matrix P 1/2 xt+1 xt =

11 Probability Distributions
xt(i) = probability that surfer is at node i at time t xt+1(i) = ∑j (Probability of being at node j)*Pr(j->i) = ∑j xt(j)*P(j,i) = xt P xt+1 = xt P = xt-1* P * P = xt-2* P * P * P = …= x0 Pt+1 What happens when the surfer keeps walking for a long time? So called Stationary distribution

12 Stationary Distribution
The stationary distribution at a node is related to the amount/proportion of time a random walker spends visiting that node. It is when the distribution does not change anymore: i.e. xT+1 = xT  xT P = 1* xT (left eigenvector of eigenvalue 1) For “well-behaved” graphs this does not depend on the start distribution: xt+1 = x0 P t+1

13 Interesting questions
Does a stationary distribution always exist? Is it unique? Yes, if the graph is “well-behaved”, namely the markov chain is irreducible and aperiodic. How fast will the random surfer approach this stationary distribution? Mixing Time!

14 Well behaved graphs Irreducible: There is a path from every node to every other node ( it is an SCC). Irreducible Not irreducible

15 Well behaved graphs Aperiodic: The GCD of all cycle lengths is 1. The GCD is also called period. Aperiodic Periodicity is 3

16 About undirected graphs
A connected undirected graph is irreducible A connected non-bipartite undirected graph has a stationary distribution proportional to the degree distribution! Makes sense, since larger the degree of the node more likely a random walk is to come back to it.

17 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
PageRanks and HITS Paolo Ferragina Dipartimento di Informatica Università di Pisa

18 Query-independent ordering
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Query-independent ordering First generation: using link counts as simple measures of popularity. Undirected popularity: Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5). Directed popularity: Score of a page = number of its in-links (es. 3). Easy to SPAM

19 Second generation: PageRank
Deploy the graph structure, and each link has its own importance ! PageRank is independent of the query many interpretations: Linear algebra – eigenvectors, eigenvalues Markov chains – steady state probability distribution Social interpretation – a sort of voting scheme Paolo Ferragina, Web Algorithmics

20 Basic Intuition… d 1-d Random jump to any node Random jump
to neighbors d Paolo Ferragina, Web Algorithmics

21 Google’s Pagerank Random jump B(i) : set of pages linking to i.
Principal eigenvector B(i) : set of pages linking to i. #out(j) : number of outgoing links from j. e : vector of components 1/sqrt{N}. Paolo Ferragina, Web Algorithmics

22 Pagerank: use in Search Engines
Preprocessing: Given graph of links, build matrix P Compute its principal eigenvector r r[i] is the pagerank of page i We are interested in the relative order At query time: Retrieve pages containing query terms Rank them by their Pagerank The final order is query-independent Paolo Ferragina, Web Algorithmics

23 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
How to compute it Fast (approximate) computation: Given the Web graph, build the matrix P Compute r = e * Pt for t = 0, 1, … r[i] is the pagerank of page i

24 (Personalized) Pagerank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" (Personalized) Pagerank Bias the random jump: Substitute e = [1 1 ….1] with a preference vector which jumps to preferred pages (topics) If e = ei , then r[j] = relatedness between node j and node i

25 CoSim Rank to «compare» nodes
Personalized PageRank, p(0)(i) varies with node i The adjacency matrix A has normalized rows >>> Set d=1, so no teleport step You can normalize it multiplying by (1-c)

26 HITS: Hypertext Induced Topic Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" HITS: Hypertext Induced Topic Search

27 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Calculating HITS It is query-dependent Produces two scores per page: Authority score: a good authority page for a topic is pointed to by many good hubs for that topic. Hub score: A good hub page for a topic points to many authoritative pages for that topic.

28 Authority and Hub scores
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Authority and Hub scores 2 5 3 1 1 6 4 7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

29 HITS: Link Analysis Computation
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" HITS: Link Analysis Computation Where a: Vector of Authority’s scores h: Vector of Hub’s scores. A: Adjacency matrix in which ai,j = 1 if ij Symmetric matrices Thus, h is an eigenvector of AAt a is an eigenvector of AtA

30 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Weighting links Weight more if the query occurs in the neighborhood of the link (e.g. anchor text).


Download ppt "Prof. Paolo Ferragina, Algoritmi per "Information Retrieval""

Similar presentations


Ads by Google