Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006

Similar presentations


Presentation on theme: "1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006"— Presentation transcript:

1 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006 http://www.ee.technion.ac.il/courses/049011

2 2 Ranking Algorithms

3 3 Outline The ranking problem PageRank HITS (Hubs & Authorities) Markov Chains and Random Walks PageRank and HITS computation

4 4 Input:  D: document collection  Q: query space Goal: Find a ranking function rank: D x Q  R s.t.  rank and q induce a ranking (partial order)  q on D  Same as the “relevance scoring function” from previous lecture The Ranking Problem

5 5 Text-based Ranking Classical ranking functions:  Keyword-based boolean ranking  Cosine similarity + TF-IDF scores Limitations in the context of web search:  The “abundance problem” Recall is not important  Short queries  Web pages are poor in text  Synonymy (cars vs. autos)  Polysemy (java, “Michael Jordan”)  Spam

6 6 Link-based Ranking Hyperlinks carry important semantics  Recommendation  Critique  Navigation Hypertext IR Principle #1 If p  q, then q is “relevant” to p Hypertext IR Principle #2 If p  q, then p confers “authority” to q

7 7 Static Ranking Static ranking: rank: D  R, where rank(d) > rank(d’) implies d is more “authoritative” than d’ Use links to come up with a static ranking of all web pages. Given a query q, use text-based ranking to identify a set S of candidate relevant pages. Order S by their static rank. Advantage: static ranking can be computed at a pre- processing step. Disadvantage: no use of Hypertext IR Principle #1.

8 8 Query-Dependent Ranking Given a query q, use text-based ranking to identify a set S of candidate relevant pages. Use links within S to come up with a ranking rank: S  R, where rank(d) > rank(d’) implies d is more authoritative than d’ with respect to q. Advantage: both Hypertext IR principles are exploited. Disadvantage: less efficient.

9 9 The Web as a Graph V = a set of pages  In static ranking, V = web  In query dependent ranking, V = S The Web Graph: G = (V,E), where  (p,q) is an edge iff p has a hyperlink to q A = adjacency matrix of G

10 10 Popularity Ranking rank(p) = in-degree(p) Advantages  Most important pages extracted from millions of matches  No need for text rich documents  Efficiently computable Disadvantages  Bias towards popular pages, irrespective of query  Easily spammable

11 11 PageRank [Page, Brin, Motwani, Winograd 1998] Motivating principles  Rank of p should be proportional to the rank of the pages that point to p Recommendations from Bill Gates & Steve Jobs vs. from Moishale and Ahuva  Rank of p should depend on the number of pages “co-cited” with p Compare: Bill Gates recommends only me vs. Bill Gates recommends everyone on earth

12 12 Then :  r is a left eigenvector of B with eigenvalue 1 PageRank, Attempt #1 Q1: Is a solution guaranteed to exist? Q2: If a solution exists, how to find it? Q3: If more than one solution exists, which one to pick? B = normalized adjacency matrix:

13 13 PageRank, Attempt #1 Solution exists if and only if B has eigenvalue 1 Problem: Solution may not exist  Example: Note: If r is a solution, then r T = r T B k, for any k ≥ 1  Rank is at “steady state”

14 14 PageRank, Attempt #1: Example a bc I 0 a/2 a/2 + b II 0 0a/2 III 0 00 IV

15 15 PageRank, Attempt #1 If r has positive “rank mass” at nodes, from which sinks are reachable, the mass “drains” at the sinks Therefore, only nodes, from which sinks cannot be reached, can have nonzero rank mass.  E.g., for DAGs, no non-trivial solution exists Conclusion: This ranking measure cannot be used for arbitrary graphs.

16 16 = normalization constant Rank mass is preserved and no longer drains at sinks  r is a left eigenvector of B with eigenvalue 1/ PageRank, Attempt #2 where:

17 17 Any nonzero eigenvalue  of B gives a solution  = 1/   r = any left eigenvector of B with eigenvalue  Which solution to pick?  Pick a “principal eigenvector” (i.e., corresponding to maximal  )  r is normalized to unit L 1 norm (i.e., ||r|| 1 = 1) How to find a solution?  Power iterations PageRank, Attempt #2

18 18 Problem #1: Maximal eigenvalue may have multiplicity > 1  Several possible solutions  Happens, for example, when graph is disconnected Problem #2: Rank accumulates at sinks.  Only sinks or nodes, from which a sink cannot be reached, can have nonzero rank mass. PageRank, Attempt #2

19 19 PageRank, Attempt #2: Example 0.5 0.30.2 I 0 0.25/0.8 = 0.31 0.55/0.8 = 0.69 II 0 01 III

20 20 Then:  r is a left eigenvector of (B + 1e T ) with eigenvalue 1/  r has unit norm PageRank, Final Definition e = “rank source” vector  Standard setting: e(p) =  /n for all p (  < 1) r is normalized to L 1 unit norm 1 = the all 1’s vector

21 21 Any nonzero eigenvalue of (B + 1e T ) gives a solution Pick r to be a normalized principal eigenvector of (B + 1e T ) Will show: Principal eigenvalue has multiplicity 1, for any graph Hence, PageRank always exists and is uniquely defined Due to rank source vector, rank no longer accumulates at sinks PageRank, Final Definition

22 22 An Alternative View of PageRank: The Random Surfer Model When visiting a page p, a “random surfer”:  With probability 1 - d, selects a random outlink p  q and goes to visit q. (“focused browsing”)  With probability d, jumps to a random web page q. (“loss of interest”)  If p has no outlinks, assume it has a self loop. P: probability transition matrix:

23 23 PageRank & Random Surfer Model Therefore, r is a principal left eigenvector of (B + 1e T ) if and only if it is a principal left eigenvector of P. Suppose: Then:

24 24 V = state space P = probability transition matrix  Non-negative.  Sum of each row is 1. q 0 = initial distribution on V q t = q 0 P t : distribution on V after t steps P is ergodic if it is:  Irreducible (underlying graph is strongly connected)  Aperiodic (for all states u,v, the gcd of the lengths of paths from u to v is 1) Theorem If P is ergodic, then it has a “stationary distribution” . Furthermore, for all q 0, q t   as t tends to infinity.  P = .  is a principal left eigenvector of P with e.v. 1. Markov Chain Primer

25 25 PageRank & Markov Chains PageRank vector is normalized principal left eigenvector of (B + 1e T ). Hence, PageRank vector is also a principal left eigenvector of P Conclusion: PageRank is the unique stationary distribution of the random surfer Markov Chain. PageRank(p) = r(p) = probability of random surfer visiting page p at the limit. Note: “Random jump” guarantees Markov Chain is ergodic.

26 26 PageRank Computation In practice: about 50 iterations suffices

27 27 HITS: Hubs and Authorities [Kleinberg, 1997] HITS: Hyperlink Induced Topic Search Main principle: every page p is associated with two scores:  Authority score: how “authoritative” a page is about the query’s topic Ex: query: “IR”; authorities: scientific IR papers Ex: query: “automobile manufacturers”; authorities: Mazda, Toyota, and GM web sites  Hub score: how good the page is as a “resource list” about the query’s topic Ex: query: “IR”; hubs: surveys and books about IR Ex: query: “automobile manufacturers”; hubs: KBB, car link lists

28 28 Mutual Reinforcement HITS principles: p is a good authority, if it is linked by many good hubs. p is a good hub, if it points to many good authorities.

29 29 HITS: Algebraic Form a: authority vector h: hub vector A: adjacency matrix Then: Therefore: a is principal eigenvector of A T A h is principal eigenvector of AA T  Need to normalize, like in PageRank

30 30 Co-Citation and Bibilographic Coupling A T A: co-citation matrix  A T A p,q = # of pages that link both to p and to q.  Thus: authority scores propagate through co-citation. AA T : bibliographic coupling matrix  AA T p,q = # of pages that both p and q link to.  Thus: hub scores propagate through bibliographic coupling. p q p q

31 31 HITS Computation

32 32 Principal Eigenvector Computation E: n × n matrix | 1 | > | 2 | >= | 3 | … >= | n | : eigenvalues of E v 1,…,v n : corresponding eigenvectors Eigenvectors are linearly independent Input:  The matrix E  The principal eigenvalue 1  A unit vector u, which is not orthogonal to v 1 Goal: compute v 1

33 33 The Power Method

34 34 Why Does It Work? Theorem: As t  , w/ 1 t  c · v 1 (c is a constant) Convergence rate: Proportional to ( 2 / 1 ) t The larger the “spectral gap” 2 - 1, the faster the convergence.

35 35 End of Lecture 3


Download ppt "1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006"

Similar presentations


Ads by Google