Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:

Similar presentations


Presentation on theme: "CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:"— Presentation transcript:

1 CS246 Link-Based Ranking

2 Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query: accident report of American Airline flights  Do users really care how many times “American Airlines” mentioned?  Easy to spam  Ranking purely based on page content  Authors can manipulate page content to get high ranking  Any idea?

3 Link-based Ranking  People “expect” to get AA home page for the query “American Airlines”  Many pages point to AA home page, but not to accident report  Use link-count!

4 Simple Link Count  Still easy to spam  Create many pages and add links to a page  How to avoid spam?

5 PageRank  A page is important if it is pointed by many important pages  PR( p ) = PR( p 1 )/ n 1 + … + PR( p k )/ n k p i : page pointing to p, n i : number of links in p i  PageRank of p is the sum of PageRanks of its parents  One equation for every page  N equations, N unknown variables

6 Example: Web of 1842 Ne Am MS PR(n) = PR(n)/2 + PR(a)/2 PR(m) = +PR(a)/2 PR(a) = PR(n)/2 + PR(m) Netscape, Microsoft and Amazon

7 PageRank: Matrix Notation  Web graph matrix M = { m ij }  Each page i corresponds to row i and column i of the matrix M  m ij = 1/ n if page i is one of the n children of page j m ij = 0 otherwise  PageRank vector  PageRank equation

8 PageRank: Iterative Computation  Initially every page has a unit of importance  At each round, each page shares its importance among its children and receives new importance from its parents  Eventually the importance of each page reaches a limit  Stochastic matrix

9 Example: Web of 1842 Ne Am MS

10 PageRank: Eigenvector  PageRank equation  is the principal eigenvector of M

11 PageRank: Random Surfer Model  The probability of a Web surfer to reach a page after many clicks, following random links Random Click

12 Problems on the Real Web  Dead end  A page with no links to send importance  All importance “leak out of” the Web  Crawler trap  A group of one or more pages that have no links out of the group  Accumulate all the importance of the Web

13 Example: Dead End  No link from Microsoft Ne Am MS Dead end

14 Example: Dead End Ne Am MS

15 Solution to Dead End  Assume a surfer to jumps to a random page at a dead end Ne Am MS

16 Example: Crawler Trap  Only self-link at Microsoft Ne Am MS Crawler trap

17 Example: Crawler Trap Ne Am MS

18 Crawler Trap: Damping Factor  “Tax” each page some fraction of its importance and distribute it equally  Probability to jump to a random page  Assuming 20% tax

19 Link Spam Problem  Q: What if a spammer creates a lot of pages and create a link to a single spam page?  PageRank better than simple link count, but still vulnerable to link spam  Q: Any way to avoid link spam?

20 TrustRank [Gyongyi et al. 2004]  Good pages don’t point to spam pages  Trust a page only if it is linked by what you trust  Same as PageRank except the random jump probability term

21 TrustRank: Theory [Bianchini et al. 2005] consider a set of pages S S IN(S) OUT(S) DP(S)

22 TrustRank: Theory [Bianchini et al. 2005]

23 What Does It Mean?  P S = 0 if B S = 0 and P IN = 0  You cannot improve your TrustRank simply by creating more pages and linking within yourself  To get non-zero TrustRank, you need to be either trusted or get links from outside

24 Is TrustRank the Ultimate Solution?  Not really…  Honeypot: A page with good content with hidden links to spams  Good users link to honeypot due to its quality content  Blogs, forums, wikis, mailing lists  Easy to add spam links  Link exchange  Set of sites exchanging links to boost ranking  A never-ending rat race…

25 Anti-Spamming at Search Engines  Anchor text  Consider what others think about your page  Give higher weights to anchors from high PageRank pages  More difficult to spam  TrustRank  To gain importance, you need to convince many pages under other’s control or convince search engines  More difficult to spam  Consider inter-site links with higher weight

26 Hub and Authority  More detailed evaluation of importance  A page is useful if  It has good contents or  It has links to useful pages (good bookmark)  Hub/Authority  Authority: pages with good contents  Hub: pages pointing to good content pages

27 Hub/Authority: Definition  Recursive definition similar to PageRank  Authority pages are linked to by many hub pages  Hub pages link to many authority pages  H( p ) = A( p 1 ) + … + A( p k ) A( p ) = H( p 1 ) + … + H( p m )

28 Hub/Authority: Matrix Notation  Web graph matrix A = { a ij }  Each page i corresponds to row i and column i of the matrix A  a ij = 1 if page i points to page j a ij = 0 otherwise  A is not a stochastic matrix  A T : similar to PageRank matrix M, without stochastic restriction

29 Example: Web of 1842 Ne Am MS [ n, m, a ]: vector

30 Hub/Authority: Iterative Computation  Hub/Authority vector  : divergence scaling factor   : divergence scaling factor  Compute and iteratively with scaling

31 Hub/Authority: Eigenvector   : eigenvector of : eigenvector of

32 Example: Web of 1842 Ne Am MS

33 Hub/Authority and Root Set  Apply the equations on a small neighbor graph (base set)  Start with, say, 100 pages on “bicycling”  Add pages pointing to the 100 pages  Add pages that the 100 pages are pointing to  Identified pages are good “Hub” and “Authority” on “bicycling”

34 Hub/Authority and Web Community  Hub/Authority is often used to identify Web communities  Nice notion of “Hub” and “Authority” of the community  Often Hub and Authority are tightly linked to each other

35 Any Questions?

36 Questions  Can we apply Hub/Authority to the entire Web like PageRank?

37 Hub/Authority on the Entire Web?  Hub/Authority works well on a topic-specific subset, but works poorly for the whole Web  Easy to spam 1. Create a page pointing to many authority pages (e.g., Yahoo, Google, etc.)  The page becomes a good hub page 2. On the page, add a link to your home page

38 Questions  Can we apply PageRank to a small base set?

39 PageRank on a Small Subset  In general, PageRank works better for larger dataset  We may be able to compute “topic-specific” PageRank  Any other way for “topic-specific” PageRank?

40 Summary: Link-Based Ranking  PageRank  TrustRank variation  Hub/Authority


Download ppt "CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:"

Similar presentations


Ads by Google