Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang.

Similar presentations


Presentation on theme: "Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang."— Presentation transcript:

1 Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang

2 Introduction & Contribution Propose algorithmic innovations for the basic PageRank paradigm. Problem of Web Frontier ( Dangling Nodes) Distinguish different types of Dangling Nodes Propose four techniques for penalty pages Problem of computing pagerank and rank manipulation Explore Web hierarchical structure HostRank & DirRank algorithms

3 PageRank BackLinks & Random surfer & Recursive computation Ideal Model or The web graph should be strongly connected. A should be stochastic. (irreducible and aperiodic)

4 PageRank Improved Model Add a link from each page to every page and give each link a small transition probability controlled by a parameter α. Random Jump (teleportation) virtual node n+1 Variations Issues  Parameter α.  Random jump---uniform distribution  Dangling Nodes

5 Dangling Nodes Dangling nodes: Nodes that either have no outlinks or for which no outlinks are known. How do pages become dangling nodes  Crawlers might not have crawled them. Dynamic Pages.  Protected by a robots.txt  Genuinely have no outlinks: PS, PDF  Meta tag indicating not to follow.

6 Handling Dangling Nodes Remove away and then added back. Random jump Reduced eigen-system. Power-iteration. A single step

7 Penalty Pages and Link Rot Penalty pages: pages that are dangling and produce 403 or 404 HTTP code. Link Rot: links used to work but then broken. (Penalty Link, Dangling Link)

8 Effects of Dangling Nodes on Ranking Whether teleportation to dangling nodes. Yes. 3 has the highest rank score. No. [0.31746, 0.31746, 0.365079], 0.269841. Less than 1and 2. The number of dangling links. 1 link: [0.198684, 0.283124, 0.283124, 0.235068] 4 links: [0.195954, 0.229266, 0.279234, 0.29554]

9 Push-back algorithm If a page has a link to a penalty page, have its rank reduced by a fraction, and the excess rank should be returned to the pages that pushed rank to it in the previous iteration. Retain (1-  i), distribute  i  ij to its backlinks.

10 Self-Loop algorithm Augment each page with a self-loop link to itself. With a  i probability follow this link. bi is the number of outlinks from i to penalty pages. gi is the number of outlinks from i to non-penalty pages. 1-  becomes Some variations.

11 Jump-weighting algorithm Instead of evenly redistribution, biasing the redistribution so that penalized pages receive less rank. A straight-forward method  Weight the link from virtual node  to an unpenalized node in C (strongly connected node set) by   to a penalized node by  g i /(g i +b i )

12 BHITS algorithm Random walk in both Forward/Backward directions. Forward step: the same as ordinary PageRank. Backward step: Non-dangling nodes: self-loop. Dangling nodes: non-penalty nodes: forward score to virtual node. penalty nodes: divide score by # of inlinks. Equally propagate score among backward links. Penalty page traverse to a random seed nodes. Matrix representation

13 HostRank algorithm Web Hierarchical Structure  62.4% links are internal to a site.  82% outlinks are to the top level of sites. Not jump uniformly, but to portal or Top-level pages. Consider all pages on a site as a single body. Assign them all a rank based on the collective value of information on that site. Each site represented by one node in the graph. Web size becomes smaller. Computation become less.

14 DirRank algorithm HostRank too coarse a level of granularity & heavy tail distribution. DirRank graph  Node: groups of URLS with prefixes up to the last “/” or “?”. Virtual directory.  Edges: if there is a link from a URL in the source virtual directory to a URL in the destination virtual directory.

15 Experiments Results Setup:  Crawling on IBM Almaden  More than 1 billion pages; 37 billion links; 4.75 billion URLS. Results: Reduce computation.  DirRank: 114 million nodes/15 billion edges  HostRank: 19.7 billion hosts(nodes)/1.1 billion edges Enhance resistance to link manipulation.  11/20 in 100 million pages. vs 14/100 hostnames  Virtual node probability : 0.82 vs 0.17

16 Conclusions PageRank with uniform teleportation are easily subject to link manipulation. HostRank and DirRank algorithm are both cheaper to compute and less subject to link manipulation. The proposed 4 techniques for penalty pages can reduce bias and improve ranking performance. In the future, hope can place the problem of web page ranking on a firmer scientific foundation besides on trade or economic domains.


Download ppt "Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang."

Similar presentations


Ads by Google