Download presentation
Presentation is loading. Please wait.
Published byErin Powers Modified over 6 years ago
1
The PageRank Citation Ranking: Bringing Order to the Web
Dr. Yingwu Zhu
2
Overview Motivation Related work Page Rank & Random Surfer Model
Implementation Conclusion
3
Motivation Web: heterogeneous and unstructured
Free of quality control on the web Commercial interest to manipulate ranking
4
Related Work Academic citation analysis Link-based analysis
Clustering methods of link structure Hubs & Authorities Model
5
Backlink Link Structure of the Web
Approximation of importance / quality
6
PageRank Pages with lots of backlinks are important
Backlinks coming from important pages convey more importance to a page
7
PageRank
8
Two Problems! Rank sink Dangling Links Introduce escape terms
Dangling links are simply links that point to any page with no outgoing links They do not affect the rank of any other pages directly Ignore first and add back later
9
Rank Sink Page cycles pointed by some incoming link
Problem: this loop will accumulate rank but never distribute any rank outside
10
Escape Term Solution: Rank Source c is maximized and = 1
E(u) is some vector over the web pages – uniform, favorite page etc.
11
Matrix Notation R is the dominant eigenvector and c is the dominant eigenvalue of because c is maximized
12
Computing PageRank - initialize vector over web pages loop:
- new ranks sum of normalized backlink ranks - compute normalizing factor - add escape term - control parameter while stop when converged
13
Random Surfer Model PageRank corresponds to the probability distribution of a random walk on the web graphs E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever
14
Implementation Computing resources Memory and disk storage
— 24 million pages — 75 million URLs Memory and disk storage Weight Vector (4 byte float) Matrix A (linear access)
15
Implementation (Con't)
Dealing with dangling links Unique integer ID for each URL Sort and Remove dangling links Rank initial assignment Iteration until convergence Add back dangling links and Re-compute
16
Convergence Properties (con't)
PageRank computation is O(log(|V|)) due to rapidly mixing graph G of the web.
17
Personalized PageRank
Rank Source E can be initialized : – uniformly over all pages: e.g. copyright warnings, disclaimers, mailing lists archives result in overly high ranking – total weight on a single page, e.g. Netscape, McCarthy great variation of ranks under different single pages as rank source – and everything in-between, e.g. server root pages allow manipulation by commercial interests
18
Issues Users are no random walkers Starting point distribution
– Content based methods Starting point distribution – Actual usage data as starting vector Reinforcing effects/bias towards main pages How about traffic to ranking pages? No query specific rank Linkage spam – PageRank favors pages that managed to get other pages to link to them – Linkage not necessarily a sign of relevancy, only of promotion (advertisement…)
19
Conclusion PageRank is a global ranking based on the web's graph structure PageRank use backlinks information to bring order to the web PageRank can separate out representative pages as cluster center A great variety of applications
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.