Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 PageSim: A Link-based Measure of Web Page Similarity Research Group Presentation Allen Z. Lin, 8 Mar 2006.

Similar presentations


Presentation on theme: "1 PageSim: A Link-based Measure of Web Page Similarity Research Group Presentation Allen Z. Lin, 8 Mar 2006."— Presentation transcript:

1 1 PageSim: A Link-based Measure of Web Page Similarity Research Group Presentation Allen Z. Lin, 8 Mar 2006

2 2 Outline  What & Why?  Existing approaches  PageSim: a new approach  Demostrations  Conclusion and current work

3 3 What & Why?  Ranking similarity between web pages.  Applications on the Web –Finding related, or similar, web pages to a page. Google’s “Similar pages” Google’s “Similar pages” –Web page classification. YAHOO!‘s Web Directory. http://dir.yahoo.com/ YAHOO!‘s Web Directory. http://dir.yahoo.com/ hierarchical structure hierarchical structure  Key question: How to measure the similarity?

4 4 Existing approaches  Text-based –Using common features of two web pages. Jaccard’s coefficient, Adamic/Adar Jaccard’s coefficient, Adamic/Adar  Link-based –Using neighbors between two web pages. Common neighbor, Co-citation, SimRank Common neighbor, Co-citation, SimRank –Using paths between two web pages. Katz index, Hitting time Katz index, Hitting time

5 5 Existing approaches (cont.)  Notations –Sim(a,b): similarity score of web page a and b. –I(a): in-link neighbors of web page a. –O(a): out-link neighbors of web page a.  Common neighbor method –Sim(a,b) = |O(a)∩O(b)| = |(c,d)| = 2 = |(c,d)| = 2  Cocitation method –Sim(a,b) = |I(a)∩I(b)| = |(c,d)| = 2 = |(c,d)| = 2

6 6 Existing approaches (cont.)  SimRank –Two pages are similar if they are referenced (cited, or linked to) by similar pages. –1. Sim(u,u)=1; 2. Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition Recursive definition –C is a constant between 0 and 1. –The iteration starts with Sim(u,u)=1, Sim(u,v)=0 if u≠ v.

7 7 PageSim: a new approach  Two problems –On the Web, not all links are equally important. Common neighbor, Cocitation Common neighbor, Cocitation –A similarity measure should be able to measure the similarity between any two web pages. SimRank SimRank  PageSim –Take the above problems into account.

8 8 PageSim: a new approach (cont.)  Cocitation  Which page is more similar to d, c or e?  Suppose page a is YAHOO!’s homepage, and b is a personal web page. Authoritative pages are more important.

9 9 PageSim: a new approach (cont.)  SimRank  Are a and b similar? –SimRank says “NO”s. Are the answers reasonable? Are the answers reasonable?

10 10 PageSim: a new approach (cont.)  Page a linking to b and c means a “thinks” –b and c are kind of similar. –both b and c are kind of similar to a too.  Page a spreads similarity to its neighbors.  Authoritative pages spread more similarity.

11 11 PageSim: a new approach (cont.)  PageSim –In PageSim, PageRank (PR) score is used to measure the authority of a web page. PR assigns global importance scores to all web pages. PR assigns global importance scores to all web pages. –Each page spreads its own similarity score (PR score) to its neighbors. –Each page also propagates other pages’ similarity scores to its neighbors. –After the similarity score propagation finished, each page contains an array of similarity scores. –PageRank score propagation

12 12 PageSim: a new approach (cont.)  Example: similarity propagation (page a only) –PR(a)=100, PR(b)=55, PR(c)=102 –Each page propagate 80% of its similarity score averagely to its neighbors.

13 13 PageSim: a new approach (cont.)  Example: similarity propagation (cont.) –PR(a)=100, PR(b)=55, PR(c)=102 –Each page contains a similarity score vector(SV).  SV(a) = (100, 35, 82 ),  SV(b) = ( 40, 55, 33 ),  SV(c) = ( 72, 44, 102 ), –PageSim score (PS) computation  PS(a,b)=Σmin( SV(a), SV(b) ) = 40+35+33 = 108 –Two pages are more similar if they share more common similarity scores.

14 14 PageSim: a new approach (cont.)  Example: similarity spreading (cont.) –PageSim score matrix  PS_matrix = (PS(u,v)) nxn = a: 217 b: 108 128 c: 189 117 219 –PS_matrix is symmetric.  PS(a,b) = PS(b, a) –Any web page is most similar to itself.  PS(u,u) = max ( PS(u,v) ), for any v.

15 15 Demostrations  Example 1: single link –PageSim matrix a: 100 b: 80 265 c: 64 212 469.2 d: 51.2 169.6 375.4 694.1 –PR = (100, 185, 257.2, 318.6) –SimRank matrix 1 01 0 01 0 0 01

16 16 Demostrations (cont.)  Example 2: loop link –PageSim matrix a: 295.2 b: 246.4 295.2 c: 230.4 246.4 295.2 d: 246.4 230.4 246.4 295.2 –PR = (100, 100, 100, 100) –SimRank matrix 1 01 0 01 0 0 01

17 17 Demostrations (cont.)  Example 3: more complex –PageSim matrix 1: 100.0 2: 40.0 487.6 3: 50.7 159.4397.4 4: 10.7 238.5130.0 275.5 5: 10.7 130.0 130.0 130.0 314.9 PR = (100, 40.0, 50.7, 10.7, 10.7) –SimRank matrix 1: 1 2: 0 1 3: 0 0.25 1 4: 0 0 0.5 1 5: 0 0 0.5 1 1 –PageSim results  v 3 is most similar to v 1.  v 4 is most similar to v 2.

18 18 Conclusion and current work  Conclusion –Web page similarity measures Text-based & Link-based –PageSim: PageRank score propagation.  Current work –Propagation radius pruning. –How to compare performance of two similarity measures, e.g., PageSim and SimRank? Text-based measures. Text-based measures. Thank you!


Download ppt "1 PageSim: A Link-based Measure of Web Page Similarity Research Group Presentation Allen Z. Lin, 8 Mar 2006."

Similar presentations


Ads by Google