Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.

Similar presentations


Presentation on theme: "1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese."— Presentation transcript:

1 1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese University of Hong Kong 20 Dec 2006

2 2 Outline 1. Introduction 2. Related Work 3. PageSim 4. Experimental Results 5. Conclusion and Future Work

3 3 1. Introduction Background  Similarity measures are required in many web applications to evaluate the similarity between web pages. The “similar pages” service of web search engines; Web document classification; Web community identification.

4 4 Similarity measures  Evaluate how similarity or related two objects are. Approaches to measuring similarity  Text-based Cosine TFIDF [Joachims97]  Link-based Bibliographic coupling [Kessler63] Co-citation [Small73] SimRank [Jeh et al 02], PageSim [Lin et al 06]  Hybrid 1. Introduction Focus of this talk

5 5 1. Introduction Problem  How to evaluate similarity between web pages purely on the structural information of the Web? Motivation  Developing effective link-based similarity measure for the World Wide Web. Contributions  PROPOSE a novel link-based similarity measure: PageSim. more flexible and accurate

6 6 What hide in hyperlinks?  (1) similarity relationship between pages,  (2) similarity relationship decrease along hyperlinks. 2. Related Work

7 7 Intuition of similarity  Similar web pages have similar neighbors. (to compare two web pages, see their neighbors.) Notations  G=(V, E), |V| = n: the web graph.  I(a) / O(a): in-link / out-link neighbors of web page a.  path(a 1, a s ): a sequence of vertices a 1, a 2, …, a s such that (a i, a i+1 ) ∈ E (i=1,…,s-1) and a i are distinct.  PATH(a,b): the set of all possible paths from page a to b.  Sim(a,b): similarity score of web page a and b.

8 8 2. Related Work Two classical methods  Co-citation: the more common in-link neighbors, the more similar. Sim(a,b) = |I(a) ∩ I(b)|  Bibliographic coupling: the more common out-link neighbors, the more similar. Sim(a,b) = |O(a) ∩ O(b)|

9 9 2. Related Work SimRank “two pages are similar if they are linked to by similar pages”  (1) Sim(u,u)=1; (2) Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition  C is a constant between 0 and 1.  The iteration starts with Sim(u,u)=1, Sim(u,v)=0 if u≠ v.

10 10 3. PageSim Intuition behind PageSim  Similar pages have similar neighbors (both direct and indirect). Strategies in PageSim  (a) Each web page contains unique feature information and propagates this information to its multi-hop neighbors.  (b) Importance web pages contain more feature information, which can be represented by any global scoring system. PageRank scores, or Authoritative scores of HITS.  (c) Two web pages are more similar, if they share more common feature information.

11 11 3. PageSim PageSim (phase 1: feature propagation)  Initially, each web page contains an unique feature information, which is represented by its PageRank score.  The feature information of a web page is propagated along out-link hyperlinks at decay rate d. The PR score of u propagated to v is defined by

12 12 3. PageSim PageSim (phase 2: similarity computation)  A web page v stores the feature information of its and others in its Feature Vector FV(v).  The similarity between web page u and v is computed by Jaccard measure [Jain et al 88]  Intuition: the more common feature information two web pages share, the more similar they are.

13 13 Case study: Sim(a,b) CC: Co-citation BC: Bibliographic Coupling SR: SimRank PS: PageSim  PageSim is more flexible, since it is able to handle more cases. 3. PageSim

14 14 4. Experimental Results Datasets  CSE Web (CW) dataset: A set of web pages crawled from http://cse.cuhk.edu.hk.http://cse.cuhk.edu.hk 22,000 pages, 180,000 hyperlinks. The average number of in-links and out-links are 8.6 and 7.7.  Google Scholar (GS) dataset: A set of articles crawled from Google Scholar searching engine. Start crawling by submitting “web mining” keywords to GS, and then crawl the articles by following the “Cited by” hyperlinks. 20,000 articles, 154,000 citations.

15 15 4. Experimental Results Evaluation Methods  Cosine TFIDF similarity (for CW dataset) A commonly used text-based similarity measure.  “Related Articles” (for GS dataset) A list of related articles to a query article provided by GS. Can be used as ground truth. Experiments  Testing the decay factor of PageSim  Evaluating the performance of the algorithms: CC: Co-citation, BC: Bibliographic Coupling, SR: SimRank, PS: PageSim.

16 16 4. Experimental Results Result on the Decay Factor of PageSim  CW data (left): x-axis: decay factor d; y-axis: average cosine TFIDF of all pages.  GS data (right): x-axis: decay factor d; y-axis: average precision of all pages.

17 17 4. Experimental Results Performance Evaluation of Algorithms  CW data (left): x-axis: decay factor d; y-axis: average cosine TFIDF of all pages.  GS data (right): x-axis: decay factor d; y-axis: average precision of all pages.

18 18 5. Conclusion and Future Work Conclusion  Lin-based similarity measures Bibliographic coupling, Co-citation, and SimRank  PageSim Feature information propagation The more common feature information, the more similar  Experiments Future Work  Testing on more datasets.  Integrating link-based with text-based


Download ppt "1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese."

Similar presentations


Ads by Google