1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
1 PageSim: A Link-based Measure of Web Page Similarity Research Group Presentation Allen Z. Lin, 8 Mar 2006.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
ON LINK-BASED SIMILARITY JOIN A joint work with: Liwen Sun, Xiang Li, David Cheung (University of Hong Kong) Jiawei Han (University of Illinois Urbana.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.
1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two.
Relevance Propagation for Web Search Dr. Tie-Yan Liu Web Search and Mining Group Microsoft Research Asia Joint Work with Tao Qin, Tsinghua University.
Link Structure and Web Mining Shuying Wang
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.
Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Using Hyperlink structure information for web search.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang Haiqin Yang, Irwin King, Michael.
Scaling Personalized Web Search Authors: Glen Jeh, Jennfier Widom Stanford University Written in: 2003 Cited by: 923 articles Presented by Sugandha Agrawal.
P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd, 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou.
Link-based Similarity Measurement Techniques and Applications Department of Computer Science & Engineering The Chinese University of Hong Kong Zhenjiang.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Link-based and Content-based Evidential Information in a Belief Network Model I. Silva, B. Ribeiro-Neto, P. Calado, E. Moura, N. Ziviani Best Student Paper.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Algorithmic Detection of Semantic Similarity WWW 2005.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Page Ranking Algorithms for Digital Libraries Submitted By: Shikha Singla MIT-872-2K11 M.Tech(3 rd Sem) Information Technology.
1 Authors: Glen Jeh, Jennifer Widom (Stanford University) KDD, 2002 Presented by: Yuchen Bian SimRank: a measure of structural-context similarity.
Hongbo Deng, Michael R. Lyu and Irwin King
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Recommender Systems with Social Regularization Hao Ma, Dengyong Zhou, Chao Liu Microsoft Research Michael R. Lyu The Chinese University of Hong Kong Irwin.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Chapter 6: Link Analysis
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 CS 430: Information Discovery Lecture 5 Ranking.
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Glen Jeh & Jennifer Widom KDD  Many applications require a measure of “similarity” between objects.  Web search  Shopping Recommendations  Search.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab.
SimRank: A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom Stanford University ACM SIGKDD 2002 January 19, 2011 Taikyoung Kim SNU.
CPS 49S Google: The Computer Science Within and its Impact on Society Shivnath Babu Spring 2007.
The Chinese University of Hong Kong Learning Larger Margin Machine Locally and Globally Dept. of Computer Science and Engineering The Chinese University.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Hao Ma, Dengyong Zhou, Chao Liu Microsoft Research Michael R. Lyu
Neighborhood - based Tag Prediction
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
CIKM’ 09 November 3rd, 2009, Hong Kong
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Zhenjiang Lin, Michael R. Lyu and Irwin King
Junghoo “John” Cho UCLA
A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.
Mingzhen Mo and Irwin King
Three steps are separately conducted
Web Page Classification with Heterogeneous Data Fusion
Discussion Class 9 Google.
Presentation transcript:

1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese University of Hong Kong 20 Dec 2006

2 Outline 1. Introduction 2. Related Work 3. PageSim 4. Experimental Results 5. Conclusion and Future Work

3 1. Introduction Background  Similarity measures are required in many web applications to evaluate the similarity between web pages. The “similar pages” service of web search engines; Web document classification; Web community identification.

4 Similarity measures  Evaluate how similarity or related two objects are. Approaches to measuring similarity  Text-based Cosine TFIDF [Joachims97]  Link-based Bibliographic coupling [Kessler63] Co-citation [Small73] SimRank [Jeh et al 02], PageSim [Lin et al 06]  Hybrid 1. Introduction Focus of this talk

5 1. Introduction Problem  How to evaluate similarity between web pages purely on the structural information of the Web? Motivation  Developing effective link-based similarity measure for the World Wide Web. Contributions  PROPOSE a novel link-based similarity measure: PageSim. more flexible and accurate

6 What hide in hyperlinks?  (1) similarity relationship between pages,  (2) similarity relationship decrease along hyperlinks. 2. Related Work

7 Intuition of similarity  Similar web pages have similar neighbors. (to compare two web pages, see their neighbors.) Notations  G=(V, E), |V| = n: the web graph.  I(a) / O(a): in-link / out-link neighbors of web page a.  path(a 1, a s ): a sequence of vertices a 1, a 2, …, a s such that (a i, a i+1 ) ∈ E (i=1,…,s-1) and a i are distinct.  PATH(a,b): the set of all possible paths from page a to b.  Sim(a,b): similarity score of web page a and b.

8 2. Related Work Two classical methods  Co-citation: the more common in-link neighbors, the more similar. Sim(a,b) = |I(a) ∩ I(b)|  Bibliographic coupling: the more common out-link neighbors, the more similar. Sim(a,b) = |O(a) ∩ O(b)|

9 2. Related Work SimRank “two pages are similar if they are linked to by similar pages”  (1) Sim(u,u)=1; (2) Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition  C is a constant between 0 and 1.  The iteration starts with Sim(u,u)=1, Sim(u,v)=0 if u≠ v.

10 3. PageSim Intuition behind PageSim  Similar pages have similar neighbors (both direct and indirect). Strategies in PageSim  (a) Each web page contains unique feature information and propagates this information to its multi-hop neighbors.  (b) Importance web pages contain more feature information, which can be represented by any global scoring system. PageRank scores, or Authoritative scores of HITS.  (c) Two web pages are more similar, if they share more common feature information.

11 3. PageSim PageSim (phase 1: feature propagation)  Initially, each web page contains an unique feature information, which is represented by its PageRank score.  The feature information of a web page is propagated along out-link hyperlinks at decay rate d. The PR score of u propagated to v is defined by

12 3. PageSim PageSim (phase 2: similarity computation)  A web page v stores the feature information of its and others in its Feature Vector FV(v).  The similarity between web page u and v is computed by Jaccard measure [Jain et al 88]  Intuition: the more common feature information two web pages share, the more similar they are.

13 Case study: Sim(a,b) CC: Co-citation BC: Bibliographic Coupling SR: SimRank PS: PageSim  PageSim is more flexible, since it is able to handle more cases. 3. PageSim

14 4. Experimental Results Datasets  CSE Web (CW) dataset: A set of web pages crawled from 22,000 pages, 180,000 hyperlinks. The average number of in-links and out-links are 8.6 and 7.7.  Google Scholar (GS) dataset: A set of articles crawled from Google Scholar searching engine. Start crawling by submitting “web mining” keywords to GS, and then crawl the articles by following the “Cited by” hyperlinks. 20,000 articles, 154,000 citations.

15 4. Experimental Results Evaluation Methods  Cosine TFIDF similarity (for CW dataset) A commonly used text-based similarity measure.  “Related Articles” (for GS dataset) A list of related articles to a query article provided by GS. Can be used as ground truth. Experiments  Testing the decay factor of PageSim  Evaluating the performance of the algorithms: CC: Co-citation, BC: Bibliographic Coupling, SR: SimRank, PS: PageSim.

16 4. Experimental Results Result on the Decay Factor of PageSim  CW data (left): x-axis: decay factor d; y-axis: average cosine TFIDF of all pages.  GS data (right): x-axis: decay factor d; y-axis: average precision of all pages.

17 4. Experimental Results Performance Evaluation of Algorithms  CW data (left): x-axis: decay factor d; y-axis: average cosine TFIDF of all pages.  GS data (right): x-axis: decay factor d; y-axis: average precision of all pages.

18 5. Conclusion and Future Work Conclusion  Lin-based similarity measures Bibliographic coupling, Co-citation, and SimRank  PageSim Feature information propagation The more common feature information, the more similar  Experiments Future Work  Testing on more datasets.  Integrating link-based with text-based