A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Slides:



Advertisements
Similar presentations
CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Advertisements

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Personalized Query Classification Bin Cao, Qiang Yang, Derek Hao Hu, et al. Computer Science and Engineering Hong Kong UST.
Optimizing search engines using clickthrough data
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Personal Name Classification in Web queries Dou Shen*, Toby Walker*, Zijian Zheng*, Qiang Yang**, Ying Li* *Microsoft Corporation ** Hong Kong University.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Web-Page Summarization Using Clickthrough Data Advisor.
Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.
Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.
WMES3103 : INFORMATION RETRIEVAL
Jierui Xie, Boleslaw Szymanski, Mohammed J. Zaki Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180, USA {xiej2, szymansk,
1 Integrating User Feedback Log into Relevance Feedback by Coupled SVM for Content-Based Image Retrieval 9-April, 2005 Steven C. H. Hoi *, Michael R. Lyu.
Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.
1 Web Query Classification Query Classification Task: map queries to concepts Application: Paid advertisement 问题:百度 /Google 怎么赚钱?
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
12 -1 Lecture 12 User Modeling Topics –Basics –Example User Model –Construction of User Models –Updating of User Models –Applications.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Web-page Classification through Summarization D. Shen, *Z. Chen, **Q Yang, *H.J. Zeng, *B.Y. Zhang, Y.H. Lu and *W.Y. Ma TsingHua University, *Microsoft.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Personalized Search Cheng Cheng (cc2999) Department of Computer Science Columbia University A Large Scale Evaluation and Analysis of Personalized Search.
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
Context-Sensitive Information Retrieval Using Implicit Feedback Xuehua Shen : department of Computer Science University of Illinois at Urbana-Champaign.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Jun Li, Peng Zhang, Yanan Cao, Ping Liu, Li Guo Chinese Academy of Sciences State Grid Energy Institute, China Efficient Behavior Targeting Using SVM Ensemble.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Algorithmic Detection of Semantic Similarity WWW 2005.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Hongbo Deng, Michael R. Lyu and Irwin King
Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Online Evolutionary Collaborative Filtering RECSYS 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University.
Hybrid Content and Tag-based Profiles for recommendation in Collaborative Tagging Systems Latin American Web Conference IEEE Computer Society, 2008 Presenter:
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Source: Procedia Computer Science(2015)70:
A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.
Web Page Classification with Heterogeneous Data Fusion
Using Link Information to Enhance Web Page Classification
Presentation transcript:

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science and Engineering The Hong Kong University of Science and Technology, Hong Kong 2 Microsoft Research Asia, China

Outline Introduction Related Work Implicit and Explicit Links Links for Classification Experiments Conclusion and Future Work

Introduction Why we need Web page classification? Organize the growing amount of pages Facilitate other text mining applications How to classify Web pages? Classification algorithm (SVM, NB, KNN…) Web page representation

Introduction ( Cont. ) Web page representation Content Based Utilize words or phrases of a target page However, very often a Web page contains enough textual clues Context Based Leverage hyperlinks to connect pages It works. However, the hyperlinks sometimes may not reflect true relationships in content between Web pages Any other kind of linkages can be defined and used? How to improve classification with the new links?

Related Work Exploiting Hyperlinks Chakrabarti et al. used predicted labels of neighboring documents to reinforce classification decisions for a given document; Furnkranz also reported a significant improvement in classification accuracy when using the link-based method as opposed to the full- text alone. Exploiting Query Logs Beeferman and Berger proposed an innovative query clustering method based on query log; Xue et al. proposed a novel categorization algorithm named IRC to categorize the interrelated Web objects by leveraging query log.

Implicit and Explicit Links Query logs

Implicit and Explicit Links ( Cont. ) Implicit link 1 ( L I 1) Assumption: a user tends to click the pages related to the issued query; Definition: there is an L I 1 between d 1 and d 2 if they are clicked by the same person through the same query; Implicit link 2 (L I 2) Assumption: users tend to click related pages according to the same query Definition: there is an L I 2 between d1 and d2 if they are clicked according to the same query

Implicit and Explicit Links ( Cont. ) Comparison between I L 1 and I L 2 The constraint of L I 2 is not as strict as that for L I 1; Thus, there are more links of L I 2 can be constructed than L I 1; L I 2 is noisier than L I 1, especially for the ambiguous queries ( such as apple)

Implicit and Explicit Links ( Cont. ) Three kinds of Explicit Links defined based on hyperlinks Cond E 1: there exists hyperlinks from d j to d i, (In-Link to d i from d j ) Cond E 2: there exists hyperlinks from d i to d j, (Out-Link from d i to d j ) Cond E 3: either Cond E 1 or Cond E 2 holds

Links for Classification Classification by Linking Neighbors (CLN) CLN is similar to KNN; K is not a constant as in KNN and it is decided by the set of the neighbors of the target page.

Links for Classification ( Cont. ) Build Virtual Document Given a document, the virtual document is constructed by borrowing some Extra Text from its neighbors Extra Text Local Text: Plain text + Meta Data Anchor Text Extended Anchor Text Anchor Sentence Apply any classifier such as SVM, NB

Links for Classification ( Cont. ) Local Text: Plain text: remaining text by removing html tags; Meta Data: text between and ; Anchor Text The visible text in a hyperlink Extended Anchor Text The set of rendered words occurring up to 25 words before and after an associated link Anchor Sentence The set of sentences containing the query based on which the implicit link is created

Experiments Datasets 1.3 million Web pages among 424 classes from Open Directory Project (ODP) 44.7 million records in 29 days from MSN Classifiers Naïve Bayesian Classifier; Support Vector Machine (SVM light ) Evaluation Metrics Precision, Recall, F1

Experiments (Cont.) Statistics of Links Consistency: the percentage of links that have the two linked pages from the same category. The consistency of L I 1 is much higher than others; The consistency values of all explicit links are lower than 50%, which explained some published results that it is not helpful to use hyperlink in a straightforward way; #L E 1 = #L E 2 > #L E 3 A B; BC; CB #L E 1 = 3; #L E 2 = 3; #L E 3 = 2

Experiments (Cont.) The results are consistent with the consistency values of different kinds of links Compare the best result of implicit links and the best result of explicit links 20.6% 44.0% Results of CLN on Different Links

Experiments (Cont.) Construction of virtual documents

Experiments (Cont.) Performance on different kinds of VD The performance of AS, EAT and AT is just as good as the baseline, or even worse. ILT is much better than ELT ELT is better than LT, but not always

Experiments (Cont.) Explanation the average size of the virtual documents (in terms of KB) the consistency or purity of the content of the virtual documents

Experiments (Cont.) Effect of Different Combinations

Experiments (Cont.) Observations Either AT, EAT or AS can improve the performance of classification; AS achieves greatest improvement; Different weighting schemes do not make too much of a difference We also tried to combine LT,EAT and AS together, no further improvement is obtained

Experiments (Cont.) The effect of Query Log quantity

Conclusion Based on the query logs, a new kind of links-- - the implicit links -- is introduced; Comparison between the implicit and explicit links on a large dataset is given; A concept of a virtual document by extracting anchor sentence (AS) though implicit links is presented; Experiment result show that implicit link is better than explicit when used for web page classification.

Future Work Introduce more kinds of implicit and explicit links; Try on more applications such as clustering and summarization; Extract other information such as Dissimilarity Relationship from query log.

Thanks