Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
{bojan.furlan, jeca, 1/42 Probabilistic Graphical Models For Text Mining: A Topic Modeling Survey V. Jelisavčić*, B.
Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Generative Topic Models for Community Analysis
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Presented by Zeehasham Rasheed
HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Overview of Web Data Mining and Applications Part I
Information Retrieval in Practice
Tal Mor  Create an automatic system that given an image of a room and a color, will color the room walls  Maintaining the original texture.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)
1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
Confidence-Aware Graph Regularization with Heterogeneous Pairwise Features Yuan FangUniversity of Illinois at Urbana-Champaign Bo-June (Paul) HsuMicrosoft.
On Node Classification in Dynamic Content-based Networks.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization --- Lei Tang, Jianping Zhang and Huan Liu.
Algorithmic Detection of Semantic Similarity WWW 2005.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Data Mining and Decision Support
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Single Document Key phrase Extraction Using Neighborhood Knowledge.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Applying Link-based Classification to Label Blogs Smriti Bhagat, Irina Rozenbaum Graham Cormode.
KNN & Naïve Bayes Hongning Wang
Bayesian Conditional Random Fields using Power EP Tom Minka Joint work with Yuan Qi and Martin Szummer.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Exploring Social Tagging Graph for Web Object Classification
What Is Cluster Analysis?
School of Computer Science & Engineering
Text & Web Mining 9/22/2018.
PRAKASH CHOCKALINGAM, NALIN PRADEEP, AND STAN BIRCHFIELD
Markov Random Fields Presented by: Vladan Radosavljevic.
CSE572: Data Mining by H. Liu
Notes from 02_CAINE conference
Presentation transcript:

Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

Hypertext categorization Automatic topic identification Also called “supervised learning” Given  Hypertext document corpus  A “small” set of classified documents Goal  Construct a classifier  Apply to new documents

Example from the web

Applications and benefits Retrieval  Browsing (Yahoo!)  Searching (“socks” and NOT “apparel”)  Adopted by most search companies Profile based filtering and routing  , news, “push” services Collaborative filtering  Automatically categorize click trails  Cluster users based on frequently visited topics

Click-trail and bookmark organizer Integrated browser View of topic Hierarchy Web Page

The limitation of text-only classifiers Text-only classifiers are well-researched  Rule induction  Bayesian learning 87% accurate on news Lower accuracy on hyperlinked corpora  Heterogenous  Information in links not utilized

Our contributions A novel approach to hypertext classification  Combine text and link information Framework for link modeling in hypertext graphs  Markov random field (limited “sphere of influence”) Techniques for feature extraction  Use of domain knowledge to limit complexity Techniques to handle incomplete information  Iterative labeling algorithm

Is this a new problem? Reduction to text classification  Include (tagged) text from neighbors  Classify the result Does not increase accuracy  Big neighbor pages  Lack of semantic correlation

“Big neighbor”

More of “big neighbor”

Coherent pages linking to incoherent pages

Model specification A hypertext graph  Nodes = documents  Edges = hyperlinks Document = sequence or set of terms and links Each document has a class label  Some labels are known  Most are unknown Labels are drawn from some distribution

Assumptions used in probability model No indirect coupling between the text and the neighbors’ classes The probability of a node’s class depends only on neighbors within limited radius Independence among the neighbor class probabilities Can assume higher order dependence (neighborhood radius greater than 1)

Probability estimation Posterior probability of class given text and neighborhood Prior class probability Class conditional term distribution Class conditional neighbor class distribution (independenc e between neighbors)

Bayesian classification algorithm Learning phase (parameter estimation) Distribution of a text within a class Interclass linkage probabilities Prior probability of a class Classification phase Compute class probabilities Choose the class with highest posterior probability

Partial neighborhood knowledge Problem: Class of test page depends on neighbors’ classes Must know neighbor’s classes to use interclass probabilities  circularity! Solution: Iterative labeling  Initially classify neighboring nodes using text  Repeatedly reclassify until consistent Text, link, or joint model Will this stabilize?

Data set 1: US patent database Local text information  Title  Abstract Citation links  Related patents cite each other Complete knowledge of the neighbors’ classes

Complete knowledge of neighborhood Features used:  Local text  Class tags from neighbor links Large gain from tags Gains sensitive to tag representation:  /Arts  /Arts/Painting

Partial knowledge of neighborhood Algorithm:  Grow radius-two neighborhood  Delete labels from a fraction of nodes  Do iterative labeling Observations:  Benefit from links  Text+Link most robust

Data set 2: Yahoo! Few links point to classified documents  19% of docs have any classified out-link  28% has any classified in-link  40% has either one  Need to find new source of information and extend the algorithm

Radius-2 information: co-citations Document to be classified Bridge Classified document Unclassified document I-link O-link An “IO-bridge” connects to many pages of similar topics “OI” tends to be noisy (many topics point to Netscape and Free Speech Online) “II” and “OO” lead to topic divergence IOOIII/OO

Link proximity Bridge Link#1 …... Link# i-1 Link# i Link # i+1 … Document to be classified Art Music Unknown Are out-links that are close together more likely to point to related topics than out- links that are far apart?

Bridges are locally coherent Link proximity  semantic proximity Exploit this source of information Huge attribute space Simple classification  Check coherence  Voting

Effect of exploiting bridges and locality

Conclusions New model for citation among hyperlinked documents belonging to various topics New categorization algorithm Complexity controlled using domain knowledge about citations Significant increase in accuracy

Future work Better models for joint distribution between terms and links Semantic page segmentation to distill “pure” bridges from ones having a mixture of topics  Higher complexity  Potentially better results More clever use of neighbors’ text Investigation of the relationship between spatial and semantic proximity

Related work