Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

Hypertext categorization Automatic topic identification Also called “supervised learning” Given  Hypertext document corpus  A “small” set of classified documents Goal  Construct a classifier  Apply to new documents

Example from the web

Applications and benefits Retrieval  Browsing (Yahoo!)  Searching (“socks” and NOT “apparel”)  Adopted by most search companies Profile based filtering and routing  Email, news, “push” services Collaborative filtering  Automatically categorize click trails  Cluster users based on frequently visited topics

Click-trail and bookmark organizer Integrated browser View of topic Hierarchy Web Page

The limitation of text-only classifiers Text-only classifiers are well-researched  Rule induction  Bayesian learning 87% accurate on news Lower accuracy on hyperlinked corpora  Heterogenous  Information in links not utilized

Our contributions A novel approach to hypertext classification  Combine text and link information Framework for link modeling in hypertext graphs  Markov random field (limited “sphere of influence”) Techniques for feature extraction  Use of domain knowledge to limit complexity Techniques to handle incomplete information  Iterative labeling algorithm

Is this a new problem? Reduction to text classification  Include (tagged) text from neighbors  Classify the result Does not increase accuracy  Big neighbor pages  Lack of semantic correlation

“Big neighbor”

More of “big neighbor”

Coherent pages linking to incoherent pages

Model specification A hypertext graph  Nodes = documents  Edges = hyperlinks Document = sequence or set of terms and links Each document has a class label  Some labels are known  Most are unknown Labels are drawn from some distribution

Assumptions used in probability model No indirect coupling between the text and the neighbors’ classes The probability of a node’s class depends only on neighbors within limited radius Independence among the neighbor class probabilities Can assume higher order dependence (neighborhood radius greater than 1)

Probability estimation Posterior probability of class given text and neighborhood Prior class probability Class conditional term distribution Class conditional neighbor class distribution (independenc e between neighbors)

Bayesian classification algorithm Learning phase (parameter estimation) Distribution of a text within a class Interclass linkage probabilities Prior probability of a class Classification phase Compute class probabilities Choose the class with highest posterior probability

Partial neighborhood knowledge Problem: Class of test page depends on neighbors’ classes Must know neighbor’s classes to use interclass probabilities  circularity! Solution: Iterative labeling  Initially classify neighboring nodes using text  Repeatedly reclassify until consistent Text, link, or joint model Will this stabilize?

Data set 1: US patent database Local text information  Title  Abstract Citation links  Related patents cite each other Complete knowledge of the neighbors’ classes

Complete knowledge of neighborhood Features used:  Local text  Class tags from neighbor links Large gain from tags Gains sensitive to tag representation:  /Arts  /Arts/Painting

Partial knowledge of neighborhood Algorithm:  Grow radius-two neighborhood  Delete labels from a fraction of nodes  Do iterative labeling Observations:  Benefit from links  Text+Link most robust

Data set 2: Yahoo! Few links point to classified documents  19% of docs have any classified out-link  28% has any classified in-link  40% has either one  Need to find new source of information and extend the algorithm

Radius-2 information: co-citations Document to be classified Bridge Classified document Unclassified document I-link O-link An “IO-bridge” connects to many pages of similar topics “OI” tends to be noisy (many topics point to Netscape and Free Speech Online) “II” and “OO” lead to topic divergence IOOIII/OO

Link proximity Bridge Link#1 …... Link# i-1 Link# i Link # i+1 … Document to be classified Art Music Unknown Are out-links that are close together more likely to point to related topics than out- links that are far apart?

Bridges are locally coherent Link proximity  semantic proximity Exploit this source of information Huge attribute space Simple classification  Check coherence  Voting

Effect of exploiting bridges and locality

Conclusions New model for citation among hyperlinked documents belonging to various topics New categorization algorithm Complexity controlled using domain knowledge about citations Significant increase in accuracy

Future work Better models for joint distribution between terms and links Semantic page segmentation to distill “pure” bridges from ones having a mixture of topics  Higher complexity  Potentially better results More clever use of neighbors’ text Investigation of the relationship between spatial and semantic proximity

Related work

Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

Similar presentations

Presentation on theme: "Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

Similar presentations

Presentation on theme: "Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)"— Presentation transcript:

Similar presentations

About project

Feedback