Download presentation
Presentation is loading. Please wait.
Published byWillie Smale Modified over 9 years ago
1
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian Document Computing Symposium
2
Instance vs. Class-based Text Classification Class-based learning Multinomial Naive Bayes, Logistic Regression, Support Vector Machines, … Pros: compact models, efficient inference, accurate with text data Cons: document-level information discarded Instance-based learning K-Nearest Neighbors, Kernel Density Classifiers, … Pros: document-level information preserved, efficient learning Cons: data sparsity reduces accuracy
3
Instance vs. Class-based Text Classification 2 Proposal: Tied Document Mixture integrated instance- and class-based model retains benefits from both types of modeling exact linear time algorithms for estimation and inference Main ideas: replace Multinomial class-conditional in MNB with a mixture over documents smooth document models hierarchically with class and background models
4
Multinomial Naive Bayes Standard generative model for text classification Result of simple generative assumptions Bayes Naive Multinomial
5
Multinomial Naive Bayes 2
6
Tied Document Mixture Replace Multinomial in MNB by a mixture over all documents, where documents models are smoothed hierarchically, where class models are estimated by averaging the documents
7
Tied Document Mixture 2
8
Tied Document Mixture 3
11
Tied Document Mixture 4 Can be described as a class-smoothed Kernel Density Classifier Document mixture equivalent to a Multinomial kernel density Hierarchical smoothing corresponds to mean shift or data sharpening with class-centroids
12
Hierarchical Sparse Inference
13
Hierarchical Sparse Inference 2
14
Hierarchical Sparse Inference 3
20
Hierarchical Sparse Inference 2
21
Experimental Setup 14 classification datasets used: 3 spam classification 3 sentiment analysis 5 multi-class classification 3 multi-label classification Scripts and datasets in LIBSVM format: http://sourceforge.net/projects /sgmweka/
22
Experimental Setup 2 Classifiers compared: Multinomial Naive Bayes (MNB) Tied Document Mixture (TDM) K-Nearest Neighbors (KNN) (Multinomial distance, distance-weighted vote) Kernel Density Classifier (KDC) (Smoothed multinomial kernel) Logistic Regression (LR, LR+) (L2-regularized) Support Vector Machine (SVM, SVM+) (L2-regularized L2-loss) LR+ and SVM+ weighted feature vectors by TFIDF Smoothing parameters optimized for MicroFscore on held-out development sets using Gaussian Random Searches
23
Results Training times for MNB, TDM, KNN and KDC linear At most 70 s for MNB on for OHSU-TREC, 170 s for the others SVM and LR require iterative algorithms At most 936 s, for LR on Amazon12 Did not scale to multi-label datasets in practical times Classification times for instance-based classifiers higher At most mean 226 ms for TDM on OHSU-TREC, compared to 70 ms for MNB (with 290k terms, 196k labels, 197k documents)
24
Results 2 TDM significantly improves on MNB, KNN and KDC Across comparable datasets, TDM is on par with SVM+ SVM+ is significantly better on multi-class datasets TDM is significantly better on spam classification
25
Results 2 TDM significantly improves on MNB, KNN and KDC Across comparable datasets, TDM is on par with SVM+ SVM+ is significantly better on multi-class datasets TDM is significantly better on spam classification
26
Results 3 TDM reduces classification errors compared to MNB by: >65% in spam classification >26% in sentiment analysis Some correlation between error reduction and number of instances/class. Task types form clearly separate clusters
27
Conclusion Tied Document Mixture Integrated instance- and class-based model for text classification Exact linear time algorithms, with same complexities as KNN and KDC Accuracy substantially improved over MNB, KNN and KDC Competitive with optimized SVM, depending on task type Many improvements to the basic model possible Sparse inference scales to hierarchical mixtures of >340k components Toolkit, datasets and scripts available: http://sourceforge.net/projects/sgmweka/
28
Sparse Inference
29
Sparse Inference 2
30
Sparse Inference 3
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.