Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

Christoph F. Eick Questions and Topics Review Dec. 10, Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2. K-means.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
Support Vector Machines and Margins
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Machine learning continued Image source:
An Overview of Machine Learning
Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
K nearest neighbor and Rocchio algorithm
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.
These slides are based on Tom Mitchell’s book “Machine Learning” Lazy learning vs. eager learning Processing is delayed until a new instance must be classified.
Discriminative Naïve Bayesian Classifiers Kaizhu Huang Supervisors: Prof. Irwin King, Prof. Michael R. Lyu Markers: Prof. Lai Wan Chan, Prof. Kin Hong.
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
Scalable Text Mining with Sparse Generative Models
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Final review LING572 Fei Xia Week 10: 03/11/
Outline Separating Hyperplanes – Separable Case
Active Learning for Class Imbalance Problem
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
Classification Ensemble Methods 1
NTU & MSRA Ming-Feng Tsai
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Classification using Co-Training
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
KNN & Naïve Bayes Hongning Wang
Does one size really fit all? Evaluating classifiers in a Bag-of-Visual-Words classification Christian Hentschel, Harald Sack Hasso Plattner Institute.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Semi-Supervised Clustering
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Shuang-Hong Yang, Hongyuan Zha, Bao-Gang Hu NIPS2009
Machine Learning Basics
CAMCOS Report Day December 9th, 2015 San Jose State University
Machine Learning Week 1.
Learning with information of features
CS 2750: Machine Learning Support Vector Machines
Topological Signatures For Fast Mobility Analysis
Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian Document Computing Symposium

Instance vs. Class-based Text Classification Class-based learning Multinomial Naive Bayes, Logistic Regression, Support Vector Machines, … Pros: compact models, efficient inference, accurate with text data Cons: document-level information discarded Instance-based learning K-Nearest Neighbors, Kernel Density Classifiers, … Pros: document-level information preserved, efficient learning Cons: data sparsity reduces accuracy

Instance vs. Class-based Text Classification 2 Proposal: Tied Document Mixture integrated instance- and class-based model retains benefits from both types of modeling exact linear time algorithms for estimation and inference Main ideas: replace Multinomial class-conditional in MNB with a mixture over documents smooth document models hierarchically with class and background models

Multinomial Naive Bayes Standard generative model for text classification Result of simple generative assumptions Bayes Naive Multinomial

Multinomial Naive Bayes 2

Tied Document Mixture Replace Multinomial in MNB by a mixture over all documents, where documents models are smoothed hierarchically, where class models are estimated by averaging the documents

Tied Document Mixture 2

Tied Document Mixture 3

Tied Document Mixture 4 Can be described as a class-smoothed Kernel Density Classifier Document mixture equivalent to a Multinomial kernel density Hierarchical smoothing corresponds to mean shift or data sharpening with class-centroids

Hierarchical Sparse Inference

Hierarchical Sparse Inference 2

Hierarchical Sparse Inference 3

Hierarchical Sparse Inference 2

Experimental Setup 14 classification datasets used: 3 spam classification 3 sentiment analysis 5 multi-class classification 3 multi-label classification Scripts and datasets in LIBSVM format: /sgmweka/

Experimental Setup 2 Classifiers compared: Multinomial Naive Bayes (MNB) Tied Document Mixture (TDM) K-Nearest Neighbors (KNN) (Multinomial distance, distance-weighted vote) Kernel Density Classifier (KDC) (Smoothed multinomial kernel) Logistic Regression (LR, LR+) (L2-regularized) Support Vector Machine (SVM, SVM+) (L2-regularized L2-loss) LR+ and SVM+ weighted feature vectors by TFIDF Smoothing parameters optimized for MicroFscore on held-out development sets using Gaussian Random Searches

Results Training times for MNB, TDM, KNN and KDC linear At most 70 s for MNB on for OHSU-TREC, 170 s for the others SVM and LR require iterative algorithms At most 936 s, for LR on Amazon12 Did not scale to multi-label datasets in practical times Classification times for instance-based classifiers higher At most mean 226 ms for TDM on OHSU-TREC, compared to 70 ms for MNB (with 290k terms, 196k labels, 197k documents)

Results 2 TDM significantly improves on MNB, KNN and KDC Across comparable datasets, TDM is on par with SVM+ SVM+ is significantly better on multi-class datasets TDM is significantly better on spam classification

Results 2 TDM significantly improves on MNB, KNN and KDC Across comparable datasets, TDM is on par with SVM+ SVM+ is significantly better on multi-class datasets TDM is significantly better on spam classification

Results 3 TDM reduces classification errors compared to MNB by: >65% in spam classification >26% in sentiment analysis Some correlation between error reduction and number of instances/class. Task types form clearly separate clusters

Conclusion Tied Document Mixture Integrated instance- and class-based model for text classification Exact linear time algorithms, with same complexities as KNN and KDC Accuracy substantially improved over MNB, KNN and KDC Competitive with optimized SVM, depending on task type Many improvements to the basic model possible Sparse inference scales to hierarchical mixtures of >340k components Toolkit, datasets and scripts available:

Sparse Inference

Sparse Inference 2

Sparse Inference 3