Improving Text Classification by Shrinkage in a Hierarchy of Classes Andrew McCallum Just Research & CMU Tom Mitchell CMU Roni Rosenfeld CMU Andrew Y.

Slides:

Advertisements

Similar presentations

When Efficient Model Averaging Out-Perform Bagging and Boosting Ian Davidson, SUNY Albany Wei Fan, IBM T.J.Watson.

Advertisements

Text Categorization.

Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang.

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan Ben Packer Geremy Heitz Daphne Koller Stanford AI Lab.

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan, Ben Packer, Geremy Heitz, Daphne Koller Computer Science Dept. Stanford.

Probabilistic Clustering-Projection Model for Discrete Data

Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.

What is Statistical Modeling

A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.

Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.

Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.

Distributional Clustering of Words for Text Classification Presentation by: Thomas Walsh (Rutgers University) L.Douglas Baker (Carnegie Mellon University)

Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.

Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Generative learning methods for bags of features

Scalable Text Mining with Sparse Generative Models

Language Modeling Approaches for Information Retrieval Rong Jin.

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.

1 Probabilistic Language-Model Based Document Retrieval.

Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University.

Rainbow Tool Kit Matt Perry Global Information Systems Spring 2003.

1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.

1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Text Classification, Active/Interactive learning.

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.

Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Hierarchical Topic Models and the Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum presented by Rodrigo de Salvo Braz.

Text Clustering.

IRCS/CCN Summer Workshop June 2003 Speech Recognition.

Concept Ontology for Text Classification Bhumika Thakker.

Randomized Algorithms for Bayesian Hierarchical Clustering

Hierarchical Classification

Andrew McCallum Just Research (formerly JPRC)

Text Classification with Limited Labeled Data Andrew McCallum Just Research (formerly JPRC) Center for Automated Learning and Discovery,

Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization --- Lei Tang, Jianping Zhang and Huan Liu.

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.

Concept Ontology For Text Classification Swaranika Palit.

Language and Statistics

A Comparison of Event Models for Naïve Bayes Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University.

Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Latent Dirichlet Allocation

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.

Document Classification with Naïve Bayes -- How to Build Yahoo Automatically Andrew McCallum Just Research & CMU Joint work with.

1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Matching Words with Pictures

Topic Models in Text Processing

Text Categorization Berlin Chen 2003 Reference:

Presentation transcript:

Improving Text Classification by Shrinkage in a Hierarchy of Classes Andrew McCallum Just Research & CMU Tom Mitchell CMU Roni Rosenfeld CMU Andrew Y. Ng MIT AI Lab

2 The Task: Document Classification (also “Document Categorization”, “Routing” or “Tagging”) Automatically placing documents in their correct categories. MagnetismRelativityEvolutionBotany Irrigation Crops corn wheat silo farm grow... corn tulips splicing grow... water grating ditch farm tractor... selection mutation Darwin Galapagos DNA “grow corn tractor…” Training Data: Testing Data: Categories: (Crops)

3 The Idea: “Shrinkage” / “Deleted Interpolation” We can improve the parameter estimates in a leaf by averaging them with the estimates in its ancestors. MagnetismRelativity Physics EvolutionBotany Irrigation Crops BiologyAgriculture Science corn wheat silo farm grow... corn tulips splicing grow... water grating ditch farm tractor... “corn grow tractor…” selection mutation Darwin Galapagos DNA Testing Data: Training Data: Categories: (Crops)

4 A Probabilistic Approach to Document Classification Maximum a posteriori estimate of Pr(w|c), with a Dirichlet prior,  =1 (AKA Laplace smoothing) Naïve Bayes where N(w,d) is number of times word w occurs in document d. where c j is a class, d is a document, w d i is the i th word of document d

5 “Shrinkage” / “Deleted Interpolation” [James and Stein, 1961] / [Jelinek and Mercer, 1980] (Uniform) MagnetismRelativity Physics EvolutionBotany Irrigation Crops BiologyAgriculture Science

6 Learning Mixture Weights Crops Agriculture Science Learn the ’s via EM, performing the E-step with leave-one-out cross-validation. Uniform corn wheat silo farm grow... Use the current ’s to estimate the degree to which each node was likely to have generated the words in held out documents. E-step M-step Use the estimates to recalculate new values for the ’s.

7 Learning Mixture Weights E-step M-step

8 Newsgroups Data Set mac ibm graphics windows X guns mideast auto motorcycle atheism christian misc baseball hockey misc computersreligionsportpoliticsmotor 15 classes, 15k documents, 1.7 million words, 52k vocabulary (Subset of Ken Lang’s 20 Newsgroups set)

9 Newsgroups Hierarchy Mixture Weights

10 Newsgroups Hierarchy Mixture Weights 235 training documents (15/class) 7497 training documents (~500/class)

11 Industry Sector Data Set water air railroad trucking misc coal oil&gas film communication electric water gas appliance furniture integrated transportationutilitiesconsumerenergyservices 71 classes, 6.5k documents, 1.2 million words, 30k vocabulary... … (11)

12 Industry Sector Classification Accuracy

13 Newsgroups Classification Accuracy

14 Yahoo Science Data Set dairy crops agronomy forestry AI HCI craft missions botany evolution cell magnetism relativity courses agriculturebiologyphysicsCSspace 264 classes, 14k documents, 3 million words, 76k vocabulary... … (30)

15 Yahoo Science Classification Accuracy

16 Pruning the tree for computational efficiency water air railroad trucking misc coal oil&gas film communication electric water gas appliance furniture integrated transportationutilitiesconsumerenergyservices... … (11)

17 Related Work Shrinkage in Statistics: –[Stein 1955], [James & Stein 1961] Deleted Interpolation in Language Modeling: –[Jelinek & Mercer 1980], [Seymore & Rosenfeld 1997] Bayesian Hierarchical Modeling for n-grams –[MacKay & Peto 1994] Class hierarchies for text classification –[Koller & Sahami 1997] Using EM to set mixture weights in a hierarchical clustering model for unsupervised learning –[Hofmann & Puzicha 1998]

18 Conclusions Shrinkage in a hierarchy of classes can dramatically improve classification accuracy (29%) Shrinkage helps especially when training data is sparse. In models more complex than naïve Bayes, it should be even more helpful. [The hierarchy can be pruned for exponential reduction in computation necessary for classification; only minimal loss of accuracy.]

19 Future Work Learning hierarchies that aid classification. Using more complex generative models. –Capturing word dependancies –Clustering words in each ancestor