Joint Max Margin & Max Entropy Learning of Graphical Models

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Thomas Trappenberg Autonomous Robotics: Supervised and unsupervised learning.

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

Computer vision: models, learning and inference Chapter 8 Regression.

Supervised Learning Recap

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Support Vector Machines (and Kernel Methods in general)

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.

Visual Recognition Tutorial

Today Logistic Regression Decision Trees Redux Graphical Models

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

An Introduction to Support Vector Machines Martin Law.

Crash Course on Machine Learning

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Cao et al. ICML 2010 Presented by Danushka Bollegala.

Machine Learning Queens College Lecture 13: SVM Again.

Conditional Topic Random Fields Jun Zhu and Eric P. Xing ICML 2010 Presentation and Discussion by Eric Wang January 12, 2011.

An Introduction to Support Vector Machines (M. Law)

CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.

Randomized Algorithms for Bayesian Hierarchical Clustering

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

Biointelligence Laboratory, Seoul National University

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Lecture 2: Statistical learning primer for biologists

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

PREDICT 422: Practical Machine Learning

Chapter 7. Classification and Prediction

Learning Deep Generative Models by Ruslan Salakhutdinov

CEE 6410 Water Resources Systems Analysis

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Deep Feedforward Networks

Multiplicative updates for L1-regularized regression

CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.

Lecture 07: Soft-margin SVM

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Computer vision: models, learning and inference

Statistical Models for Automatic Speech Recognition

Multimodal Learning with Deep Boltzmann Machines

Machine Learning Basics

Data Mining Lecture 11.

Latent Variables, Mixture Models and EM

J. Zhu, A. Ahmed and E.P. Xing Carnegie Mellon University ICML 2009

Statistical Learning Dong Liu Dept. EEIS, USTC.

Learning with information of features

Bayesian Models in Machine Learning

Probabilistic Models with Latent Variables

Lecture 07: Soft-margin SVM

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Jointly primal and dual SPARSE Structured I/O Models

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Topic models for corpora and for graphs

Generally Discriminant Analysis

Expectation-Maximization & Belief Propagation

Topic models for corpora and for graphs

Parametric Methods Berlin Chen, 2005 References:

Primal Sparse Max-Margin Markov Networks

Presentation transcript:

Joint Max Margin & Max Entropy Learning of Graphical Models 1/17/2019 NIPS 2009 @ Vancouver, Canada Laplace Max-margin Markov Networks Joint Max Margin & Max Entropy Learning of Graphical Models Eric Xing epxing@cs.cmu.edu Machine Learning Dept./Language Technology Inst./Computer Science Dept. Carnegie Mellon University 1/17/2019 1

Structured Inference Problem Unstructured prediction Structured prediction Part of speech tagging Image segmentation “Do you want sugar in it?” Þ <verb pron verb noun prep pron> NIPS 2009 @ Vancouver, Canada 1/17/2019

Classical Predictive Models Laplace Max-margin Markov Networks Classical Predictive Models Predictive function : Examples: Logistic Regression, Bayes classifiers Max-likelihood estimation Support Vector Machines (SVM) Max-margin learning E.g.: The classical approach to discriminative prediction problem … NIPS 2009 @ Vancouver, Canada 1/17/2019 3

Structured Prediction Models Conditional Random Fields (CRFs) (Lafferty et al 2001) Based on Logistic Regression Max-likelihood estimation (point- estimate) Max-margin Markov Networks (M3Ns) (Taskar et al 2003) Based on SVM Max-margin learning ( point-estimate) Markov properties are encoded in the feature functions NIPS 2009 @ Vancouver, Canada 1/17/2019

Example I: Image Segmentation Jointly segmenting/annotating images Image-image matching, image- text matching Problem: Given structure (feature), learning Learning sparse, interpretable predictive structures/features NIPS 2009 @ Vancouver, Canada 1/17/2019

Example II: Genome-Phenome association in complex diseases Pleotropic effects Epistatic effects NIPS 2009 @ Vancouver, Canada 1/17/2019

Structured Prediction Models Conditional Random Fields (CRFs) (Lafferty et al 2001) Based on Logistic Regression Max-likelihood estimation (point- estimate) Max-margin Markov Networks (M3Ns) (Taskar et al 2003) Based on SVM Max-margin learning ( point-estimate) Challenges: SPARSE “Interpretable” prediction model Prior information of structures Latent structures/variables Time series and non-stationarity Scalable to large-scale problems (e.g., 104 input/output dimension) NIPS 2009 @ Vancouver, Canada 1/17/2019

Outline Maximum entropy discrimination Markov networks General Theory (Zhu and Xing, JMLR 2009, Zhu and Xing, ICML 2009) Gaussian MEDN: reduction to M3N (Zhu, Xing and Zhang, ICML 08) Laplace MEDN: a sparse M3N (Zhu, Xing and Zhang, ICML 08) Partially observed MEDN: (Zhu, Xing and Zhang, NIPS 08) Max-margin/Max entropy topic model: (Zhu, Ahmed and Xing, ICML 09) NIPS 2009 @ Vancouver, Canada 1/17/2019

MLE versus max-margin learning Laplace Max-margin Markov Networks MLE versus max-margin learning Likelihood-based estimation Probabilistic (joint/conditional likelihood model) Easy to perform Bayesian learning, and incorporate prior knowledge, latent structures, missing data Bayesian or direct regularization Hidden structures or generative hierarchy Max-margin learning Non-probabilistic (concentrate on input-output mapping) Not obvious how to perform Bayesian learning or consider prior, and missing data Support vector property, sound theoretical guarantee with limited samples Kernel tricks Maximum Entropy Discrimination (MED) (Jaakkola, et al., 1999) Model averaging The optimization problem (binary classification) These models fall into two different learning paradigms. Basically, it performs Bayesian learning. It has a lot of nice propertities and subsumes SVM as a special case. NIPS 2009 @ Vancouver, Canada 1/17/2019 9

Max-Margin Learning Paradigms SVM SVM b r a c e M3N M3N MED MED MED-MN = SMED + “Bayesian” M3N ? NIPS 2009 @ Vancouver, Canada 1/17/2019

Primal and Dual Problems of M3Ns Primal problem: Algorithms Cutting plane Sub-gradient … Dual problem: Algorithms: SMO Exponentiated gradient … So, M3N is dual sparse! NIPS 2009 @ Vancouver, Canada 1/17/2019

MaxEnt Discrimination Markov Network (Zhu et al, ICML 2008, Zhu and Xing, JMLR 2009) Structured MaxEnt Discrimination (SMED): Feasible subspace of weight distribution: Average from distribution of M3Ns NIPS 2009 @ Vancouver, Canada 1/17/2019

Solution to MaxEnDNet Theorem 1: Posterior Distribution: Dual Optimization Problem: NIPS 2009 @ Vancouver, Canada 1/17/2019

Gaussian MaxEnDNet (reduction to M3N) Theorem 2 Assume Posterior distribution: Dual optimization: Predictive rule: Thus, MaxEnDNet subsumes M3Ns and admits all the merits of max-margin learning Furthermore, MaxEnDNet has at least three advantages … M3N NIPS 2009 @ Vancouver, Canada 1/17/2019

Three Advantages An averaging Model: PAC-Bayesian prediction error guarantee (Theorem 3) Entropy regularization: Introducing useful biases Standard Normal prior => reduction to standard M3N (we’ve seen it) Laplace prior => Posterior shrinkage effects (sparse M3N) Integrating Generative and Discriminative principles Incorporate latent variables and structures (PoMEN) Semisupervised learning (with partially labeled data) NIPS 2009 @ Vancouver, Canada 1/17/2019 15

I: Generalization Guarantee MaxEntNet is an averaging model Theorem 3 (PAC-Bayes Bound) If Then M < 0? NIPS 2009 @ Vancouver, Canada 1/17/2019

II: Laplace MaxEnDNet (primal sparse M3N) Laplace Max-margin Markov Networks II: Laplace MaxEnDNet (primal sparse M3N) Laplace Prior: Corollary 4: Under a Laplace MaxEnDNet, the posterior mean of parameter vector w is: The Gaussian MaxEnDNet and the regular M3N has no such shrinkage there, we have NIPS 2009 @ Vancouver, Canada 1/17/2019 17

LapMEDN vs. L2 and L1 regularization (Zhu and Xing, ICML 2009) Corollary 5: LapMEDN corresponding to solving the following primal optimization problem: KL norm: NIPS 2009 @ Vancouver, Canada 1/17/2019 L1 and L2 norms KL norms

Variational Learning of LapMEDN Exact primal or dual function is hard to optimize Use the hierarchical representation of Laplace prior, we get: We optimize an upper bound: Why is it easier? Alternating minimization leads to nicer optimization problems L Keep fixed Keep fixed - The effective prior is normal - Closed form solution of and its expectation An M3N optimization problem! Closed-form solution! NIPS 2009 @ Vancouver, Canada 1/17/2019

Experimental results on OCR datasets y brace Structured output We approach structured prediction problems as learning a mapping from a structured input x to a structured output y. For example, in handwriting recognition, we have sequential structure. We seek a mapping from strings of images to coherent strings of letters. a-z y x NIPS 2009 @ Vancouver, Canada 1/17/2019

Experimental results on OCR datasets We randomly construct OCR100, OCR150, OCR200, and OCR250 for 10 fold CV. NIPS 2009 @ Vancouver, Canada 1/17/2019

Feature Selection NIPS 2009 @ Vancouver, Canada 1/17/2019

Sensitivity to Regularization Constants L1-CRFs are much sensitive to regularization constants; the others are more stable LapM3N is the most stable one L1-CRF and L2-CRF: - 0.001, 0.01, 0.1, 1, 4, 9, 16 M3N and LapM3N: - 1, 4, 9, 16, 25, 36, 49, 64, 81 NIPS 2009 @ Vancouver, Canada 1/17/2019

III: Latent Hierarchical MaxEnDNet Web data extraction Goal: Name, Image, Price, Description, etc. Hierarchical labeling Advantages: Computational efficiency Long-range dependency Joint extraction {image} {name, price} {name} {price} {desc} {Head} {Tail} {Info Block} {Repeat block} {Note} NIPS 2009 @ Vancouver, Canada 1/17/2019 24

Partially Observed MaxEnDNet (PoMEN) (Zhu et al, NIPS 2008) Now we are given partially labeled data: PoMEN: learning Prediction: To solve this problem… NIPS 2009 @ Vancouver, Canada 1/17/2019 25

Alternating Minimization Alg. Factorization assumption: Alternating minimization: Step 1: keep fixed, optimize over Step 2: keep fixed, optimize over Normal prior M3N problem (QP) Laplace prior Laplace M3N problem (VB) Equivalently reduced to an LP with a polynomial number of constraints NIPS 2009 @ Vancouver, Canada 1/17/2019

Experimental Results Web data extraction: Name, Image, Price, Description Methods: Hierarchical CRFs, Hierarchical M^3N PoMEN, Partially observed HCRFs Pages from 37 templates Training: 185 (5/per template) pages, or 1585 data records Testing: 370 (10/per template) pages, or 3391 data records Record-level Evaluation Leaf nodes are labeled Page-level Evaluation Supervision Level 1: Leaf nodes and data record nodes are labeled Supervision Level 2: Level 1 + the nodes above data record nodes NIPS 2009 @ Vancouver, Canada 1/17/2019

Record-Level Evaluations Overall performance: Avg F1: avg F1 over all attributes Block instance accuracy: % of records whose Name, Image, and Price are correct Attribute performance: NIPS 2009 @ Vancouver, Canada 1/17/2019

Page-Level Evaluations Supervision Level 1: Leaf nodes and data record nodes are labeled Supervision Level 2: Level 1 + the nodes above data record nodes 29 Machine Learning Lunch @ CMU 1/17/2019 NIPS 2009 @ Vancouver, Canada 1/17/2019

VI: Max-Margin/Max Entropy Topic Model – MED-LDA (from images.google.cn) NIPS 2009 @ Vancouver, Canada 1/17/2019 30

LDA: a generative story for documents Bag-of-word representation of documents Each word is generated by ONE topic Each document is a random mixture over topics image, jpg, gif, file, color, file, images, files, format ground, wire, power, wiring, current, circuit, Topic #1 Topic #2 Document #1: gif jpg image current file color images ground power file current format file formats circuit gif images Document #2: wire currents file format ground power image format wire circuit current wiring ground circuit images files… LDA is one kind of topic models. It’s based on the BOW representation. To illustrate the idea. Let’s suppose there are 2 topics… The models we will talk about are all based on a generative procedure for documents. Let’s suppose we have K topics, for example 2. Each topic is a distribution over the words in a dictionary. Each topic has a set of top words that have high probabilities. For example, the topic #1 is more likely about graphics, while topic 2 is about electronics, such as power, current and circuit. Each document is represented as an admixture over the topics. The mixture weights are determined by the words. Each word is generated by a particular topic. For example, for Document 1, since most of the words are more likely generated by the first topic, the mixture weight for topic 1 is much larger than that for topic 2. Similarly, Document 2 is more likely expressed by the second topic. If we perform Bayesian learning with a Dirichlet prior over the mixture weights, then we get the LDA model. Mixture Components Mixture Weights NIPS 2009 @ Vancouver, Canada 1/17/2019 31

LDA: Latent Dirichlet Allocation (Blei et al., 2003) Generative Procedure: For each document d: Sample a topic proportion For each word: Sample a topic Sample a word LDA stands for Latent Dirichlet Allocation. This is the formal graphical representation. Here, we use plate means copies. Z_{d,n} represents the topic assignment for the word W_{d,n}. \theta_d to represent the topic proportion for Document d; \beta is a matrix, of which each row is a topic. For two topics, here is another example of \theta & \beta. \Alpha is the parameter of the Dirichlet prior over mixture weights \theta. The generative procedure of LDA is as follows: For each document, generate a topic proportion \theta_d For each word: sample a topic Z_{d,n}, and then generate a word using the topic Z_{d,n} LDA defines a joint distribution over \theta, z, and all the words W. Since exact inference of the posterior distribution is intractable, variational approximation is one popular method. Let q(z, theta) be a variational distribution. Then, an upper bound can be derived. Minimizing this upper bound gives the parameter estimates and posterior distributions Joint Distribution: Variational Inference with : Minimize the variational bound to estimate parameters and infer the posterior distribution exact inference intractable! NIPS 2009 @ Vancouver, Canada 1/17/2019 32 32

Supervised Topic Model (sLDA) LDA ignores documents’ side information (e.g., categories or rating score), thus lead to suboptimal topic representation for supervised tasks Supervised Topic Models handle such problems, e.g., sLDA (Blei & McAuliffe, 2007) and DiscLDA (Simon et al., 2008) Generative Procedure (sLDA): For each document d: Sample a topic proportion For each word: Sample a topic Sample a word Sample Our models will based on the sLDA. In sLDA, the parameters are unknown constants. The generative model defines a joint distribution of the response variables and input documents. Similarly, we can introduce a variational distribution and derive a variational upper bound of the negative log-likelihood. Then, we use EM method to estimate model parameters and infer the posterior distribution of latent variables \theta and Z. (Blei & McAuliffe, 2007) Joint distribution: Variational inference: NIPS 2009 @ Vancouver, Canada 1/17/2019 33 33

Max-Likelihood Estimation Max-Margin and Max-Likelihood Laplace Max-margin Markov Networks The big picture Max-Likelihood Estimation Max-Margin and Max-Likelihood sLDA MedLDA How to integrate the max-margin principle into a probabilistic latent variable model? Here is the big picture. sLDA is based on max-likelihood estimation, and our proposed MedLDA is based on max-margin learning. Now, the question we need to address is how to integrate the max-margin principle into the probabilistic topic models? Let’s see the regression model first. NIPS 2009 @ Vancouver, Canada 1/17/2019 34 34

MedLDA Regression Model (Zhu et al, ICML 2009) Bayesian sLDA: MED Estimation: Variational bound Predictive Rule: We first talk about the regression model. We define the regression model as an integration of a Bayesian sLDA and an epsilon-insensitive support vector regression model. The generative procedure of Bayesian sLDA is similar to that of sLDA. We first sample a parameter \eta from a prior distribution p0(\eta). Then, we follow the same procedure to generate a document. So, the key difference from sLDA is that \eta is a hidden variable instead of an unknown constant. The precise definition is: We want to minimize a joint objective function which contains two parts. L(q) is a variational bound of the joint likelihood, which is similar to that of sLDA. But since \eta is a hidden variable, the variational distribution q is a joint distribution over \theta, z, & \eta. Minimizing the variational bound means we want to find a latent topic representation that can fits the data well. The second part measures the \epsilon-insensitive loss of our regression model on the training data. Epsilon is a precision parameter that measures the acceptable derivation of our prediction from the true value. In this model, the prediction is an expectation over Z and \eta. \xi are slack variables that tolerate errors in the training data. C is a constant that tradeoff the two parts. model fitting predictive accuracy NIPS 2009 @ Vancouver, Canada 1/17/2019 35

MedLDA Classification Model (Zhu et al, ICML 2009) Bayesian sLDA: Multiclass MedLDA Classification Model: Variational bound Predictive Rule: We first talk about the regression model. We define the regression model as an integration of a Bayesian sLDA and an epsilon-insensitive support vector regression model. The generative procedure of Bayesian sLDA is similar to that of sLDA. We first sample a parameter \eta from a prior distribution p0(\eta). Then, we follow the same procedure to generate a document. So, the key difference from sLDA is that \eta is a hidden variable instead of an unknown constant. The precise definition is: We want to minimize a joint objective function which contains two parts. L(q) is a variational bound of the joint likelihood, which is similar to that of sLDA. But since \eta is a hidden variable, the variational distribution q is a joint distribution over \theta, z, & \eta. Minimizing the variational bound means we want to find a latent topic representation that can fits the data well. The second part measures the \epsilon-insensitive loss of our regression model on the training data. Epsilon is a precision parameter that measures the acceptable derivation of our prediction from the true value. In this model, the prediction is an expectation over Z and \eta. \xi are slack variables that tolerate errors in the training data. C is a constant that tradeoff the two parts. NIPS 2009 @ Vancouver, Canada 1/17/2019 36

Variational EM Alg. E-step: infer the posterior distribution of hidden r.v. M-step: estimate unknown parameters Independence assumption: Optimize L over : The first two terms are the same as in LDA The third and fourth terms are similar to those of sLDA, but in expected version. The variance matters! The last term is a regularizer. Only support vectors affect the topic proportions Optimize L over other variables. See the paper for details! The MedLDA problem can be efficiently solved by using a variational EM method, which contains two steps. The E-step is to infer the posterior distribution over the hidden variables \theta, z, and \eta. The M-step is to esimate the unknown parameters \alpha, \beta, \delta^2. As in the sLDA, we make the mean field assumption about q. The Lagrangian function L can be written as this form. Then, for E-step, we optimize L over \gamma, \phi, and q(\eta). By using the optimality conditions, we can get: For \gamma, the update equation is the same as that of sLDA For \phi, the update rule is of this form, where the first two terms are the same as in the standard LDA. The third and fourth terms are similar to those of sLDA, but in an expected version since \eta is a random variable in MedLDA. The second order expectations will make the variance of \eta affect the distrubtion. The last term is a regularizor. For the support vectors, their lagrange multipliers are not zero, and the last term will affect the distribution. Also, since the last term is the same for all the terms in a document, it directly affects the latent topic representation of the document. NIPS 2009 @ Vancouver, Canada 1/17/2019 37 37

MedTM: a general framework MedLDA can be generalized to arbitrary topic models: Unsupervised or supervised Generative or undirected random fields (e.g., Harmoniums) MED Topic Model (MedTM)： : hidden r.v.s in the underlying topic model, e.g., in LDA : parameters in predictive model, e.g., in sLDA : parameters of the topic model, e.g., in LDA : an variational upper bound of the log-likelihood : a convex function over slack variables predictive accuracy model fitting MedLDA represents the first step towards integrating the max-margin principle into the procedure of discovering latent topics. The same principle can be generalized to arbitrary topic models, including supervised or unsupervised models, and generative or undirected random fields, such as the Harmonium. We define the generalized MED topic model as follows: NIPS 2009 @ Vancouver, Canada 1/17/2019 38 38

Experiments Goal: Data Sets： To qualitatively and quantitatively evaluate how the max-margin estimates of MedLDA affect its topic discovering procedure Data Sets： 20 Newsgroups (classification) Documents from 20 categories ~ 20,000 documents in each group Remove stop word as listed in UMASS Mallet Movie Review (regression) 5006 documents, and 1.6M words Dictionary: 5000 terms selected by tf-idf Preprocessing to make the response approximately normal (Blei & McAuliffe, 2007) Our goal is to qualitatively and quantitatively evaluate how the max-margin estimates of MedLDA affect its topic discovery procedure. We use the 20 Newsgroups for classification, and movie review data sets for regression. 20 Newsgroups contain documents from 20 categories, and each category contains about 20,000 documents. We remove a standard list of stop words. The movie review data sets contain 5006 reviews. We build a dictionary with 5000 terms, and do some pre-processing to make the response variable to be approximately normal. NIPS 2009 @ Vancouver, Canada 1/17/2019 39 39

Document Modeling Data Set: 20 Newsgroups 110 topics + 2D embedding with t-SNE (var der Maaten & Hinton, 2008) The first experiment is a qualitative evaluation. We fit 110-topic MedLDA and unsupervised LDA on the 20 Newsgroups data set. Then, we use t-SNE stochastic neighborhood embedding algorithm to produce a 2D embedding based on the discovered topic representation. These figures show the 2D embedding of the expected topic proportions of MedLDA and LDA.Each dot is a document, and different color and shapes represent different categories. We can see: the max-margin based MedLDA produces a better grouping and separation of the documents in different categories. In contrast, the unsupervised LDA does not produce a well separated embedding, and documents from different categories tend to mix together. MedLDA LDA NIPS 2009 @ Vancouver, Canada 1/17/2019 40 40

Document Modeling (cont’) comp.graphics Comp.graphics: politics.mideast We further examine how the topics are associated with the categories. We use two categories (graphics and mideast) as examples. This table shows the top topics by different models. Also, we show the per-class distribution over topics for each model. The distribution is computed by averaging the expected latent representation of the documents in each class. We can see that MedLDA yields a sharper, sparser, and fast decaying per-class distribution over topics, which have a better discriminative power. This behavior is in fact due to the regularization effect enforced over \phi. In contrast, the unsupervised LDA tends to discover topics that model the fine details of documents with no regard to their discriminative power. For example, for the class Graphics, MedLDA mainly model documents in this category using two salient topics T69 and T11. But LDA produces a much flatter distribution over many topics. NIPS 2009 @ Vancouver, Canada 1/17/2019 41 41

Classification 42 Data Set: 20Newsgroups Binary classification: “alt.atheism” and “talk.religion.misc” (Simon et al., 2008) Multiclass Classification: all the 20 categories Models: DiscLDA, sLDA (Binary ONLY! Classification sLDA (Wang et al., 2009)), LDA+SVM (baseline), MedLDA, MedLDA+SVM Measure: Relative Improvement Ratio The rest of the experiments are on the predictive accuracy of different models. We evaluate our classification model on the 20 Newsgroups for both binary and multiclass tasks. For binary classification, we want to distinguish the postings from two groups – atheism and religion.misc. For multiclass classification, we consider all the 20 categories. We compare with MedLDA+SVM, The models NIPS 2009 @ Vancouver, Canada 1/17/2019 42 42

Regression Data Set: Movie Review (Blei & McAuliffe, 2007) Models: MedLDA(partial), MedLDA(full), sLDA, LDA+SVR Measure: predictive R2 and per-word log-likelihood We compare MedLDA regression model with other models on the Movie Review Data set. We compare with sLDA and unsupervised LDA. For LDA, we first fit all the documents into an LDA model, and then use the latent topic representation as features to learn a SVR model. For MedLDA, as we have mentioned, the underlying topic model can be a unsupervised LDA or a supervised sLDA. We have implemented both and denote them by MedLDA (partial) and MedLDA (full), respectively. The measures are predictive R^2 and per-word log-likelihood. We can see that, all the supervised methods significantly outperform the unsupervised LDA. And, the max-margin MedLDA outperforms sLDA, when the number of topics is small. In fact, by a close examination, we found that the number of support vectors in MedLDA decrease significantly at the point with 15 topics, and stand stable after that. This suggests that, for harder problem (e.g., the topic number is small), max-margin based formulation can help improve the performance much. Finally, we can see that the MedLDA (full) outperforms MedLDA (partial). This is because, the connections between max-margin parameter estimation and the procedure of discovering latent topic representation is loser in MedLDA (partial). Sharp decrease in SVs NIPS 2009 @ Vancouver, Canada 1/17/2019 43 43

Time Efficiency Binary Classification Multiclass: Regression: MedLDA is comparable with LDA+SVM Regression: MedLDA is comparable with sLDA NIPS 2009 @ Vancouver, Canada 1/17/2019

Summary A general framework of MaxEnDNet for learning structured input/output models Subsumes the standard M3Ns Model averaging: PAC-Bayes theoretical error bound Entropic regularization: sparse M3Ns Generative + discriminative: latent variables, semi-supervised learning on partially labeled data Laplace MaxEnDNet: simultaneously primal and dual sparse Can perform as well as sparse models on synthetic data Perform better on real data sets More stable to regularization constants PoMEN Provides an elegant approach to incorporate latent variables and structures under max-margin framework Experimental results show the advantages of max-margin learning over likelihood methods with latent variables NIPS 2009 @ Vancouver, Canada 1/17/2019

Margin-based Learning Paradigms Structured prediction Bayes learning Bayes learning To get them connected Structured prediction NIPS 2009 @ Vancouver, Canada 1/17/2019 46

Acknowledgement http://www.sailing.cs.cmu.edu/ Funding: 1/17/2019 NIPS 2009 @ Vancouver, Canada 1/17/2019

Thanks! Reference: NIPS 2009 @ Vancouver, Canada 1/17/2019