Joint Max Margin & Max Entropy Learning of Graphical Models

Joint Max Margin & Max Entropy Learning of Graphical Models
1/17/2019 NIPS Vancouver, Canada Laplace Max-margin Markov Networks Joint Max Margin & Max Entropy Learning of Graphical Models Eric Xing Machine Learning Dept./Language Technology Inst./Computer Science Dept. Carnegie Mellon University 1/17/2019 1

Structured Inference Problem
Unstructured prediction Structured prediction Part of speech tagging Image segmentation “Do you want sugar in it?” Þ <verb pron verb noun prep pron> NIPS Vancouver, Canada 1/17/2019

Classical Predictive Models
Laplace Max-margin Markov Networks Classical Predictive Models Predictive function : Examples: Logistic Regression, Bayes classifiers Max-likelihood estimation Support Vector Machines (SVM) Max-margin learning E.g.: The classical approach to discriminative prediction problem … NIPS Vancouver, Canada 1/17/2019 3

Structured Prediction Models
Conditional Random Fields (CRFs) (Lafferty et al 2001) Based on Logistic Regression Max-likelihood estimation (point- estimate) Max-margin Markov Networks (M3Ns) (Taskar et al 2003) Based on SVM Max-margin learning ( point-estimate) Markov properties are encoded in the feature functions NIPS Vancouver, Canada 1/17/2019

Example I: Image Segmentation
Jointly segmenting/annotating images Image-image matching, image- text matching Problem: Given structure (feature), learning Learning sparse, interpretable predictive structures/features NIPS Vancouver, Canada 1/17/2019

Example II: Genome-Phenome association in complex diseases
Pleotropic effects Epistatic effects NIPS Vancouver, Canada 1/17/2019

Structured Prediction Models
Conditional Random Fields (CRFs) (Lafferty et al 2001) Based on Logistic Regression Max-likelihood estimation (point- estimate) Max-margin Markov Networks (M3Ns) (Taskar et al 2003) Based on SVM Max-margin learning ( point-estimate) Challenges: SPARSE “Interpretable” prediction model Prior information of structures Latent structures/variables Time series and non-stationarity Scalable to large-scale problems (e.g., 104 input/output dimension) NIPS Vancouver, Canada 1/17/2019

Outline Maximum entropy discrimination Markov networks
General Theory (Zhu and Xing, JMLR 2009, Zhu and Xing, ICML 2009) Gaussian MEDN: reduction to M3N (Zhu, Xing and Zhang, ICML 08) Laplace MEDN: a sparse M3N (Zhu, Xing and Zhang, ICML 08) Partially observed MEDN: (Zhu, Xing and Zhang, NIPS 08) Max-margin/Max entropy topic model: (Zhu, Ahmed and Xing, ICML 09) NIPS Vancouver, Canada 1/17/2019

MLE versus max-margin learning
Laplace Max-margin Markov Networks MLE versus max-margin learning Likelihood-based estimation Probabilistic (joint/conditional likelihood model) Easy to perform Bayesian learning, and incorporate prior knowledge, latent structures, missing data Bayesian or direct regularization Hidden structures or generative hierarchy Max-margin learning Non-probabilistic (concentrate on input-output mapping) Not obvious how to perform Bayesian learning or consider prior, and missing data Support vector property, sound theoretical guarantee with limited samples Kernel tricks Maximum Entropy Discrimination (MED) (Jaakkola, et al., 1999) Model averaging The optimization problem (binary classification) These models fall into two different learning paradigms. Basically, it performs Bayesian learning. It has a lot of nice propertities and subsumes SVM as a special case. NIPS Vancouver, Canada 1/17/2019 9

Max-Margin Learning Paradigms
SVM SVM b r a c e M3N M3N MED MED MED-MN = SMED + “Bayesian” M3N ? NIPS Vancouver, Canada 1/17/2019

Primal and Dual Problems of M3Ns
Primal problem: Algorithms Cutting plane Sub-gradient … Dual problem: Algorithms: SMO Exponentiated gradient … So, M3N is dual sparse! NIPS Vancouver, Canada 1/17/2019

MaxEnt Discrimination Markov Network
(Zhu et al, ICML 2008, Zhu and Xing, JMLR 2009) Structured MaxEnt Discrimination (SMED): Feasible subspace of weight distribution: Average from distribution of M3Ns NIPS Vancouver, Canada 1/17/2019

Solution to MaxEnDNet Theorem 1: Posterior Distribution:
Dual Optimization Problem: NIPS Vancouver, Canada 1/17/2019

Gaussian MaxEnDNet (reduction to M3N)
Theorem 2 Assume Posterior distribution: Dual optimization: Predictive rule: Thus, MaxEnDNet subsumes M3Ns and admits all the merits of max-margin learning Furthermore, MaxEnDNet has at least three advantages … M3N NIPS Vancouver, Canada 1/17/2019

Three Advantages An averaging Model: PAC-Bayesian prediction error guarantee (Theorem 3) Entropy regularization: Introducing useful biases Standard Normal prior => reduction to standard M3N (we’ve seen it) Laplace prior => Posterior shrinkage effects (sparse M3N) Integrating Generative and Discriminative principles Incorporate latent variables and structures (PoMEN) Semisupervised learning (with partially labeled data) NIPS Vancouver, Canada 1/17/2019 15

I: Generalization Guarantee
MaxEntNet is an averaging model Theorem 3 (PAC-Bayes Bound) If Then M < 0? NIPS Vancouver, Canada 1/17/2019

II: Laplace MaxEnDNet (primal sparse M3N)
Laplace Max-margin Markov Networks II: Laplace MaxEnDNet (primal sparse M3N) Laplace Prior: Corollary 4: Under a Laplace MaxEnDNet, the posterior mean of parameter vector w is: The Gaussian MaxEnDNet and the regular M3N has no such shrinkage there, we have NIPS Vancouver, Canada 1/17/2019 17

LapMEDN vs. L2 and L1 regularization
(Zhu and Xing, ICML 2009) Corollary 5: LapMEDN corresponding to solving the following primal optimization problem: KL norm: NIPS Vancouver, Canada 1/17/2019 L1 and L2 norms KL norms

Variational Learning of LapMEDN
Exact primal or dual function is hard to optimize Use the hierarchical representation of Laplace prior, we get: We optimize an upper bound: Why is it easier? Alternating minimization leads to nicer optimization problems L Keep fixed Keep fixed - The effective prior is normal - Closed form solution of and its expectation An M3N optimization problem! Closed-form solution! NIPS Vancouver, Canada 1/17/2019

Experimental results on OCR datasets
y brace Structured output We approach structured prediction problems as learning a mapping from a structured input x to a structured output y. For example, in handwriting recognition, we have sequential structure. We seek a mapping from strings of images to coherent strings of letters. a-z y x NIPS Vancouver, Canada 1/17/2019

Experimental results on OCR datasets
We randomly construct OCR100, OCR150, OCR200, and OCR250 for 10 fold CV. NIPS Vancouver, Canada 1/17/2019

Feature Selection NIPS Vancouver, Canada 1/17/2019

Sensitivity to Regularization Constants
L1-CRFs are much sensitive to regularization constants; the others are more stable LapM3N is the most stable one L1-CRF and L2-CRF: , 0.01, 0.1, 1, 4, 9, 16 M3N and LapM3N: - 1, 4, 9, 16, 25, 36, 49, 64, 81 NIPS Vancouver, Canada 1/17/2019

III: Latent Hierarchical MaxEnDNet
Web data extraction Goal: Name, Image, Price, Description, etc. Hierarchical labeling Advantages: Computational efficiency Long-range dependency Joint extraction {image} {name, price} {name} {price} {desc} {Head} {Tail} {Info Block} {Repeat block} {Note} NIPS Vancouver, Canada 1/17/2019 24

Partially Observed MaxEnDNet (PoMEN)
(Zhu et al, NIPS 2008) Now we are given partially labeled data: PoMEN: learning Prediction: To solve this problem… NIPS Vancouver, Canada 1/17/2019 25

Alternating Minimization Alg.
Factorization assumption: Alternating minimization: Step 1: keep fixed, optimize over Step 2: keep fixed, optimize over Normal prior M3N problem (QP) Laplace prior Laplace M3N problem (VB) Equivalently reduced to an LP with a polynomial number of constraints NIPS Vancouver, Canada 1/17/2019

Experimental Results Web data extraction:
Name, Image, Price, Description Methods: Hierarchical CRFs, Hierarchical M^3N PoMEN, Partially observed HCRFs Pages from 37 templates Training: 185 (5/per template) pages, or 1585 data records Testing: 370 (10/per template) pages, or 3391 data records Record-level Evaluation Leaf nodes are labeled Page-level Evaluation Supervision Level 1: Leaf nodes and data record nodes are labeled Supervision Level 2: Level 1 + the nodes above data record nodes NIPS Vancouver, Canada 1/17/2019

Record-Level Evaluations
Overall performance: Avg F1: avg F1 over all attributes Block instance accuracy: % of records whose Name, Image, and Price are correct Attribute performance: NIPS Vancouver, Canada 1/17/2019

Page-Level Evaluations
Supervision Level 1: Leaf nodes and data record nodes are labeled Supervision Level 2: Level 1 + the nodes above data record nodes 29 Machine Learning CMU 1/17/2019 NIPS Vancouver, Canada 1/17/2019

VI: Max-Margin/Max Entropy Topic Model – MED-LDA
(from images.google.cn) NIPS Vancouver, Canada 1/17/2019 30

LDA: a generative story for documents
Bag-of-word representation of documents Each word is generated by ONE topic Each document is a random mixture over topics image, jpg, gif, file, color, file, images, files, format ground, wire, power, wiring, current, circuit, Topic #1 Topic #2 Document #1: gif jpg image current file color images ground power file current format file formats circuit gif images Document #2: wire currents file format ground power image format wire circuit current wiring ground circuit images files… LDA is one kind of topic models. It’s based on the BOW representation. To illustrate the idea. Let’s suppose there are 2 topics… The models we will talk about are all based on a generative procedure for documents. Let’s suppose we have K topics, for example 2. Each topic is a distribution over the words in a dictionary. Each topic has a set of top words that have high probabilities. For example, the topic #1 is more likely about graphics, while topic 2 is about electronics, such as power, current and circuit. Each document is represented as an admixture over the topics. The mixture weights are determined by the words. Each word is generated by a particular topic. For example, for Document 1, since most of the words are more likely generated by the first topic, the mixture weight for topic 1 is much larger than that for topic 2. Similarly, Document 2 is more likely expressed by the second topic. If we perform Bayesian learning with a Dirichlet prior over the mixture weights, then we get the LDA model. Mixture Components Mixture Weights NIPS Vancouver, Canada 1/17/2019 31

LDA: Latent Dirichlet Allocation
(Blei et al., 2003) Generative Procedure: For each document d: Sample a topic proportion For each word: Sample a topic Sample a word LDA stands for Latent Dirichlet Allocation. This is the formal graphical representation. Here, we use plate means copies. Z_{d,n} represents the topic assignment for the word W_{d,n}. \theta_d to represent the topic proportion for Document d; \beta is a matrix, of which each row is a topic. For two topics, here is another example of \theta & \beta. \Alpha is the parameter of the Dirichlet prior over mixture weights \theta. The generative procedure of LDA is as follows: For each document, generate a topic proportion \theta_d For each word: sample a topic Z_{d,n}, and then generate a word using the topic Z_{d,n} LDA defines a joint distribution over \theta, z, and all the words W. Since exact inference of the posterior distribution is intractable, variational approximation is one popular method. Let q(z, theta) be a variational distribution. Then, an upper bound can be derived. Minimizing this upper bound gives the parameter estimates and posterior distributions Joint Distribution: Variational Inference with : Minimize the variational bound to estimate parameters and infer the posterior distribution exact inference intractable! NIPS Vancouver, Canada 1/17/2019 32 32

Supervised Topic Model (sLDA)
LDA ignores documents’ side information (e.g., categories or rating score), thus lead to suboptimal topic representation for supervised tasks Supervised Topic Models handle such problems, e.g., sLDA (Blei & McAuliffe, 2007) and DiscLDA (Simon et al., 2008) Generative Procedure (sLDA): For each document d: Sample a topic proportion For each word: Sample a topic Sample a word Sample Our models will based on the sLDA. In sLDA, the parameters are unknown constants. The generative model defines a joint distribution of the response variables and input documents. Similarly, we can introduce a variational distribution and derive a variational upper bound of the negative log-likelihood. Then, we use EM method to estimate model parameters and infer the posterior distribution of latent variables \theta and Z. (Blei & McAuliffe, 2007) Joint distribution: Variational inference: NIPS Vancouver, Canada 1/17/2019 33 33

Max-Likelihood Estimation Max-Margin and Max-Likelihood
Laplace Max-margin Markov Networks The big picture Max-Likelihood Estimation Max-Margin and Max-Likelihood sLDA MedLDA How to integrate the max-margin principle into a probabilistic latent variable model? Here is the big picture. sLDA is based on max-likelihood estimation, and our proposed MedLDA is based on max-margin learning. Now, the question we need to address is how to integrate the max-margin principle into the probabilistic topic models? Let’s see the regression model first. NIPS Vancouver, Canada 1/17/2019 34 34

MedLDA Regression Model
(Zhu et al, ICML 2009) Bayesian sLDA: MED Estimation: Variational bound Predictive Rule: We first talk about the regression model. We define the regression model as an integration of a Bayesian sLDA and an epsilon-insensitive support vector regression model. The generative procedure of Bayesian sLDA is similar to that of sLDA. We first sample a parameter \eta from a prior distribution p0(\eta). Then, we follow the same procedure to generate a document. So, the key difference from sLDA is that \eta is a hidden variable instead of an unknown constant. The precise definition is: We want to minimize a joint objective function which contains two parts. L(q) is a variational bound of the joint likelihood, which is similar to that of sLDA. But since \eta is a hidden variable, the variational distribution q is a joint distribution over \theta, z, & \eta. Minimizing the variational bound means we want to find a latent topic representation that can fits the data well. The second part measures the \epsilon-insensitive loss of our regression model on the training data. Epsilon is a precision parameter that measures the acceptable derivation of our prediction from the true value. In this model, the prediction is an expectation over Z and \eta. \xi are slack variables that tolerate errors in the training data. C is a constant that tradeoff the two parts. model fitting predictive accuracy NIPS Vancouver, Canada 1/17/2019 35

MedLDA Classification Model
(Zhu et al, ICML 2009) Bayesian sLDA: Multiclass MedLDA Classification Model: Variational bound Predictive Rule: We first talk about the regression model. We define the regression model as an integration of a Bayesian sLDA and an epsilon-insensitive support vector regression model. The generative procedure of Bayesian sLDA is similar to that of sLDA. We first sample a parameter \eta from a prior distribution p0(\eta). Then, we follow the same procedure to generate a document. So, the key difference from sLDA is that \eta is a hidden variable instead of an unknown constant. The precise definition is: We want to minimize a joint objective function which contains two parts. L(q) is a variational bound of the joint likelihood, which is similar to that of sLDA. But since \eta is a hidden variable, the variational distribution q is a joint distribution over \theta, z, & \eta. Minimizing the variational bound means we want to find a latent topic representation that can fits the data well. The second part measures the \epsilon-insensitive loss of our regression model on the training data. Epsilon is a precision parameter that measures the acceptable derivation of our prediction from the true value. In this model, the prediction is an expectation over Z and \eta. \xi are slack variables that tolerate errors in the training data. C is a constant that tradeoff the two parts. NIPS Vancouver, Canada 1/17/2019 36

Variational EM Alg. E-step: infer the posterior distribution of hidden r.v. M-step: estimate unknown parameters Independence assumption: Optimize L over : The first two terms are the same as in LDA The third and fourth terms are similar to those of sLDA, but in expected version. The variance matters! The last term is a regularizer. Only support vectors affect the topic proportions Optimize L over other variables. See the paper for details! The MedLDA problem can be efficiently solved by using a variational EM method, which contains two steps. The E-step is to infer the posterior distribution over the hidden variables \theta, z, and \eta. The M-step is to esimate the unknown parameters \alpha, \beta, \delta^2. As in the sLDA, we make the mean field assumption about q. The Lagrangian function L can be written as this form. Then, for E-step, we optimize L over \gamma, \phi, and q(\eta). By using the optimality conditions, we can get: For \gamma, the update equation is the same as that of sLDA For \phi, the update rule is of this form, where the first two terms are the same as in the standard LDA. The third and fourth terms are similar to those of sLDA, but in an expected version since \eta is a random variable in MedLDA. The second order expectations will make the variance of \eta affect the distrubtion. The last term is a regularizor. For the support vectors, their lagrange multipliers are not zero, and the last term will affect the distribution. Also, since the last term is the same for all the terms in a document, it directly affects the latent topic representation of the document. NIPS Vancouver, Canada 1/17/2019 37 37

MedTM: a general framework
MedLDA can be generalized to arbitrary topic models: Unsupervised or supervised Generative or undirected random fields (e.g., Harmoniums) MED Topic Model (MedTM)： : hidden r.v.s in the underlying topic model, e.g., in LDA : parameters in predictive model, e.g., in sLDA : parameters of the topic model, e.g., in LDA : an variational upper bound of the log-likelihood : a convex function over slack variables predictive accuracy model fitting MedLDA represents the first step towards integrating the max-margin principle into the procedure of discovering latent topics. The same principle can be generalized to arbitrary topic models, including supervised or unsupervised models, and generative or undirected random fields, such as the Harmonium. We define the generalized MED topic model as follows: NIPS Vancouver, Canada 1/17/2019 38 38

Experiments Goal: Data Sets：
To qualitatively and quantitatively evaluate how the max-margin estimates of MedLDA affect its topic discovering procedure Data Sets： 20 Newsgroups (classification) Documents from 20 categories ~ 20,000 documents in each group Remove stop word as listed in UMASS Mallet Movie Review (regression) 5006 documents, and 1.6M words Dictionary: 5000 terms selected by tf-idf Preprocessing to make the response approximately normal (Blei & McAuliffe, 2007) Our goal is to qualitatively and quantitatively evaluate how the max-margin estimates of MedLDA affect its topic discovery procedure. We use the 20 Newsgroups for classification, and movie review data sets for regression. 20 Newsgroups contain documents from 20 categories, and each category contains about 20,000 documents. We remove a standard list of stop words. The movie review data sets contain 5006 reviews. We build a dictionary with 5000 terms, and do some pre-processing to make the response variable to be approximately normal. NIPS Vancouver, Canada 1/17/2019 39 39

Document Modeling Data Set: 20 Newsgroups
110 topics + 2D embedding with t-SNE (var der Maaten & Hinton, 2008) The first experiment is a qualitative evaluation. We fit 110-topic MedLDA and unsupervised LDA on the 20 Newsgroups data set. Then, we use t-SNE stochastic neighborhood embedding algorithm to produce a 2D embedding based on the discovered topic representation. These figures show the 2D embedding of the expected topic proportions of MedLDA and LDA.Each dot is a document, and different color and shapes represent different categories. We can see: the max-margin based MedLDA produces a better grouping and separation of the documents in different categories. In contrast, the unsupervised LDA does not produce a well separated embedding, and documents from different categories tend to mix together. MedLDA LDA NIPS Vancouver, Canada 1/17/2019 40 40

Document Modeling (cont’)
comp.graphics Comp.graphics: politics.mideast We further examine how the topics are associated with the categories. We use two categories (graphics and mideast) as examples. This table shows the top topics by different models. Also, we show the per-class distribution over topics for each model. The distribution is computed by averaging the expected latent representation of the documents in each class. We can see that MedLDA yields a sharper, sparser, and fast decaying per-class distribution over topics, which have a better discriminative power. This behavior is in fact due to the regularization effect enforced over \phi. In contrast, the unsupervised LDA tends to discover topics that model the fine details of documents with no regard to their discriminative power. For example, for the class Graphics, MedLDA mainly model documents in this category using two salient topics T69 and T11. But LDA produces a much flatter distribution over many topics. NIPS Vancouver, Canada 1/17/2019 41 41

Classification 42 Data Set: 20Newsgroups
Binary classification: “alt.atheism” and “talk.religion.misc” (Simon et al., 2008) Multiclass Classification: all the 20 categories Models: DiscLDA, sLDA (Binary ONLY! Classification sLDA (Wang et al., 2009)), LDA+SVM (baseline), MedLDA, MedLDA+SVM Measure: Relative Improvement Ratio The rest of the experiments are on the predictive accuracy of different models. We evaluate our classification model on the 20 Newsgroups for both binary and multiclass tasks. For binary classification, we want to distinguish the postings from two groups – atheism and religion.misc. For multiclass classification, we consider all the 20 categories. We compare with MedLDA+SVM, The models NIPS Vancouver, Canada 1/17/2019 42 42

Regression Data Set: Movie Review (Blei & McAuliffe, 2007)
Models: MedLDA(partial), MedLDA(full), sLDA, LDA+SVR Measure: predictive R2 and per-word log-likelihood We compare MedLDA regression model with other models on the Movie Review Data set. We compare with sLDA and unsupervised LDA. For LDA, we first fit all the documents into an LDA model, and then use the latent topic representation as features to learn a SVR model. For MedLDA, as we have mentioned, the underlying topic model can be a unsupervised LDA or a supervised sLDA. We have implemented both and denote them by MedLDA (partial) and MedLDA (full), respectively. The measures are predictive R^2 and per-word log-likelihood. We can see that, all the supervised methods significantly outperform the unsupervised LDA. And, the max-margin MedLDA outperforms sLDA, when the number of topics is small. In fact, by a close examination, we found that the number of support vectors in MedLDA decrease significantly at the point with 15 topics, and stand stable after that. This suggests that, for harder problem (e.g., the topic number is small), max-margin based formulation can help improve the performance much. Finally, we can see that the MedLDA (full) outperforms MedLDA (partial). This is because, the connections between max-margin parameter estimation and the procedure of discovering latent topic representation is loser in MedLDA (partial). Sharp decrease in SVs NIPS Vancouver, Canada 1/17/2019 43 43

Time Efficiency Binary Classification Multiclass: Regression:
MedLDA is comparable with LDA+SVM Regression: MedLDA is comparable with sLDA NIPS Vancouver, Canada 1/17/2019

Summary A general framework of MaxEnDNet for learning structured input/output models Subsumes the standard M3Ns Model averaging: PAC-Bayes theoretical error bound Entropic regularization: sparse M3Ns Generative + discriminative: latent variables, semi-supervised learning on partially labeled data Laplace MaxEnDNet: simultaneously primal and dual sparse Can perform as well as sparse models on synthetic data Perform better on real data sets More stable to regularization constants PoMEN Provides an elegant approach to incorporate latent variables and structures under max-margin framework Experimental results show the advantages of max-margin learning over likelihood methods with latent variables NIPS Vancouver, Canada 1/17/2019

Margin-based Learning Paradigms
Structured prediction Bayes learning Bayes learning To get them connected Structured prediction NIPS Vancouver, Canada 1/17/2019 46

Acknowledgement http://www.sailing.cs.cmu.edu/ Funding: 1/17/2019
NIPS Vancouver, Canada 1/17/2019

Thanks! Reference: NIPS Vancouver, Canada 1/17/2019

Joint Max Margin & Max Entropy Learning of Graphical Models

Similar presentations

Presentation on theme: "Joint Max Margin & Max Entropy Learning of Graphical Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Joint Max Margin & Max Entropy Learning of Graphical Models

Similar presentations

Presentation on theme: "Joint Max Margin & Max Entropy Learning of Graphical Models"— Presentation transcript:

Similar presentations

About project

Feedback