Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jointly primal and dual SPARSE Structured I/O Models

Similar presentations


Presentation on theme: "Jointly primal and dual SPARSE Structured I/O Models"— Presentation transcript:

1 Jointly primal and dual SPARSE Structured I/O Models
Laplace Max-margin Markov Networks 1/13/2019 Jointly primal and dual SPARSE Structured I/O Models Eric Xing Machine Learning Dept./Language Technology Inst./Computer Science Dept. Carnegie Mellon University Joint work with Jun Zhu and Amr Ahmed 1

2 Classical Predictive Models
Laplace Max-margin Markov Networks Classical Predictive Models Inputs: a set of training samples , where and Outputs: a predictive function : Examples: Logistic Regression, Bayes classifiers Max-likelihood estimation Support Vector Machines (SVM) Max-margin learning The classical approach to discriminative prediction problem … E.g.: 1/13/2019

3 Structured Prediction Problem
Unstructured prediction Structured prediction Part of speech tagging Image segmentation “Do you want sugar in it?” Þ <verb pron verb noun prep pron> 1/13/2019

4 Structured Prediction Models
Conditional Random Fields (CRFs) (Lafferty et al 2001) Based on Logistic Regression Max-likelihood estimation (point- estimate) Max-margin Markov Networks (M3Ns) (Taskar et al 2003) Based on SVM Max-margin learning ( point-estimate) Challenges: SPARSE prediction model Prior information of structures Scalable to large-scale problems (e.g., 104 input/output dimension) ACGTTTTACTGTACAATT 1/13/2019

5 Outline Jointly maximum entropy and maximum margin Markov networks
General Theorems (Zhu and Xing, JMLR submitted) Gaussian MEDN: reduction to M3N (Zhu, Xing and Zhang, ICML 08) Laplace MEDN: a sparse M3N (Zhu, Xing and Zhang, ICML 08) Partially observed MEDN: (Zhu, Xing and Zhang, NIPS 08) Max-margin/Max entropy topic model: (Zhu, Ahmed, and Xing, ICML 09) 1/13/2019

6 Primal and Dual Problems of M3Ns
Primal problem: Algorithms Cutting plane Sub-gradient Dual problem: Algorithms: SMO (Sequential minimal optimization) Exponentiated gradient Sequential minimal optimization So, M3N is dual sparse! 1/13/2019

7 MLE versus max-margin learning
Laplace Max-margin Markov Networks MLE versus max-margin learning Likelihood-based estimation Probabilistic (joint/conditional likelihood model) Easy to perform Bayesian learning, and incorporate prior knowledge, latent structures, missing data Bayesian or direct regularization!! Hidden structures or generative hierarchy Max-margin learning Non-probabilistic (concentrate on input-output mapping) Not obvious how to perform Bayesian learning or consider prior, and missing data Support vector property, sound theoretical guarantee with limited samples Kernel tricks Maximum Entropy Discrimination (MED) (Jaakkola, et al., 1999) Model averaging The optimization problem (binary classification) These models fall into two different learning paradigms. Basically, it performs Bayesian learning. It has a lot of nice propertities and subsumes SVM as a special case. 1/13/2019 7

8 MaxEnt Discrimination Markov Network
Structured MaxEnt Discrimination (SMED): Feasible subspace of weight distribution: Average from distribution of M3Ns 1/13/2019

9 Solution to MaxEnDNet Theorem 1: Posterior Distribution:
Dual Optimization Problem: 1/13/2019

10 Gaussian MaxEnDNet (reduction to M3N)
Theorem 2 Assume Posterior distribution: Dual optimization: Predictive rule: Thus, MaxEnDNet subsumes M3Ns and admits all the merits of max-margin learning Furthermore, MaxEnDNet has at least three advantages … 1/13/2019

11 Three Advantages An averaging Model: PAC-Bayesian prediction error guarantee (Theorem 3) Entropy regularization: Introducing useful biases Standard Normal prior => reduction to standard M3N (we’ve seen it) Laplace prior => Posterior shrinkage effects (sparse M3N) Integrating Generative and Discriminative principles Incorporate latent variables and structures (PoMEN) Semisupervised learning (with partially labeled data) 1/13/2019 11

12 Variational Learning of LapMEDN
Exact primal or dual function is hard to optimize Use a hierarchical representation of Laplace prior, we get: We optimize an upper bound: Why is it easier? Alternating minimization leads to nicer optimization problems L Keep fixed Keep fixed - The effective prior is normal - Closed form solution of and its expectation An M3N optimization problem! Closed-form solution! 1/13/2019

13 Experimental results on OCR datasets
We randomly construct OCR100, OCR150, OCR200, and OCR250 for 10 fold CV. 1/13/2019

14 Feature Selection 1/13/2019

15 Sensitivity to Regularization Constants
L1-CRFs are much sensitive to regularization constants; the others are more stable LapM3N is the most stable one L1-CRF and L2-CRF: , 0.01, 0.1, 1, 4, 9, 16 M3N and LapM3N: - 1, 4, 9, 16, 25, 36, 49, 64, 81 1/13/2019

16 II. Latent Hierarchical MaxEnDNet
Web data extraction Goal: Name, Image, Price, Description, etc. Hierarchical labeling Advantages: Computational efficiency Long-range dependency Joint extraction {image} {name, price} {name} {price} {desc} {Head} {Tail} {Info Block} {Repeat block} {Note} 1/13/2019 16

17 Partially Observed MaxEnDNet (PoMEN)
Now we are given partially labeled data: PoMEN: learning Prediction: To solve this problem… 1/13/2019 17

18 Record-Level Evaluations
Overall performance: Avg F1: avg F1 over all attributes Block instance accuracy: % of records whose Name, Image, and Price are correct Attribute performance: 1/13/2019

19 III: Max-Margin/Max Entropy Topic Model – MED-LDA
(from images.google.cn) 1/13/2019 19

20 LDA: a generative story for documents
Bag-of-word representation of documents Each word is generated by ONE topic Each document is a random mixture over topics image, jpg, gif, file, color, file, images, files, format ground, wire, power, wiring, current, circuit, Topic #1 Topic #2 Document #1: gif jpg image current file color images ground power file current format file formats circuit gif images Document #2: wire currents file format ground power image format wire circuit current wiring ground circuit images files… LDA is one kind of topic models. It’s based on the BOW representation. To illustrate the idea. Let’s suppose there are 2 topics… The models we will talk about are all based on a generative procedure for documents. Let’s suppose we have K topics, for example 2. Each topic is a distribution over the words in a dictionary. Each topic has a set of top words that have high probabilities. For example, the topic #1 is more likely about graphics, while topic 2 is about electronics, such as power, current and circuit. Each document is represented as an admixture over the topics. The mixture weights are determined by the words. Each word is generated by a particular topic. For example, for Document 1, since most of the words are more likely generated by the first topic, the mixture weight for topic 1 is much larger than that for topic 2. Similarly, Document 2 is more likely expressed by the second topic. If we perform Bayesian learning with a Dirichlet prior over the mixture weights, then we get the LDA model. Bayesian Approach Dirichlet Prior LDA Mixture Components Mixture Weights 1/13/2019 20 20

21 Supervised Topic Model (sLDA)
Supervised Topic Models handle such problems, e.g., sLDA (Blei & McAuliffe, 2007) and DiscLDA (Simon et al., 2008) Generative Procedure (sLDA): For each document d: Sample a topic proportion For each word: Sample a topic Sample a word Sample (Blei & McAuliffe, 2007) Our models will based on the sLDA. In sLDA, the parameters are unknown constants. The generative model defines a joint distribution of the response variables and input documents. Similarly, we can introduce a variational distribution and derive a variational upper bound of the negative log-likelihood. Then, we use EM method to estimate model parameters and infer the posterior distribution of latent variables \theta and Z. Joint distribution: Variational inference: 1/13/2019 21 21

22 MedLDA Regression Model
Generative Procedure (Bayesian sLDA): Sample a parameter For each document d: Sample a topic proportion For each word: Sample a topic Sample a word Sample : Def: is a hidden random variable Variational bound Predictive Rule: intractable We first talk about the regression model. We define the regression model as an integration of a Bayesian sLDA and an epsilon-insensitive support vector regression model. The generative procedure of Bayesian sLDA is similar to that of sLDA. We first sample a parameter \eta from a prior distribution p0(\eta). Then, we follow the same procedure to generate a document. So, the key difference from sLDA is that \eta is a hidden variable instead of an unknown constant. The precise definition is: We want to minimize a joint objective function which contains two parts. L(q) is a variational bound of the joint likelihood, which is similar to that of sLDA. But since \eta is a hidden variable, the variational distribution q is a joint distribution over \theta, z, & \eta. Minimizing the variational bound means we want to find a latent topic representation that can fits the data well. The second part measures the \epsilon-insensitive loss of our regression model on the training data. Epsilon is a precision parameter that measures the acceptable derivation of our prediction from the true value. In this model, the prediction is an expectation over Z and \eta. \xi are slack variables that tolerate errors in the training data. C is a constant that tradeoff the two parts. predictive accuracy model fitting 1/13/2019 22 22

23 MedLDA Regression Model
Generative Procedure (Bayesian sLDA): Sample a parameter For each document d: Sample a topic proportion For each word: Sample a topic Sample a word Sample : Def: is a hidden random variable Variational bound Predictive Rule: We first talk about the regression model. We define the regression model as an integration of a Bayesian sLDA and an epsilon-insensitive support vector regression model. The generative procedure of Bayesian sLDA is similar to that of sLDA. We first sample a parameter \eta from a prior distribution p0(\eta). Then, we follow the same procedure to generate a document. So, the key difference from sLDA is that \eta is a hidden variable instead of an unknown constant. The precise definition is: We want to minimize a joint objective function which contains two parts. L(q) is a variational bound of the joint likelihood, which is similar to that of sLDA. But since \eta is a hidden variable, the variational distribution q is a joint distribution over \theta, z, & \eta. Minimizing the variational bound means we want to find a latent topic representation that can fits the data well. The second part measures the \epsilon-insensitive loss of our regression model on the training data. Epsilon is a precision parameter that measures the acceptable derivation of our prediction from the true value. In this model, the prediction is an expectation over Z and \eta. \xi are slack variables that tolerate errors in the training data. C is a constant that tradeoff the two parts. 1/13/2019 23 23

24 MedLDA Classification Model
Normalization factor in GLM makes inference harder We use LDA as the underlying topic model Multiclass MedLDA Classification Model: Variational upper bound : Expected margin constraints. Predictive Rule: model fitting predictive accuracy 1/13/2019 24 24

25 Experiments Goal: Data Sets:
To qualitatively and quantitatively evaluate how the max-margin estimates of MedLDA affect its topic discovering procedure Data Sets: 20 Newsgroups (classification) Documents from 20 categories ~ 20,000 documents in each group Remove stop word as listed in UMASS Mallet Movie Review (regression) 5006 documents, and 1.6M words Dictionary: 5000 terms selected by tf-idf Preprocessing to make the response approximately normal (Blei & McAuliffe, 2007) Our goal is to qualitatively and quantitatively evaluate how the max-margin estimates of MedLDA affect its topic discovery procedure. We use the 20 Newsgroups for classification, and movie review data sets for regression. 20 Newsgroups contain documents from 20 categories, and each category contains about 20,000 documents. We remove a standard list of stop words. The movie review data sets contain 5006 reviews. We build a dictionary with 5000 terms, and do some pre-processing to make the response variable to be approximately normal. 1/13/2019 25 25

26 Document Modeling Data Set: 20 Newsgroups
110 topics + 2D embedding with t-SNE (var der Maaten & Hinton, 2008) The first experiment is a qualitative evaluation. We fit 110-topic MedLDA and unsupervised LDA on the 20 Newsgroups data set. Then, we use t-SNE stochastic neighborhood embedding algorithm to produce a 2D embedding based on the discovered topic representation. These figures show the 2D embedding of the expected topic proportions of MedLDA and LDA.Each dot is a document, and different color and shapes represent different categories. We can see: the max-margin based MedLDA produces a better grouping and separation of the documents in different categories. In contrast, the unsupervised LDA does not produce a well separated embedding, and documents from different categories tend to mix together. MedLDA LDA 1/13/2019 26 26

27 Document Modeling (cont’)
comp.graphics Comp.graphics: politics.mideast We further examine how the topics are associated with the categories. We use two categories (graphics and mideast) as examples. This table shows the top topics by different models. Also, we show the per-class distribution over topics for each model. The distribution is computed by averaging the expected latent representation of the documents in each class. We can see that MedLDA yields a sharper, sparser, and fast decaying per-class distribution over topics, which have a better discriminative power. This behavior is in fact due to the regularization effect enforced over \phi. In contrast, the unsupervised LDA tends to discover topics that model the fine details of documents with no regard to their discriminative power. For example, for the class Graphics, MedLDA mainly model documents in this category using two salient topics T69 and T11. But LDA produces a much flatter distribution over many topics. 1/13/2019 27 27

28 Classification 28 Data Set: 20Newsgroups
Binary classification: “alt.atheism” and “talk.religion.misc” (Simon et al., 2008) Multiclass Classification: all the 20 categories Models: DiscLDA, sLDA(Binary ONLY! Classification sLDA (Wang et al., 2009)), LDA+SVM (baseline), MedLDA, MedLDA+SVM Measure: Relative Improvement Ratio The rest of the experiments are on the predictive accuracy of different models. We evaluate our classification model on the 20 Newsgroups for both binary and multiclass tasks. For binary classification, we want to distinguish the postings from two groups – atheism and religion.misc. For multiclass classification, we consider all the 20 categories. We compare with MedLDA+SVM, The models 1/13/2019 28 28

29 Regression Data Set: Movie Review (Blei & McAuliffe, 2007)
Models: MedLDA(partial), MedLDA(full), sLDA, LDA+SVR Measure: predictive R2 and per-word log-likelihood We compare MedLDA regression model with other models on the Movie Review Data set. We compare with sLDA and unsupervised LDA. For LDA, we first fit all the documents into an LDA model, and then use the latent topic representation as features to learn a SVR model. For MedLDA, as we have mentioned, the underlying topic model can be a unsupervised LDA or a supervised sLDA. We have implemented both and denote them by MedLDA (partial) and MedLDA (full), respectively. The measures are predictive R^2 and per-word log-likelihood. We can see that, all the supervised methods significantly outperform the unsupervised LDA. And, the max-margin MedLDA outperforms sLDA, when the number of topics is small. In fact, by a close examination, we found that the number of support vectors in MedLDA decrease significantly at the point with 15 topics, and stand stable after that. This suggests that, for harder problem (e.g., the topic number is small), max-margin based formulation can help improve the performance much. Finally, we can see that the MedLDA (full) outperforms MedLDA (partial). This is because, the connections between max-margin parameter estimation and the procedure of discovering latent topic representation is loser in MedLDA (partial). Sharp decrease in SVs 1/13/2019 29 29

30 Margin-based Learning Paradigms
Structured prediction Bayes learning Bayes learning To get them connected Structured prediction 1/13/2019 30

31 Acknowledgement Funding: 1/13/2019

32 Thanks! Reference: 1/13/2019


Download ppt "Jointly primal and dual SPARSE Structured I/O Models"

Similar presentations


Ads by Google