Download presentation

Presentation is loading. Please wait.

Published byGregory Casey Modified over 4 years ago

1
Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint work with Eric P. Xing and Bo Zhang. Laplace Maximum Margin Markov Networks

2
Outline 5/24/2015 ICML 2008 @ Helsinki, Finland 2 Introduction – Structured Prediction – Max-margin Markov Networks Max-Entropy Discrimination Markov Networks – Basic Theorems – Laplace Max-margin Markov Networks – Experimental Results Summary

3
Classical Classification Models 5/24/2015 ICML 2008 @ Helsinki, Finland 3 Inputs: – a set of training samples, where and Outputs: – a predictive function : Examples ( ): –Support Vector Machine (SVM) Max-margin learning –Logistic Regression Max-likelihood estimation

4
Structured Prediction 5/24/2015 ICML 2008 @ Helsinki, Finland 4 Complicated Prediction Examples: – Part-of-speech (POS) Tagging: – Image segmentation Inputs: – a set of training samples:, where Outputs: – a predictive function : “ Do you want fries with that? ” ->

5
Structured Prediction Models 5/24/2015 ICML 2008 @ Helsinki, Finland 5 Conditional Random Fields (CRFs) (Lafferty et al., 2001) – Based on Logistic Regression – Max-likelihood estimation (point-estimate) Max-margin Markov Networks (M 3 Ns) (Taskar et al., 2003) – Based on SVM – Max-margin learning ( point-estimate) Markov properties are encoded in the feature functions

6
Between MLE and max-margin learning 5/24/2015 ICML 2008 @ Helsinki, Finland 6 Likelihood-based estimation – Probabilistic (joint/conditional likelihood model) – Easy adapt to perform Bayesian learning, and consider prior knowledge, missing data Max-margin learning – Non-probabilistic (concentrate on input- output mapping) – Not obvious how to perform Bayesian learning or consider prior, and missing data – Sound theoretical guarantee with limited samples Maximum Entropy Discrimination (MED) (Jaakkola, et al., 1999) – A Bayesian learning approach – The optimization problem (binary classification)

7
MaxEnt Discrimination Markov networks 5/24/2015 ICML 2008 @ Helsinki, Finland 7 MaxEnt Discrimination Markov Networks (MaxEntNet): – Generalized maximum entropy or regularized KL-divergence – Subspace of distributions defined with expected margin constraints Bayesian Prediction

8
Solution to MaxEntNet 5/24/2015 ICML 2008 @ Helsinki, Finland 8 Theorem 1 (Solution to MaxEntNet): – Posterior Distribution: – Dual Optimization Problem: Convex conjugate (closed proper convex ) – Def: – Ex:

9
Reduction to M 3 Ns 5/24/2015 ICML 2008 @ Helsinki, Finland 9 Theorem 2 (Reduction of MaxEntNet to M 3 Ns): – Assume Posterior distribution: Dual optimization: Predictive rule: Thus, MaxEntNet subsumes M 3 Ns and admits all the merits of max-margin learning Furthermore, MaxEntNet has at least three advantages …

10
The Three Advantages PAC-Bayes prediction error guarantee Introduce regularization effects, such as sparsity bias Provides an elegant approach to incorporate latent variables and structures 5/24/2015 ICML 2008 @ Helsinki, Finland 10

11
The Three Advantages PAC-Bayes prediction error guarantee Introduce regularization effects, such as sparsity bias Provides an elegant approach to incorporate latent variables and structures 5/24/2015 ICML 2008 @ Helsinki, Finland 11

12
Generalization Guarantee 5/24/2015 ICML 2008 @ Helsinki, Finland 12 MaxEntNet is an averaging model – we also call it a Bayesian Max-Margin Markov Network Theorem 3 (PAC-Bayes Bound) If Then

13
Laplace M 3 Ns (LapM 3 N) 5/24/2015 ICML 2008 @ Helsinki, Finland 13 The prior in MaxEntNet can be designed to introduce useful regularizatioin effects, such as sparsity bias Normal prior is not so good for sparse data … Instead, we use the Laplace prior … – hierarchical representation (scale mixture)

14
Similar calculation for M 3 Ns (A standard normal prior) Posterior shrinkage effect in LapM 3 N 5/24/2015 ICML 2008 @ Helsinki, Finland 14 Exact integration in LapM 3 N Alternatively, A functional transformation

15
Variational Bayesian Learning 5/24/2015 ICML 2008 @ Helsinki, Finland 15 Exact dual function is hard to optimize Use the hierarchical representation, we get: We optimize an upper bound: Why is it easier? – Alternating minimization leads to nicer optimization problems Keep fixed - The effective prior is normal- Closed form solution of and its expectation

16
Variational Bayesian Learning (Cont’) 5/24/2015 ICML 2008 @ Helsinki, Finland 16

17
Experiments Compare LapM 3 N with: CRFs: MLE L 2 -CRFs: L 2 -norm penalized MLE L 1 -CRFs: L 1 -norm penalized MLE (sparse estimation) M 3 N: max-margin learning 5/24/2015 ICML 2008 @ Helsinki, Finland 17

18
Experimental results on synthetic datasets 5/24/2015 ICML 2008 @ Helsinki, Finland 18 Datasets with 100 i.i.d features of which 10, 30, 50 are relevant – For each setting, we generate 10 datasets, each having 1000 samples. True labels are assigned via a Gibbs sampler with 5000 iterations. Datasets with 30 correlated relevant features + 70 i.i.d irrelevant features

19
Experimental results on OCR datasets 5/24/2015 ICML 2008 @ Helsinki, Finland 19 We randomly construct OCR100, OCR150, OCR200, and OCR250 for 10 fold CV.

20
Sensitivity to Regularization Constants 5/24/2015 ICML 2008 @ Helsinki, Finland 20 L 1 -CRFs are much sensitive to regularization constants; the others are more stable LapM 3 N is the most stable one L 1 -CRF and L 2 -CRF: - 0.001, 0.01, 0.1, 1, 4, 9, 16 M 3 N and LapM 3 N: - 1, 4, 9, 16, 25, 36, 49, 64, 81

21
Summary 5/24/2015 ICML 2008 @ Helsinki, Finland 21 We propose a general framework MaxEntNet to do Bayesian max-margin structured prediction – MaxEntNet subsumes the standard M 3 Ns – PAC-Bayes Theoretical Error Bound We propose Laplace max-margin Markov networks – Enjoys a posterior shrinkage effect – Can perform as well as sparse models on synthetic data; better on real data sets – More stable to regularization constants

22
5/24/2015 ICML 2008 @ Helsinki, Finland 22 Thanks! Detailed Proof: http://166.111.138.19/junzhu/MaxEntNet_TR.pdf http://166.111.138.19/junzhu/MaxEntNet_TR.pdf http://www.sailing.cs.cmu.edu/pdf/2008/zhutr1.pdf

23
Primal and Dual Problems of M 3 Ns 5/24/2015 ICML 2008 @ Helsinki, Finland 23 Primal problem: Algorithms – Cutting plane – Sub-gradient – … Dual problem: Algorithms: – SMO – Exponentiated gradient – …

24
Bayesian Learning 5/24/2015 ICML 2008 @ Helsinki, Finland 24 Point-estimate (MLE & margin-margin learning) – Prediction: Bayesian learning: – Prediction (an averaging model): Bayesian Learning Point-estimate

Similar presentations

© 2020 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google