Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

Similar presentations


Presentation on theme: "Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint."— Presentation transcript:

1 Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint work with Eric P. Xing and Bo Zhang. Laplace Maximum Margin Markov Networks

2 Outline 5/24/2015 ICML 2008 @ Helsinki, Finland 2 Introduction – Structured Prediction – Max-margin Markov Networks Max-Entropy Discrimination Markov Networks – Basic Theorems – Laplace Max-margin Markov Networks – Experimental Results Summary

3 Classical Classification Models 5/24/2015 ICML 2008 @ Helsinki, Finland 3 Inputs: – a set of training samples, where and Outputs: – a predictive function : Examples ( ): –Support Vector Machine (SVM) Max-margin learning –Logistic Regression Max-likelihood estimation

4 Structured Prediction 5/24/2015 ICML 2008 @ Helsinki, Finland 4 Complicated Prediction Examples: – Part-of-speech (POS) Tagging: – Image segmentation Inputs: – a set of training samples:, where Outputs: – a predictive function : “ Do you want fries with that? ” ->

5 Structured Prediction Models 5/24/2015 ICML 2008 @ Helsinki, Finland 5 Conditional Random Fields (CRFs) (Lafferty et al., 2001) – Based on Logistic Regression – Max-likelihood estimation (point-estimate) Max-margin Markov Networks (M 3 Ns) (Taskar et al., 2003) – Based on SVM – Max-margin learning ( point-estimate) Markov properties are encoded in the feature functions

6 Between MLE and max-margin learning 5/24/2015 ICML 2008 @ Helsinki, Finland 6 Likelihood-based estimation – Probabilistic (joint/conditional likelihood model) – Easy adapt to perform Bayesian learning, and consider prior knowledge, missing data Max-margin learning – Non-probabilistic (concentrate on input- output mapping) – Not obvious how to perform Bayesian learning or consider prior, and missing data – Sound theoretical guarantee with limited samples Maximum Entropy Discrimination (MED) (Jaakkola, et al., 1999) – A Bayesian learning approach – The optimization problem (binary classification)

7 MaxEnt Discrimination Markov networks 5/24/2015 ICML 2008 @ Helsinki, Finland 7 MaxEnt Discrimination Markov Networks (MaxEntNet): – Generalized maximum entropy or regularized KL-divergence – Subspace of distributions defined with expected margin constraints Bayesian Prediction

8 Solution to MaxEntNet 5/24/2015 ICML 2008 @ Helsinki, Finland 8 Theorem 1 (Solution to MaxEntNet): – Posterior Distribution: – Dual Optimization Problem: Convex conjugate (closed proper convex ) – Def: – Ex:

9 Reduction to M 3 Ns 5/24/2015 ICML 2008 @ Helsinki, Finland 9 Theorem 2 (Reduction of MaxEntNet to M 3 Ns): – Assume Posterior distribution: Dual optimization: Predictive rule: Thus, MaxEntNet subsumes M 3 Ns and admits all the merits of max-margin learning Furthermore, MaxEntNet has at least three advantages …

10 The Three Advantages PAC-Bayes prediction error guarantee Introduce regularization effects, such as sparsity bias Provides an elegant approach to incorporate latent variables and structures 5/24/2015 ICML 2008 @ Helsinki, Finland 10

11 The Three Advantages PAC-Bayes prediction error guarantee Introduce regularization effects, such as sparsity bias Provides an elegant approach to incorporate latent variables and structures 5/24/2015 ICML 2008 @ Helsinki, Finland 11

12 Generalization Guarantee 5/24/2015 ICML 2008 @ Helsinki, Finland 12 MaxEntNet is an averaging model – we also call it a Bayesian Max-Margin Markov Network Theorem 3 (PAC-Bayes Bound) If Then

13 Laplace M 3 Ns (LapM 3 N) 5/24/2015 ICML 2008 @ Helsinki, Finland 13 The prior in MaxEntNet can be designed to introduce useful regularizatioin effects, such as sparsity bias Normal prior is not so good for sparse data … Instead, we use the Laplace prior … – hierarchical representation (scale mixture)

14 Similar calculation for M 3 Ns (A standard normal prior) Posterior shrinkage effect in LapM 3 N 5/24/2015 ICML 2008 @ Helsinki, Finland 14 Exact integration in LapM 3 N Alternatively, A functional transformation

15 Variational Bayesian Learning 5/24/2015 ICML 2008 @ Helsinki, Finland 15 Exact dual function is hard to optimize Use the hierarchical representation, we get: We optimize an upper bound: Why is it easier? – Alternating minimization leads to nicer optimization problems Keep fixed - The effective prior is normal- Closed form solution of and its expectation

16 Variational Bayesian Learning (Cont’) 5/24/2015 ICML 2008 @ Helsinki, Finland 16

17 Experiments Compare LapM 3 N with: CRFs: MLE L 2 -CRFs: L 2 -norm penalized MLE L 1 -CRFs: L 1 -norm penalized MLE (sparse estimation) M 3 N: max-margin learning 5/24/2015 ICML 2008 @ Helsinki, Finland 17

18 Experimental results on synthetic datasets 5/24/2015 ICML 2008 @ Helsinki, Finland 18 Datasets with 100 i.i.d features of which 10, 30, 50 are relevant – For each setting, we generate 10 datasets, each having 1000 samples. True labels are assigned via a Gibbs sampler with 5000 iterations. Datasets with 30 correlated relevant features + 70 i.i.d irrelevant features

19 Experimental results on OCR datasets 5/24/2015 ICML 2008 @ Helsinki, Finland 19 We randomly construct OCR100, OCR150, OCR200, and OCR250 for 10 fold CV.

20 Sensitivity to Regularization Constants 5/24/2015 ICML 2008 @ Helsinki, Finland 20 L 1 -CRFs are much sensitive to regularization constants; the others are more stable LapM 3 N is the most stable one  L 1 -CRF and L 2 -CRF: - 0.001, 0.01, 0.1, 1, 4, 9, 16  M 3 N and LapM 3 N: - 1, 4, 9, 16, 25, 36, 49, 64, 81

21 Summary 5/24/2015 ICML 2008 @ Helsinki, Finland 21 We propose a general framework MaxEntNet to do Bayesian max-margin structured prediction – MaxEntNet subsumes the standard M 3 Ns – PAC-Bayes Theoretical Error Bound We propose Laplace max-margin Markov networks – Enjoys a posterior shrinkage effect – Can perform as well as sparse models on synthetic data; better on real data sets – More stable to regularization constants

22 5/24/2015 ICML 2008 @ Helsinki, Finland 22 Thanks! Detailed Proof: http://166.111.138.19/junzhu/MaxEntNet_TR.pdf http://166.111.138.19/junzhu/MaxEntNet_TR.pdf http://www.sailing.cs.cmu.edu/pdf/2008/zhutr1.pdf

23 Primal and Dual Problems of M 3 Ns 5/24/2015 ICML 2008 @ Helsinki, Finland 23 Primal problem: Algorithms – Cutting plane – Sub-gradient – … Dual problem: Algorithms: – SMO – Exponentiated gradient – …

24 Bayesian Learning 5/24/2015 ICML 2008 @ Helsinki, Finland 24 Point-estimate (MLE & margin-margin learning) – Prediction: Bayesian learning: – Prediction (an averaging model): Bayesian Learning Point-estimate


Download ppt "Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint."

Similar presentations


Ads by Google