Primal Sparse Max-Margin Markov Networks

Primal Sparse Max-Margin Markov Networks
Laplace Max-margin Markov Networks Primal Sparse Max-Margin Markov Networks Jun Zhu*†, Eric P. Xing† and Bo Zhang* *Department of Computer Science & Technology, Tsinghua University †School of Computer Science, Carnegie Mellon University

Outline Problem Formulation Three Learning Algorithms
Structured prediction and models Three Learning Algorithms Cutting-plane Projected sub-gradient EM-style method Experimental Results Conclusion KDD Paris, France

Classification Learning Input: a set of I.I.D. training samples , where OCR: Learning Output: a predictive function Support Vector Machines (SVM) (Crammer & Singer, 2001): b r a c e KDD Paris, France

Structure Prediction b r a c e
Classification ignores the dependency among data: Learning Inputs：a set of I.I.D samples , where Image Annotation (Yuan et al., 2007)： Learning Output: a structured predictive function : Structured input b r a c e Structured output KDD Paris, France

Regularized Empirical Risk Minimization
Laplace Max-margin Markov Networks Regularized Empirical Risk Minimization True data distribution: Risk: is a non-negative loss function, e.g., Hamming loss Empirical Risk : Regularized Empirical Risk is a regularizer; is the regularization parameter One type of learning framework is the regularized empirical risk minimization. In this general framework, by using different regularizers and defining different h, we can get different models. Here we introduce several of them, and analyze their connections and differences. KDD Paris, France

Max-Margin Markov Networks (M3N)
Laplace Max-margin Markov Networks Max-Margin Markov Networks (M3N) Basic setup: Base discriminant function: the linear form: Prediction rule: Margin: Structured hinge loss (an upper bound of empirical risk): -norm M3N (Taskar et al., 2003): -norm M3N: By using different regularizers, ell_1 and ell_2 M3N have different sparsity properties. To illustrate it, we note that… KDD Paris, France

Sparsity Interpretation
Laplace Max-margin Markov Networks Sparsity Interpretation Equivalent formulation ( ) Projection Primal Sparsity: Only a few input features have non-zero weights in the original model Pseudo-primal Sparsity: Only a few input features have large weights in the original model Dual Sparsity: Only a few Lagrange multipliers are non-zero Example: (linear max-margin Markov networks) Complementary Slackness Condition: p3 p1 p2 To illustrate the difference. We note that the regularized minimization problem can be equivalent formulated as a constrained optimization problem. Basically, for each lambda, you can find a c. these two problems have the same solution. And for each c, you can find a lambda. This constraint generally defines a feasible subset. And it can be viewed as a projection of the esimates into a subspace. Here is the illustration for ell_2-ball and ell_1-ball. L1-ball L2-ball -norm M3N is the primal sparse max-margin Markov network KDD Paris, France

Laplace Max-margin Markov Networks
Alg. 1: Cutting-Plane Basic Procedure (based on constrained optimization problem) Construct a sub-set of constraints Optimize w.r.t the constraints in the sub-set Successfully used to learn M3Ns (Tsochantaridis et al., 2004; Joachims et al., 2008) Find most violated constraints Fining is an loss-augmented prediction problem, which can be efficiently solved using max-product alg. Solve an LP sub-problem over working set S a w.l.o.g, we assume at most one of is non-zero KDD Paris, France

Alg. 2: Projected Sub-gradient
Laplace Max-margin Markov Networks Alg. 2: Projected Sub-gradient Equivalent Formulation Projected Sub-gradient Sub-gradient: Sub-gradient of Hinge Loss each component is piecewise linear loss augmented prediction Efficient projection to L1-ball (Duchi et al., 2008) in expectation for sparse gradients KDD Paris, France

Alg. 3: EM-style Method a Can adaptively change the scaling parameters for different features Irrelevant features can be discarded by having a zero scaling parameter KDD Paris, France

Alg. 3: EM-style Method (cont’)
Laplace Max-margin Markov Networks Alg. 3: EM-style Method (cont’) a a EM-style Alg. Variational Bayesian Alg. KDD Paris, France

Relationship Summary Equivalence EM-style Alg. Relaxation
More reading: J. Zhu and E. P. Xing. On Primal and Dual Sparsity of Markov Networks, Proc. of ICML, 2009 KDD Paris, France

Experiments Goals: Compare different models
Compare the learning algorithms for L1-M3N EM-style algorithm Cutting-plane Projected sub-gradient Models Primal Sparse Dual Sparse Regularizer CRFs (Lafferty et al., 2001) no NULL L2-CRFs pseudo L2-norm L1-CRFs yes L1-norm M3N (Taskar et al., 2003) L1-M3N LapM3N (Zhu et al., 2008) KL-norm KDD Paris, France

Synthetic Data Sets Datasets with 30 correlated relevant features + 70 i.i.d irrelevant features We generate 10 datasets, each having 1000 samples. True labels are assigned via a Gibbs sampler. Average results on the 10 data sets. KDD Paris, France

OCR Data Sets Task: (Taskar et al., 2003)
Datasets: OCR100, OCR150, OCR200, and OCR250 KDD Paris, France

Algorithm Comparison Error rates and training time (CPU-seconds)
On synthetic data set On OCR data set KDD Paris, France

Web Data Extraction Task: identify Name, Image, Price, and Description of product items Data: 1585 training records; 3391 testing records Evaluation Criteria: Average F1: the average of F1 values of all the attributes Block instance accuracy: the percentage of records whose Name, Image, and Price are all correct KDD Paris, France

Summary Models and Algorithms: primal sparse max-margin MNs
develop three learning algorithms for L1-M3N propose a novel Adaptive max-margin MNs (AdapM3N) show the equivalence between L1-M3N and AdapM3N Experiments: L1-M3N can effectively select significant features simultaneously (pseudo-) primal and dual sparsity benefit the structured prediction models EM-style algorithm is robust KDD Paris, France

Thanks! KDD Paris, France

Sparsity Interpretation
Laplace Max-margin Markov Networks Sparsity Interpretation Equivalent formulation ( ) Projection Primal Sparsity: Only a few input features have non-zero weights in the original model Pseudo-primal Sparsity: Only a few input features have large weights in the original model Dual Sparsity: Only a few Lagrange multipliers are non-zero Example: (linear max-margin Markov networks) Complementary Slackness Condition: To illustrate the difference. We note that the regularized minimization problem can be equivalent formulated as a constrained optimization problem. Basically, for each lambda, you can find a c. these two problems have the same solution. And for each c, you can find a lambda. This constraint generally defines a feasible subset. And it can be viewed as a projection of the esimates into a subspace. Here is the illustration for ell_2-ball and ell_1-ball. -norm M3N is the primal sparse max-margin Markov network KDD Paris, France

Alg. 3: EM-style Method a Intuitive interpretation from the scale-mixture interpretation of Laplace distribution First-order moment matching constraint. KDD Paris, France

Solution to MaxEnDNet Theorem 1 (Solution to MaxEnDNet):
Posterior Distribution: Dual Optimization Problem: Convex conjugate (closed proper convex ) Def: Ex: KDD Paris, France

Reduction to M3Ns Theorem 2 (Reduction of MaxEnDNet to M3Ns):
Assume Posterior distribution: Dual optimization: Predictive rule: Thus, MaxEnDNet subsumes M3Ns and admits all the merits of max-margin learning Furthermore, MaxEnDNet has at least three advantages … KDD Paris, France

Laplace M3Ns (LapM3N) The prior in MaxEntNet can be designed to introduce useful regularizatioin effects, such as sparsity bias Normal prior is not so good for sparse data … Instead, we use the Laplace prior … hierarchical representation (scale mixture) KDD Paris, France

Posterior shrinkage effect in LapM3N
Exact integration in LapM3N Alternatively, Similar calculation for M3Ns (A standard normal prior) A functional transformation KDD Paris, France

Variational Bayesian Learning
Exact dual function is hard to optimize Use the hierarchical representation, we get: We optimize an upper bound: Why is it easier? Alternating minimization leads to nicer optimization problems Keep fixed Keep fixed - The effective prior is normal - Closed form solution of and its expectation An M3N optimization problem! Closed-form solution! KDD Paris, France

Laplace Max-Margin Markov Networks (LapM3N)
MaxEnDNet – an averaging model (Zhu et al., 2008) Base discriminant function: Effective discriminant function: Predictive rule: Margin: Expected structured hinge loss: Learning problem Laplace max-margin Markov network is the Laplace MaxEnDNet MaxEnDNet with a Laplace prior Note that for linear models, the predictive rule depends on the posterior mean only. Let me say a little bit more about MaxEnDNet and LapM3N. KDD Paris, France

More about MaxEnDNet MaxEnDNet: MED (Jaakkola et al., 99) for structured prediction General solution to MaxEnDNet: Posterior Distribution: Dual Optimization Problem: Gaussian MaxEnDNet  norm M3N Assume Posterior distribution: Dual optimization: Predictive rule: KDD Paris, France

More about Laplace M3Ns Motivation: normal prior is not good for sparse learning … Can’t distinguish relevant and irrelevant features Posterior Shrinkage Effect (Zhu et al., 08) Assume For Laplace prior: For standard normal prior KDD Paris, France

Entropic Regularizer:
Strict positive: Monotonically increasing: Approaching the 1-norm: KDD Paris, France

Graphical illustration
KDD Paris, France

Relationship Summary More reading:
J. Zhu and E. P. Xing. On Primal and Dual Sparsity of Markov Networks, Proc. of ICML, 2009 KDD Paris, France

Primal Sparse Max-Margin Markov Networks

Similar presentations

Presentation on theme: "Primal Sparse Max-Margin Markov Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Primal Sparse Max-Margin Markov Networks

Similar presentations

Presentation on theme: "Primal Sparse Max-Margin Markov Networks"— Presentation transcript:

Similar presentations

About project

Feedback