Primal Sparse Max-Margin Markov Networks

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Various Regularization Methods in Computer Vision Min-Gyu Park Computer Vision Lab. School of Information and Communications GIST.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Fast Algorithms For Hierarchical Range Histogram Constructions
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
Pattern Recognition and Machine Learning
Support Vector Machines
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Pattern Recognition and Machine Learning
Locally Constraint Support Vector Clustering
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
Universit at Dortmund, LS VIII
ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang Haiqin Yang, Irwin King, Michael.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
An Introduction to Support Vector Machines (M. Law)
Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.
Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.
Biointelligence Laboratory, Seoul National University
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Dd Generalized Optimal Kernel-based Ensemble Learning for HS Classification Problems Generalized Optimal Kernel-based Ensemble Learning for HS Classification.
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Support vector machines
PREDICT 422: Practical Machine Learning
Linli Xu Martha White Dale Schuurmans University of Alberta
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Yu-Feng Li 1, James T. Kwok2, Ivor W. Tsang3 and Zhi-Hua Zhou1
Dan Roth Department of Computer and Information Science
Lecture 07: Soft-margin SVM
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Geometrical intuition behind the dual problem
Support Vector Machines
Jan Rupnik Jozef Stefan Institute
J. Zhu, A. Ahmed and E.P. Xing Carnegie Mellon University ICML 2009
An Introduction to Support Vector Machines
Learning with information of features
Statistical Learning Dong Liu Dept. EEIS, USTC.
Lecture 07: Soft-margin SVM
CSCI B609: “Foundations of Data Science”
Large Scale Support Vector Machines
10701 / Machine Learning Today: - Cross validation,
Discriminative Frequent Pattern Analysis for Effective Classification
Jointly primal and dual SPARSE Structured I/O Models
Lecture 08: Soft-margin SVM
Joint Max Margin & Max Entropy Learning of Graphical Models
Lecture 07: Soft-margin SVM
Support Vector Machines and Kernels
Support vector machines
Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks
SVMs for Document Ranking
Presentation transcript:

Primal Sparse Max-Margin Markov Networks Laplace Max-margin Markov Networks Primal Sparse Max-Margin Markov Networks Jun Zhu*†, Eric P. Xing† and Bo Zhang* jun-zhu@mails.tsinghua.edu.cn *Department of Computer Science & Technology, Tsinghua University †School of Computer Science, Carnegie Mellon University

Outline Problem Formulation Three Learning Algorithms Structured prediction and models Three Learning Algorithms Cutting-plane Projected sub-gradient EM-style method Experimental Results Conclusion KDD 2009 @ Paris, France

Classification Learning Input: a set of I.I.D. training samples , where OCR: Learning Output: a predictive function Support Vector Machines (SVM) (Crammer & Singer, 2001): b r a c e KDD 2009 @ Paris, France

Structure Prediction b r a c e Classification ignores the dependency among data: Learning Inputs:a set of I.I.D samples , where Image Annotation (Yuan et al., 2007): Learning Output: a structured predictive function : Structured input b r a c e Structured output KDD 2009 @ Paris, France

Regularized Empirical Risk Minimization Laplace Max-margin Markov Networks Regularized Empirical Risk Minimization True data distribution: Risk: is a non-negative loss function, e.g., Hamming loss Empirical Risk : Regularized Empirical Risk is a regularizer; is the regularization parameter One type of learning framework is the regularized empirical risk minimization. In this general framework, by using different regularizers and defining different h, we can get different models. Here we introduce several of them, and analyze their connections and differences. KDD 2009 @ Paris, France

Max-Margin Markov Networks (M3N) Laplace Max-margin Markov Networks Max-Margin Markov Networks (M3N) Basic setup: Base discriminant function: the linear form: Prediction rule: Margin: Structured hinge loss (an upper bound of empirical risk): -norm M3N (Taskar et al., 2003): -norm M3N: By using different regularizers, ell_1 and ell_2 M3N have different sparsity properties. To illustrate it, we note that… KDD 2009 @ Paris, France

Sparsity Interpretation Laplace Max-margin Markov Networks Sparsity Interpretation Equivalent formulation ( ) Projection Primal Sparsity: Only a few input features have non-zero weights in the original model Pseudo-primal Sparsity: Only a few input features have large weights in the original model Dual Sparsity: Only a few Lagrange multipliers are non-zero Example: (linear max-margin Markov networks) Complementary Slackness Condition: p3 p1 p2 To illustrate the difference. We note that the regularized minimization problem can be equivalent formulated as a constrained optimization problem. Basically, for each lambda, you can find a c. these two problems have the same solution. And for each c, you can find a lambda. This constraint generally defines a feasible subset. And it can be viewed as a projection of the esimates into a subspace. Here is the illustration for ell_2-ball and ell_1-ball. L1-ball L2-ball -norm M3N is the primal sparse max-margin Markov network KDD 2009 @ Paris, France

Laplace Max-margin Markov Networks Alg. 1: Cutting-Plane Basic Procedure (based on constrained optimization problem) Construct a sub-set of constraints Optimize w.r.t the constraints in the sub-set Successfully used to learn M3Ns (Tsochantaridis et al., 2004; Joachims et al., 2008) Find most violated constraints Fining is an loss-augmented prediction problem, which can be efficiently solved using max-product alg. Solve an LP sub-problem over working set S a w.l.o.g, we assume at most one of is non-zero KDD 2009 @ Paris, France

Alg. 2: Projected Sub-gradient Laplace Max-margin Markov Networks Alg. 2: Projected Sub-gradient Equivalent Formulation Projected Sub-gradient Sub-gradient: Sub-gradient of Hinge Loss each component is piecewise linear loss augmented prediction Efficient projection to L1-ball (Duchi et al., 2008) in expectation for sparse gradients KDD 2009 @ Paris, France

Alg. 3: EM-style Method a Can adaptively change the scaling parameters for different features Irrelevant features can be discarded by having a zero scaling parameter KDD 2009 @ Paris, France

Alg. 3: EM-style Method (cont’) Laplace Max-margin Markov Networks Alg. 3: EM-style Method (cont’) a a EM-style Alg. Variational Bayesian Alg. KDD 2009 @ Paris, France

Relationship Summary Equivalence EM-style Alg. Relaxation More reading: J. Zhu and E. P. Xing. On Primal and Dual Sparsity of Markov Networks, Proc. of ICML, 2009 KDD 2009 @ Paris, France

Experiments Goals: Compare different models Compare the learning algorithms for L1-M3N EM-style algorithm Cutting-plane Projected sub-gradient Models Primal Sparse Dual Sparse Regularizer CRFs (Lafferty et al., 2001) no NULL L2-CRFs pseudo L2-norm L1-CRFs yes L1-norm M3N (Taskar et al., 2003) L1-M3N LapM3N (Zhu et al., 2008) KL-norm KDD 2009 @ Paris, France

Laplace Max-margin Markov Networks Synthetic Data Sets Datasets with 30 correlated relevant features + 70 i.i.d irrelevant features We generate 10 datasets, each having 1000 samples. True labels are assigned via a Gibbs sampler. Average results on the 10 data sets. KDD 2009 @ Paris, France

OCR Data Sets Task: (Taskar et al., 2003) Datasets: OCR100, OCR150, OCR200, and OCR250 KDD 2009 @ Paris, France

Algorithm Comparison Error rates and training time (CPU-seconds) On synthetic data set On OCR data set KDD 2009 @ Paris, France

Web Data Extraction Task: identify Name, Image, Price, and Description of product items Data: 1585 training records; 3391 testing records Evaluation Criteria: Average F1: the average of F1 values of all the attributes Block instance accuracy: the percentage of records whose Name, Image, and Price are all correct KDD 2009 @ Paris, France

Summary Models and Algorithms: primal sparse max-margin MNs develop three learning algorithms for L1-M3N propose a novel Adaptive max-margin MNs (AdapM3N) show the equivalence between L1-M3N and AdapM3N Experiments: L1-M3N can effectively select significant features simultaneously (pseudo-) primal and dual sparsity benefit the structured prediction models EM-style algorithm is robust KDD 2009 @ Paris, France

Thanks! KDD 2009 @ Paris, France

Sparsity Interpretation Laplace Max-margin Markov Networks Sparsity Interpretation Equivalent formulation ( ) Projection Primal Sparsity: Only a few input features have non-zero weights in the original model Pseudo-primal Sparsity: Only a few input features have large weights in the original model Dual Sparsity: Only a few Lagrange multipliers are non-zero Example: (linear max-margin Markov networks) Complementary Slackness Condition: To illustrate the difference. We note that the regularized minimization problem can be equivalent formulated as a constrained optimization problem. Basically, for each lambda, you can find a c. these two problems have the same solution. And for each c, you can find a lambda. This constraint generally defines a feasible subset. And it can be viewed as a projection of the esimates into a subspace. Here is the illustration for ell_2-ball and ell_1-ball. -norm M3N is the primal sparse max-margin Markov network KDD 2009 @ Paris, France

Alg. 3: EM-style Method a Intuitive interpretation from the scale-mixture interpretation of Laplace distribution First-order moment matching constraint. KDD 2009 @ Paris, France

Solution to MaxEnDNet Theorem 1 (Solution to MaxEnDNet): Posterior Distribution: Dual Optimization Problem: Convex conjugate (closed proper convex ) Def: Ex: KDD 2009 @ Paris, France

Reduction to M3Ns Theorem 2 (Reduction of MaxEnDNet to M3Ns): Assume Posterior distribution: Dual optimization: Predictive rule: Thus, MaxEnDNet subsumes M3Ns and admits all the merits of max-margin learning Furthermore, MaxEnDNet has at least three advantages … KDD 2009 @ Paris, France

Laplace Max-margin Markov Networks Laplace M3Ns (LapM3N) The prior in MaxEntNet can be designed to introduce useful regularizatioin effects, such as sparsity bias Normal prior is not so good for sparse data … Instead, we use the Laplace prior … hierarchical representation (scale mixture) KDD 2009 @ Paris, France

Posterior shrinkage effect in LapM3N Exact integration in LapM3N Alternatively, Similar calculation for M3Ns (A standard normal prior) A functional transformation KDD 2009 @ Paris, France

Variational Bayesian Learning Exact dual function is hard to optimize Use the hierarchical representation, we get: We optimize an upper bound: Why is it easier? Alternating minimization leads to nicer optimization problems Keep fixed Keep fixed - The effective prior is normal - Closed form solution of and its expectation An M3N optimization problem! Closed-form solution! KDD 2009 @ Paris, France

Laplace Max-Margin Markov Networks (LapM3N) MaxEnDNet – an averaging model (Zhu et al., 2008) Base discriminant function: Effective discriminant function: Predictive rule: Margin: Expected structured hinge loss: Learning problem Laplace max-margin Markov network is the Laplace MaxEnDNet MaxEnDNet with a Laplace prior Note that for linear models, the predictive rule depends on the posterior mean only. Let me say a little bit more about MaxEnDNet and LapM3N. KDD 2009 @ Paris, France

Laplace Max-margin Markov Networks More about MaxEnDNet MaxEnDNet: MED (Jaakkola et al., 99) for structured prediction General solution to MaxEnDNet: Posterior Distribution: Dual Optimization Problem: Gaussian MaxEnDNet  -norm M3N Assume Posterior distribution: Dual optimization: Predictive rule: KDD 2009 @ Paris, France

Laplace Max-margin Markov Networks More about Laplace M3Ns Motivation: normal prior is not good for sparse learning … Can’t distinguish relevant and irrelevant features Posterior Shrinkage Effect (Zhu et al., 08) Assume For Laplace prior: For standard normal prior KDD 2009 @ Paris, France

Entropic Regularizer: Strict positive: Monotonically increasing: Approaching the 1-norm: KDD 2009 @ Paris, France

Graphical illustration KDD 2009 @ Paris, France

Relationship Summary More reading: J. Zhu and E. P. Xing. On Primal and Dual Sparsity of Markov Networks, Proc. of ICML, 2009 KDD 2009 @ Paris, France