Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

Slides:



Advertisements
Similar presentations
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Advertisements

Thomas Trappenberg Autonomous Robotics: Supervised and unsupervised learning.
Various Regularization Methods in Computer Vision Min-Gyu Park Computer Vision Lab. School of Information and Communications GIST.
Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
Computer vision: models, learning and inference Chapter 8 Regression.
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
Expectation Maximization
Optimization Tutorial
Supervised Learning Recap
Variational Inference in Bayesian Submodular Models
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Chapter 4: Linear Models for Classification
Computer vision: models, learning and inference
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop.
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
Pattern Recognition and Machine Learning
Speeding up multi-task learning Phong T Pham. Multi-task learning  Combine data from various data sources  Potentially exploit the inter-relation between.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Lecture 5: Learning models using EM
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Visual Recognition Tutorial
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Support Vector Machines
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Conditional Random Fields Rahul Gupta (KReSIT, IIT Bombay)
PATTERN RECOGNITION AND MACHINE LEARNING
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Conditional Topic Random Fields Jun Zhu and Eric P. Xing ICML 2010 Presentation and Discussion by Eric Wang January 12, 2011.
Graphical models for part of speech tagging
Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang Haiqin Yang, Irwin King, Michael.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Markov Random Fields Probabilistic Models for Images
Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
Lecture 2: Statistical learning primer for biologists
John Lafferty Andrew McCallum Fernando Pereira
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
CS6772 Advanced Machine Learning Fall 2006 Extending Maximum Entropy Discrimination on Mixtures of Gaussians With Transduction Final Project by Barry.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Presented by: Mingkui Tan, Li Wang, Ivor W. Tsang School of Computer Engineering June 21-24, ICML2010 Haifa, Israel Learning Sparse SVM.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Learning Deep Generative Models by Ruslan Salakhutdinov
Probability Theory and Parameter Estimation I
Multiplicative updates for L1-regularized regression
Multimodal Learning with Deep Boltzmann Machines
J. Zhu, A. Ahmed and E.P. Xing Carnegie Mellon University ICML 2009
Probabilistic Models for Linear Regression
Statistical Learning Dong Liu Dept. EEIS, USTC.
Jointly primal and dual SPARSE Structured I/O Models
Joint Max Margin & Max Entropy Learning of Graphical Models
Primal Sparse Max-Margin Markov Networks
Presentation transcript:

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint work with Eric P. Xing and Bo Zhang. Laplace Maximum Margin Markov Networks

Outline 5/24/2015 ICML Helsinki, Finland 2 Introduction – Structured Prediction – Max-margin Markov Networks Max-Entropy Discrimination Markov Networks – Basic Theorems – Laplace Max-margin Markov Networks – Experimental Results Summary

Classical Classification Models 5/24/2015 ICML Helsinki, Finland 3 Inputs: – a set of training samples, where and Outputs: – a predictive function : Examples ( ): –Support Vector Machine (SVM) Max-margin learning –Logistic Regression Max-likelihood estimation

Structured Prediction 5/24/2015 ICML Helsinki, Finland 4 Complicated Prediction Examples: – Part-of-speech (POS) Tagging: – Image segmentation Inputs: – a set of training samples:, where Outputs: – a predictive function : “ Do you want fries with that? ” ->

Structured Prediction Models 5/24/2015 ICML Helsinki, Finland 5 Conditional Random Fields (CRFs) (Lafferty et al., 2001) – Based on Logistic Regression – Max-likelihood estimation (point-estimate) Max-margin Markov Networks (M 3 Ns) (Taskar et al., 2003) – Based on SVM – Max-margin learning ( point-estimate) Markov properties are encoded in the feature functions

Between MLE and max-margin learning 5/24/2015 ICML Helsinki, Finland 6 Likelihood-based estimation – Probabilistic (joint/conditional likelihood model) – Easy adapt to perform Bayesian learning, and consider prior knowledge, missing data Max-margin learning – Non-probabilistic (concentrate on input- output mapping) – Not obvious how to perform Bayesian learning or consider prior, and missing data – Sound theoretical guarantee with limited samples Maximum Entropy Discrimination (MED) (Jaakkola, et al., 1999) – A Bayesian learning approach – The optimization problem (binary classification)

MaxEnt Discrimination Markov networks 5/24/2015 ICML Helsinki, Finland 7 MaxEnt Discrimination Markov Networks (MaxEntNet): – Generalized maximum entropy or regularized KL-divergence – Subspace of distributions defined with expected margin constraints Bayesian Prediction

Solution to MaxEntNet 5/24/2015 ICML Helsinki, Finland 8 Theorem 1 (Solution to MaxEntNet): – Posterior Distribution: – Dual Optimization Problem: Convex conjugate (closed proper convex ) – Def: – Ex:

Reduction to M 3 Ns 5/24/2015 ICML Helsinki, Finland 9 Theorem 2 (Reduction of MaxEntNet to M 3 Ns): – Assume Posterior distribution: Dual optimization: Predictive rule: Thus, MaxEntNet subsumes M 3 Ns and admits all the merits of max-margin learning Furthermore, MaxEntNet has at least three advantages …

The Three Advantages PAC-Bayes prediction error guarantee Introduce regularization effects, such as sparsity bias Provides an elegant approach to incorporate latent variables and structures 5/24/2015 ICML Helsinki, Finland 10

The Three Advantages PAC-Bayes prediction error guarantee Introduce regularization effects, such as sparsity bias Provides an elegant approach to incorporate latent variables and structures 5/24/2015 ICML Helsinki, Finland 11

Generalization Guarantee 5/24/2015 ICML Helsinki, Finland 12 MaxEntNet is an averaging model – we also call it a Bayesian Max-Margin Markov Network Theorem 3 (PAC-Bayes Bound) If Then

Laplace M 3 Ns (LapM 3 N) 5/24/2015 ICML Helsinki, Finland 13 The prior in MaxEntNet can be designed to introduce useful regularizatioin effects, such as sparsity bias Normal prior is not so good for sparse data … Instead, we use the Laplace prior … – hierarchical representation (scale mixture)

Similar calculation for M 3 Ns (A standard normal prior) Posterior shrinkage effect in LapM 3 N 5/24/2015 ICML Helsinki, Finland 14 Exact integration in LapM 3 N Alternatively, A functional transformation

Variational Bayesian Learning 5/24/2015 ICML Helsinki, Finland 15 Exact dual function is hard to optimize Use the hierarchical representation, we get: We optimize an upper bound: Why is it easier? – Alternating minimization leads to nicer optimization problems Keep fixed - The effective prior is normal- Closed form solution of and its expectation

Variational Bayesian Learning (Cont’) 5/24/2015 ICML Helsinki, Finland 16

Experiments Compare LapM 3 N with: CRFs: MLE L 2 -CRFs: L 2 -norm penalized MLE L 1 -CRFs: L 1 -norm penalized MLE (sparse estimation) M 3 N: max-margin learning 5/24/2015 ICML Helsinki, Finland 17

Experimental results on synthetic datasets 5/24/2015 ICML Helsinki, Finland 18 Datasets with 100 i.i.d features of which 10, 30, 50 are relevant – For each setting, we generate 10 datasets, each having 1000 samples. True labels are assigned via a Gibbs sampler with 5000 iterations. Datasets with 30 correlated relevant features + 70 i.i.d irrelevant features

Experimental results on OCR datasets 5/24/2015 ICML Helsinki, Finland 19 We randomly construct OCR100, OCR150, OCR200, and OCR250 for 10 fold CV.

Sensitivity to Regularization Constants 5/24/2015 ICML Helsinki, Finland 20 L 1 -CRFs are much sensitive to regularization constants; the others are more stable LapM 3 N is the most stable one  L 1 -CRF and L 2 -CRF: , 0.01, 0.1, 1, 4, 9, 16  M 3 N and LapM 3 N: - 1, 4, 9, 16, 25, 36, 49, 64, 81

Summary 5/24/2015 ICML Helsinki, Finland 21 We propose a general framework MaxEntNet to do Bayesian max-margin structured prediction – MaxEntNet subsumes the standard M 3 Ns – PAC-Bayes Theoretical Error Bound We propose Laplace max-margin Markov networks – Enjoys a posterior shrinkage effect – Can perform as well as sparse models on synthetic data; better on real data sets – More stable to regularization constants

5/24/2015 ICML Helsinki, Finland 22 Thanks! Detailed Proof:

Primal and Dual Problems of M 3 Ns 5/24/2015 ICML Helsinki, Finland 23 Primal problem: Algorithms – Cutting plane – Sub-gradient – … Dual problem: Algorithms: – SMO – Exponentiated gradient – …

Bayesian Learning 5/24/2015 ICML Helsinki, Finland 24 Point-estimate (MLE & margin-margin learning) – Prediction: Bayesian learning: – Prediction (an averaging model): Bayesian Learning Point-estimate