MSRC Summer School - 30/06/2009 Cambridge – UK Hybrids of generative and discriminative methods for machine learning.

Slides:



Advertisements
Similar presentations
Rerun of machine learning Clustering and pattern recognition.
Advertisements

INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
Pattern Recognition and Machine Learning
Simultaneous Image Classification and Annotation Chong Wang, David Blei, Li Fei-Fei Computer Science Department Princeton University Published in CVPR.
Supervised Learning Recap
Probabilistic Clustering-Projection Model for Discrete Data
Chapter 4: Linear Models for Classification
Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.
What is Statistical Modeling
Laboratory for Social & Neural Systems Research (SNS) PATTERN RECOGNITION AND MACHINE LEARNING Institute of Empirical Research in Economics (IEW)
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Guillaume Bouchard Xerox Research Centre Europe
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Robust Moving Object Detection & Categorization using self- improving classifiers Omar Javed, Saad Ali & Mubarak Shah.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Herding: The Nonlinear Dynamics of Learning Max Welling SCIVI LAB - UCIrvine.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Transferring information using Bayesian priors on object categories Li Fei-Fei 1, Rob Fergus 2, Pietro Perona 1 1 California Institute of Technology, 2.
Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Radford M. Neal and Jianguo Zhang the winners.
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
Machine Learning CMPT 726 Simon Fraser University
Object Class Recognition Using Discriminative Local Features Gyuri Dorko and Cordelia Schmid.
Discriminative Naïve Bayesian Classifiers Kaizhu Huang Supervisors: Prof. Irwin King, Prof. Michael R. Lyu Markers: Prof. Lai Wan Chan, Prof. Kin Hong.
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Graph-Based Semi-Supervised Learning with a Generative Model Speaker: Jingrui He Advisor: Jaime Carbonell Machine Learning Department
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
CS Ensembles and Bayes1 Semi-Supervised Learning Can we improve the quality of our learning by combining labeled and unlabeled data Usually a lot.
Simple Bayesian Supervised Models Saskia Klein & Steffen Bollmann 1.
Thanks to Nir Friedman, HU
Crash Course on Machine Learning
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Xiaoxiao Shi, Qi Liu, Wei Fan, Philip S. Yu, and Ruixin Zhu
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Randomized Algorithms for Bayesian Hierarchical Clustering
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Dropout as a Bayesian Approximation
Christopher M. Bishop Object Recognition: A Statistical Learning Perspective Microsoft Research, Cambridge Sicily, 2003.
Detecting New a Priori Probabilities of Data Using Supervised Learning Karpov Nikolay Associate professor NRU Higher School of Economics.
Gaussian Processes For Regression, Classification, and Prediction.
Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework N 工科所 錢雅馨 2011/01/16 Li-Jia Li, Richard.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Paper: A. Kapoor, H. Ahn, and R. Picard, “Mixture of Gaussian Processes for Combining Multiple Modalities,” MIT Media Lab Technical Report, Paper.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
A Method to Approximate the Bayesian Posterior Distribution in Singular Learning Machines Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Evolvable dialogue systems
Chapter 3: Maximum-Likelihood Parameter Estimation
Restricted Boltzmann Machines for Classification
Non-Parametric Models
J. Zhu, A. Ahmed and E.P. Xing Carnegie Mellon University ICML 2009
CSCI 5822 Probabilistic Models of Human and Machine Learning
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
Topic Models in Text Processing
Mathematical Foundations of BME
Discriminative Probabilistic Models for Relational Data
Presentation transcript:

MSRC Summer School - 30/06/2009 Cambridge – UK Hybrids of generative and discriminative methods for machine learning

Motivation Generative models prior knowledge handle missing data such as labels Discriminative models perform well at classification However no straightforward way to combine them

Content Generative and discriminative methods A principled hybrid framework Study of the properties on a toy example Influence of the amount of labelled data

Content Generative and discriminative methods A principled hybrid framework Study of the properties on a toy example Influence of the amount of labelled data

Generative methods Answer: “what does a cat look like? and a dog?” => data and labels joint distribution x : data c : label  : parameters

Generative methods Objective function: G(  ) = p(  ) p(X, C|  ) G(  ) = p(  )  n p(x n, c n |  ) 1 reusable model per class, can deal with incomplete data Example: GMMs

Example of generative model

Discriminative methods Answer: “is it a cat or a dog?” => labels posterior distribution x : data c : label  : parameters

Discriminative methods The objective function is D(  ) = p(  ) p(C|X,  ) D(  ) = p(  )  n p(c n |x n,  ) Focus on regions of ambiguity, make faster predictions Example: neural networks, SVMs

Example of discriminative model SVMs / NNs

Generative versus discriminative No effect of the double mode on the decision boundary

Content Generative and discriminative methods A principled hybrid framework Study of the properties on a toy example Influence of the amount of labelled data

Semi-supervised learning Few labelled data / lots of unlabelled data Discriminative methods overfit, generative models only help classify if they are “good” Need to have the modelling power of generative models while performing at discriminating => hybrid models

Discriminative training Bach et al, ICASSP 05 Discriminative objective function: D(  ) = p(  )  n p(c n |x n,  ) Using a generative model: D(  ) = p(  )  n [ p(x n, c n |  ) / p(x n |  ) ] D(  ) = p(  )  n  c p(x n, c|  ) p(x n, c n |  )

Convex combination Bouchard et al, COMPSTAT 04 Generative objective function: G(  ) = p(  )  n p(x n, c n |  ) Discriminative objective function: D(  ) = p(  )  n p(c n |x n,  ) Convex combination: log L(  ) =   log D(  ) + (1-  )  log G(  )  [0,1]

A principled hybrid model

 - posterior distribution of the labels  ’- marginal distribution of the data  and  ’ communicate through a prior Hybrid objective function: L( ,  ’) = p( ,  ’)   n p(c n |x n,  )   n p(x n |  ’)

A principled hybrid model  =  ’ => p( ,  ’) = p(  )  (  -  ’) L( ,  ’) = p(  )  (  -  ’)  n p(c n |x n,  )  n p(x n |  ’) L(  ) = G(  ) generative case    ’ => p( ,  ’) = p(  ) p(  ’) L( ,  ’) = [ p(  )  n p(c n |x n,  ) ]  [ p(  ’)  n p(x n |  ’) ] L( ,  ’) = D(  )  f(  ’) discriminative case

A principled hybrid model Anything in between – hybrid case Choice of prior: p( ,  ’) = p(  ) N(  ’| ,  (  ))      0 =>  =  ’   1 =>    =>    ’

Why principled? Consistent with the likelihood of graphical models => one way to train a system Everything can now be modelled => potential to be Bayesian Potential to learn 

Learning EM / Laplace approximation / MCMC either intractable or too slow Conjugate gradients flexible, easy to check BUT sensitive to initialisation, slow Variational inference

Content Generative and discriminative methods A principled hybrid framework Study of the properties on a toy example Influence of the amount of labelled data

Toy example

2 elongated distributions Only spherical gaussians allowed => wrong model 2 labelled points per class => strong risk of overfitting

Toy example

Decision boundaries

Content Generative and discriminative methods A principled hybrid framework Study of the properties on a toy example Influence of the amount of labelled data

A real example Images are a special case, as they contain several features each 2 levels of supervision: at the image level, and at the feature level Image label only => weakly labelled Image label + segmentation => fully labelled

The underlying generative model gaussian multinomial

The underlying generative model weakly – fully labelled

Experimental set-up 3 classes: bikes, cows, sheep : 1 Gaussian per class => poor generative model 75 training images for each category

HF framework

HF versus CC

Results When increasing the proportion of fully labelled data, the trend is: generative  hybrid  discriminative Weakly labelled data has little influence on the trend With sufficient fully labelled data, HF tends to perform better than CC

Experimental set-up 3 classes: lions, tigers and cheetahs : 1 Gaussian per class => poor generative model 75 training images for each category

HF framework

HF versus CC

Results Hybrid models consistently perform better However, generative and discriminative models haven’t reached saturation No clear difference between HF and CC

Conclusion Principled hybrid framework Possibility to learn the best trade-off Helps for ambiguous datasets when labelled data is scarce Problem of optimisation

Future avenues Bayesian version (posterior distribution of  ) under study Replace  by a diagonal matrix  to allow flexibility => need for the Bayesian version Choice of priors

Thank you!