Classification 1: generative and non-parameteric methods Jakob Verbeek January 7, 2011 Course website:

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Applications of one-class classification
Principles of Density Estimation
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
Pattern Recognition and Machine Learning
Data Mining Classification: Alternative Techniques
Machine learning continued Image source:
Clustering with k-means and mixture of Gaussian densities Jakob Verbeek December 3, 2010 Course website:
K Means Clustering , Nearest Cluster and Gaussian Mixture
Lecture 3 Nonparametric density estimation and classification
Discriminative and generative methods for bags of features
Visual Recognition Tutorial
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Project 4 out today –help session today –photo session today Project 2 winners Announcements.
Machine Learning CMPT 726 Simon Fraser University
Object Class Recognition Using Discriminative Local Features Gyuri Dorko and Cordelia Schmid.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Computer Vision I Instructor: Prof. Ko Nishino. Today How do we recognize objects in images?
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Discriminative and generative methods for bags of features
Machine learning & category recognition Cordelia Schmid Jakob Verbeek.
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Introduction to Machine Learning for Category Representation
Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.
The EM algorithm, and Fisher vector image representation
Classification 2: discriminative models
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Jakob Verbeek December 11, 2009
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Nonparametric Density Estimation Riu Baring CIS 8526 Machine Learning Temple University Fall 2007 Christopher M. Bishop, Pattern Recognition and Machine.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
KNN & Naïve Bayes Hongning Wang
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
The EM algorithm for Mixture of Gaussians & Classification with Generative models Jakob Verbeek December 2, 2011 Course website:
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Machine Learning and Category Representation Jakob Verbeek November 25, 2011 Course website:
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
INTRODUCTION TO Machine Learning 3rd Edition
Ch8: Nonparametric Methods
Paper Presentation: Shape and Matching
Outline Parameter estimation – continued Non-parametric methods.
Lecture 26: Faces and probabilities
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
Generally Discriminant Analysis
Announcements Project 4 out today Project 2 winners help session today
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Classification 1: generative and non-parameteric methods Jakob Verbeek January 7, 2011 Course website:

Plan for the course Session 1, October –Cordelia Schmid: Introduction –Jakob Verbeek: Introduction Machine Learning Session 2, December –Jakob Verbeek: Clustering with k-means, mixture of Gaussians –Cordelia Schmid: Local invariant features –Student presentation 1: Scale and affine invariant interest point detectors, Mikolajczyk, Schmid, IJCV 2004.Scale and affine invariant interest point detectors Session 3, December –Cordelia Schmid: Instance-level recognition: efficient search –Student presentation 2: Scalable Recognition with a Vocabulary Tree, Nister and Stewenius, CVPR 2006.Scalable Recognition with a Vocabulary Tree,

Plan for the course Session 4, December –Jakob Verbeek: The EM algorithm, and Fisher vector image representation –Cordelia Schmid: Bag-of-features models for category-level classification –Student presentation 2: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, Lazebnik, Schmid and Ponce, CVPR 2006.Beyond bags of features: spatial pyramid matching for recognizing natural scene categories Session 5, January –Jakob Verbeek: Classification 1: generative and non-parameteric methods –Student presentation 4: Large-Scale Image Retrieval with Compressed Fisher Vectors, Perronnin, Liu, Sanchez and Poirier, CVPR 2010.Large-Scale Image Retrieval with Compressed Fisher Vectors –Cordelia Schmid: Category level localization: Sliding window and shape model –Student presentation 5: Object Detection with Discriminatively Trained Part Based Models, Felzenszwalb, Girshick, McAllester and Ramanan, PAMI 2010.Object Detection with Discriminatively Trained Part Based Models Session 6, January –Jakob Verbeek: Classification 2: discriminative models –Student presentation 6: TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation, Guillaumin, Mensink, Verbeek and Schmid, ICCV 2009.TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation –Student presentation 7: IM2GPS: estimating geographic information from a single image, Hays and Efros, CVPR 2008.IM2GPS: estimating geographic information from a single image

Example of classification Given: training images and their categoriesWhat are the categories of these test images? apple pear tomato cow dog horse

Classification Goal is to predict for a test data input the corresponding class label. –Data input x, eg. image but could be anything, format may be vector or other –Class label y, can take one out of at least 2 discrete values, can be more In binary classification we often refer to one class as “positive”, and the other as “negative” Training data consists of inputs x, and corresponding class labels y. Learn a “classifier”: function f(x) from the input data that outputs the class label or a probability over the class labels. Classifier creates boundaries in the input space between areas assigned to each class –Specific form of these boundaries will depend on the class of classifiers used

Discriminative vs generative methods Generative probabilistic methods –Model the density of inputs x from each class p(x|y) –Estimate class prior probability p(y) –Use Bayes’ rule to infer distribution over class given input Discriminative (probabilistic) methods –Directly estimate class probability given input: p(y|x) –Some methods do not have probabilistic interpretation, eg. they fit a function f(x), and assign to class 1 if f(x)>0, and to class 2 if f(x)<0

Generative classification methods Generative probabilistic methods –Model the density of inputs x from each class p(x|y) –Estimate class prior probability p(y) –Use Bayes’ rule to infer distribution over class given input 1.Selection of model class: –Parametric model: Gaussian (for continuous), Bernoulli (for binary), … –Semi-parametric models: mixtures of Gaussian, mixtures of Bernoulli, … –Non-parametric models: Histograms over one-dimensional, or multi-dimensional data, nearest-neighbor method, kernel density estimator,… 2.Estimate parameters of density for each class to obtain p(x|class) –Eg: run EM to learn Gaussian mixture on data of each class 3.Estimate prior probability of each class –If data point is equally likely given each class, then assign to the most probable class. –Prior probability might be different than the number of available examples !

Generative classification methods Generative probabilistic methods –Model the density of inputs x from each class p(x|y) –Estimate class prior probability p(y) –Use Bayes’ rule to predict classes given input Given class conditional model, classification is trivial: just apply Bayes’ rule –Compute p(x|class) for each class, –multiply with class prior probability –Normalize to obtain the class probabilities Adding new classes can be done by adding a new class conditional model –Existing class conditional models stay as they are –Just estimate p(x|new class) from training examples of new class –Plug-in the new class model when using Bayes-rule to predict class

Generative classification methods Generative probabilistic methods –Model the density of inputs x from each class p(x|y) –Estimate class prior probability p(y) –Use Bayes’ rule to predict classes given input Three-class example in 2d with parametric model –Single Gaussian model per class P(x|class)p(class|x)

Generative classification methods Generative probabilistic methods –Model the density of inputs x from each class p(x|y) –Estimate class prior probability p(y) –Use Bayes’ rule to infer distribution over class given input 1.Selection of model class: –Parametric model: Gaussian (for continuous), Bernoulli (for binary), … –Semi-parametric models: mixtures of Gaussian, mixtures of Bernoulli, … –Non-parametric models: Histograms over one-dimensional, or multi- dimensional data, nearest-neighbor method, kernel density estimator,… 2.Estimate parameters of density for each class to obtain p(x|class) –Eg: run EM to learn Gaussian mixture on data of each class 3.Estimate prior probability of each class –If data point is equally likely given each class, then assign to the most probable class. –Prior probability might be different than the number of available examples !

Histogram methods Suppose we –have N data points –use a histogram with C cells How to set the density level in each cell ? –Maximum likelihood estimator. –Proportional to nr of points n in cell –Inversely proportional to volume V of cell Problems with histogram method: –# cells scales exponentially with the dimension of the data –Discontinuous density estimate –How to choose cell size?

The ‘curse of dimensionality’ Number of bins increases exponentially with the dimensionality of the data. –Fine division of each dimension: many empty bins –Rough division of each dimension: poor density model Multi-dimensional histogram density estimate for D variables takes at least 2 values per variable, thus at least 2 D values in total The number of cells may be reduced assuming independence between the components of x: the naïve Bayes model –Model is “naïve” since it assumes that all variables are independent… –Unrealistic for high dimensional data, where variables tend to be dependent Typically poor density estimator for p(x) Classification performance can still be good using the derived p(y|x)

Example of a naïve Bayes model Hand-written digit classification –Input: binary 28x28 scanned digit images, collect in 784 long vector –Desired output: class label of image Generative model –Independent Bernoulli model for each class –Probability per pixel per class –Maximum likelihood estimator is average value per pixel per class Classify using Bayes’ rule:

k-nearest-neighbor estimation method Idea: put a cell around the test sample we want to know p(x) for –fix number of samples in the cell, find the right cell size. Probability to find a point in a sphere A centered on x with volume v is Smooth density approximately constant in small region, and thus Alternatively: estimate P from the fraction of training data in a sphere on x Combine the above to obtain estimate –Same formula as used in the histogram method to set the density level in each cell

k-nearest-neighbor estimation method Method in practice: –Choose k –For given x, compute the volume v which contain k samples. –Estimate density with Volume of a sphere with radius r in d dimensions is What effect does k have? –Data sampled from mixture of Gaussians plotted in green –Larger k, larger region, smoother estimate Selection of k –Leave-one-out cross validation –Select k that maximizes data log-likelihood

k-nearest-neighbor classification rule Use k-nearest neighbor density estimation to find p(x|category) Apply Bayes rule for classification: k-nearest neighbor classification –Find sphere volume v to capture k data points for estimate –Use the same sphere for each class for estimates –Estimate global class priors –Calculate class posterior distribution

k-nearest-neighbor classification rule Effect of k on classification boundary –Larger number of neighbors: Larger regions, smoother class boundaries Pros: Very simple –just set k, and choose a distance measure, no learning –Generic: applies to almost anything, as long as you have a distance Cons: Very costly when having large training data set –Need to store all data (memory) –Need to compute distances to all data (time)

Kernel density estimation methods Consider a simple estimator of the cumulative distribution function: Derivative gives an estimator of the density function, but this is just a set of delta peaks. Derivative is defined as Consider a non-limiting value of h: Each data point adds 1/(2hN) in region of size 2h around it, sum of “blocks” gives estimate –Mixture of “fat” delta peaks

Kernel density estimation methods Can use other than “block” function to obtain smooth estimator. Widely used kernel function is the (multivariate) Gaussian –Contribution decreases smoothly as a function of the distance to data point. Choice of smoothing parameter –Larger size of “kernel” function gives smoother density estimator –Use the average distance between samples. –Use cross-validation. Method can be used for multivariate data –Or in naïve bayes model Gaussian kernel density estimator is “just” a Gaussian mixture –With a component centered on each data point –With equal mixing weights –With “hand tuned” covariance matrix, typically small multiple of identity

Summary generative classification methods histogramk-nnk.d.e. Semi-Parametric models, eg p(data |class) is Gaussian or mixture of … –Pros: no need to store data, but possibly too strong assumptions on data density –Cons: can lead to poor fit on data, and possibly poor classification result Non-parametric models: –Advantage is their flexibility no assumption on shape of data distribution –Histograms: Only practical in low dimensional space (<5 or so), application in high dimensional space will lead to exponentially many cells, most of which will be empty Naïve Bayes modeling in higher dimensional cases –K-nearest neighbor & kernel density estimation: expensive storing all training data (memory space) Computing nearest neighbors or points with non-zero kernel evaluation (computation)

Discriminative vs generative methods Generative probabilistic methods –Model the density of inputs x from each class p(x|y) –Estimate class prior probability p(y) –Use Bayes’ rule to infer distribution over class given input Discriminative (probabilistic) methods  next week –Directly estimate class probability given input: p(y|x) –Some methods do not have probabilistic interpretation, eg. they fit a function f(x), and assign to class 1 if f(x)>0, and to class 2 if f(x)<0 Hybrid generative-discriminative models –Fit density model to data –Use properties of this model as input for classifier –Example: Fisher-vectors for image representation