Classification 2: discriminative models

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
An Introduction of Support Vector Machine
Data Mining Classification: Alternative Techniques
CHAPTER 10: Linear Discrimination
An Introduction of Support Vector Machine
Support Vector Machines
Quiz 1 on Wednesday ~20 multiple choice or short answer questions
Support vector machine
Machine learning continued Image source:
Clustering with k-means and mixture of Gaussian densities Jakob Verbeek December 3, 2010 Course website:
Classification and Decision Boundaries
Discriminative and generative methods for bags of features
Support Vector Machine
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Support Vector Machines Kernel Machines
Support Vector Machines
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Lecture XI: Object Recognition (2)
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Discriminative and generative methods for bags of features
Machine learning & category recognition Cordelia Schmid Jakob Verbeek.
An Introduction to Support Vector Machines Martin Law.
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Introduction to Machine Learning for Category Representation
Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.
The EM algorithm, and Fisher vector image representation
This week: overview on pattern recognition (related to machine learning)
Support Vector Machine & Image Classification Applications
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010.
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 24 – Classifiers 1.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Classification 1: generative and non-parameteric methods Jakob Verbeek January 7, 2011 Course website:
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Jakob Verbeek December 11, 2009
Methods for classification and image representation
Linear Models for Classification
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CS 2750: Machine Learning Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 17, 2016.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Machine Learning and Category Representation Jakob Verbeek November 25, 2011 Course website:
Lecture IX: Object Recognition (2)
Neural networks and support vector machines
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Paper Presentation: Shape and Matching
An Introduction to Support Vector Machines
CS 2770: Computer Vision Intro to Visual Recognition
CS 2750: Machine Learning Support Vector Machines
COSC 4335: Other Classification Techniques
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Classification 2: discriminative models Jakob Verbeek January 14, 2011 SVM slides stolen from Svetlana Lazebnik Course website: http://lear.inrialpes.fr/~verbeek/MLCR.10.11.php

Plan for the course Session 1, October 1 2010 Cordelia Schmid: Introduction Jakob Verbeek: Introduction Machine Learning Session 2, December 3 2010 Jakob Verbeek: Clustering with k-means, mixture of Gaussians Cordelia Schmid: Local invariant features Student presentation 1: Scale and affine invariant interest point detectors, Mikolajczyk, Schmid, IJCV 2004. Session 3, December 10 2010 Cordelia Schmid: Instance-level recognition: efficient search Student presentation 2: Scalable Recognition with a Vocabulary Tree, Nister and Stewenius, CVPR 2006.

Plan for the course Session 4, December 17 2010 Jakob Verbeek: The EM algorithm, and Fisher vector image representation Cordelia Schmid: Bag-of-features models for category-level classification Student presentation 2: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, Lazebnik, Schmid and Ponce, CVPR 2006. Session 5, January 7 2011 Jakob Verbeek: Classification 1: generative and non-parameteric methods Student presentation 4: Large-Scale Image Retrieval with Compressed Fisher Vectors, Perronnin, Liu, Sanchez and Poirier, CVPR 2010. Cordelia Schmid: Category level localization: Sliding window and shape model Student presentation 5: Object Detection with Discriminatively Trained Part Based Models, Felzenszwalb, Girshick, McAllester and Ramanan, PAMI 2010. Session 6, January 14 2011 Jakob Verbeek: Classification 2: discriminative models Student presentation 6:TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation, Guillaumin, Mensink, Verbeek and Schmid, ICCV 2009. Student presentation 7:IM2GPS: estimating geographic information from a single image, Hays and Efros, CVPR 2008.

Example of classification apple pear tomato cow dog horse Given: training images x and their categories y What are the categories of these test images?

Generative vs Discriminative Classification Training data consists of “inputs”, denoted x, and corresponding output “class labels”, denoted as y. Goal is to predict class label y for a given test data x. Generative probabilistic methods Model the density of inputs x from each class p(x|y) Estimate class prior probability p(y) Use Bayes’ rule to infer distribution over class given input Discriminative (probabilistic) methods Directly estimate class probability given input: p(y|x) Some methods do not have probabilistic interpretation, but fit a function f(x), and assign to classes based on sign(f(x))

Discriminant function Choose class of decision functions in feature space. Estimate the function parameters from the training set. Classify a new pattern on the basis of this decision rule. kNN example from last week Needs to store all data Separation using smooth curve Only need to store curve parameters

Linear classifiers Decision function is linear in the features: Classification based on the sign of f(x) Orientation is determined by w (surface normal) Offset from origin is determined by w0 Decision surface is (d-1) dimensional hyper-plane orthogonal to w, given by f(x)=0 w

Linear classifiers Decision function is linear in the features: Classification based on the sign of f(x) Orientation is determined by w (surface normal) Offset from origin is determined by w0 Decision surface is (d-1) dimensional hyper-plane orthogonal to w, given by What happens in 3d with w=(1,0,0) and w0 =-1 ? w f(x)=0

Dealing with more than two classes First (bad) idea: construction from multiple binary classifiers Learn the 2-class “base” classifiers independently One vs rest classifiers: train 1 vs (2 & 3), and 2 vs (1 & 3), and 3 vs (1 & 2) One vs One classifiers: train 1 vs 2, and 2 vs 3 and 1 vs 3 Not clear what should be done in some regions One vs Rest: Region claimed by several classes One vs One: no agreement

Dealing with more than two classes Alternative: define a separate linear function for each class Assign sample to the class of the function with maximum value Decision regions are convex in this case If two points fall in the region, then also all points on connecting line What shape do the separation surfaces have ?

Logistic discriminant for two classes Map linear score function to class probabilities with sigmoid function The class boundary is obtained for p(y|x)=1/2, thus by setting linear function in exponent to zero w p(y|x)=1/2 f(x)=-5 f(x)=+5

Multi-class logistic discriminant Map K linear score functions (one for each class) to K class probabilities using the “soft-max” function The class probability estimates are non-negative, and sum to one. Probability for the most likely class increases quickly with the difference in the linear score functions For any given pair of classes we find that they are equally likely on a hyperplane in the feature space

Parameter estimation for logistic discriminant Maximize the (log) likelihood of predicting the correct class label for training data, ie the sum log-likelihood of all training data Derivative of log-likelihood as intuitive interpretation No closed-form solution, use gradient-descent methods Note 1: log-likelihood is concave in parameters, hence no local optima Note 2: w is linear combination of data points Expected number of points from each class should equal the actual number. Indicator function 1 if yn=c, else 0 Expected value of each feature, weighting points by p(y|x), should equal empirical expectation.

Support Vector Machines Find linear function (hyperplane) to separate positive and negative examples Which hyperplane is best?

Support vector machines Find hyperplane that maximizes the margin between the positive and negative examples C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Support vector machines Find hyperplane that maximizes the margin between the positive and negative examples For support vectors, Distance between point and hyperplane: Therefore, the margin is 2 / ||w|| Support vectors Margin C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Finding the maximum margin hyperplane Maximize margin 2/||w|| Correctly classify all training data: Quadratic optimization problem: Minimize Subject to yi(w·xi+b) ≥ 1 C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Finding the maximum margin hyperplane Solution: learned weight Support vector C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Finding the maximum margin hyperplane Solution: b = yi – w·xi for any support vector Classification function (decision boundary): Notice that it relies on an inner product between the test point x and the support vectors xi Solving the optimization problem also involves computing the inner products xi · xj between all pairs of training points C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Summary Linear discriminant analysis Two most widely used linear classifiers in practice: Logistic discriminant (supports more than 2 classes directly) Support vector machines (multi-class extensions recently developed) In both cases the weight vector w is a linear combination of the data points This means that we only need the inner-products between data points to calculate the linear functions The function k that computes the inner products is called a kernel function

Nonlinear SVMs Datasets that are linearly separable work out great: But what if the dataset is just too hard? We can map it to a higher-dimensional space: x x x2 x Slide credit: Andrew Moore

Nonlinear SVMs General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x) Slide credit: Andrew Moore

Nonlinear SVMs The kernel trick: instead of explicitly computing the lifting transformation φ(x), define a kernel function K such that K(xi , xj) = φ(xi ) · φ(xj) (to be valid, the kernel function must satisfy Mercer’s condition) This gives a nonlinear decision boundary in the original feature space: C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Kernels for bags of features Histogram intersection kernel: Generalized Gaussian kernel: D can be Euclidean distance, χ2 distance, Earth Mover’s Distance, etc. J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, Local Features and Kernels for Classifcation of Texture and Object Categories: A Comprehensive Study, IJCV 2007

Summary: SVMs for image classification Pick an image representation (in our case, bag of features) Pick a kernel function for that representation Compute the matrix of kernel values between every pair of training examples Feed the kernel matrix into your favorite SVM solver to obtain support vectors and weights At test time: compute kernel values for your test example and each support vector, and combine them with the learned weights to get the value of the decision function

SVMs vs Logisitic discriminants Multi-class SVMs can be obtained in a similar way as for logistic discriminants Separate linear score function for each class Kernels can also be used for non-linear logisitic discriminants Requires storing the support vectors, may cost lots of memory in practice Computing kernel between new data point and support vectors may be computationally expensive For Logisitic discriminant to work well in practice with small number of examples, regularizer needs to be added to achieve large margins Kernel functions can be applied to virtually any linear data processing technique Principle component analysis, k-means clustering, ….