Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Classification / Regression Support Vector Machines

An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
Support vector machine
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Discriminative and generative methods for bags of features
Support Vector Machines
Support Vector Machine
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Support Vector Machines
CS 4700: Foundations of Artificial Intelligence
Lecture 10: Support Vector Machines
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Discriminative and generative methods for bags of features
An Introduction to Support Vector Machines Martin Law.
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.
Outline Separating Hyperplanes – Separable Case
Classification 2: discriminative models
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 24 – Classifiers 1.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
An Introduction to Support Vector Machine (SVM)
Linear Models for Classification
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CS 2750: Machine Learning Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 17, 2016.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
An Introduction of Support Vector Machine Courtesy of Jinwei Gu.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Lecture IX: Object Recognition (2)
Neural networks and support vector machines
Support vector machines
PREDICT 422: Practical Machine Learning
Support Vector Machine
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Support Vector Machines
An Introduction to Support Vector Machines
An Introduction to Support Vector Machines
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Statistical Learning Dong Liu Dept. EEIS, USTC.
CS 2750: Machine Learning Support Vector Machines
Support Vector Machines Most of the slides were taken from:
COSC 4335: Other Classification Techniques
Support vector machines
Support vector machines
Support vector machines
COSC 4368 Machine Learning Organization
SVMs for Document Ranking
Presentation transcript:

Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:

Support vector machines For separable classes: Find hyperplane that maximizes the margin between the positive and negative examples Margin Support vectors For support vectors, Maximize the margin, thus minimize Positive samples Negative samples

Support vector machines For non-separable classes: pay a penalty for crossing the margin – If on correct side of the margin: zero – Otherwise, amount by which score violates the constraint of correct classification

Finding the maximum margin hyperplane 1.Minimize norm of w + penalties: 1. Constraints to correctly classify training data, up to penalties : 2. Optimization: quadratic-programming problem – : slack variables, loosening the margin constraint – C: trade-off between large margin & small penalties Typically set by cross-validation, i.e. learn with part of data set using various C values, evaluate classification on held- out data. Repeat for several train-validation splits to pick the optimal C value. 3.

Relation SVM and logistic regression A classification error occurs when sign of the function does not match the sign of the class label: the zero-one loss Consider error minimized when training classifier: – Non-separable SVM, hinge loss: – Logistic loss: Both hinge & logistic loss are convex bounds on zero-one loss which is non- convex and discontinuous Both lead to efficient optimization – Hinge-loss is piece-wise linear: quadratic programming – Logistic loss is smooth: gradient descent methods Loss z

Summary Linear discriminant analysis Two most widely used linear classifiers in practice: – Logistic discriminant (supports more than 2 classes directly) – Support vector machines (multi-class extensions recently developed) In both cases the weight vector w is a linear combination of the data points This means that we only need the inner-products between data points to calculate the linear functions – The function k that computes the inner products is called a kernel function

A 1d datasets that is linearly separable But what if the dataset is not linearly seperable? We can map it to a higher-dimensional space: 0x 0 x 0 x x2x2 Nonlinear Classification Slide credit: Andrew Moore

Φ: x → φ(x) Kernels for non-linear classification General idea: map the original input space to some higher- dimensional feature space where the training set is separable Which features would you use in this example ? Slide credit: Andrew Moore

Nonlinear SVMs The kernel trick: instead of explicitly computing the lifting transformation φ(x), define a kernel function K such that K(x i, x j ) = φ(x i ) · φ(x j ) If a kernel satisfies Mercer’s condition then it computes an inner product in some feature space (possibly infinite nr of dimensions!) – Mercer's Condition: The square n x n matrix with kernel evaluations for any arbitrary n data points should always be a positive definite matrix. This gives a nonlinear decision boundary in the original space:

Kernels for non-linear classification What is the kernel function is we use Φ: x → φ(x) Slide credit: Andrew Moore

Kernels for non-linear classification Suppose we also want to keep the original features to be able to still implement linear functions Φ: x → φ(x) Slide credit: Andrew Moore

Kernels for non-linear classification What happens if we use the same kernel for higher dimensional data vectors? Which feature vector corresponds to it ? – First term, encodes an additional 1 in each feature vector – Second term, encodes scaling of the original features – Let's consider the third term – In total we have 1 + D + D(D+1)/2 features ! – But the inner product did not get any harder to compute Original features SquaresProducts of two distinct elements

Popular kernels for bags of features Histogram intersection kernel: – What is the feature vector transformation ? Generalized Gaussian kernel: – d can be Euclidean distance, χ 2 distance, Earth Mover’s Distance, etc. See also: J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, Local features and kernels for classification of texture and object categories: a comprehensive study. Int. Journal of Computer Vision, 2007

Summary linear classification & kernels Linear classifiers learned by minimizing convex cost functions – Logistic discriminant: smooth objective, minimized using gradient descend – Support vector machines: piecewise linear objective, quadratic programming – Both require only computing inner product between data points Non-linear classification can be done with linear classifiers over new features that are non-linear functions of the original features – Kernel functions efficiently compute inner products in (very) high-dimensional spaces, can even be infinite dimensional in some cases. Using kernel functions non-linear classification has drawbacks – Requires storing the support vectors, may cost lots of memory in practice – Computing kernel between new data point and support vectors may be computationally expensive Kernel functions also work for other linear data analysis techniques – Principle component analysis, k-means clustering, ….

Additional reading material Pattern Recognition and Machine Learning Chris Bishop, 2006, Springer Chapter 4 – Section & – Section & – Section & – Section 7.1 start & 7.1.2