Support Vector Machines 2

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
Machine learning continued Image source:
Computer vision: models, learning and inference Chapter 8 Regression.
Model generalization Test error Bias, variance and complexity
Support Vector Machines
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Support Vector Machines
SVMs Finalized. Where we are Last time Support vector machines in grungy detail The SVM objective function and QP Today Last details on SVMs Putting it.
SVM Support Vectors Machines
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
SVMs, cont’d Intro to Bayesian learning. Quadratic programming Problems of the form Minimize: Subject to: are called “quadratic programming” problems.
An Introduction to Support Vector Machines Martin Law.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Outline Separating Hyperplanes – Separable Case
This week: overview on pattern recognition (related to machine learning)
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
Neural networks and support vector machines
Support Vector Machines
Support vector machines
CS 9633 Machine Learning Support Vector Machines
PREDICT 422: Practical Machine Learning
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Computer vision: models, learning and inference
Support Vector Machines
Support Vector Machines (SVM)
Bias and Variance of the Estimator
An Introduction to Support Vector Machines
An Introduction to Support Vector Machines
Support Vector Machines
Hyperparameters, bias-variance tradeoff, validation
Support Vector Machines
Support Vector Machines
Support vector machines
Support Vector Machines
Support Vector Machines and Kernels
Support vector machines
Model generalization Brief summary of methods
Support vector machines
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Introduction to Machine Learning
Presentation transcript:

Support Vector Machines 2 SVM The Kernel trick Selection of models VC dimension SVR Other applications

Large margin classifier The optimization problem: ξ=0 when the point is on the correct side of the margin; ξ>1 when the point passes the hyperplane to the wrong side; 0<ξ<1 when the point is in the margin but still on the correct side.

Large margin classifier The solution of β has the form: Non-zero coefficients only for those points i for which These are called “support vectors”. Some will lie on the edge of the margin the remainder have , They are on the wrong side of the margin.

SVM Consider basis expansion Solution of the large margin classifier in h() space:

SVM h(x) is involved ONLY in the form of inner product! So as long as we define the kernel function which computes the inner product in the transformed space, we don’t need to know what h(x) itself is! “Kernel trick”

SVM Recall αi=0 for non-support vectors, f(x) depends only on the support vectors. The decision is made essentially by a weighted sum of similarity of the object to all the support vectors.

SVM

More on kernels Most kernels don’t correspond to explicit basis functions There are exceptions. Example: This is a degree 2 polynomial kernel with κ=0. Other degree 2 polynormial kernels with non-zero κ also correspond to explicit degree 2 polynomial basis.

More on kernels

More on kernels (Gaussian) radial basis function kernel, or RBF kernel

More on kernels CPD: conditional positive definite “SVM model using a sigmoid kernel function is equivalent to a two-layer, perceptron neural network” CPD: conditional positive definite Lin and Lin: “A Study on Sigmoid Kernels for SVM and the Training of non-PSD Kernels by SMO-type Methods”

More on kernels https://gist.github.com/WittmannF/60680723ed8dd0cb993051a7448f7805

SVM Using kernel trick brings the feature space to very high dimension  many many parameters. Why doesn’t the method suffer from the curse of dimensionality or overfitting??? Vapnic argues that the number of parameters alone, or dimensions alone, is not a true reflection of how flexible the classifier is. Compare two functions in 1-dimension: f(x)=α+βx g(x)=sin(αx)

SVM g(x)=sin(αx) is a really flexible classifier in 1-dimension, although it has only one parameter. f(x)=α+βx can only promise to separate two points every time, although it has one more parameter ?

SVM Vapnic-Chernovenkis dimension: The VC dimension of a class of classifiers {f(x,α)} is defined to be the largest number of points that can be shattered by members of {f(x,α)} A set of points is said to be shattered by a class of function if, no matter how the class labels are assigned, a member of the class can separate them perfectly.

SVM Linear classifier is rigid. A hyperplane classifier has VC dimension of d+1, where d is the feature dimension.

SVM The class sin(αx) has infinite VC dimension. By appropriate choice of α, any number of points can be shattered. The VC dimension of the nearest neighbor classifier is infinity --- you can always get perfect classification in training data. For many classifiers, it is difficult to compute VC dimension exactly. But this doesn’t diminish its value for theoretical arguments. Th VC dimension is a measure of complexity of the class of functions by assessing how wiggly the function can be.

SVM For SVM: VC-dimension of maximum-margin hyperplane does not necessarily depend on the number of features. VC-dimension is lower with larger margin.

SVM Strengths of SVM: flexibility scales well for high-dimensional data can control complexity and error trade-off explicitly as long as a kernel can be defined, non-traditional (vector) data, like strings, trees can be input Weakness how to choose a good kernel? (a low degree polynomial or radial basis function can be a good start)

K-fold cross-validation: The goal is to directly estimate the extra-sample error (error on an independent test set) K-fold cross-validation: Split data into K roughly equal-sized parts For each of the K parts, fit the model with the other K-1 parts, and calculate the prediction error on the part that is left out.

Cross-validation The CV estimate of the prediction error is from the combination of the K estimates α is the tuning parameter (different models, model parameters) Find that minimizes CV(α) Finally, fit all data on the model

Cross-validation Leave-one-out cross-validation (K=N) is approximately unbiased, yet it has high variance. K=5 or 10, CV has low variance but more bias. If the learning curve has large slope at the training set size, a 5-fold or 10-fold CV can overestimate the prediction error substantially.

Cross-validation

Support Vector Regression A linear function that approximates all pairs xi yi with ε precision: Smola & Scholkopf “A Tutorial on Support Vector Regression ”

Support Vector Regression Similar to classification, allow slack variables:

Support Vector Regression Similar to the classification case, the solution is a linear combination of support vectors: Thus the prediction only involves the features in an inner product: Kernel trick is applicable.

Support Vector Regression PLoS ONE 8(11): e79970.

Kernel trick in other areas Capturing non-linear effects without explicitly specifying a particular non-linear functional form Statistical testing of combined effects of variables Clustering Quantile regression Dimensionality reduction ……