Pattern Recognition and Image Analysis

Slides:

Advertisements

Similar presentations

Introduction to Support Vector Machines (SVM)

Advertisements

Support Vector Machine

Lecture 9 Support Vector Machines

ECG Signal processing (2)

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

An Introduction of Support Vector Machine

Classification / Regression Support Vector Machines

An Introduction of Support Vector Machine

Support Vector Machines

SVM—Support Vector Machines

Support vector machine

Machine learning continued Image source:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

SVMs Reprised. Administrivia I’m out of town Mar 1-3 May have guest lecturer May cancel class Will let you know more when I do...

Support Vector Machines

Support Vector Machine

Support Vector Machines (and Kernel Methods in general)

Support Vector Machines and Kernel Methods

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.

Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl.

Support Vector Machines Kernel Machines

Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.

Support Vector Machines and Kernel Methods

Support Vector Machines

CS 4700: Foundations of Artificial Intelligence

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

2806 Neural Computation Support Vector Machines Lecture Ari Visa.

Support Vector Machines

Lecture 10: Support Vector Machines

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

SVMs, cont’d Intro to Bayesian learning. Quadratic programming Problems of the form Minimize: Subject to: are called “quadratic programming” problems.

An Introduction to Support Vector Machines Martin Law.

Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?

Support Vector Machine & Image Classification Applications

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,

An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

An Introduction to Support Vector Machines (M. Law)

Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.

Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.

CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.

CS 478 – Tools for Machine Learning and Data Mining SVM.

Support Vector Machines and Kernel Methods Machine Learning March 25, 2010.

An Introduction to Support Vector Machine (SVM)

Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.

CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.

An Introduction of Support Vector Machine In part from of Jinwei Gu.

Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)

Support vector machines

PREDICT 422: Practical Machine Learning

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

An Introduction to Support Vector Machines

Support Vector Machines Introduction to Data Mining, 2nd Edition by

SVMs for Document Ranking

Presentation transcript:

Pattern Recognition and Image Analysis Non-Linear Classifiers 2: SVM Dr. Manal Helal – Fall 2014 Lecture 10

Overview Kernel Based Methods RBFs SVM (Linear recap) Non Linear SVM Matlab Examples

SVM Pre-requisites SVM is a kernel based classification method (learning), that depends on some Linear Algebra, Optimisation, and Calculus concepts.

Nonlinear Classifiers: Introduction An example: Suppose we’re in 1-dimension What would a linear SVMs do with this data? Negative “plane” Positive “plane”

Nonlinear Classifiers: Introduction Harder 1-dimensional dataset What can be done about this?

Nonlinear Classifiers: Introduction non-linear basis function zk =(xk,xk2) Positive “plane” Negative “plane”

Polynomial Classifier: XOR problem XOR problem with polynomial function. With nonlinear polynomial function classes can be classified. Example XOR-Problem: linear not separable!

Polynomial Classifier: XOR problem  (0,0)(0,0,0) With we obtain  (0,1) (0,1,0)  (1,0) (1,0,0)  (1,1) (1,1,1) ... that‘s separable in H by the Hyper-plane:

Polynomial Classifier: XOR problem Hyper-plane: 𝑔 (𝑦) = 𝘸𝑦 + 𝘸0 = 0 is Hyper-plane in H is polynomial in X

Polynomial Classifier: XOR problem Decision Surface in X: MatLab: >> x1=[-0.5:0.1:1.5]; >> x2=( )./(2*x1-1); >> plot(x1,x2);

Polynomial Classifier: XOR problem With nonlinear polynomial functions, classes can be classified in original space X Example: XOR-Problem was not linear separable! ... but linear separable in H ! ... and separable in X with a polynomial function!

Polynomial Classifier Decision function is approximated by a polynomial function g(x) , of order p e.g. p=2: Special case of a Two-Layer Perceptron Activation function with polynomial input

Similarly Map ℝ2  ℝ3 (Left) A dataset in ℝ2 , not linearly separable. (Right) The same dataset transformed to ℝ3 by the transformation: [𝑥1, 𝑥2] [𝑥1, 𝑥2, 𝑥12 + 𝑥22]

Non-Linear SVM in Higher dimensions Given a dataset that is not linearly separable in ℝN may be linearly separable in a higher- dimensional space ℝM (where M > N). Thus, if we have a transformation that lifts the dataset to a higher-dimensional such that is linearly separable, then we can train a linear SVM on to find a decision boundary that separates the classes in ℝM. Projecting the decision boundary found in back to the original space will yield a nonlinear decision boundary. Visualisations: "SVM with polynomial kernel visualization" "Performing nonlinear classification via linear separation in higher dimensional space"

What about if M is too large?

The Kernel Trick (We only need dot products) Give me the numeric result of (𝑥 + 𝑦)10 for some values of 𝑥 and 𝑦. By The Binomial Theorem: 𝑥10+10𝑥9𝑦+45𝑥8𝑦2+120𝑥7𝑦3+210𝑥6𝑦4+252𝑥5𝑦5+210𝑥4𝑦6+120𝑥3𝑦7 +45𝑥2𝑦8+10𝑥𝑦9+𝑦10 The trick: we can evaluate the dot product between degree 10 polynomials without explicitly forming them. Valid kernels are dot products where we can compute the numeric result between two points without having to form their explicit feature values. 𝘒 (𝑥, 𝑦) = <ϕ(𝑥), ϕ(𝑦)> where < . , . > is the inner product in the new space. We don’t even need to know what ϕ is

Representation Theory A portion of the vector field (sin y, sin x) In vector calculus, a vector field is an assignment of a vector to each point in a subset of space. A vector space is a mathematical structure formed by a collection of vectors, which may be added together and multiplied ("scaled") by a factor, preserving the group structure The geometric properties of a family of surfaces in a 3D Euclidian space, can be studied using the vector fields of the unit normals on these surfaces. A basis vector is a set of linearly independent vectors that, in a linear combination, can represent every vector in a given vector space or free module, or, more simply put, which define a "coordinate system”. Generally, a basis is a linearly independent spanning set. The same vector can be represented in two different bases (purple and red arrows).

Orthogonality explains the trick It is the relation of two lines at right angles to one another (perpendicularity). It describes non-overlapping, uncorrelated, or independent objects of some kind. The dot product of two vectors measures the angle between them. Two vectors are orthogonal, when their dot product is equal to zero. (arccos(0) = 90°).

Example We don’t always have domain experts, and need to predict the model

Linear Regression Linear Regression fits the data D = {(𝑥(n), 𝑦(n)), n = 1, 2, .. N}) to a linear equation 𝑦(𝑥) = a𝑥 + b, by choosing the weights 𝘸T = (a, b)T of the feature map 𝑦(𝑥) = 𝘸Tϕ(𝑥) that minimize the squared prediction error E (𝘸) = Σn[𝑦(n) = 𝘸Tϕ(n)(𝑥(n)) ]2

Linear Regression in Matrix Form In matrix form the Error Function is: E(𝘸) = (𝑦T –𝘸Tϕ)T (𝑦T –𝘸Tϕ) with 𝑦n = 𝑦 (n) and ϕij = ϕi (𝑥j) Minimum Error is obtained by: 𝘸 = (ϕϕT)-1ϕ𝑦 where (ϕϕT)-1ϕ is called the Moore-Penrose pseudo inverse of ϕ (ϕϕT) is an MxM matrix where M is the number of basis functions  O(M3) operations to calculate (ϕϕT)-1ϕ

Dealing with unknown Basis Functions One way to deal with over-fitting is by using regularization. However, it will not help much if the basis functions do not match the data. So, what we should do if we do not know the model? Is there a universal set of basis functions that can approximate arbitrary functions reasonably well?

More Complex Basis functions We can also use more complex mode, for example a 3rd degree polynomial : 𝑦(𝑥) = 𝘸1 + 𝘸2𝑥 + 𝘸3𝑥2 + 𝘸4𝑥3. This yields the feature map: 1 ϕ(𝑥) = 𝑥 𝑥2 𝑥3 Quality of Solution? Worse Model Complexity? Higher Why? Model does not match data

Linear Regression Using “bump” Functions Approximate the function with a sum of bump shaped functions

Linear Regression Using Radial Basis Functions (RBFs) Gaussian Radial Basis Functions : The centre is at m(i) and αdetermines the width of the bump. Advantage: They decrease smoothly to zero and do not show oscillatory behaviour unlike the higher order polynomials.

Gaussian RBF in 2D space

Linear Regression using RBFs Use with i = 1, … 16 as basis functions. Apply linear regression. In one dimension, this seems to work well, but does it scale to higher dimensions? Approximation of sin(10x) by 16 Gaussian RBFs. Black Line: Original function Red Line: Regression.

Curse of Dimensionality To cover the data with a constant discretisation level (number of RBFs per unit volume) the number of basis functions and weights will grows exponentially with the number of dimensions: Is there a way to reduce the number of required weights? Dimensions # of basis functions / weights 1 16 2 162 = 256 3 163 = 4096 4 164 = 65536 5 165 = 1048576 : 10 1610 ≈ 1012

Dual Representation Orthogonal Complement Orthogonal decomposition Let 𝒜 be a subspace of ℝN. The orthogonal complement 𝒜⊥ of 𝒜 is the set of all vectors that are orthogonal to all elements of 𝒜, 𝒜⊥ = {z∈ ℝN: 𝑥Tz = 0 ∀ 𝑥 ∈𝒜 } Orthogonal decomposition 𝑦 ∈ ℝN can be written uniquely in the form ŷ = 𝑦 + z with ŷ ∈ 𝒜 and z ∈𝒜⊥ 𝒜 = {c1(1, 0, 0) + c2(1, 1, 0), c ∈ ℝ2} 𝒜⊥= {c(0, 0, 1), c ∈ ℝ} 𝑦 = (2, 3, 5) ŷ = (2, 3, 0) z = (0, 0, 5)

Null Space of Training Set Let 𝒳 = {𝑥(n), n = 1, 2, … N} be the set of all training points Consider the predictions of the model 𝑦 (𝑥) = 𝘸T𝑥 On the training set, i.e. 𝑥∈𝒳 Using orthogonal decomposition we write, w = ŵ + z with ŵ ∈span(𝒳) and z ∈𝒳⊥ The prediction becomes: 𝑦(𝑥) = (ŵ + z)T𝑥 = ŵT𝑥 + zT𝑥 = ŵT𝑥 Hence, we can assume that 𝘸 ∈span(𝒳) Why this is zero?

Dual Representation The weight vector w is thus a linear combination of the training samples 𝒳. The parameters 𝑎 = (𝑎1, … 𝑎N) are called the dual parameters This also applies when using basis functions, There are as many dual parameters as training samples. Their number is independent of the number of basis functions.

Predictions in Dual Representation Substituting w back into the linear regression y(𝑥) = 𝘸Tϕ(𝑥) yields With the kernel function ϕ, 𝘒 (𝑥, 𝑦) := ϕ(𝑥) T ϕ(𝑦) Using the Gram Matrix: 𝘒mn := 𝘒 (𝑥(m), 𝑥 (n)) = ϕ(𝑥(m)) T ϕ(x(n)) The predictions Yn = 𝑦 (𝑥(n)) on the training set are simply YT = aT𝘒 Gram matrix of a set of vectors v1, …, vn in an inner product space is the Hermitian matrix of inner products, whose entries are given by Gij=< vj, vi>. For a Finite Dimensions in Euclidian space, it is G = VTV. It is the avaluation of the kernel function over all pairs of points in the training set.

Solution of Dual Representation Primal Representation (Standard Linear Regression) Dual Representation where ϕ is the matrix of feature vectors Solution A =(ϕϕT)-1ϕ𝑦 = ϕ+𝑦 Complexity is O(M3), where M is the number of basis functions Predictions: 𝑦 (𝑥) = 𝘸Tϕ(𝑥) where 𝘒 is the gram matrix (also called the kernel matrix) Solution A =(𝘒𝘒T)-1𝘒𝑦 = 𝘒+𝑦 Complexity is O(N3), where N is the number training samples Predictions: “Components an weighted with similarity to training sample 𝑥 (n)”

Why is Dual Representation Useful N is the number of Training Sample Lots of basis functions & moderate number of training samples:  dual representation saves lots of parameters

Example: Kernel of RBF Gaussian RBF basis function: For i = 1, … M Calculate Kernel: 𝘒(𝑥, 𝑦) := ϕ(𝑥) T ϕ(𝑦) = Linear Regression: where N is the number of training samples.

Example: Kernel of Polynomial Basis The polynomial basis of order D: ϕi(𝑥) = 𝑥i-1, i = 1, … , D Induces the kernel: 𝘒 (𝑥, 𝑦) := ϕ(𝑥) T ϕ(y) = And the Gram Matrix: 𝘒nm := 𝘒 (𝑥(n), 𝑥 (m)) = Evaluating the Kernel is much cheaper than calculating the mapping ϕ into feature space and the scalar product explicitly. This expensive d- dimensional scalar product is simplified to this geometric series

Properties of the Kernel Calculate ϕTϕ componentwise: (ϕTϕ)mn = Thus we have: 𝘒=ϕTϕ with 𝘒 symmetric 𝘒 is positive semi-definite, that is 𝑏T𝘒𝑏 ≥0 for any 𝑏, since 0 ≤ for any 𝑏

When is a function a Kernel? Mercer’s theorem (for finite input space): Consider a finite input space 𝒳 = {𝑥(1), …, 𝑥(N)} with 𝘒(𝑥(n), 𝑥(m) ) a function on 𝒳. Then 𝘒(𝑥(n), 𝑥(m) ) is a kernel function, that is a scalar product in a feature space, if and only if, 𝘒 is symmetric and the matrix 𝘒nm = 𝘒(𝑥(n), 𝑥(m) ) is positive semi-definite. Proof: 𝘒 is symmetric  the eigenvalue decomposition has the form 𝘒 = VɅVT, where Ʌ is a diagonal matrix containing the eigenvalues λt = Ʌtt of 𝘒 and V is an orthogonal matrix containing the eigenvectors Vt. 𝘒 is positive semi-definite  All eigenvalues are non-negative, λt ≥ 0. Define the feature map: And see if it corresponds to the kernel by calculating the scalar product :

Kernels Summary Kernels compute the value of a scalar product in high dimensional feature space without explicitly computing the features. They can be used in any model that depends (or can be rewritten so that it depends) on the scalar product of the training samples. Not every function is a kernel. Use Kernel construction rules to prove that a function is a valid kernel.

Math Review: Hyper-planes in Hessian Normal Form For a point 𝑥1 that lies on the far side of the hyper-plane we have: 𝘸T𝑥1 + 𝑏 > 0 Because the projection of 𝑥1 on 𝘸 is larger than for points 𝑥 that are on the hyper-plane. Conversely for a point 𝑥2 that lies on the near side of the hyper- plane, we have, 𝘸T𝑥2 + 𝑏 < 0 Because the projection of 𝑥2 on 𝘸 is smaller than for points 𝑥 that are on the hyper-plane. Linear Classifier h(𝑥) = 𝘸T𝑥 + b is classified by checking the sign(h(𝑥) ).

Linear Classifier with a Margin We add two more hyper-planes that are parallel to the original hyper-plane and require that no training point must lie on between those hyper-planes. Thus we now require: 𝘸T𝑥 + (𝑏 - 𝘴) > 0 for 𝑥 from the blue class And 𝘸T𝑥 + (𝑏 + 𝘴) < 0 for all 𝑥 from the green class. The distance from the origin to the hyper-plane 𝘸T𝑥 + 𝑏 = 0 is given by: d= -𝑏/|𝘸|, thus dblue = -(𝑏-𝘴)/|𝘸| dgreen = -(𝑏+𝘴)/|𝘸| And the margin is m= dblue - dgreen= 2𝘴/|𝘸| ≈ 2/|𝘸|

Set of Constraints Let 𝑥i be the ith data, and 𝑦i ∈ {-1, 1} the class assigned to 𝑥i . The constraints are:: 𝘸T𝑥i + 𝑏 ≥ +1 for 𝑦i = +1 𝘸T𝑥i + 𝑏 ≤ -1 for 𝑦i = -1 Can be condensed to: 𝑦i(𝘸T𝑥i + 𝑏) ≥ 1 for all i. If these constraints are fulfilled, the margin is:

Optimisation Problem Let 𝑥i be the ith data, i = 1, … m, and 𝑦i ∈ {-1, 1} the class assigned to 𝑥i . To find the separating hyper-plane with the maximum margin we: minimise 𝑓0 (𝘸, 𝑏 ) =½ 𝘸T𝘸 subject to: 𝑓i (𝘸, 𝑏 ) = 𝑦i(𝘸T𝑥i + 𝑏) – 1 ≥ 0 for i = 1, ..., m This is a constrained convex optimisation problem. Note that we introduced the arbitrary ½ in the objective function 𝑓0 (𝘸, 𝑏 ) to obtain nicer derivative later on.

Dual Problem We apply the recipe for solving the constrained optimisation problem. 1. Calculate the Lagrangian: L(𝘸, 𝑏 , α) = ½𝘸T𝘸 - 2. Minimise L(𝘸, 𝑏 , α) w.r.t 𝘸 and 𝑏 Thus the weights are a linear combination of the training samples:

Dual Problem Substituting both relations back to L(𝘸, 𝑏 , α) gives us the Lagrange dual function 𝑔 (α). Thus: Maximise 𝑔 (α) = Subject to αi ≥ 0 i=1, … m We will demonstrate the an efficient algorithm to solve the dual problem. The bias 𝑏 does not appear in the dual problem and thus needs to be recovered afterwards. But for both issues we need to derive some properties of the solution first.

Support Vectors Remember that the optimal solution must satisfy amongst others the KKT complimentary slackness condition: αi𝑓i(𝑥*)=0 In our case, this means: αi[𝑦i(𝘸T𝑥i + 𝑏) – 1] =0 for all i Hence, a training sample 𝑥i can only contribute to the weight vector (αi≠0) if it lies on the margin, that is 𝑦i(𝘸T𝑥i + 𝑏) = 1 a training sample 𝑥i with (αi≠0) is called a support vector The class of 𝑥 is h(𝑥) = sign(𝘸T𝑥 + 𝑏), substituting gives: We only need to remember the few training samples where (αi≠0).

Recovering the bias From the complimentary slackness condition αi𝑓i(𝑥)=0, we can recover the bias. When we take the support vector (αi≠0), the corresponding constraint 𝑓i (𝘸, 𝑏 ) must be zero and thus we have: 𝘸T𝑥i + 𝑏 = 𝑦i Solving this for 𝑏 gives the bias: 𝑏 = 𝑦i - 𝘸T𝑥i where . We can use any support vector to calculate the bias.

Non Linear SVM: The XOR Problem Linear separation in high dimensional space H via nonlinear functions (polynomial and RBF’s) in the original space X. For this we found nonlinear mappings ϕ(𝑥): X  H Is that possible without knowing the mapping function  ?!?

Non-linear Support Vector Machines Recall that, the probability of having linearly separable classes increases as the dimensionality of feature vectors increases. Assume the mapping: 𝑥  Rl  z  Rk, k  l Then use linear SVM in Rk Recall that in this case the dual problem formulation will be : The classifier will be:

Non-linear SVM Formulation

Linear SVM – Pol SVM in the input Space X

Pol. SVM – RBF SVM in the input space X

Pol. SVM – RBF SVM in the input space X

SVM vs. Neural Networks Training a Support Vector Machine (SVM) involves optimization of a concave function: there is a unique solution Training a neural network learning model is generally non- convex: there are potentially different solutions depending on the starting values for the model parameters.

SVM Software

Matlab SVM Functions For the exam dataset, the polynomial kernel function in SVM was the best scoring: xtrain = Dataset(1:80,1:2) ytrain = Dataset(1:80, 3) xtest = Dataset(81:100,1:2) %The Polynomial Kernel SVM classifier: SVMStruct = svmtrain(xtrain,ytrain, 'kernel_function','polynomial’) ytest = svmclassify(SVMStruct , xtest ) bad = ~strcmp(svmclassify(SVMStruct , xtrain),ytrain) P_SVMResubErr = sum(bad) / 80 cp = cvpartition(ytrain,'k',10) SVMClassFun = @(xtrain,ytrain,xtest (svmclassify(SVMStruct ,xtest)) P_SVMCVErr = crossval('mcr',xtrain,ytrain,'predfun',SVMClassFun,'partition',cp)

Other Matlab SVM functions SVMModel1 = fitcsvm(xtrain,ytrain,'KernelFunction','polynomial','Standardize',true); % Compute the scores over a grid d = 0.02; % Step size of the grid [x1Grid,x2Grid] = meshgrid(min(xtrain(:,1)):d:max(xtrain(:,1)),... min(xtrain(:,2)):d:max(xtrain(:,2))); xGrid = [x1Grid(:),x2Grid(:)]; % The grid [~,scores1] = predict(SVMModel1,xGrid); % The scores figure; h(1:2) = gscatter(xtrain(:,1),xtrain(:,2),ytrain) hold on h(3) = plot(xtrain(SVMModel1.IsSupportVector,1),... xtrain(SVMModel1.IsSupportVector,2),'ko','MarkerSize',10); % Support vectors contour(x1Grid,x2Grid,reshape(scores1(:,2),size(x1Grid)),[0 0],'k'); % Decision boundary title('Scatter Diagram with the Decision Boundary') legend({'-1','1','Support Vectors'},'Location','Best'); hold off