Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pattern Recognition and Image Analysis

Similar presentations


Presentation on theme: "Pattern Recognition and Image Analysis"— Presentation transcript:

1 Pattern Recognition and Image Analysis
Non-Linear Classifiers 2: SVM Dr. Manal Helal – Fall 2014 Lecture 10

2 Overview Kernel Based Methods RBFs SVM (Linear recap) Non Linear SVM
Matlab Examples

3 SVM Pre-requisites SVM is a kernel based classification method (learning), that depends on some Linear Algebra, Optimisation, and Calculus concepts.

4 Nonlinear Classifiers: Introduction
An example: Suppose we’re in 1-dimension What would a linear SVMs do with this data? Negative “plane” Positive “plane”

5 Nonlinear Classifiers: Introduction
Harder 1-dimensional dataset What can be done about this?

6 Nonlinear Classifiers: Introduction
non-linear basis function zk =(xk,xk2) Positive “plane” Negative “plane”

7 Polynomial Classifier: XOR problem
XOR problem with polynomial function. With nonlinear polynomial function classes can be classified. Example XOR-Problem: linear not separable!

8 Polynomial Classifier: XOR problem
 (0,0)(0,0,0) With we obtain  (0,1) (0,1,0)  (1,0) (1,0,0)  (1,1) (1,1,1) ... that‘s separable in H by the Hyper-plane:

9 Polynomial Classifier: XOR problem
Hyper-plane: 𝑔 (𝑦) = 𝘸𝑦 + 𝘸0 = 0 is Hyper-plane in H is polynomial in X

10 Polynomial Classifier: XOR problem
Decision Surface in X: MatLab: >> x1=[-0.5:0.1:1.5]; >> x2=( )./(2*x1-1); >> plot(x1,x2);

11 Polynomial Classifier: XOR problem
With nonlinear polynomial functions, classes can be classified in original space X Example: XOR-Problem was not linear separable! ... but linear separable in H ! ... and separable in X with a polynomial function!

12 Polynomial Classifier
Decision function is approximated by a polynomial function g(x) , of order p e.g. p=2: Special case of a Two-Layer Perceptron Activation function with polynomial input

13 Similarly Map ℝ2  ℝ3 (Left) A dataset in ℝ2 , not linearly separable. (Right) The same dataset transformed to ℝ3 by the transformation: [𝑥1, 𝑥2] [𝑥1, 𝑥2, 𝑥12 + 𝑥22]

14 Non-Linear SVM in Higher dimensions
Given a dataset  that is not linearly separable in  ℝN  may be linearly separable in a higher- dimensional space   ℝM (where M > N). Thus, if we have a transformation  that lifts the dataset  to a higher-dimensional  such that  is linearly separable, then we can train a linear SVM on  to find a decision boundary  that separates the classes in  ℝM. Projecting the decision boundary  found in  back to the original space  will yield a nonlinear decision boundary. Visualisations: "SVM with polynomial kernel visualization" "Performing nonlinear classification via linear separation in higher dimensional space"

15 What about if M is too large?

16 The Kernel Trick (We only need dot products)
Give me the numeric result of (𝑥 + 𝑦)10 for some values of 𝑥 and 𝑦. By The Binomial Theorem: 𝑥10+10𝑥9𝑦+45𝑥8𝑦2+120𝑥7𝑦3+210𝑥6𝑦4+252𝑥5𝑦5+210𝑥4𝑦6+120𝑥3𝑦7 +45𝑥2𝑦8+10𝑥𝑦9+𝑦10 The trick: we can evaluate the dot product between degree 10 polynomials without explicitly forming them. Valid kernels are dot products where we can compute the numeric result between two points without having to form their explicit feature values.   𝘒 (𝑥, 𝑦) = <ϕ(𝑥), ϕ(𝑦)> where < . , . > is the inner product in the new space. We don’t even need to know what ϕ is

17 Representation Theory
A portion of the vector field (sin y, sin x) In vector calculus, a vector field is an assignment of a vector to each point in a subset of space. A vector space is a mathematical structure formed by a collection of vectors, which may be added together and multiplied ("scaled") by a factor, preserving the group structure The geometric properties of a family of surfaces in a 3D Euclidian space, can be studied using the vector fields of the unit normals on these surfaces. A basis vector is a set of linearly independent vectors that, in a linear combination, can represent every vector in a given vector space or free module, or, more simply put, which define a "coordinate system”. Generally, a basis is a linearly independent spanning set. The same vector can be represented in two different bases (purple and red arrows).

18 Orthogonality explains the trick
It is the relation of two lines at right angles to one another (perpendicularity). It describes non-overlapping, uncorrelated, or independent objects of some kind. The dot product of two vectors measures the angle between them. Two vectors are orthogonal, when their dot product is equal to zero. (arccos(0) = 90°).

19 Example We don’t always have domain experts, and need to predict the model

20 Linear Regression Linear Regression fits the data
D = {(𝑥(n), 𝑦(n)), n = 1, 2, .. N}) to a linear equation 𝑦(𝑥) = a𝑥 + b, by choosing the weights 𝘸T = (a, b)T of the feature map 𝑦(𝑥) = 𝘸Tϕ(𝑥) that minimize the squared prediction error E (𝘸) = Σn[𝑦(n) = 𝘸Tϕ(n)(𝑥(n)) ]2

21 Linear Regression in Matrix Form
In matrix form the Error Function is: E(𝘸) = (𝑦T –𝘸Tϕ)T (𝑦T –𝘸Tϕ) with 𝑦n = 𝑦 (n) and ϕij = ϕi (𝑥j) Minimum Error is obtained by: 𝘸 = (ϕϕT)-1ϕ𝑦 where (ϕϕT)-1ϕ is called the Moore-Penrose pseudo inverse of ϕ (ϕϕT) is an MxM matrix where M is the number of basis functions  O(M3) operations to calculate (ϕϕT)-1ϕ

22 Dealing with unknown Basis Functions
One way to deal with over-fitting is by using regularization. However, it will not help much if the basis functions do not match the data. So, what we should do if we do not know the model? Is there a universal set of basis functions that can approximate arbitrary functions reasonably well?

23 More Complex Basis functions
We can also use more complex mode, for example a 3rd degree polynomial : 𝑦(𝑥) = 𝘸1 + 𝘸2𝑥 + 𝘸3𝑥2 + 𝘸4𝑥3. This yields the feature map: 1 ϕ(𝑥) = 𝑥 𝑥2 𝑥3 Quality of Solution? Worse Model Complexity? Higher Why? Model does not match data

24 Linear Regression Using “bump” Functions
Approximate the function with a sum of bump shaped functions

25 Linear Regression Using Radial Basis Functions (RBFs)
Gaussian Radial Basis Functions : The centre is at m(i) and αdetermines the width of the bump. Advantage: They decrease smoothly to zero and do not show oscillatory behaviour unlike the higher order polynomials.

26 Gaussian RBF in 2D space

27 Linear Regression using RBFs
Use with i = 1, … 16 as basis functions. Apply linear regression. In one dimension, this seems to work well, but does it scale to higher dimensions? Approximation of sin(10x) by 16 Gaussian RBFs. Black Line: Original function Red Line: Regression.

28 Curse of Dimensionality
To cover the data with a constant discretisation level (number of RBFs per unit volume) the number of basis functions and weights will grows exponentially with the number of dimensions: Is there a way to reduce the number of required weights? Dimensions # of basis functions / weights 1 16 2 162 = 256 3 163 = 4096 4 164 = 65536 5 165 = : 10 1610 ≈ 1012

29 Dual Representation Orthogonal Complement Orthogonal decomposition
Let 𝒜 be a subspace of ℝN. The orthogonal complement 𝒜⊥ of 𝒜 is the set of all vectors that are orthogonal to all elements of 𝒜, 𝒜⊥ = {z∈ ℝN: 𝑥Tz = 0 ∀ 𝑥 ∈𝒜 } Orthogonal decomposition 𝑦 ∈ ℝN can be written uniquely in the form ŷ = 𝑦 + z with ŷ ∈ 𝒜 and z ∈𝒜⊥ 𝒜 = {c1(1, 0, 0) + c2(1, 1, 0), c ∈ ℝ2} 𝒜⊥= {c(0, 0, 1), c ∈ ℝ} 𝑦 = (2, 3, 5) ŷ = (2, 3, 0) z = (0, 0, 5)

30 Null Space of Training Set
Let 𝒳 = {𝑥(n), n = 1, 2, … N} be the set of all training points Consider the predictions of the model 𝑦 (𝑥) = 𝘸T𝑥 On the training set, i.e. 𝑥∈𝒳 Using orthogonal decomposition we write, w = ŵ + z with ŵ ∈span(𝒳) and z ∈𝒳⊥ The prediction becomes: 𝑦(𝑥) = (ŵ + z)T𝑥 = ŵT𝑥 + zT𝑥 = ŵT𝑥 Hence, we can assume that 𝘸 ∈span(𝒳) Why this is zero?

31 Dual Representation The weight vector w is thus a linear combination of the training samples 𝒳. The parameters 𝑎 = (𝑎1, … 𝑎N) are called the dual parameters This also applies when using basis functions, There are as many dual parameters as training samples. Their number is independent of the number of basis functions.

32 Predictions in Dual Representation
Substituting w back into the linear regression y(𝑥) = 𝘸Tϕ(𝑥) yields With the kernel function ϕ, 𝘒 (𝑥, 𝑦) := ϕ(𝑥) T ϕ(𝑦) Using the Gram Matrix: 𝘒mn := 𝘒 (𝑥(m), 𝑥 (n)) = ϕ(𝑥(m)) T ϕ(x(n)) The predictions Yn = 𝑦 (𝑥(n)) on the training set are simply YT = aT𝘒 Gram matrix of a set of vectors v1, …, vn in an inner product space is the Hermitian matrix of inner products, whose entries are given by Gij=< vj, vi>. For a Finite Dimensions in Euclidian space, it is G = VTV. It is the avaluation of the kernel function over all pairs of points in the training set.

33 Solution of Dual Representation
Primal Representation (Standard Linear Regression) Dual Representation where ϕ is the matrix of feature vectors Solution A =(ϕϕT)-1ϕ𝑦 = ϕ+𝑦 Complexity is O(M3), where M is the number of basis functions Predictions: 𝑦 (𝑥) = 𝘸Tϕ(𝑥) where 𝘒 is the gram matrix (also called the kernel matrix) Solution A =(𝘒𝘒T)-1𝘒𝑦 = 𝘒+𝑦 Complexity is O(N3), where N is the number training samples Predictions: “Components an weighted with similarity to training sample 𝑥 (n)”

34 Why is Dual Representation Useful
N is the number of Training Sample Lots of basis functions & moderate number of training samples:  dual representation saves lots of parameters

35 Example: Kernel of RBF Gaussian RBF basis function: For i = 1, … M
Calculate Kernel: 𝘒(𝑥, 𝑦) := ϕ(𝑥) T ϕ(𝑦) = Linear Regression: where N is the number of training samples.

36 Example: Kernel of Polynomial Basis
The polynomial basis of order D: ϕi(𝑥) = 𝑥i-1, i = 1, … , D Induces the kernel: 𝘒 (𝑥, 𝑦) := ϕ(𝑥) T ϕ(y) = And the Gram Matrix: 𝘒nm := 𝘒 (𝑥(n), 𝑥 (m)) = Evaluating the Kernel is much cheaper than calculating the mapping ϕ into feature space and the scalar product explicitly. This expensive d- dimensional scalar product is simplified to this geometric series

37 Properties of the Kernel
Calculate ϕTϕ componentwise: (ϕTϕ)mn = Thus we have: 𝘒=ϕTϕ with 𝘒 symmetric 𝘒 is positive semi-definite, that is 𝑏T𝘒𝑏 ≥0 for any 𝑏, since 0 ≤ for any 𝑏

38 When is a function a Kernel?
Mercer’s theorem (for finite input space): Consider a finite input space 𝒳 = {𝑥(1), …, 𝑥(N)} with 𝘒(𝑥(n), 𝑥(m) ) a function on 𝒳. Then 𝘒(𝑥(n), 𝑥(m) ) is a kernel function, that is a scalar product in a feature space, if and only if, 𝘒 is symmetric and the matrix 𝘒nm = 𝘒(𝑥(n), 𝑥(m) ) is positive semi-definite. Proof: 𝘒 is symmetric  the eigenvalue decomposition has the form 𝘒 = VɅVT, where Ʌ is a diagonal matrix containing the eigenvalues λt = Ʌtt of 𝘒 and V is an orthogonal matrix containing the eigenvectors Vt. 𝘒 is positive semi-definite  All eigenvalues are non-negative, λt ≥ 0. Define the feature map: And see if it corresponds to the kernel by calculating the scalar product :

39 Kernels Summary Kernels compute the value of a scalar product in high dimensional feature space without explicitly computing the features. They can be used in any model that depends (or can be rewritten so that it depends) on the scalar product of the training samples. Not every function is a kernel. Use Kernel construction rules to prove that a function is a valid kernel.

40

41 Math Review: Hyper-planes in Hessian Normal Form
For a point 𝑥1 that lies on the far side of the hyper-plane we have: 𝘸T𝑥1 + 𝑏 > 0 Because the projection of 𝑥1 on 𝘸 is larger than for points 𝑥 that are on the hyper-plane. Conversely for a point 𝑥2 that lies on the near side of the hyper- plane, we have, 𝘸T𝑥2 + 𝑏 < 0 Because the projection of 𝑥2 on 𝘸 is smaller than for points 𝑥 that are on the hyper-plane. Linear Classifier h(𝑥) = 𝘸T𝑥 + b is classified by checking the sign(h(𝑥) ).

42 Linear Classifier with a Margin
We add two more hyper-planes that are parallel to the original hyper-plane and require that no training point must lie on between those hyper-planes. Thus we now require: 𝘸T𝑥 + (𝑏 - 𝘴) > 0 for 𝑥 from the blue class And 𝘸T𝑥 + (𝑏 + 𝘴) < 0 for all 𝑥 from the green class. The distance from the origin to the hyper-plane 𝘸T𝑥 + 𝑏 = 0 is given by: d= -𝑏/|𝘸|, thus dblue = -(𝑏-𝘴)/|𝘸| dgreen = -(𝑏+𝘴)/|𝘸| And the margin is m= dblue - dgreen= 2𝘴/|𝘸| ≈ 2/|𝘸|

43 Set of Constraints Let 𝑥i be the ith data, and 𝑦i ∈ {-1, 1} the class assigned to 𝑥i . The constraints are:: 𝘸T𝑥i + 𝑏 ≥ +1 for 𝑦i = +1 𝘸T𝑥i + 𝑏 ≤ -1 for 𝑦i = -1 Can be condensed to: 𝑦i(𝘸T𝑥i + 𝑏) ≥ 1 for all i. If these constraints are fulfilled, the margin is:

44 Optimisation Problem Let 𝑥i be the ith data, i = 1, … m, and 𝑦i ∈ {-1, 1} the class assigned to 𝑥i . To find the separating hyper-plane with the maximum margin we: minimise 𝑓0 (𝘸, 𝑏 ) =½ 𝘸T𝘸 subject to: 𝑓i (𝘸, 𝑏 ) = 𝑦i(𝘸T𝑥i + 𝑏) – 1 ≥ 0 for i = 1, ..., m This is a constrained convex optimisation problem. Note that we introduced the arbitrary ½ in the objective function 𝑓0 (𝘸, 𝑏 ) to obtain nicer derivative later on.

45 Dual Problem We apply the recipe for solving the constrained optimisation problem. 1. Calculate the Lagrangian: L(𝘸, 𝑏 , α) = ½𝘸T𝘸 - 2. Minimise L(𝘸, 𝑏 , α) w.r.t 𝘸 and 𝑏 Thus the weights are a linear combination of the training samples:

46 Dual Problem Substituting both relations back to L(𝘸, 𝑏 , α) gives us the Lagrange dual function 𝑔 (α). Thus: Maximise 𝑔 (α) = Subject to αi ≥ 0 i=1, … m We will demonstrate the an efficient algorithm to solve the dual problem. The bias 𝑏 does not appear in the dual problem and thus needs to be recovered afterwards. But for both issues we need to derive some properties of the solution first.

47 Support Vectors Remember that the optimal solution must satisfy amongst others the KKT complimentary slackness condition: αi𝑓i(𝑥*)=0 In our case, this means: αi[𝑦i(𝘸T𝑥i + 𝑏) – 1] =0 for all i Hence, a training sample 𝑥i can only contribute to the weight vector (αi≠0) if it lies on the margin, that is 𝑦i(𝘸T𝑥i + 𝑏) = 1 a training sample 𝑥i with (αi≠0) is called a support vector The class of 𝑥 is h(𝑥) = sign(𝘸T𝑥 + 𝑏), substituting gives: We only need to remember the few training samples where (αi≠0).

48 Recovering the bias From the complimentary slackness condition αi𝑓i(𝑥)=0, we can recover the bias. When we take the support vector (αi≠0), the corresponding constraint 𝑓i (𝘸, 𝑏 ) must be zero and thus we have: 𝘸T𝑥i + 𝑏 = 𝑦i Solving this for 𝑏 gives the bias: 𝑏 = 𝑦i - 𝘸T𝑥i where . We can use any support vector to calculate the bias.

49 Non Linear SVM: The XOR Problem
Linear separation in high dimensional space H via nonlinear functions (polynomial and RBF’s) in the original space X. For this we found nonlinear mappings ϕ(𝑥): X  H Is that possible without knowing the mapping function  ?!?

50 Non-linear Support Vector Machines
Recall that, the probability of having linearly separable classes increases as the dimensionality of feature vectors increases. Assume the mapping: 𝑥  Rl  z  Rk, k  l Then use linear SVM in Rk Recall that in this case the dual problem formulation will be : The classifier will be:

51

52

53

54

55

56 Non-linear SVM Formulation

57 Linear SVM – Pol SVM in the input Space X

58 Pol. SVM – RBF SVM in the input space X

59 Pol. SVM – RBF SVM in the input space X

60 SVM vs. Neural Networks Training a Support Vector Machine (SVM) involves optimization of a concave function: there is a unique solution Training a neural network learning model is generally non- convex: there are potentially different solutions depending on the starting values for the model parameters.

61 SVM Software

62 Matlab SVM Functions For the exam dataset, the polynomial kernel function in SVM was the best scoring: xtrain = Dataset(1:80,1:2) ytrain = Dataset(1:80, 3) xtest = Dataset(81:100,1:2) %The Polynomial Kernel SVM classifier: SVMStruct = svmtrain(xtrain,ytrain, 'kernel_function','polynomial’) ytest = svmclassify(SVMStruct , xtest ) bad = ~strcmp(svmclassify(SVMStruct , xtrain),ytrain) P_SVMResubErr = sum(bad) / 80 cp = cvpartition(ytrain,'k',10) SVMClassFun (svmclassify(SVMStruct ,xtest)) P_SVMCVErr = crossval('mcr',xtrain,ytrain,'predfun',SVMClassFun,'partition',cp)

63 Other Matlab SVM functions
SVMModel1 = fitcsvm(xtrain,ytrain,'KernelFunction','polynomial','Standardize',true); % Compute the scores over a grid d = 0.02; % Step size of the grid [x1Grid,x2Grid] = meshgrid(min(xtrain(:,1)):d:max(xtrain(:,1)),... min(xtrain(:,2)):d:max(xtrain(:,2))); xGrid = [x1Grid(:),x2Grid(:)]; % The grid [~,scores1] = predict(SVMModel1,xGrid); % The scores figure; h(1:2) = gscatter(xtrain(:,1),xtrain(:,2),ytrain) hold on h(3) = plot(xtrain(SVMModel1.IsSupportVector,1),... xtrain(SVMModel1.IsSupportVector,2),'ko','MarkerSize',10); % Support vectors contour(x1Grid,x2Grid,reshape(scores1(:,2),size(x1Grid)),[0 0],'k'); % Decision boundary title('Scatter Diagram with the Decision Boundary') legend({'-1','1','Support Vectors'},'Location','Best'); hold off


Download ppt "Pattern Recognition and Image Analysis"

Similar presentations


Ads by Google