Download presentation

Presentation is loading. Please wait.

1
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

2
© Prof. Rolf Ingold 2 Outline Introduction to linear discrimination Linear machines Generalized discriminant functions Augmented vectors and linear separability Objective functions and gradient descent procedure Perceptron (principle and algorithms) Perceptron with margins Relaxation with margins Principles of support vector machines

3
© Prof. Rolf Ingold 3 Principle of linear discrimination The principle consist in determining region boundaries (or equivalently discriminant functions) directly from training samples Additionally it is assumed that these functions are linear fast to compute well known properties no loss of generality, if combined with arbitrary feature transformations The problem of finding discriminant functions is stated as an optimization problem minimizing an error cost on training samples

4
© Prof. Rolf Ingold 4 Linear discriminant functions for two classes A linear discriminant function is written where x represents the feature vector of the sample to be classified w is a weight vector and w 0 is threshold weight (or bias), which have to be determined The equation g( x ) = 0 defines a decision boundary between two classes

5
© Prof. Rolf Ingold 5 Geometrical interpretation The decision boundary is a hyper- plane dividing the feature space into two half-spaces w represents a normal vector, since for x 1 et x 2 belonging to the hyper-plane the distance of x to the hyperplane is since

6
© Prof. Rolf Ingold 6 Discrimination of multiple classes To discriminate c classes pairwise, we must use c(c-1)/2 discriminant functions decision regions do not produce a partition of the feature space ambiguous regions appear !

7
© Prof. Rolf Ingold 7 Linear machines Multiple class discrimination can be performed with exactly one function g i (x) per class the decision consists in choosing i that maximizes g i (x) decision boundary between i et j is given by the hyperplane ij defined by the equation g i (x) - g j (x) = 0 decision regions produce a partition of the feature space

8
© Prof. Rolf Ingold 8 Quadratic discriminant functions Discriminant functions can be generalized with quadratic terms on x decision boundaries become non linear By extending the feature space with quadratic forms, decision boundaries become again linear

9
© Prof. Rolf Ingold 9 Generalized discriminant functions A more general approach consists of using generalized discriminant functions of the form where y k (x) are arbitrary functions of x in some other dimension Decision boundaries will linear in the space of y but not in the space containing x

10
© Prof. Rolf Ingold 10 Augmented vectors The principle of generalized discriminant functions can be applied to define augmented vectors as follows where w 0 is added as a vector component The problem is formulated in a new space with a dimension augmented by 1 where the hyperplane a t y = 0 passes through the origin the distance from y to the hyperplane is equal to |a t y|/||a||

11
© Prof. Rolf Ingold 11 Linear separability Let us consider n samples {y 1,... y n } labeled either 1 or 2 We are looking for a separating vector a such as Each training sample is putting a constraint on the solution region If a solution exists, the two classes are said to be linearly separable By replacing all y i labeled 2 by -y i we obtain the new condition which allows to ignore the class labels

12
© Prof. Rolf Ingold 12 Gradient descent procedure To find a solution for a satisfying for a set of unequalities a t y i >0 we can minimize an objective function J(a) and apply a gradient descent procedure : chose a[0] compute a[k] iteratively using where the learning rate (k) > 0 controls the convergence if (k) is small, the convergence is slow if (k) is too large, the iteration may not converge stop when a the convergence criteria is reached The approach can be refined by a second order method using the Hessian matrix

13
© Prof. Rolf Ingold 13 Objective functions Considering the set of misclassified samples Y ={y i | a t y i 0}, the following objective functions can be considered le number of misclassified samples la perceptron rule minimizing the sum of distances from misclassified samples to the decision boundary the sum of square distances of misclassified samples a criteria using margins

14
© Prof. Rolf Ingold 14 Illustrations of objective functions

15
© Prof. Rolf Ingold 15 Perceptron principle The objective function to be minimized is and its gradient is Thus, the update rules becomes at each step the distance from y to the boundary is reduced If there exist a solution, the perceptron always finds one

16
© Prof. Rolf Ingold 16 Perceptron algorithms The perceptron rule can be implemented in two ways Batch perceptron algorithm : at each step, (a multiple) of the sum of all misclassified samples is added to the weight vector Iterative single-sample perceptron algorithm : at each step, a selected misclassified sample is added to the weight vector chose a; k = 0; repeat k = k+1 mod n; if a.y[k] < 0 then a = a + y[k]; until a.y > 0 for all y

17
© Prof. Rolf Ingold 17 Final remarks on the perceptron The iterative single-sample perceptron algorithm ends with a solution if and only if the classes are linearly separable The found solution is often not optimal regarding generalization the solution vector is often at the border of the solution region There exist variants, which improve this behavior The perceptron rule is at the origin of a family of artificial neural networks called multi-layer perceptron (MLP), which are interesting for pattern recognition

18
© Prof. Rolf Ingold 18 Discrimination with margin To improve generalization behavior, the constranit a t y > 0 can be replaced by a t y > b where b > 0 is called margin the solution region is reduced by bands with a width of b / ||y i ||

19
© Prof. Rolf Ingold 19 Perceptron with margin The perceptron algorithm can be generalized by using margins the update rule becomes it can be shown that, If the classes are linearly separable, the algorithm finds always a solution under the conditions this is the case for (k) = 1 and (k) = 1/k

20
© Prof. Rolf Ingold 20 Relaxation procedure with margin The objective function of the perceptron is piecewise linear and its gradient is not continuous The relaxation procedure generalize the approach by considering where Y contains all samples y for which a t y ≤ b The gradient of J r being the update rule becomes

21
© Prof. Rolf Ingold 21 Relaxation algorithm The relaxation algorithm in batch mode is as follows define b, nano[k]; chose a; k = 0; repaet k = k+1; sum = {0,…,0}; for each y do if a.y ≤ b then sum = sum + (b-a.y)/(y.y)*y; a = a + nano[k]*sum; until a.y > b for all y There exists also a single sample iterative version

22
© Prof. Rolf Ingold 22 Support Vector Machines Support vector machines (SVM) are based on similar considerations the feature space is mapped on a much higher dimensions using a non linear mapping including for each a component y k,0 =1 for each pattern let z k = ±1 according to the class 1 or 2 the patter y k belongs to let g(y) = a t y be a linear discriminant; then a separating hyperplane ensures

23
© Prof. Rolf Ingold 23 SVM optimization criteria The goal of a support vector machine is to find the separating hyperplane with the largest margin supposing a margin b > 0 exists, the goal is to find the vector a that maximizes b in points verifying are called support vectors

24
© Prof. Rolf Ingold 24 Conclusion on SVM SVMs are still subject to numerous researches issues choice of basic functions optimized training strategies SVMs are reputed to avoid overfitting and therefore to have good generalization properties

Similar presentations

© 2020 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google