# Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

## Presentation on theme: "Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008."— Presentation transcript:

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold 2 Outline  Introduction to linear discrimination  Linear machines  Generalized discriminant functions  Augmented vectors and linear separability  Objective functions and gradient descent procedure  Perceptron (principle and algorithms)  Perceptron with margins  Relaxation with margins  Principles of support vector machines

© Prof. Rolf Ingold 3 Principle of linear discrimination  The principle consist in determining region boundaries (or equivalently discriminant functions) directly from training samples  Additionally it is assumed that these functions are linear  fast to compute  well known properties  no loss of generality, if combined with arbitrary feature transformations  The problem of finding discriminant functions is stated as an optimization problem  minimizing an error cost on training samples

© Prof. Rolf Ingold 4 Linear discriminant functions for two classes  A linear discriminant function is written where  x represents the feature vector of the sample to be classified  w is a weight vector and w 0 is threshold weight (or bias), which have to be determined  The equation g( x ) = 0 defines a decision boundary between two classes

© Prof. Rolf Ingold 5 Geometrical interpretation  The decision boundary is a hyper- plane dividing the feature space into two half-spaces  w represents a normal vector, since for x 1 et x 2 belonging to the hyper-plane  the distance of x to the hyperplane is since

© Prof. Rolf Ingold 6 Discrimination of multiple classes  To discriminate c classes pairwise, we must use c(c-1)/2 discriminant functions  decision regions do not produce a partition of the feature space  ambiguous regions appear !

© Prof. Rolf Ingold 7 Linear machines  Multiple class discrimination can be performed with exactly one function g i (x) per class  the decision consists in choosing  i that maximizes g i (x)  decision boundary between  i et  j is given by the hyperplane  ij defined by the equation g i (x) - g j (x) = 0  decision regions produce a partition of the feature space

© Prof. Rolf Ingold 8 Quadratic discriminant functions  Discriminant functions can be generalized with quadratic terms on x  decision boundaries become non linear  By extending the feature space with quadratic forms, decision boundaries become again linear

© Prof. Rolf Ingold 9 Generalized discriminant functions  A more general approach consists of using generalized discriminant functions of the form where y k (x) are arbitrary functions of x in some other dimension  Decision boundaries will linear in the space of y but not in the space containing x

© Prof. Rolf Ingold 10 Augmented vectors  The principle of generalized discriminant functions can be applied to define augmented vectors as follows where w 0 is added as a vector component  The problem is formulated in a new space with a dimension augmented by 1 where  the hyperplane a t y = 0 passes through the origin  the distance from y to the hyperplane is equal to |a t y|/||a||

© Prof. Rolf Ingold 11 Linear separability  Let us consider n samples {y 1,... y n } labeled either  1 or  2  We are looking for a separating vector a such as  Each training sample is putting a constraint on the solution region  If a solution exists, the two classes are said to be linearly separable  By replacing all y i labeled  2 by -y i we obtain the new condition which allows to ignore the class labels

© Prof. Rolf Ingold 12 Gradient descent procedure  To find a solution for a satisfying for a set of unequalities a t y i >0 we can minimize an objective function J(a) and apply a gradient descent procedure :  chose a[0]  compute a[k] iteratively using where the learning rate  (k) > 0 controls the convergence  if  (k) is small, the convergence is slow  if  (k) is too large, the iteration may not converge  stop when a the convergence criteria is reached  The approach can be refined by a second order method using the Hessian matrix

© Prof. Rolf Ingold 13 Objective functions  Considering the set of misclassified samples Y ={y i | a t y i  0}, the following objective functions can be considered  le number of misclassified samples  la perceptron rule minimizing the sum of distances from misclassified samples to the decision boundary  the sum of square distances of misclassified samples  a criteria using margins

© Prof. Rolf Ingold 14 Illustrations of objective functions

© Prof. Rolf Ingold 15 Perceptron principle  The objective function to be minimized is and its gradient is  Thus, the update rules becomes  at each step the distance from y to the boundary is reduced  If there exist a solution, the perceptron always finds one

© Prof. Rolf Ingold 16 Perceptron algorithms  The perceptron rule can be implemented in two ways  Batch perceptron algorithm : at each step, (a multiple) of the sum of all misclassified samples is added to the weight vector  Iterative single-sample perceptron algorithm : at each step, a selected misclassified sample is added to the weight vector chose a; k = 0; repeat k = k+1 mod n; if a.y[k] < 0 then a = a + y[k]; until a.y > 0 for all y

© Prof. Rolf Ingold 17 Final remarks on the perceptron  The iterative single-sample perceptron algorithm ends with a solution if and only if the classes are linearly separable  The found solution is often not optimal regarding generalization  the solution vector is often at the border of the solution region  There exist variants, which improve this behavior  The perceptron rule is at the origin of a family of artificial neural networks called multi-layer perceptron (MLP), which are interesting for pattern recognition

© Prof. Rolf Ingold 18 Discrimination with margin  To improve generalization behavior, the constranit a t y > 0 can be replaced by a t y > b where b > 0 is called margin  the solution region is reduced by bands with a width of b / ||y i ||

© Prof. Rolf Ingold 19 Perceptron with margin  The perceptron algorithm can be generalized by using margins  the update rule becomes  it can be shown that, If the classes are linearly separable, the algorithm finds always a solution under the conditions  this is the case for  (k) = 1 and  (k) = 1/k

© Prof. Rolf Ingold 20 Relaxation procedure with margin  The objective function of the perceptron is piecewise linear and its gradient is not continuous  The relaxation procedure generalize the approach by considering where Y contains all samples y for which a t y ≤ b  The gradient of J r being the update rule becomes

© Prof. Rolf Ingold 21 Relaxation algorithm  The relaxation algorithm in batch mode is as follows define b, nano[k]; chose a; k = 0; repaet k = k+1; sum = {0,…,0}; for each y do if a.y ≤ b then sum = sum + (b-a.y)/(y.y)*y; a = a + nano[k]*sum; until a.y > b for all y  There exists also a single sample iterative version

© Prof. Rolf Ingold 22 Support Vector Machines  Support vector machines (SVM) are based on similar considerations  the feature space is mapped on a much higher dimensions using a non linear mapping including for each a component y k,0 =1  for each pattern let z k = ±1 according to the class  1 or  2 the patter y k belongs to  let g(y) = a t y be a linear discriminant; then a separating hyperplane ensures

© Prof. Rolf Ingold 23 SVM optimization criteria  The goal of a support vector machine is to find the separating hyperplane with the largest margin  supposing a margin b > 0 exists, the goal is to find the vector a that maximizes b in  points verifying are called support vectors

© Prof. Rolf Ingold 24 Conclusion on SVM  SVMs are still subject to numerous researches issues  choice of basic functions  optimized training strategies  SVMs are reputed to avoid overfitting and therefore to have good generalization properties

Download ppt "Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008."

Similar presentations