1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

3.6 Support Vector Machines
Introduction to Support Vector Machines (SVM)
Support Vector Machine
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning
Support Vector Machines
Support vector machine
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Separating Hyperplanes
The General Linear Model. The Simple Linear Model Linear Regression.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Support Vector Machines Kernel Machines
Support Vector Machines and Kernel Methods
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Neural Networks Lecture 8: Two simple learning algorithms
SVM by Sequential Minimal Optimization (SMO)
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
1 E. Fatemizadeh Statistical Pattern Recognition.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
1  The goals:  Select the “optimum” number l of features  Select the “best” l features  Large l has a three-fold disadvantage:  High computational.
Non-Bayes classifiers. Linear discriminants, neural networks.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Biointelligence Laboratory, Seoul National University
An Introduction to Support Vector Machine (SVM)
Linear Models for Classification
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Computacion Inteligente Least-Square Methods for System Identification.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
ECES 690 – Statistical Pattern Recognition Lecture 2- Density Estimation / Linear Classifiers Andrew R. Cohen 1.
ECEC 481/681– Statistical Pattern Recognition Chapter 3 – Linear Classifiers - SVM Chapter 4 – Non-Linear Classifiers 1.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support vector machines
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Large Margin classifiers
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Support vector machines
Support Vector Machines
Support vector machines
Presentation transcript:

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

2  Hence:

3  The Perceptron Algorithm  Assume linearly separable classes, i.e.,  The case falls under the above formulation, since

4  Our goal: Compute a solution, i.e., a hyperplane w, so that The steps –Define a cost function to be minimized. –Choose an algorithm to minimize the cost function. –The minimum corresponds to a solution.

5  The Cost Function Where Y is the subset of the vectors wrongly classified by w. When Y =O (empty set) a solution is achieved and

6 J(w) is piecewise linear (WHY?)  The Algorithm The philosophy of the gradient descent is adopted.

7 Wherever valid This is the celebrated Perceptron Algorithm. w

8  An example:  The perceptron algorithm converges in a finite number of iteration steps to a solution if

9  A useful variant of the perceptron algorithm  It is a reward and punishment type of algorithm.

10  The perceptron  It is a learning machine that learns from the training vectors via the perceptron algorithm.  The network is called perceptron or neuron.

11  Example: At some stage t the perceptron algorithm results in The corresponding hyperplane is ρ=0.7

12  Least Squares Methods  If classes are linearly separable, the perceptron output results in  If classes are NOT linearly separable, we shall compute the weights,, so that the difference between The actual output of the classifier,, and The desired outputs, e.g., to be SMALL.

13  SMALL, in the mean square error sense, means to choose so that the cost function:

14  Minimizing where R x is the autocorrelation matrix and the crosscorrelation vector.

15  Multi-class generalization The goal is to compute M linear discriminant functions: according to the MSE. Adopt as desired responses y i : Let And the matrix

16 The goal is to compute : The above is equivalent to a number M of MSE minimization problems. That is: Design each so that its desired output is 1 for and 0 for any other class.  Remark: The MSE criterion belongs to a more general class of cost function with the following important property: The value of is an estimate, in the MSE sense, of the a-posteriori probability, provided that the desired responses used during training are and 0 otherwise.

17  Mean square error regression: Let, be jointly distributed random vectors with a joint pdf The goal: Given the value of, estimate the value of. In the pattern recognition framework, given one wants to estimate the respective label. The MSE estimate of, given, is defined as: It turns out that: The above is known as the regression of given and it is, in general, a non-linear function of. If is Gaussian the MSE regressor is linear.

18  SMALL in the sum of error squares sense means  that is, the input x i and its corresponding class label (±1).  : training pairs

19  Pseudoinverse Matrix  Define   

20 Thus  Assume N=l X square and invertible. Then Pseudoinverse of X

21  Assume N>l. Then, in general, there is no solution to satisfy all equations simultaneously:  The “solution” corresponds to the minimum sum of squares solution.

22  Example:

23 

24  The Bias – Variance Dilemma A classifier is a learning machine that tries to predict the class label y of. In practice, a finite data set D is used for its training. Let us write. Observe that:  For some training sets,, the training may result to good estimates, for some others the result may be worse.  The average performance of the classifier can be tested against the MSE optimal value, in the mean squares sense, that is: where E D is the mean over all possible data sets D.

25 The above is written as: In the above, the first term is the contribution of the bias and the second term is the contribution of the variance. For a finite D, there is a trade-off between the two terms. Increasing bias it reduces variance and vice verse. This is known as the bias-variance dilemma. Using a complex model results in low-bias but a high variance, as one changes from one training set to another. Using a simple model results in high bias but low variance.

26  Let an -class task,. In logistic discrimination, the logarithm of the likelihood ratios are modeled via linear functions, i.e.,  Taking into account that it can be easily shown that the above is equivalent with modeling posterior probabilities as:  LOGISTIC DISCRIMINATION

27  For the two-class case it turns out that

28  The unknown parameters are usually estimated by maximum likelihood arguments.  Logistic discrimination is a useful tool, since it allows linear modeling and at the same time ensures posterior probabilities to add to one.

29  The goal: Given two linearly separable classes, design the classifier that leaves the maximum margin from both classes.  Support Vector Machines

30  Margin: Each hyperplane is characterized by: Its direction in space, i.e., Its position in space, i.e., For EACH direction,, choose the hyperplane that leaves the SAME distance from the nearest points from each class. The margin is twice this distance.

31  The distance of a point from a hyperplane is given by:  Scale, so that at the nearest points, from each class, the discriminant function is ±1:  Thus the margin is given by:  Also, the following is valid

32  SVM (linear) classifier  Minimize  Subject to  The above is justified since by minimizing the margin is maximised.

33  The above is a quadratic optimization task, subject to a set of linear inequality constraints. The Karush- Kuhh-Tucker conditions state that the minimizer satisfies: (1) (2) (3) (4) Where is the Lagrangian

34  The solution: from the above, it turns out that:

35  Remarks: The Lagrange multipliers can be either zero or positive. Thus, – where, corresponding to positive Lagrange multipliers. –From constraint (4) above, i.e., the vectors contributing to satisfy

36 –These vectors are known as SUPPORT VECTORS and are the closest vectors, from each class, to the classifier. –Once is computed, is determined from conditions (4). –The optimal hyperplane classifier of a support vector machine is UNIQUE. –Although the solution is unique, the resulting Lagrange multipliers are not unique.

37  Dual Problem Formulation The SVM formulation is a convex programming problem, with –Convex cost function –Convex region of feasible solutions Thus, its solution can be achieved by its dual problem, i.e., –maximize –subject to λ

38 Combine the above to obtain: –maximize –subject to λ

39  Remarks: Support vectors enter via inner products.  Non-Separable classes

40 In this case, there is no hyperplane such that: Recall that the margin is defined as twice the distance between the following two hyperplanes:

41  The training vectors belong to one of three possible categories 1) Vectors outside the band which are correctly classified, i.e., 2) Vectors inside the band, and correctly classified, i.e., 3) Vectors misclassified, i.e.,

42  All three cases above can be represented as: 1) 2) 3) are known as slack variables.

43  The goal of the optimization is now two-fold: Maximize margin Minimize the number of patterns with. One way to achieve this goal is via the cost where C is a constant and I(.) is not differentiable. In practice, we use an approximation. A popular choice is: Following a similar procedure as before we obtain:

44  KKT conditions

45  The associated dual problem Maximize subject to  Remarks: The only difference with the separable class case is the existence of in the constraints. λ

46  Training the SVM: A major problem is the high computational cost. To this end, decomposition techniques are used. The rationale behind them consists of the following: Start with an arbitrary data subset (working set) that can fit in the memory. Perform optimization, via a general purpose optimizer. Resulting support vectors remain in the working set, while others are replaced by new ones (outside the set) that violate severely the KKT conditions. Repeat the procedure. The above procedure guarantees that the cost function decreases. Platt’s SMO algorithm chooses a working set of two samples, thus analytic optimization solution can be obtained.

47  Multi-class generalization Although theoretical generalizations exist, the most popular in practice is to look at the problem as M two- class problems (one against all).  Example:  Observe the effect of different values of C in the case of non-separable classes.