1 Support Vector Machines Podpůrné vektorové stroje Babak Mahdian, June 2009.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
Support Vector Machines
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
S UPPORT V ECTOR M ACHINES Jianping Fan Dept of Computer Science UNC-Charlotte.
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
Support vector machine
Groundwater 3D Geological Modeling: Solving as Classification Problem with Support Vector Machine A. Smirnoff, E. Boisvert, S. J.Paradis Earth Sciences.
Support Vector Machine
Support Vector Machines (and Kernel Methods in general)
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Support Vector Machines
CS 4700: Foundations of Artificial Intelligence
Support Vector Machines
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
An Introduction to Support Vector Machines Martin Law.
Support Vector Machine & Image Classification Applications
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
An Introduction to Support Vector Machine (SVM)
SVM – Support Vector Machines Presented By: Bella Specktor.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
An Introduction of Support Vector Machine Courtesy of Jinwei Gu.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
PREDICT 422: Practical Machine Learning
Support Vector Machine
Support Vector Machines
Support Vector Machines
CSSE463: Image Recognition Day 14
Support Vector Machines
SVMs for Document Ranking
Presentation transcript:

1 Support Vector Machines Podpůrné vektorové stroje Babak Mahdian, June 2009

2 Most of slides are taken from the presentations provided by: 1. Chih-Jen Lin (National Taiwan University) 2. Colin Campbell (Bristol University) 3. Andrew W. Moore (Carnegie Mellon University) 4. Jan Flusser (AS CR, ÚTIA)

3 Outline 1. SVMs for binary classification. 2. Soft margins and multi-class classification.

4 A classifier derived from statistical learning by Vladimir Vapnik et al. in Currently SVM is widely used in object detection & recognition, content-based image retrieval, text recognition, biometrics, speech recognition, etc.

5 Preliminaries: Consider a binary classification problem: input vectors are x i and y i = {1,-1} are the targets or labels. The index i labels the pattern pairs (i = 1,…,m). The x i define a space of labelled points called input space.

6 Binary classification:

7 Let us to separate the input data via a hyperplane.

8

9

10 Which of the linear separators is optimal?

11 SVM search for a separating hyperplane, which maximizes the margin between the two data sets. A hyperplane which has the miximum distance from nearest points of both sides.

12 Such a hyperplane exhibits the best generalization. It creates a “safe zone”. The closest points are called support vectors (they directly support where the hyperplane should be). Any change in support vectors shift the hyperplane. Any change in non-support vectors do not shift the hyperplane.

13 In an arbitrary-dimensional space a separating hyperplane can be written: where b is the bias, and w the weights, and x is a data point. Thus, we will consider a decision function of the form:

14 w x + b > 0 w x + b < 0 w x + b = 0

15 We can note that the sign of (w x + b) is invariant under a positive rescaling: We will implicitly fix a scale in the following way (infinite number of hyperplane equations): for the support vectors.

16 Now, support vectors are +1 or –1 and other vector >1 or < – 1.

17 Now, we easily can show that: The equation is for two support vectors on each side of the separating hyperplane.

18 The margin will be given by the projection of the vector (x 1 – x 2 ) onto the normal vector to the hyperplane i.e. w/||w|| from which we deduce that length of the margin is equal to 2/||w||.

19 Lagrange multipliers In optimization, the method of Lagrange multipliers provides a strategy for finding the maximum/minimum of a function subject to constraints. For example: maximize f(x,y), subject to g(x,y)=c We introduce a new variable (λ) called a Lagrange multiplier, and study the Lagrange function defined by: If (x,y) is a maximum for the original constrained problem, then there exists a λ such that (x,y,λ) is a stationary point for the Lagrange function.

20 Example 1/2: Suppose you wish to maximize f(x,y) = x + y subject to the constraint x 2 + y 2 = 1.

21 Example 2/2: Formally, set g(x,y) − c = x 2 + y 2 − 1, and Λ(x,y,λ) = f(x,y) + λ(g(x,y) − c) = x + y + λ(x 2 + y 2 − 1) Set the derivative dΛ = 0, which yields the system of equations: Solving the equation gives that the stationary points are: which gives the maximum

22 Maximization of the margin is thus equivalent to minimization of the functional: subject to the constraints: Please note that y i = {+1, -1}. We need to optimize a quadratic function subject to linear constraints. Quadratic optimization problems are a well- known. The Karush-Kuhn-Trucker criterion ok => Lagrange

23 Thus the task is to find an optimum of the primal objective function (consisted of my function to be minimized and constraints): Solving the saddle point equations gives the constraint:

24 and gives: which when substituted back in L gives us that we should maximize the functional (the Wolfe dual): I need to optimize (maximize)  (the rest is data) and it’s quadratic => quadratic programming. The primal was also quadratic (in w). Optimization theory: for every primal is dual and the solution of primal is same as the solution of dual.

25 subject to the constraints: (they are Lagrange multipliers) And Important note: there is only one solution (which is our hyperplane). Different starting points give the same solution (unlike ANN).

26 After finding alphas, the decision function is then: where z is a new data point.

27  6 =1.4  1 =0.8  2 =0  3 =0  4 =0  5 =0  7 =0  8 =0.6  9 =0  10 =0 After finding alphas, some are 0 (non-support vectors) and some not (as aforementioned shifting non-support vectors do not shift the hyperplane).

28 Some alphas are zero, some non-zero and sometime some are very large. Two reasons: 1. a correct, but unusual data point 2. an outlier Very large alphas have a big influence on hyperplane ’ s position.

29 Example (find SVM): We see that the optimal hyperplane is g(x) = x 1 =0. All four points are support vectors. The margin size is 2.

30 Example (find SVM): The equation of a hyperplane: w 1 x 1 + w 2 x 2 + w 0 = 0. We need to minimize ½||w.w|| subject to constraints: y i (w T x + w 0 ) ≥ 1 pro i=1,...,4. When substituted, we get:

31 Example (find SVM): Lagrange:

32 Example (find SVM): 7 equations and 7 unknown variables. After solving the equations: This gives more solutions, but all end up with: w 1 = 1, w 2 = 0, w 0 = 0.

33 Recap: 1. The classifier is a separating hyperplane. 2. Most “important” training points are support vectors; they define the hyperplane. 3. Quadratic optimization algorithms can identify which training points x i are support vectors with non- zero Lagrangian multipliers α i.

34 So far we don ’ t know how to handle non-separable datasets. Rather than fitting nonlinear curves to the data, SVM handles this by using a kernel function to map the data into a different space where a hyperplane can be used to do the separation.

35 The kernel function may transform the data into a higher dimensional space to make it possible to perform the separation.

36 For the dual objective function we notice that the datapoints, x i, only appear inside an inner product. To get a better representation of the data we can therefore map the datapoints into an alternative higher dimensional space through a replacement: i.e. we have used a mapping This higher dimensional space must be a Hilbert Space.

37 The function will be called a kernel, so: The kernel function is a function that is eqiuvalent to an inner product in some feature space.

38 Example of kernels: Linear: K(x i,x j )= x i T x j Polynomial of power p: K(x i,x j )= (1+ x i T x j ) p Gaussian (radial-basis function):

39 Kernel function implicitly maps data to a high-dimensional space (without the need to compute each  (x) explicitly). Example: 2-dimensional vectors x=[x 1 x 2 ]; let K(x i,x j ) = (1 + x i T x j ) 2, Need to show that K(x i, x j )=  (x i ) T  (x j ): K(x i,x j ) = (1 + x i T x j ) 2 = = 1+ x i1 2 x j x i1 x j1 x i2 x j2 + x i2 2 x j x i1 x j1 + 2x i2 x j2 = = [1 x i1 2 √2x i1 x i2 x i2 2 √2x i1 √2x i2 ] T [1 x j1 2 √2 x j1 x j2 x j2 2 √2x j1 √2x j2 ]= =  (x i )  (x j ), where  (x) = [1 x 1 2 √2 x 1 x 2 x 2 2 √2x 1 √2x 2 ]

40 Which functions are kernels? Kernel functions should satisfy Mercer’s condition: For any g(x) for which: it must be the case that:

41 A simple criterion is that the kernel should be positive semi-definite. If a kernel is positive semi-definite i.e.: where {c 1,..., c n } are real numbers, then there exists a function  (x) defining an inner product of possibly higher dimension i.e.:

42 Thus the following steps are used to train an SVM : 1. Choose kernel function K (x i,x j ) 2. Maximize: subject to:

43 3. The bias b is found as follows: 4. The optimal  i go into the decision function:

44 Choosing the Kernel Function and its parameters Probably the most tricky part of using SVM. Many principles have been proposed. In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try for most applications. It was said that for text classification, linear kernel is the best choice, because of the high feature dimension.

45 Multi-class problems SVM constructed for binary classification. Multi-class classification: Many problems involve multiclass classification. A number of schemes have been outlined. One of the simplest schemes is to use a directed acyclic graph (DAG) with the learning task reduced to binary classification at each node:

46 We see that this way is good for a small number of classes (for example, 7). What about a large number of classes (for example, 100)? One way is using the “one against all” approach.

47 Soft margins Most real life datasets contain noise. SVM can fit this noise leading to poor generalization (the problem of overfitting). Rather we allow some misclassifications. Can be readily viewed as relaxation of the hard margin constraint. ii

48 The effect of outliers and noise can be reduced by introducing a soft margin. Two schemes are commonly used: L 1 error norm (we bound the alpha - Lagrange multipliers): Not a bad idea. Abnormal points have a large alpha value. L 2 error norm: When construct the kernel, a small element is added.

49 Effect of soft margins L1L1

50 For the L 1 error norm (prior to introducing kernels) we introduce a positive slack variable We actually allow some points to be within the margin bound. So, minimize the sum of errors and in the same time ||w|| (the condition to maximize the margin): So, C is a tradeoff. If C is infinity => normal hard margin. If C < infinity, we will have some terms being non-zero among the slack variables.

51

52 This is readily formulated as a primal objective function : with Lagrange multipliers

53 The derivatives with respect to w, b and  give: Resubstituting back in the primal objective function obtain the same dual objective function as before. However, r i <= 0 and C -  i - r i = 0, hence  i <= C and the constraint 0 <=  i is replaced by 0 <=  i <= C.

54 Patterns with values 0 <=  i <= C will be referred to as non-bound and those with  i = 0 or  i = C will be said to be at bound. Some theory about the optimal value C. In practice, the optimal value of C must be found by experimentation using a validation.

55 In an alternative approach, -SVM, it can be shown that solutions for an L 1 -error norm are the same as those obtained from maximizing: and lies on the range 0 to 1. Better interpretation and more transparent of soft- margins and C. The fraction of training errors is upper bounded by and also provides a lower bound on the fraction of points which are support vectors.

56 For many real-life datasets there is an imbalance between the amount of data in different classes, or the significance of the data in the two classes can be quite different. The relative balance between the detection rate for different classes can be easily shifted by introducing asymmetric soft margin parameters. Thus for binary classification with an L 1 error norm: for y i =+1, and for y i =-1, etc. Thus, it allows a control on false positives rate. We do a favor for one class.

57 Some SVM Software LIBSVM SVM Light TinySVM WEKA

58 Conclusion SVMs are currently among the better performers for a number of classification tasks. SVM techniques have been extended to a number of tasks such as regression. Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and- see manner (a lengthy series of experiments in which various parameters are tested).