Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1.

Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

sample 2 Linear Classifiers

Feature 3 Linear Classifiers

Feature 4 Linear Classifiers

Training Set 5 Linear Classifiers

How to Classify Them Using Computer? Linear Classifiers 6

How to Classify Them Using Computer? Linear Classifiers 7 혹은

Linear Classification Linear Classifiers 8

9 Optimal Hyperplane SVMs(Support Vector Machines) 9 Misclassified

10 Which Separating Hyperplane to Use? Var 1 Var 2

11 denotes +1 denotes -1 Any of these would be fine....but which is best? Optimal Hyperplane SVMs(Support Vector Machines)

12 Support Vector Machines Three main ideas:  Define what an optimal hyperplane is (in way that can be identified in a computationally efficient way): maximize margin  Extend the above definition for non-linearly separable problems: have a penalty term for misclassifications  Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space

Optimal Hyperplane SVMs(Support Vector Machines) 13 Define the margin of a linear classifier as the width that the boundary could be i ncreased by before hitting a datapoint.

Canonical Hyperplane SVMs(Support Vector Machines) 14 The maximum margin linear classifier is the linear classifier with the maxi mum margin. This is the simplest kind of SVM (Called an LSVM)

Normal Vector SVMs(Support Vector Machines) 15

16 Maximizing the Margin Var 1 Var 2 Margin Width IDEA 1: Select the separating hyperplane that maximizes the margin!

17 Support Vectors Var 1 Var 2 Margin Width Support Vectors

18 Margin width B 1 b 11 b 12 d 벡터 W 와 의 내적은 다음의 기하학적 의미 를 갖는다. d

19 Setting Up the Optimization Problem Var 1 Var 2 There is a scale and unit for data so that k=1. Then problem becomes:

20 Setting Up the Optimization Problem SVMs(Support Vector Machines) 20 Var 1 Var 2 The width of the margin is: So, the problem is:

21 Setting Up the Optimization Problem If class 1 corresponds to 1 and class 2 corresponds to -1, we can rewrite as So the problem becomes:  SVMs(Support Vector Machines) ||w|| 2 =w T w is minimized

22 Linear, Hard-Margin SVM Formulation Find w,b that solves Problem is convex so, there is a unique global minimum value (when feasible) There is also a unique minimizer, i.e. weight and b value that provides the minimum Non-solvable if the data is not linearly separable Quadratic Programming Very efficient computationally with modern constraint optimization engines (handles thousands of constraints and training instances).

23 Finding the Decision Boundary Let {x 1,..., x n } be our data set and let y i  {1,-1} be the cl ass label of x i The decision boundary should classify all points correctly  The decision boundary can be found by solving the follo wing constrained optimization problem The Lagrangian of this optimization problem is

Lagrangian of SVM optimization problem

25 Lagrangian of SVM optimization problem

30 와 다음 식에 대입하여 정리하면 = Lagrangian of SVM optimization problem 편미분 하여 얻은 다음 결과를

31 Remember The Dual Problem !! Two functions based on the Lagrangian function Min L(x, λ ) 을 위한 x 값, 의 최대값에 해당하는 λ 을 구하는 문제 L(x, λ )

32 The Dual Problem By setting the derivative of the Lagrangian to be zero, th e optimization problem can be written in terms of  i (the dual problem) This is a quadratic programming (QP) problem –A global maximum of  i can always be found w can be recovered by

33 The Dual Problem By setting the derivative of the Lagrangian to be zero, th e optimization problem can be written in terms of  i (the dual problem) This is a quadratic programming (QP) problem –A global maximum of  i can always be found w can be recovered by 만약 학습 데이터 수가 아주 많을때 는 SVM 학습 속도가 매우 느려질 수 있다. dual 문제에서는 parameters α 의 수가 매우 많아 질 수 있기 때문이다.

34  6 =1.4 A Geometrical Interpretation Class 1 Class 2  1 =0.8  2 =0  3 =0  4 =0  5 =0  7 =0  8 =0.6  9 =0  10 =0

36 Characteristics of the Solution KKT condition indicates many of the  i are zero –w is a linear combination of a small number of data points x i with non-zero  i are called support vectors (SV) –The decision boundary is determined only by the SV –Let t j (j=1,..., s) be the indices of the s support vectors. We can write For testing with a new data z –Compute and classify z as: class 1 if the sum is positive, class 2 otherwise.

38 Non-Linearly Separable Data Var 1 Var 2 Allow some instances to fall within the margin, but penalize them Introduce slack variables What if the training set is not linearly separable?

39 Formulating the Optimization Problem Var 1 Var 2 Constraint becomes: Objective function penalizes for misclassified instances and those within the margin C trades-off margin width and misclassifications; chosen by the user; large C  a higher penalty to errors

40 Soft Margin Hyperplane By minimizing  i  i,  i can be obtained by  i are “ slack variables ” in optimization;  i =0 if there is no error for x i, and  i is an upper bound of the number of errors The optimization problem becomes

41 Soft Margin Classification Slack variables ξ i can be added to allow misclassification of difficult or noisy examples, resulting margin called soft. Need to minimize: Subject to: ξiξi ξiξi SVMs(Support Vector Machines)

42 Linear, Soft-Margin SVMs Algorithm tries to maintain  i to zero while maximizing margin Notice: algorithm does not minimize the number of misclassifications (NP-complete problem) but the sum of distances from the margin hyperplanes Other formulations use  i 2 instead As C , we get closer to the hard-margin solution

43 Robustness of Soft vs Hard Margin SVMs Var 1 Var 2 ii Var 1 Var 2 Soft Margin SVM Hard Margin SVM As C  As C  0

44 Soft vs Hard Margin SVM Soft-Margin always have a solution Soft-Margin is more robust to outliers Smoother surfaces (in the non-linear case) Hard-Margin does not require to guess the cost parameter (requires no parameters at all)

Linear SVMs: Overview The classifier is a separating hyperplane. Most “important” training points are support vectors; they define the hyperplane. Quadratic optimization algorithms can identify which training points x i are support vectors with non-zero Lagrangian multipliers α i. Both in the dual formulation of the problem and in the solution training points appear only inside inner products: Find α 1 …α N such that f(x) = Σ α i y i x i T x + b SVMs(Support Vector Machines)

52 Extension to Non-linear Decision Boundary So far, we only consider large-margin classifier with a linear decision boundary, how to generalize it to become nonlinear? Key idea: transform x i to a higher dimensional space to “ make life easier ” –Input space: the space the point x i are located –Feature space: the space of  (x i ) after transformation Why transform? –Linear operation in the feature space is equivalent to non-linear operation in input space –Classification can become easier with a proper transformation. In the XOR problem, for example, adding a new feature of x 1 x 2 make the problem linearly separable

53 Non-linear SVMs SVMs(Support Vector Machines) 53 Datasets that are linearly separable with some noise wor k out great: But what are we going to do if the dataset is just too hard? How about … mapping data to a higher-dimensional spac e: 0 x 0 x 0 x x2x2

54 Disadvantages of Linear Decision Surfaces Var 1 Var 2 SVMs(Support Vector Machines)

55 Advantages of Non-Linear Surfaces Var 1 Var 2 SVMs(Support Vector Machines)

56 Linear Classifiers in High-Dimensional Spaces Var 1 Var 2 Constructed Feature 1 Find function  (x) to map to a different space Constructed Feature 2 SVMs(Support Vector Machines)

57 Transforming the Data Computation in the feature space can be costly because it is high dimensional –The feature space is typically infinite-dimensional! The kernel trick comes to rescue  ( )  (.)  ( ) Feature space Input space

59 Mapping Data to a High-Dimensional Space Find function  (x) to map to a different space, then SVM formulation becomes: Data appear as  (x), weights w are now weights in the new space Explicit mapping expensive if  (x) is very high dimensional Solving the problem without explicitly mapping the data is desirable !! SVMs(Support Vector Machines)

60 The Kernel Trick Recall the SVM optimization problem The data points only appear as inner product As long as we can calculate the inner product in the feature space, we do not need the mapping explicitly Many common geometric operations (angles, distances) can be expressed by inner products Define the kernel function K by

61 An Example for  (.) and K(.,.) Suppose  (.) is given as follows An inner product in the feature space is So, if we define the kernel function as follows, there is no need to carry out  (.) explicitly This use of kernel function to avoid carrying out  (.) explicitly is known as the kernel trick

62 The linear classifier relies on dot product between vectors K(x i,x j )=x i T x j If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the dot product becomes: K(x i,x j )= φ(x i ) T φ(x j ) A kernel function is some function that corresponds to an inner product in some expanded feature space. Example: 2-dimensional vectors x=[x 1 x 2 ]; let K(x i,x j )=(1 + x i T x j ) 2, Need to show that K(x i,x j )= φ(x i ) T φ(x j ): K(x i,x j )=(1 + x i T x j ) 2, = (1+[x i1 x i2 ] T [x j1 x j2 ]) 2 = 1+ x i1 2 x j1 2 + 2 x i1 x j1 x i2 x j2 + x i2 2 x j2 2 + 2x i1 x j1 + 2x i2 x j2 = [1 x i1 2 √2 x i1 x i2 x i2 2 √2x i1 √2x i2 ] T [1 x j1 2 √2 x j1 x j2 x j2 2 √2x j1 √2x j2 ] = φ(x i ) T φ(x j ), where φ(x) = [1 x 1 2 √2 x 1 x 2 x 2 2 √2x 1 √2x 2 ] Kernel Example

63 Kernel Example Let Where: …(we can do XOR!)

64 What Functions are Kernels? SVMs(Support Vector Machines) 64 For some functions K(x i,x j ) checking that K(x i,x j )= φ(x i ) T φ(x j ) can be cumbersome. Mercer’s theorem: Every semi-positive definite symmetric function is a kernel Semi-positive definite symmetric functions correspond to a semi-positive definite symmetric Gram matrix: K(x1,x1)K(x1,x1)K(x1,x2)K(x1,x2)K(x1,x3)K(x1,x3) … K(x1,xN)K(x1,xN) K(x2,x1)K(x2,x1)K(x2,x2)K(x2,x2)K(x2,x3)K(x2,x3)K(x2,xN)K(x2,xN) …………… K(xN,x1)K(xN,x1)K(xN,x2)K(xN,x2)K(xN,x3)K(xN,x3) … K(xN,xN)K(xN,xN) K=

65 Kernel Functions In practical use of SVM, only the kernel function is specified (and not  (.)) Kernel function can be thought of as a similarity measure between the input objects Not all similarity measure can be used as kernel function, however –Mercer's condition states that any positive semi-definite kernel K(x, y), i.e. can be expressed as a dot product in a high dimensional space.

66 Examples of Kernel Functions SVMs(Support Vector Machines) 66 Linear: K(x i,x j )= x i T x j Polynomial of power p : K(x i,x j )= (1+ x i T x j ) p Gaussian (radial-basis function network): Closely related to radial basis function neural networks Sigmoid: K(x i,x j )= tanh(β 0 x i T x j + β 1 ) It does not satisfy the Mercer condition on all  and 

67 Modification Due to Kernel Function Change all inner products to kernel functions For training, Original With kernel function

68 Modification Due to Kernel Function For testing, the new data z is classified as: –class 1 if f  0, –class 2 if f <0 Original With kernel function

Suppose we have 5 1D data points –x 1 =1, x 2 =2, x 3 =4, x 4 =5, x 5 =6, with 1, 2, 6 as class 1 and 4, 5 as class 2  y 1 =1, y 2 =1, y 3 =-1, y 4 =-1, y 5 =1 We use the polynomial kernel of degree 2 –K(x,y) = (xy+1) 2 –C is set to 100 We first find  i (i=1, …, 5) by Example

72  By using a QP solver, we get  1 =0,  2 =2.5,  3 =0,  4 =7.333,  5 =4.833 –Verify (at home) that the constraints are indeed satisfied –The support vectors are {x 2 =2, x 4 =5, x 5 =6}  The discriminant function is b is recovered by solving f(x 2 =2)=1 or by f(x 4 =5)=-1 or by f(x 5 =6})=1, as x 2, x 5 lie on and x 4, lies on  all give b=9 Example

74 Solve the classical XOR problem, i.e find non-linear discriminant function !! –Dataset Class 1: 1=(−1,−1), 4=(+1,+1) Class 2: 2=(−1,+1), 3=(+1,−1) –Kernel function Polynomial of order 2: (,′)=(′+1) 2 –To achieve linear separability, we will use =∞ Homework

76 Robustness of Soft vs Hard Margin SVMs Var 1 Var 2 ii Var 1 Var 2 Soft Margin SVM Hard Margin SVM

78 Choosing the Kernel Function  Probably the most tricky part of using SVM.  The kernel function is important because it creates the kernel matrix, which summarize all the data  Many principles have been proposed (diffusion kernel, Fisher kernel, string kernel, …)  In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try for most applications.  SVM with RBF kernel is closely related to RBF neural networks, with the centers of the radial basis functions automatically chosen for SVM

79 Strengths of SVM Strengths –Training is relatively easy No local optimal, unlike in neural networks –It scales relatively well to high dimensional data –Tradeoff between classifier complexity and error can be controlled explicitly –Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors –By performing logistic regression (Sigmoid) on the SVM output of a set of data can map SVM output to probabilities.

80 Need to choose a “good” kernel function. It is sensitive to noise - A relatively small number of mislabeled examples can dramatically d ecrease the performance It only considers two classes - how to do multi-class classification with SVM ? - Answer: 1) with output arity m, learn m SVM’s – SVM 1 learns “Output==1” vs “Output != 1” – SVM 2 learns “Output==2” vs “Output != 2” – : – SVM m learns “Output==m” vs “Output != m” 2)To predict the output for a new input, just predict with each SVM a nd find out which one puts the prediction the furthest into the positiv e region. Weaknesses of SVM

81 Summary: Steps for Classification 1.Prepare the pattern matrix {(x i,y i )} 2. Select a Kernel function 3. Select the error parameter C can use the values suggested by the SVM software, or can set apart a validation set to determine the values of the parameter 4. Execute the training algorithm (to find all α i ) 5. New data can be classified using α i and Support Vectors

82 The Dual of the SVM Formulation Original SVM formulation n inequality constraints n positivity constraints n number of  variables The (Wolfe) dual of this problem one equality constraint n positivity constraints n number of  variables (Lagrange multipliers) Objective function more complicated NOTICE: Data only appear as  (x i )   (x j )

Nonlinear SVM - Overview SVMs(Support Vector Machines) 83 SVM locates a separating hyperplane in the feature space and classify points in that space It does not need to represent the space explicitly, simply by defining a kernel function The kernel function plays the role of the dot product in the feature space.

SVM Applications 84 SVM has been used successfully in many real-world problems - text (and hypertext) categorization - image classification - Ranking (e.g., Google searches) - bioinformatics (Protein classification, Cancer classificat ion) - hand-written character recognition

85 Handwritten digit recognition

86 Comparison with Neural Networks Neural Networks Hidden Layers map to lower dimensional spaces Search space has multiple local minima Training is expensive Classification extremely efficient Requires number of hidden units and layers Very good accuracy in typical domains SVMs Kernel maps to a very-high dimensional space Search space has a unique minimum Training is extremely efficient Classification extremely efficient Kernel and cost the two parameters to select Very good accuracy in typical domains Extremely robust

87 Conclusions SVMs express learning as a mathematical program taking advantage of the rich theory in optimization SVM uses the kernel trick to map indirectly to extremely high dimensional spaces SVMs extremely successful, robust, efficient, and versatile while there are good theoretical indications as to why they generalize well

88 Suggested Further Reading http://www.kernel-machines.org/tutorial.html C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998. P.H. Chen, C.-J. Lin, and B. Schölkopf. A tutorial on nu -support vector machines. 2003. N. Cristianini. ICML'01 tutorial, 2001. K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf. An introduction to kernel-based learning algorithms. IEEE Neural Networks, 12(2):181-201, May 2001. (PDF)PDF B. Schölkopf. SVM and kernel methods, 2001. Tutorial given at the NIPS Conference. Hastie, Tibshirani, Friedman, The Elements of Statistical Learning, Springel 2001

89 References Burges, C. “A Tutorial on Support Vector Machines for Pattern Recognition.” Bell Labs. 1998 Law, Martin. “A Simple Introduction to Support Vector Machines.” Michigan State University. 2006 Prabhakar, K. “ An Introduction to Support Vector Machines”

Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1.

Similar presentations

Presentation on theme: "Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1.

Similar presentations

Presentation on theme: "Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1."— Presentation transcript:

Similar presentations

About project

Feedback