Presentation is loading. Please wait.

Presentation is loading. Please wait.

CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.

Similar presentations


Presentation on theme: "CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I."— Presentation transcript:

1 CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I

2 CISC667, F05, Lec22, Liao2 Terminologies An object x is represented by a set of m attributes x i, 1  i  m. A set of n training examples S = { (x 1, y 1 ), …, (x n, y n )}, where y i is the classification (or label) of instance x i. –For binary classification, y i ={  1, +1}, and for k-class classification, y i ={1, 2, …,k}. –Without loss of generality, we focus on binary classification. The task is to learn the mapping: x i  y i A machine is a learned function/mapping/hypothesis h: x i  h(x i,  ) where  stands for parameters to be fixed during training. Performance is measured as E = (1/2n)  i=1 to n |y i - h(x i,  )|

3 CISC667, F05, Lec22, Liao3 Linear SVMs: find a hyperplane (specified by normal vector w and perpendicular distance b to the origin) that separates the positive and negative examples with the largest margin. Separating hyperplane (w, b) Margin  Origin w b w · x i + b > 0 if y i = +1 w · x i + b < 0 if y i =  1 +  An unknown x is classified as sign(w · x + b)

4 CISC667, F05, Lec22, Liao4 Rosenblatt’s Algorithm (1956)  ; // is the learning rate w 0 = 0; b 0 = 0; k = 0 R = max 1  i  n || x i || error = 1; // flag for misclassification/mistake while (error) { // as long as modification is made in the for-loop error = 0; for (i = 1 to n) { if (y i ( + b k )  0 ){ // misclassification w k+1 = w k +  y i x i // update the weight b k+1 = b k +  y i R 2 // update the bias k = k +1 error = 1; } return (w k, b k ) // hyperplane that separates the data, where k is the number of // modifications.

5 CISC667, F05, Lec22, Liao5 Questions w.r.t. Rosenblatt’s algorithm –Is the algorithm guaranteed to converge? –How quickly does it converge? Novikoff Theorem: Let S be a training set of size n and R = max 1  i  n || x i ||. If there exists a vector w* such that ||w*|| = 1 and y i (w* · x i )  , for 1  i  n, then the number of modifications before convergence is at most (R/  ) 2.

6 CISC667, F05, Lec22, Liao6 Proof: 1. w t · w* = w t-1 · w* +  y i x i · w*  w t-1 · w* +   w t · w*  t   2.|| w t || 2 = || w t-1 || 2 + 2  y i x i · w t-1 +  2 || x i || 2  || w t-1 || 2 +  2 || x i || 2  || w t-1 || 2 +  2 R 2 || w t || 2  t  2 R 2 3.  t  R ||w*||  w t · w*  t   t  (R/  ) 2. Note: –Without loss of generality, the separating plane is assumed to pass the origin, i.e., no bias b is necessary. –The learning rate  seems to have no bearing on this upper bound. –What if the training data is not linearly separable, i.e., w* does not exist?

7 CISC667, F05, Lec22, Liao7 Dual form The final hypothesis w is a linear combination of the training points: w =  i=1 to n  i y i x i where  i are positive values proportional to the number of times misclassification of x i has caused the weight to be updated. Vector  can be considered as alternative representation of the hypothesis;  i can be regarded as an indication of the information content of the example x i. The decision function can be rewritten as h(x) = sign (w · x + b) = sign( (  j=1 to n  j y j x j ) · x + b) = sign(  j=1 to n  j y j (x j · x) + b)

8 CISC667, F05, Lec22, Liao8 Rosenblatt’s Algorithm in dual form  = 0; b = 0 R = max 1  i  n || x i || error = 1; // flag for misclassification while (error) { // as long as modification is made in the for-loop error = 0; for (i = 1 to n) { if (y i (  j=1 to n  j y j (x j · x i ) + b)  0 ){ // x i is misclassified  i =  i + 1 // update the weight b = b + y i R 2 // update the bias error = 1; } return ( , b) // hyperplane that separates the data, where k is the number of // modifications. Notes: –The training examples enter the algorithm as dot products (x j · x i ). –  i i s a measure of information content; x i with non-zero information content (  i >0) are called support vectors, as they are located on the boundaries.

9 CISC667, F05, Lec22, Liao9 Separating hyperplane (w, b) Margin  Origin w b +   > 0  = 0

10 CISC667, F05, Lec22, Liao10 Larger margin is preferred: converge more quickly generalize better

11 CISC667, F05, Lec22, Liao11 w · x + + b = + 1 w · x - + b =  1 2 = [ (x + · w ) - (x - · w ) ] = (x + - x - ) · w = || x + - x - || ||w|| Therefore, maximizing the geometric margin || x + - x - || is equivalent to minimizing ½ ||w|| 2, under linear constraints: y i (w · x i ) +b  1 for i = 1, …, n. Min w,b subject to y i +b  1 for i = 1, …, n

12 CISC667, F05, Lec22, Liao12 Optimization with constraints Min w,b subject to y i +b  1 for i = 1, …, n Lagrangian Theory: Introducing Lagrangian multiplier  i for each constraint L(w, b,  )= ½ ||w|| 2    i (y i (w · x i + b)  1), and then calculating  L  L  L ------ = 0, ------ = 0, ------ = 0,  w  b   This optimization problem can be solved as Quadratic Programming. … guaranteed to converge to the global minimum because of its being a convex Note: advantages over the artificial neural nets

13 CISC667, F05, Lec22, Liao13 The optimal w* and b* can be found by solving the dual problem for  to maximize: L(  ) =   i  ½   i  j y i y j x i · x j under the constraints:  i  0, and   i y i = 0. Once  is solved, w* =   i y i x i b* = ½ (max y =-1 w*· x i + min y=+1 w*· x i ) And an unknown x is classified as sign(w* · x + b*) = sign(   i y i x i · x + b*) Notes: 1.Only the dot product for vectors is needed. 2.Many  i are equal to zero, and those that are not zero correspond to x i on the boundaries – support vectors! 3.In practice, instead of sign function, the actual value of w* · x + b* is used when its absolute value is less than or equal to one. Such a value is called a discriminant.

14 CISC667, F05, Lec22, Liao14 Non-linear mapping to a feature space Φ( )Φ( ) xixi Φ(xi)Φ(xi) Φ(xj)Φ(xj) xjxj L(  ) =   i  ½   i  j y i y j Φ (x i )·Φ (x j )

15 CISC667, F05, Lec22, Liao15 X = Nonlinear SVMs Input SpaceFeature Space x1x2x1x2

16 CISC667, F05, Lec22, Liao16 Kernel function for mapping For input X= (x 1, x 2 ), Define map  (X) = (x 1 x 1,  2x 1 x 2, x 2 x 2 ). Define Kernel function as K(X,Y) = (X·Y) 2. It has K(X,Y) =  (X) ·  (Y) We can compute the dot product in feature space without computing . K(X,Y) =  (X) ·  (Y) = (x 1 x 1,  2 x 1 x 2, x 2 x 2 ) · (y 1 y 1,  2 y 1 y 2, y 2 y 2 ) = (x 1 x 1 y 1 y 1 + 2x 1 x 2 y 1 y 2 + x 2 x 2 y 2 y 2 ) = (x 1 y 1 + x 2 y 2 )(x 1 y 1 + x 2 y 2 ) = ((x 1, x 2 ) · (y 1, y 2 )) 2 = (X·Y) 2

17 CISC667, F05, Lec22, Liao17 Kernels Given a mapping Φ( ) from the space of input vectors to some higher dimensional feature space, the kernel K of two vectors x i, x j is the inner product of their images in the feature space, namely, K(x i, x j ) = Φ (x i )·Φ (x j ). Since we just need the inner product of vectors in the feature space to find the maximal margin separating hyperplane, we use the kernel in place of the mapping Φ( ). Because inner product of two vectors is a measure of the distance between the vectors, a kernel function actually defines the geometry of the feature space (lengths and angles), and implicitly provides a similarity measure for objects to be classified.

18 CISC667, F05, Lec22, Liao18 Mercer’s condition Since kernel functions play an important role, it is important to know if a kernel gives dot products (in some higher dimension space). For a kernel K(x,y), if for any g(x) such that  g(x) 2 dx is finite, we have  K(x,y)g(x)g(y) dx dy  0, then there exist a mapping  such that K(x,y) =  (x) ·  (y) Notes: 1.Mercer’s condition does not tell how to actually find . 2.Mercer’s condition may be hard to check since it must hold for every g(x).

19 CISC667, F05, Lec22, Liao19 More kernel functions some commonly used generic kernel functions –Polynomial kernel: K(x,y) = (1+x·y) p –Radial (or Gaussian) kernel: K(x,y) = exp(-||x-y|| 2 /2  2 ) Questions: By introducing extra dimensions (sometimes infinite), we can find a linearly separating hyperplane. But how can we be sure such a mapping to a higher dimension space will generalize well to unseen data? Because the mapping introduces flexibility for fitting the training examples, how to avoid overfitting? Answer: Use the maximum margin hyperplane. (Vapnik theory)

20 CISC667, F05, Lec22, Liao20 The optimal w* and b* can be found by solving the dual problem for  to maximize: L(  ) =   i  ½   i  j y i y j K(x i, x j ) under the constraints:  i  0, and   i y i = 0. Once  is solved, w* =   i y i x i b* = ½ max y =-1 [K(w*, x i ) + min y=+1 (w*· x i )] And an unknown x is classified as sign(w* · x + b*) = sign(   i y i K(x i, x) + b*)

21 CISC667, F05, Lec22, Liao21 References and resources Cristianini & Shawe-Tayor, “An introduction to Support Vector Machines”, Cambridge University Press, 2000. www.kernel-machines.org –SVMLight –Chris Burges, A tutorial J.-P Vert, A 3-day tutorial W. Noble, “Support vector machine applications in computational biology”, Kernel Methods in Computational Biology. B. Schoelkopf, K. Tsuda and J.-P. Vert, ed. MIT Press, 2004.


Download ppt "CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I."

Similar presentations


Ads by Google