CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.

Slides:

Advertisements

Similar presentations

Introduction to Support Vector Machines (SVM)

Advertisements

Lecture 9 Support Vector Machines

ECG Signal processing (2)

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

An Introduction of Support Vector Machine

Classification / Regression Support Vector Machines

An Introduction of Support Vector Machine

Support Vector Machines

SVM—Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Separating Hyperplanes

Support Vector Machine

Support Vector Machines (and Kernel Methods in general)

Support Vector Machines and Kernel Methods

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.

Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.

Support Vector Machines Kernel Machines

Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.

Support Vector Machine (SVM) Classification

Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.

Binary Classification Problem Learn a Classifier from the Training Set

Support Vector Machines and Kernel Methods

Support Vector Machines

CS 4700: Foundations of Artificial Intelligence

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

Lecture 10: Support Vector Machines

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

An Introduction to Support Vector Machines Martin Law.

Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.

SVM by Sequential Minimal Optimization (SMO)

Support Vector Machine & Image Classification Applications

Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.

Support Vector Machine (SVM) Based on Nello Cristianini presentation

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

An Introduction to Support Vector Machines (M. Law)

1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

1 CISC 841 Bioinformatics (Fall 2007) Kernel engineering and applications of SVMs.

An Introduction to Support Vector Machine (SVM)

Support Vector Machines

Support Vector Machines Tao Department of computer science University of Illinois.

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

Support Vector Machines (SVM): A Tool for Machine Learning Yixin Chen Ph.D Candidate, CSE 1/10/2002.

Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.

© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

SVMs in a Nutshell.

Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.

An Introduction of Support Vector Machine In part from of Jinwei Gu.

Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)

An Introduction of Support Vector Machine Courtesy of Jinwei Gu.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Support Vector Machines

An Introduction to Support Vector Machines

Support Vector Machines Introduction to Data Mining, 2nd Edition by

CISC 841 Bioinformatics (Fall 2007) Kernel Based Methods (I)

SVMs for Document Ranking

Presentation transcript:

CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I

CISC667, F05, Lec22, Liao2 Terminologies An object x is represented by a set of m attributes x i, 1  i  m. A set of n training examples S = { (x 1, y 1 ), …, (x n, y n )}, where y i is the classification (or label) of instance x i. –For binary classification, y i ={  1, +1}, and for k-class classification, y i ={1, 2, …,k}. –Without loss of generality, we focus on binary classification. The task is to learn the mapping: x i  y i A machine is a learned function/mapping/hypothesis h: x i  h(x i,  ) where  stands for parameters to be fixed during training. Performance is measured as E = (1/2n)  i=1 to n |y i - h(x i,  )|

CISC667, F05, Lec22, Liao3 Linear SVMs: find a hyperplane (specified by normal vector w and perpendicular distance b to the origin) that separates the positive and negative examples with the largest margin. Separating hyperplane (w, b) Margin  Origin w b w · x i + b > 0 if y i = +1 w · x i + b < 0 if y i =  1 +  An unknown x is classified as sign(w · x + b)

CISC667, F05, Lec22, Liao4 Rosenblatt’s Algorithm (1956)  ; // is the learning rate w 0 = 0; b 0 = 0; k = 0 R = max 1  i  n || x i || error = 1; // flag for misclassification/mistake while (error) { // as long as modification is made in the for-loop error = 0; for (i = 1 to n) { if (y i ( + b k )  0 ){ // misclassification w k+1 = w k +  y i x i // update the weight b k+1 = b k +  y i R 2 // update the bias k = k +1 error = 1; } return (w k, b k ) // hyperplane that separates the data, where k is the number of // modifications.

CISC667, F05, Lec22, Liao5 Questions w.r.t. Rosenblatt’s algorithm –Is the algorithm guaranteed to converge? –How quickly does it converge? Novikoff Theorem: Let S be a training set of size n and R = max 1  i  n || x i ||. If there exists a vector w* such that ||w*|| = 1 and y i (w* · x i )  , for 1  i  n, then the number of modifications before convergence is at most (R/  ) 2.

CISC667, F05, Lec22, Liao6 Proof: 1. w t · w* = w t-1 · w* +  y i x i · w*  w t-1 · w* +   w t · w*  t   2.|| w t || 2 = || w t-1 ||  y i x i · w t-1 +  2 || x i || 2  || w t-1 || 2 +  2 || x i || 2  || w t-1 || 2 +  2 R 2 || w t || 2  t  2 R 2 3.  t  R ||w*||  w t · w*  t   t  (R/  ) 2. Note: –Without loss of generality, the separating plane is assumed to pass the origin, i.e., no bias b is necessary. –The learning rate  seems to have no bearing on this upper bound. –What if the training data is not linearly separable, i.e., w* does not exist?

CISC667, F05, Lec22, Liao7 Dual form The final hypothesis w is a linear combination of the training points: w =  i=1 to n  i y i x i where  i are positive values proportional to the number of times misclassification of x i has caused the weight to be updated. Vector  can be considered as alternative representation of the hypothesis;  i can be regarded as an indication of the information content of the example x i. The decision function can be rewritten as h(x) = sign (w · x + b) = sign( (  j=1 to n  j y j x j ) · x + b) = sign(  j=1 to n  j y j (x j · x) + b)

CISC667, F05, Lec22, Liao8 Rosenblatt’s Algorithm in dual form  = 0; b = 0 R = max 1  i  n || x i || error = 1; // flag for misclassification while (error) { // as long as modification is made in the for-loop error = 0; for (i = 1 to n) { if (y i (  j=1 to n  j y j (x j · x i ) + b)  0 ){ // x i is misclassified  i =  i + 1 // update the weight b = b + y i R 2 // update the bias error = 1; } return ( , b) // hyperplane that separates the data, where k is the number of // modifications. Notes: –The training examples enter the algorithm as dot products (x j · x i ). –  i i s a measure of information content; x i with non-zero information content (  i >0) are called support vectors, as they are located on the boundaries.

CISC667, F05, Lec22, Liao9 Separating hyperplane (w, b) Margin  Origin w b +   > 0  = 0

CISC667, F05, Lec22, Liao10 Larger margin is preferred: converge more quickly generalize better

CISC667, F05, Lec22, Liao11 w · x + + b = + 1 w · x - + b =  1 2 = [ (x + · w ) - (x - · w ) ] = (x + - x - ) · w = || x + - x - || ||w|| Therefore, maximizing the geometric margin || x + - x - || is equivalent to minimizing ½ ||w|| 2, under linear constraints: y i (w · x i ) +b  1 for i = 1, …, n. Min w,b subject to y i +b  1 for i = 1, …, n

CISC667, F05, Lec22, Liao12 Optimization with constraints Min w,b subject to y i +b  1 for i = 1, …, n Lagrangian Theory: Introducing Lagrangian multiplier  i for each constraint L(w, b,  )= ½ ||w|| 2    i (y i (w · x i + b)  1), and then calculating  L  L  L = 0, = 0, = 0,  w  b   This optimization problem can be solved as Quadratic Programming. … guaranteed to converge to the global minimum because of its being a convex Note: advantages over the artificial neural nets

CISC667, F05, Lec22, Liao13 The optimal w* and b* can be found by solving the dual problem for  to maximize: L(  ) =   i  ½   i  j y i y j x i · x j under the constraints:  i  0, and   i y i = 0. Once  is solved, w* =   i y i x i b* = ½ (max y =-1 w*· x i + min y=+1 w*· x i ) And an unknown x is classified as sign(w* · x + b*) = sign(   i y i x i · x + b*) Notes: 1.Only the dot product for vectors is needed. 2.Many  i are equal to zero, and those that are not zero correspond to x i on the boundaries – support vectors! 3.In practice, instead of sign function, the actual value of w* · x + b* is used when its absolute value is less than or equal to one. Such a value is called a discriminant.

CISC667, F05, Lec22, Liao14 Non-linear mapping to a feature space Φ( )Φ( ) xixi Φ(xi)Φ(xi) Φ(xj)Φ(xj) xjxj L(  ) =   i  ½   i  j y i y j Φ (x i )·Φ (x j )

CISC667, F05, Lec22, Liao15 X = Nonlinear SVMs Input SpaceFeature Space x1x2x1x2

CISC667, F05, Lec22, Liao16 Kernel function for mapping For input X= (x 1, x 2 ), Define map  (X) = (x 1 x 1,  2x 1 x 2, x 2 x 2 ). Define Kernel function as K(X,Y) = (X·Y) 2. It has K(X,Y) =  (X) ·  (Y) We can compute the dot product in feature space without computing . K(X,Y) =  (X) ·  (Y) = (x 1 x 1,  2 x 1 x 2, x 2 x 2 ) · (y 1 y 1,  2 y 1 y 2, y 2 y 2 ) = (x 1 x 1 y 1 y 1 + 2x 1 x 2 y 1 y 2 + x 2 x 2 y 2 y 2 ) = (x 1 y 1 + x 2 y 2 )(x 1 y 1 + x 2 y 2 ) = ((x 1, x 2 ) · (y 1, y 2 )) 2 = (X·Y) 2

CISC667, F05, Lec22, Liao17 Kernels Given a mapping Φ( ) from the space of input vectors to some higher dimensional feature space, the kernel K of two vectors x i, x j is the inner product of their images in the feature space, namely, K(x i, x j ) = Φ (x i )·Φ (x j ). Since we just need the inner product of vectors in the feature space to find the maximal margin separating hyperplane, we use the kernel in place of the mapping Φ( ). Because inner product of two vectors is a measure of the distance between the vectors, a kernel function actually defines the geometry of the feature space (lengths and angles), and implicitly provides a similarity measure for objects to be classified.

CISC667, F05, Lec22, Liao18 Mercer’s condition Since kernel functions play an important role, it is important to know if a kernel gives dot products (in some higher dimension space). For a kernel K(x,y), if for any g(x) such that  g(x) 2 dx is finite, we have  K(x,y)g(x)g(y) dx dy  0, then there exist a mapping  such that K(x,y) =  (x) ·  (y) Notes: 1.Mercer’s condition does not tell how to actually find . 2.Mercer’s condition may be hard to check since it must hold for every g(x).

CISC667, F05, Lec22, Liao19 More kernel functions some commonly used generic kernel functions –Polynomial kernel: K(x,y) = (1+x·y) p –Radial (or Gaussian) kernel: K(x,y) = exp(-||x-y|| 2 /2  2 ) Questions: By introducing extra dimensions (sometimes infinite), we can find a linearly separating hyperplane. But how can we be sure such a mapping to a higher dimension space will generalize well to unseen data? Because the mapping introduces flexibility for fitting the training examples, how to avoid overfitting? Answer: Use the maximum margin hyperplane. (Vapnik theory)

CISC667, F05, Lec22, Liao20 The optimal w* and b* can be found by solving the dual problem for  to maximize: L(  ) =   i  ½   i  j y i y j K(x i, x j ) under the constraints:  i  0, and   i y i = 0. Once  is solved, w* =   i y i x i b* = ½ max y =-1 [K(w*, x i ) + min y=+1 (w*· x i )] And an unknown x is classified as sign(w* · x + b*) = sign(   i y i K(x i, x) + b*)

CISC667, F05, Lec22, Liao21 References and resources Cristianini & Shawe-Tayor, “An introduction to Support Vector Machines”, Cambridge University Press, –SVMLight –Chris Burges, A tutorial J.-P Vert, A 3-day tutorial W. Noble, “Support vector machine applications in computational biology”, Kernel Methods in Computational Biology. B. Schoelkopf, K. Tsuda and J.-P. Vert, ed. MIT Press, 2004.