Presentation is loading. Please wait.

Presentation is loading. Please wait.

Support Vector Machines and The Kernel Trick William Cohen 3-26-2007.

Similar presentations


Presentation on theme: "Support Vector Machines and The Kernel Trick William Cohen 3-26-2007."— Presentation transcript:

1 Support Vector Machines and The Kernel Trick William Cohen 3-26-2007

2 The voted perceptron A B instance x i Compute: y i = v k. x i ^ y i ^ If mistake: v k+1 = v k + y i x i

3 u -u 2γ2γ u -u-u 2γ2γ v1v1 +x2+x2 v2v2 +x1+x1 v1v1 -x2-x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 (3b) The guess v 2 after the one positive and one negative example: v 2 =v 1 -x 2 >γ>γ

4

5 Perceptrons vs SVMs For the voted perceptron to “work” (in this proof), we need to assume there is some u such that..or, u.u=||u|| 2 =1

6 Perceptrons vs SVMs Question: why not use this assumption directly in the learning algorithm? i.e. –Given: γ, (x 1,y 1 ), (x 2,y 2 ), (x 3,y 3 ), … –Find: some w where ||w|| 2 =1 and for all i, w.x i. y i > γ

7 Perceptrons vs SVMs Question: why not use this assumption directly in the learning algorithm? i.e. –Given: (x 1,y 1 ), (x 2,y 2 ), (x 3,y 3 ), … –Find: some w and γ such that ||w||=1 and for all i, w.x i. y i > γ The best possible w and γ

8 Perceptrons vs SVMs Question: why not use this assumption directly in the learning algorithm? i.e. –Given: (x 1,y 1 ), (x 2,y 2 ), (x 3,y 3 ), … –Maximize γ under the constraint ||w|| 2 =1 and for all i, w.x i. y i > γ –Mimimize ||w|| 2 under the constraint for all i, w.x i. y i > 1 Units are arbitrary: rescaling increases γ and w

9 SVMs and optimization Question: why not use this assumption directly in the learning algorithm? i.e. –Given: (x 1,y 1 ), (x 2,y 2 ), (x 3,y 3 ), … –Find: This is a constrained optimization problem. objective function constraints Famous example of constrained optimization: linear programming, where objective function is linear, constraints are linear (in)equalities …but here nothing is linear, so you need to use quadratic programming

10 SVMs and optimization Motivation for SVMs as “better perceptrons” – learners that minimize w.w under the constraint that for all i, y i w.x i >1 Questions: –What if the data isn’t separable? Slack variables Kernel trick –How do you solve this constrained optimization problem?

11 SVMs and optimization Question: why not use this assumption directly in the learning algorithm? i.e. –Given: (x 1,y 1 ), (x 2,y 2 ), (x 3,y 3 ), … –Find:

12 SVM with slack variables http://www.csie.ntu.edu.tw/~cjlin/libsvm/

13 The Kernel Trick

14 The voted perceptron A B instance x i Compute: y i = v k. x i ^ y i ^ If mistake: v k+1 = v k + y i x i

15 The kernel trick where i 1,…,i k are the mistakes… so: Remember: sparse weighted sum of examples Can think of this as a weighted sum of all examples with some of the weights being zero – non-zero weighted examples are support vectors

16 The kernel trick – con’t where i 1,…,i k are the mistakes… then Since: Consider a preprocesser that replaces every x with x’ to include, directly in the example, all the pairwise variable interactions, so what is learned is a vector v’:

17 The kernel trick – con’t A voted perceptron over vectors like u,v is a linear function… Replacing u with u’ would lead to non-linear functions – f(x,y,xy,x 2,…)

18 The kernel trick – con’t But notice…if we replace uv with (uv+1) 2 …. Compare to

19 The kernel trick – con’t So – up to constants on the cross-product terms Why not replace the computation of With the computation of where ?

20 The kernel trick – con’t General idea: replace an expensive preprocessor x  x’ and ordinary inner product with no preprocessor and a function K(x,x i ) where Some popular kernels for numeric vectors x:

21 Demo with An Applet http://www.site.uottawa.ca/~gcaron/SVMApplet/SVMApplet.html

22 The kernel trick – con’t Kernels work for other data structures also! String kernels: x and x i are strings, S=set of shared substrings, |s|=length of string s: by dynamic programming you can quickly compute There are also tree kernels, graph kernels, …..

23 The kernel trick – con’t Kernels work for other data structures also! String kernels: x and x i are strings, S=set of shared substrings, j,k are subsets of the positions inside x,x i, len(x,j) is the distance between the first position in j and the last, s<t means s is a substring of t, by dynamic programming you can quickly compute x=“william” j={1,3,4} x[j]=“wll” “wll”<“wl” len(x,j)=4

24 The kernel trick – con’t Even more general idea: use any function K that is Continuous Symmetric—i.e., K(u,v)=K(v,u) “Positive semidefinite”—i.e., K(u,v)≥0 Then by an ancient theorem due to Mercer, K corresponds to some combination of a preprocessor and an inner product: i.e., Terminology: K is a Mercer kernel. The set of all x’ is a reproducing kernel Hilbert space (RKHS). The matrix M[i,j]=K(x i,x j ) is a Gram matrix.

25 SVMs and optimization Question: why not use this assumption directly in the learning algorithm? i.e. –Given: (x 1,y 1 ), (x 2,y 2 ), (x 3,y 3 ), … –Find: primal form Lagrangian dual which is equivalent to finding:

26 Langrange multipliers maximize f(x,y)=2-x 2 -2y 2 subject to g(x)=x 2 +y 2 -1=0

27 Langrange multipliers maximize f(x,y)=2-x 2 -2y 2 subject to g(x)=x 2 +y 2 -1=0 Claim: at the constrained maximum the gradient of f must be perpendicular to g

28 Langrange multipliers maximize f(x,y)=2-x 2 -2y 2 subject to g(x)=x 2 +y 2 -1=0 Claim: at the constrained maximum the gradient of f must be perpendicular to g

29 SVMs and optimization Question: why not use this assumption directly in the learning algorithm? i.e. –Given: (x 1,y 1 ), (x 2,y 2 ), (x 3,y 3 ), … –Find: primal form Lagrangian dual which is equivalent to finding:

30 SVMs and optimization Question: why not use this assumption directly in the learning algorithm? i.e. –Given: (x 1,y 1 ), (x 2,y 2 ), (x 3,y 3 ), … –Find: Some key points: Solving the QP directly (Vapnik’s original method) is possible but expensive. The dual form can be expressed as constraints on each example eg. α i =0  y i w.x i ≥1 Fastest methods for SVM learning ignore most of the constraints, solve a subproblem containing a few ‘active constraints’, then cleverly pick a few additional constraints & repeat….. KKT (Karush-Kuhn-Tucker) conditions or Kuhn-Tucker conditions, after Karush (1939) and Kuhn-Tucker (1951)

31 More on SVMs and kernels Many other types of algorithms can be “kernelized” –Gaussian processes, memory-based/nearest neighbor methods, …. Work on optimization for linear SVMs is very active


Download ppt "Support Vector Machines and The Kernel Trick William Cohen 3-26-2007."

Similar presentations


Ads by Google