# Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

## Presentation on theme: "Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County."— Presentation transcript:

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County Doing Really Well with Linear Decision Surfaces

Outline Prediction Prediction Why might predictions be wrong? Why might predictions be wrong? Support vector machines Support vector machines Doing really well with linear models Doing really well with linear models Kernels Kernels Making the non-linear linear Making the non-linear linear

Supervised ML = Prediction Given training instances (x,y) Given training instances (x,y) Learn a model f Learn a model f Such that f(x) = y Such that f(x) = y Use f to predict y for new x Use f to predict y for new x Many variations on this basic theme Many variations on this basic theme

Why might predictions be wrong? True Non-Determinism True Non-Determinism Flip a biased coin Flip a biased coin p(heads) =  p(heads) =  Estimate  Estimate  If  > 0.5 predict heads, else tails If  > 0.5 predict heads, else tails Lots of ML research on problems like this Lots of ML research on problems like this Learn a model Learn a model Do the best you can in expectation Do the best you can in expectation

Why might predictions be wrong? Partial Observability Partial Observability Something needed to predict y is missing from observation x Something needed to predict y is missing from observation x N-bit parity problem N-bit parity problem x contains N-1 bits (hard PO) x contains N-1 bits (hard PO) x contains N bits but learner ignores some of them (soft PO) x contains N bits but learner ignores some of them (soft PO)

Why might predictions be wrong? True non-determinism True non-determinism Partial observability Partial observability hard, soft hard, soft Representational bias Representational bias Algorithmic bias Algorithmic bias Bounded resources Bounded resources

Representational Bias Having the right features (x) is crucial Having the right features (x) is crucial XOOOOXXX X O O O O X X X

Support Vector Machines Doing Really Well with Linear Decision Surfaces

Strengths of SVMs Good generalization in theory Good generalization in theory Good generalization in practice Good generalization in practice Work well with few training instances Work well with few training instances Find globally best model Find globally best model Efficient algorithms Efficient algorithms Amenable to the kernel trick Amenable to the kernel trick

Linear Separators Training instances Training instances x   n x   n y  {-1, 1} y  {-1, 1} w   n w   n b   b   Hyperplane Hyperplane + b = 0 + b = 0 w 1 x 1 + w 2 x 2 … + w n x n + b = 0 w 1 x 1 + w 2 x 2 … + w n x n + b = 0 Decision function Decision function f(x) = sign( + b) f(x) = sign( + b) Math Review Inner (dot) product: = a · b = ∑ a i *b i = a · b = ∑ a i *b i = a 1 b 1 + a 2 b 2 + …+a n b n

Intuitions X X O O O O O O X X X X X X O O

Intuitions X X O O O O O O X X X X X X O O

Intuitions X X O O O O O O X X X X X X O O

Intuitions X X O O O O O O X X X X X X O O

A “Good” Separator X X O O O O O O X X X X X X O O

Noise in the Observations X X O O O O O O X X X X X X O O

Ruling Out Some Separators X X O O O O O O X X X X X X O O

Lots of Noise X X O O O O O O X X X X X X O O

Maximizing the Margin X X O O O O O O X X X X X X O O

“Fat” Separators X X O O O O O O X X X X X X O O

Why Maximize Margin? Increasing margin reduces capacity Increasing margin reduces capacity Must restrict capacity to generalize Must restrict capacity to generalize m training instances m training instances 2 m ways to label them 2 m ways to label them What if function class that can separate them all? What if function class that can separate them all? Shatters the training instances Shatters the training instances VC Dimension is largest m such that function class can shatter some set of m points VC Dimension is largest m such that function class can shatter some set of m points

VC Dimension Example X XX O XX X OX X XO O OX O XO X OO O OO

Bounding Generalization Error R[f] = risk, test error R[f] = risk, test error R emp [f] = empirical risk, train error R emp [f] = empirical risk, train error h = VC dimension h = VC dimension m = number of training instances m = number of training instances  = probability that bound does not hold  = probability that bound does not hold 1 m 2m h ln + 1 4  + ln h R[f]  R emp [f] +

Support Vectors X X O O O O O O O O X X X X X X

The Math Training instances Training instances x   n x   n y  {-1, 1} y  {-1, 1} Decision function Decision function f(x) = sign( + b) f(x) = sign( + b) w   n w   n b   b   Find w and b that Find w and b that Perfectly classify training instances Perfectly classify training instances Assuming linear separability Assuming linear separability Maximize margin Maximize margin

The Math For perfect classification, we want For perfect classification, we want y i ( + b) ≥ 0 for all i y i ( + b) ≥ 0 for all i Why? Why? To maximize the margin, we want To maximize the margin, we want w that minimizes |w| 2 w that minimizes |w| 2

Dual Optimization Problem Maximize over  Maximize over  W(  ) =  i  i - 1/2  i,j  i  j y i y j W(  ) =  i  i - 1/2  i,j  i  j y i y j Subject to Subject to  i  0  i  0  i  i y i = 0  i  i y i = 0 Decision function Decision function f(x) = sign(  i  i y i + b) f(x) = sign(  i  i y i + b)

What if Data Are Not Perfectly Linearly Separable? Cannot find w and b that satisfy Cannot find w and b that satisfy y i ( + b) ≥ 1 for all i y i ( + b) ≥ 1 for all i Introduce slack variables  i Introduce slack variables  i y i ( + b) ≥ 1 -  i for all i y i ( + b) ≥ 1 -  i for all i Minimize Minimize |w| 2 + C   i |w| 2 + C   i

Strengths of SVMs Good generalization in theory Good generalization in theory Good generalization in practice Good generalization in practice Work well with few training instances Work well with few training instances Find globally best model Find globally best model Efficient algorithms Efficient algorithms Amenable to the kernel trick … Amenable to the kernel trick …

What if Surface is Non- Linear? X X X X X X O O O O O O O O O O O O O O O O O O O O Image from http://www.atrandomresearch.com/iclass/

Kernel Methods Making the Non-Linear Linear

When Linear Separators Fail XOOOOXXX x1x1 x2x2 X O O O O X X X x1x1 x12x12

Mapping into a New Feature Space Rather than run SVM on x i, run it on  (x i ) Rather than run SVM on x i, run it on  (x i ) Find non-linear separator in input space Find non-linear separator in input space What if  (x i ) is really big? What if  (x i ) is really big? Use kernels to compute it implicitly! Use kernels to compute it implicitly!  : x  X =  (x)  (x 1,x 2 ) = (x 1,x 2,x 1 2,x 2 2,x 1 x 2 ) Image from http://web.engr.oregonstate.edu/http://web.engr.oregonstate.edu/ ~afern/classes/cs534/

Kernels Find kernel K such that Find kernel K such that K(x 1,x 2 ) = K(x 1,x 2 ) = Computing K(x 1,x 2 ) should be efficient, much more so than computing  (x 1 ) and  (x 2 ) Computing K(x 1,x 2 ) should be efficient, much more so than computing  (x 1 ) and  (x 2 ) Use K(x 1,x 2 ) in SVM algorithm rather than Use K(x 1,x 2 ) in SVM algorithm rather than Remarkably, this is possible Remarkably, this is possible

The Polynomial Kernel K(x 1,x 2 ) = 2 K(x 1,x 2 ) = 2 x 1 = (x 11, x 12 ) x 1 = (x 11, x 12 ) x 2 = (x 21, x 22 ) x 2 = (x 21, x 22 ) = (x 11 x 21 + x 12 x 22 ) = (x 11 x 21 + x 12 x 22 ) 2 = (x 11 2 x 21 2 + x 12 2 x 22 2 + 2x 11 x 12 x 21 x 22 ) 2 = (x 11 2 x 21 2 + x 12 2 x 22 2 + 2x 11 x 12 x 21 x 22 )  (x 1 ) = (x 11 2, x 12 2, √2x 11 x 12 )  (x 1 ) = (x 11 2, x 12 2, √2x 11 x 12 )  (x 2 ) = (x 21 2, x 22 2, √2x 21 x 22 )  (x 2 ) = (x 21 2, x 22 2, √2x 21 x 22 ) K(x 1,x 2 ) = K(x 1,x 2 ) =

The Polynomial Kernel  (x) contains all monomials of degree d  (x) contains all monomials of degree d Useful in visual pattern recognition Useful in visual pattern recognition Number of monomials Number of monomials 16x16 pixel image 16x16 pixel image 10 10 monomials of degree 5 10 10 monomials of degree 5 Never explicitly compute  (x)! Never explicitly compute  (x)! Variation - K(x 1,x 2 ) = ( + 1) 2 Variation - K(x 1,x 2 ) = ( + 1) 2

Kernels What does it mean to be a kernel? What does it mean to be a kernel? K(x 1,x 2 ) = for some  K(x 1,x 2 ) = for some  What does it take to be a kernel? What does it take to be a kernel? The Gram matrix G ij = K(x i, x j ) The Gram matrix G ij = K(x i, x j ) Positive definite matrix Positive definite matrix  ij c i c j G ij  0 for c i, c j    ij c i c j G ij  0 for c i, c j   Positive definite kernel Positive definite kernel For all samples of size m, induces a positive definite Gram matrix For all samples of size m, induces a positive definite Gram matrix

A Few Good Kernels Dot product kernel Dot product kernel K(x 1,x 2 ) = K(x 1,x 2 ) = Polynomial kernel Polynomial kernel K(x 1,x 2 ) = d (Monomials of degree d) K(x 1,x 2 ) = d (Monomials of degree d) K(x 1,x 2 ) = ( + 1) d (All monomials of degree 1,2,…,d) K(x 1,x 2 ) = ( + 1) d (All monomials of degree 1,2,…,d) Gaussian kernel Gaussian kernel K(x 1,x 2 ) = exp(-| x 1 -x 2 | 2 /2  2 ) K(x 1,x 2 ) = exp(-| x 1 -x 2 | 2 /2  2 ) Radial basis functions Radial basis functions Sigmoid kernel Sigmoid kernel K(x 1,x 2 ) = tanh( + ) K(x 1,x 2 ) = tanh( + ) Neural networks Neural networks Establishing “kernel-hood” from first principles is non- trivial Establishing “kernel-hood” from first principles is non- trivial

The Kernel Trick “Given an algorithm which is formulated in terms of a positive definite kernel K 1, one can construct an alternative algorithm by replacing K 1 with another positive definite kernel K 2 ”  SVMs can use the kernel trick

Using a Different Kernel in the Dual Optimization Problem For example, using the polynomial kernel with d = 4 (including lower-order terms). For example, using the polynomial kernel with d = 4 (including lower-order terms). Maximize over  Maximize over  W(  ) =  i  i - 1/2  i,j  i  j y i y j W(  ) =  i  i - 1/2  i,j  i  j y i y j Subject to Subject to  i  0  i  0  i  i y i = 0  i  i y i = 0 Decision function Decision function f(x) = sign(  i  i y i + b) f(x) = sign(  i  i y i + b) ( + 1) 4 X X These are kernels! So by the kernel trick, we just replace them!

Exotic Kernels Strings Strings Trees Trees Graphs Graphs The hard part is establishing kernel-hood The hard part is establishing kernel-hood

Application: “Beautification Engine” (Leyvand et al., 2008)

Conclusion SVMs find optimal linear separator SVMs find optimal linear separator The kernel trick makes SVMs non-linear learning algorithms The kernel trick makes SVMs non-linear learning algorithms

Download ppt "Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County."

Similar presentations