Administrivia I’m out of town Mar 1-3 May have guest lecturer May cancel class Will let you know more when I do...
Tides of history Last time Linear classification (briefly) Multi-classes from sets of hyperplanes SVMs: nonlinear data projections Today Vector geometry The SVM objective function Linear programming (briefly) Slack variables
Exercise Given a hyperplane defined by a weight vector What is the equation for points on the surface of the hyperplane? What are the equations for points on the two margins? Give an expression for the distance between a point and the hyperplane (and/or either margin) What is the role of w 0 ?
5 minutes of math... A dot product (inner product) is a projection of one vector onto another When the projection of x onto w is equal to w 0, then x falls exactly onto the w hyperplane w Hyperplane X
5 minutes of math... BTW, are we sure that hyperplane is perpendicular to w ? Why?
5 minutes of math... BTW, are we sure that hyperplane is perpendicular to w ? Why? Consider any two vectors, x 1 and x 2, falling exactly on the hyperplane, then: is perpendicular to any vector in the hyperplane is some vector in the hyperplane
5 minutes of math... Projections on one side of the line have dot products >0... w Hyperplane x
5 minutes of math... Projections on one side of the line have dot products >0...... and on the other, <0 w Hyperplane x
5 minutes of math... What is the distance from any vector x to the hyperplane? w x r =?
5 minutes of math... What is the distance from any vector x to the hyperplane? Write x as a point on plane + offset from plane w Unit vector in direction of w
5 minutes of math... Theorem: The distance, r, from any point x to the hyperplane defined by w and is given by: Lemma: The distance from the origin to the hyperplane is given by: Also: r>0 for points on one side of the hyperplane; r<0 for points on the other
Back to SVMs & margins The margins are parallel to hyperplane, so are defined by same w, plus constant offsets w b b
Back to SVMs & margins The margins are parallel to hyperplane, so are defined by same w, plus constant offsets Want to ensure that all data points are “outside” the margins w b b
Maximizing the margin So now we have a learning criterion function: Pick w to maximize b s.t. all points still satisfy Note: w.l.o.g. can rescale w arbitrarily (why?)
Maximizing the margin So now we have a learning criterion function: Pick w to maximize b s.t. all points still satisfy Note: w.l.o.g. can rescale w arbitrarily (why?) So can formulate full problem as: Minimize: Subject to: But how do you do that? And how does this help?
Quadratic programming Problems of the form Minimize: Subject to: are called “quadratic programming” problems There are off-the-shelf methods to solve them Actually solving this is way, way beyond the scope of this class Consider it a black box If a solution exists, it will be found & be unique Expensive, but not intractably so
Nonseparable data What if the data isn’t linearly separable? Project into higher dim space (we’ll get there) Allow some “slop” in the system Allow margins to be violated “a little” w
The new “slackful” QP The are “slack variables” Allow margins to be violated a little Still want to minimize margin violations, so add them to QP instance: Minimize: Subject to:
You promised nonlinearity! Where did the nonlinear transform go in all this? Another clever trick With a little algebra (& help from Lagrange multipliers), can rewrite our QP in the form: Maximize: Subject to:
Kernel functions So??? It’s still the same linear system Note, though, that appears in the system only as a dot product:
Kernel functions So??? It’s still the same linear system Note, though, that appears in the system only as a dot product: Can replace with : The inner product is called a “kernel function”
Why are kernel fns cool? The cool trick is that many useful projections can be written as kernel functions in closed form I.e., can work with K() rather than If you know K(x i,x j ) for every (i,j) pair, then you can construct the maximum margin hyperplane between the projected data without ever explicitly doing the projection!
Example kernels Homogeneous degree- k polynomial: Inhomogeneous degree- k polynomial: Gaussian radial basis function: Sigmoidal (neural network):
Side note on kernels What precisely do kernel functions mean? Metric functions take two points and return a (generalized) distance between them What is the equivalent interpretation for kernels? Hint: think about what kernel function replaces in the max margin QP formulation
Side note on kernels Kernel functions are generalized inner products Essentially, give you the cosine of the angle between vectors Recall the law of cosines:
Side note on kernels Replace traditional dot product with “generalized inner product” and get:
Side note on kernels Replace traditional dot product with “generalized inner product” and get: Kernel (essentially) represents: Angle between vectors in the projected, high- dimensional space
Using the classifier Solution of the QP gives back a set of Data points for which are called “support vectors” Turns out that we can write w as
Using the classifier And our classification rule for query pt was: So:
Using the classifier SVM images from lecture notes by S. Dreiseitl Support vectors