START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

START OF DAY 5 Reading: Chap. 8

Support Vector Machine

Revisiting Linear Classification Recall: – Perceptron can only solve linearly-separable tasks – Non-linear dimensions (e.g., x 2, xy) can be added that may make the new task linearly separable Two questions here: – Is there an optimal way to do linear classification? – Is there a systematic way to leverage it in higher dimensions?

Maximal-Margin Classification (I) Consider a 2-class problem in R d As needed (and without loss of generality), relabel the classes to -1 and +1 Suppose we have a separating hyperplane – Its equation is: w.x + b = 0 w is normal to the hyperplane |b|/||w|| is the perpendicular distance from the hyperplane to the origin ||w|| is the Euclidean norm of w

Maximal-Margin Classification (II) We can certainly choose w and b in such a way that: – w.x i + b > 0 when y i = +1 – w.x i + b < 0 when y i = -1 Rescaling w and b so that the closest points to the hyperplane satisfy |w.x i + b| = 1, we can rewrite the above to – w.x i + b ≥ +1 when y i = +1(1) – w.x i + b ≤ -1 when y i = -1(2)

Maximal-Margin Classification (III) Consider the case when (1) is an equality – w.x i + b = +1 (H+) Normal w Distance from origin |1-b|/||w|| Similarly for (2) – w.x i + b = -1 (H-) Normal w Distance from origin |-1-b|/||w|| We now have two hyperplanes (// to original)

Maximal-Margin Classification (IV)

Maximal-Margin Classification (V) Note that the points on H- and H+ are sufficient to define H- and H+ and therefore are sufficient to build a linear classifier Define the margin as the distance between H- and H+ What would be a good choice for w and b? – Maximize the margin Support Vectors

Maximal-Margin Classification (VI) From the equations of H- and H+, we have – Margin= |1-b|/||w|| - |-1-b|/||w|| = 2/||w|| So, we can maximize the margin by: – Minimizing ||w|| 2 – Subject to: y i (w.x i + b) - 1 ≥ 0 (see (1) and (2) above)

Minimizing ||w|| 2 Use Lagrange multipliers for each constraint (1 per training instance) – For constraints of the form c i ≥ 0 (see above) The constraint equations are multiplied by positive Lagrange multipliers, and Subtracted from the objective function Hence, we have the Lagrangian

Maximizing L D It turns out, after some transformations beyond the scope of our discussion that minimizing L P is equivalent to maximizing the following dual Lagrangian: – Where denotes the dot product subject to : Support vectors are those instances for which α≠0

SVM Learning (I) We could stop here and we would have a nice linear classification algorithm. SVM goes one step further: – It assumes that non-linearly separable problems in low dimensions may become linearly separable in higher dimensions (e.g., XOR)

SVM Learning (II) SVM thus: – Creates a non-linear mapping from the low dimensional space to a higher dimensional space – Uses MM learning in the new space Computation is efficient when “good” transformations are selected – The kernel trick

Choosing a Transformation (I) Recall the formula for L D Note that it involves a dot product – Expensive to compute in high dimensions – Gets worse if we transform to more dimensions What if we did not have to?

Choosing a Transformation (II) It turns out that it is possible to design transformations φ such that: – can be expressed in terms of Hence, one needs only compute in the original lower dimensional space Example: – φ: R 2  R 3 where φ(x)=(x 1 2, √2x 1 x 2, x 2 2 )

Choosing a Kernel Can start from a desired feature space and try to construct kernel More often one starts from a reasonable kernel and may not analyze the feature space Some kernels are better fit for certain problems, domain knowledge can be helpful Common kernels: – Polynomial – Gaussian – Sigmoidal – Application specific

SVM Notes Excellent empirical and theoretical potential Multi-class problems not handled naturally How to choose kernel – main learning parameter – Also includes other parameters to be defined (degree of polynomials, variance of Gaussians, etc.) Speed and size: both training and testing, how to handle very large training sets not yet solved MM can lead to overfit due to noise, or problem may not be linearly separable within a reasonable feature space – Soft Margin is a common solution, allows slack variables – α i constrained to be >= 0 and less than C. The C allows outliers. How to pick C?

Chunking Start with a reasonably sized subset of the data set (one that fits in memory and does not take too long during training) Train on this subset and just keep the support vectors or the m patterns with the highest α i values Grab another subset, add the current support vectors to it and continue training Note that this training may allow previous support vectors to be dropped as better ones are discovered Repeat until all data is used and no new support vectors are added or some other stopping criteria is fulfilled

Comparing Classifiers

Statistical Significance How do we know that some measurement is statistically significant vs being just a random perturbation – How good a predictor of generalization accuracy is the sample accuracy on a test set? – Is a particular hypothesis really better than another one because its accuracy is higher on a validation set? – When can we say that one learning algorithm is better than another for a particular task or set of tasks? For example, if learning algorithm 1 gets 95% accuracy and learning algorithm 2 gets 93% on a task, can we say with some confidence that algorithm 1 is superior in general for that task? Question becomes: What is the likely difference between the sample error (estimator of the parameter) and the true error (true parameter value)? Key point – What is the probability that the differences observed in our results are just due to chance, and thus not significant?

Sample Error Error of hypothesis h with respect to function f and sample S

True Error Error of hypothesis h with respect to function f and distribution D

The Question We wish to know error D (h) We can only measure error S (h) How good an estimate of error D (h) is provided by error S (h)?

Confidence Interval If h is a discrete-valued hypothesis, |S|=n  30, and examples are drawn independently of h and one another, then with N% probability: Confidence level N% 50%68%80%90%95%98%99% Constant z N 0.671.001.281.641.962.332.58

A Few Useful Facts (I) The expected value of a random variable X, also know as the mean, is defined by: The Binomial distribution gives the probability of observing r heads in a series of n independent coin tosses, if the probability of heads in a single toss is p

A Few Useful Facts (II) The Normal distribution is the well-known bell-shaped distribution, arising often in nature: Expected values:

A Few Useful Facts (III) Estimating p from a random sample of coin tosses is equivalent to estimating error D (h) from testing h on a random sample from D: – Single coin toss  Single instance drawing – Probability p that a single coin toss is head  probability that single instance is misclassified – Number r of heads observed over a sample of n coin tosses  number of misclassifications observed over n randomly drawn instances Hence, we have: – p = error D (h) – r/n = error S (h)

A Few Useful Facts (IV) An random variable can be viewed as the name of an experiment whose outcome is probabilistic An estimator is a random variable X used to estimate some parameter p (e.g., the mean) of an underlying population The bias of an estimator X is defined by: An estimator X is unbiased if and only if:

A Few Useful Facts (V) Since, we have: – p = error D (h) – r/n = error S (h) – E[X Binomial ] = np It follows that: – E[error S (h)] = E[r/n] = E[r]/n = np/n = p = error D (h) Hence, error S (h) is an unbiased estimator of error D (h)

Comparing Hypotheses We wish to estimate error D (h 1 ) - error D (h 2 ) We measure error S1 (h 1 ) – error S2 (h 2 ), which turns out to be an unbiased estimator In practice, it is OK to measure on the same test set error S (h 1 ) – error S (h 2 ), which has lower variance

Comparing Classifiers (I) We wish to estimate In practice, we have a sample D 0 : Split into train/test sets and measure:

Comparing Classifiers (II) Problems: We use error T0 (h) to estimate error D (h) We measure the difference on S 0 alone (not the expected value over all samples) Improvement: Repeatedly partition dataset D i into N disjoint train (S i )/test (T i ) sets (e.g., using n-fold Xval) Compute:

Comparing Classifiers (III) Derive: t is known as the student-t statistic. – Choose a significance level q – With N-1 degrees of freedom, if t c exceeds the value in the table then the difference is statistically significant at that level

END OF DAY 5 Homework: SVM, Neural Networks

START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Similar presentations

Presentation on theme: "START OF DAY 5 Reading: Chap. 8. Support Vector Machine."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Similar presentations

Presentation on theme: "START OF DAY 5 Reading: Chap. 8. Support Vector Machine."— Presentation transcript:

Similar presentations

About project

Feedback