Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 07: Soft-margin SVM

Similar presentations


Presentation on theme: "Lecture 07: Soft-margin SVM"— Presentation transcript:

1 Lecture 07: Soft-margin SVM
CS489/698: Intro to ML Lecture 07: Soft-margin SVM 10/5/17 Yao-Liang Yu

2 Announcement A2 due tonight Proposal due next Tuesday
ICLR 2018 Reproducibility Challenge select a paper from 2018 ICLR submissions, and aim to replicate the experiments described in the paper produce a Reproducibility report, describing the target questions, experimental methodology, implementation details, analysis and discussion of findings, conclusions on reproducibility of the paper 10/5/17 Yao-Liang Yu

3 Outline Formulation Dual Optimization Extension 10/5/17 Yao-Liang Yu

4 Hard-margin SVM Primal hard constraint Dual 10/5/17 Yao-Liang Yu

5 What if inseparable? 10/5/17 Yao-Liang Yu

6 Soft-margin (Cortes & Vapnik’95)
Primal hard constraint propto 1/margin hyper-parameter training error Primal soft constraint prediction (no sign) 10/5/17 Yao-Liang Yu

7 Zero-one loss your prediction
Find prediction rule f so that on an unseen random X, our prediction sign(f(X)) has small chance to be different from the true label Y 10/5/17 Yao-Liang Yu

8 The hinge loss upper bound zero-one Squared hinge exponential loss
logistic loss exponential loss Squared hinge zero-one still suffer loss! 10/5/17 Yao-Liang Yu

9 Classification-calibration
Want to minimize zero-one loss End up with minimizing some other loss Theorem (Bartlett, Jordan, McAuLiffe’06). Any convex margin loss ℓ is classification-calibrated iff ℓ is differentiable at 0 and ℓ’(0) < 0. Classification calibration. has the same sign as , i.e., the Bayes rule. 10/5/17 Yao-Liang Yu

10 Outline Formulation Dual Optimization Extension 10/5/17 Yao-Liang Yu

11 Important optimization trick
joint over x and t 10/5/17 Yao-Liang Yu

12 Slack for “wrong” prediction
10/5/17 Yao-Liang Yu

13 Lagrangian 10/5/17 Yao-Liang Yu

14 Dual problem only dot product is needed! 10/5/17 Yao-Liang Yu

15 The effect of C RdxR C  0? C  inf? Rn 10/5/17 Yao-Liang Yu

16 Karush-Kuhn-Tucker conditions
Primal constraints on w, b and ξ: Dual constraints on α and β: Complementary slackness Stationarity 10/5/17 Yao-Liang Yu

17 Parsing the equations 10/5/17 Yao-Liang Yu

18 Support Vectors 10/5/17 Yao-Liang Yu

19 Recover b Take any i such that Then xi is on the hyperplane:
How to recover ξ ? What if there is no such i ? 10/5/17 Yao-Liang Yu

20 More examples 10/5/17 Yao-Liang Yu

21 Just in case 10/5/17 Yao-Liang Yu

22 Outline Formulation Dual Optimization Extension 10/5/17 Yao-Liang Yu

23 Gradient Descent Step size (learning rate) const., if L is smooth
diminishing, otherwise (Generalized) gradient O(nd) ! 10/5/17 Yao-Liang Yu

24 Stochastic Gradient Descent (SGD)
average over n samples a random sample suffices O(d) diminishing step size, e.g., 1/sqrt{t} or 1/t averaging, momentum, variance-reduction, etc. sample w/o replacement; cycle; permute in each pass 10/5/17 Yao-Liang Yu

25 The derivative What about zero-one loss? All other losses are diff.
What about perceptron? 10/5/17 Yao-Liang Yu

26 Solving the dual O(n*n) Can choose constant step size ηt = η
Faster algorithms exist: e.g., choose a pair of αp and αq and derive a closed-form update 10/5/17 Yao-Liang Yu

27 A little history on optimization
Gradient descent mentioned first in (Cauchy, 1847) First rigorous convergence proof (Curry, 1944) SGD proposed and analyzed (Robbins & Monro, 1951) 10/5/17 Yao-Liang Yu

28 Herbert Robbins (1915 – 2001) 10/5/17 Yao-Liang Yu

29 Outline Formulation Dual Optimization Extension 10/5/17 Yao-Liang Yu

30 Multiclass (Crammer & Singer’01)
separate by a “safety margin” Prediction for correct class Prediction for wrong classes Soft-margin is similar Many other variants Calibration theory is more involved 10/5/17 Yao-Liang Yu

31 Regression (Drucker et al.’97)
10/5/17 Yao-Liang Yu

32 Large-scale training (You, Demmel, et al.’17)
Randomly partition training data evenly into p nodes Train SVM independently on each node Compute center on each node For a test sample Find the nearest center (node / SVM) Predict using the corresponding node / SVM 10/5/17 Yao-Liang Yu

33 Questions? 10/5/17 Yao-Liang Yu


Download ppt "Lecture 07: Soft-margin SVM"

Similar presentations


Ads by Google