Lecture 07: Soft-margin SVM

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
Pattern Recognition and Machine Learning
Support Vector Machines
Support vector machine
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Optimization Tutorial
Separating Hyperplanes
Support Vector Machines (and Kernel Methods in general)
The Perceptron Algorithm (Dual Form) Given a linearly separable training setand Repeat: until no mistakes made within the for loop return:
Support Vector Machine (SVM) Classification
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Optimization Theory Primal Optimization Problem subject to: Primal Optimal Value:
An Introduction to Support Vector Machines Martin Law.
Outline Classification Linear classifiers Perceptron Multi-class classification Generative approach Naïve Bayes classifier 2.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
An Introduction to Support Vector Machines (M. Law)
Non-Bayes classifiers. Linear discriminants, neural networks.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Biointelligence Laboratory, Seoul National University
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Learning by Loss Minimization. Machine learning: Learn a Function from Examples Function: Examples: – Supervised: – Unsupervised: – Semisuprvised:
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Support Vector Machines
Support vector machines
PREDICT 422: Practical Machine Learning
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Large Margin classifiers
Lecture 04: Logistic Regression
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Dan Roth Department of Computer and Information Science
Lecture 07: Soft-margin SVM
Support Vector Machines
Support Vector Machines
Lecture 04: Logistic Regression
Jan Rupnik Jozef Stefan Institute
CS 4/527: Artificial Intelligence
An Introduction to Support Vector Machines
An Introduction to Support Vector Machines
CS489/698: Intro to ML Lecture 08: Kernels 10/12/17 Yao-Liang Yu.
CS489/698: Intro to ML Lecture 08: Kernels 10/12/17 Yao-Liang Yu.
CSCI B609: “Foundations of Data Science”
Large Scale Support Vector Machines
Lecture 18: Bagging and Boosting
CS489/698: Intro to ML Lecture 08: Kernels 10/12/17 Yao-Liang Yu.
Lecture 07: Hard-margin SVM
Lecture 08: Soft-margin SVM
Support vector machines
Lecture 07: Soft-margin SVM
CS480/680: Intro to ML Lecture 01: Perceptron 9/11/18 Yao-Liang Yu.
CS480/680: Intro to ML Lecture 09: Kernels 23/02/2019 Yao-Liang Yu.
Support Vector Machines
Lecture 18. SVM (II): Non-separable Cases
Lecture 06: Bagging and Boosting
Support vector machines
Softmax Classifier.
Support vector machines
Linear Discrimination
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Lecture 07: Soft-margin SVM CS489/698: Intro to ML Lecture 07: Soft-margin SVM 10/5/17 Yao-Liang Yu

Announcement A2 due tonight Proposal due next Tuesday ICLR 2018 Reproducibility Challenge select a paper from 2018 ICLR submissions, and aim to replicate the experiments described in the paper produce a Reproducibility report, describing the target questions, experimental methodology, implementation details, analysis and discussion of findings, conclusions on reproducibility of the paper 10/5/17 Yao-Liang Yu

Outline Formulation Dual Optimization Extension 10/5/17 Yao-Liang Yu

Hard-margin SVM Primal hard constraint Dual 10/5/17 Yao-Liang Yu

What if inseparable? 10/5/17 Yao-Liang Yu

Soft-margin (Cortes & Vapnik’95) Primal hard constraint propto 1/margin hyper-parameter training error Primal soft constraint prediction (no sign) 10/5/17 Yao-Liang Yu

Zero-one loss your prediction Find prediction rule f so that on an unseen random X, our prediction sign(f(X)) has small chance to be different from the true label Y 10/5/17 Yao-Liang Yu

The hinge loss upper bound zero-one Squared hinge exponential loss logistic loss exponential loss Squared hinge zero-one still suffer loss! 10/5/17 Yao-Liang Yu

Classification-calibration Want to minimize zero-one loss End up with minimizing some other loss Theorem (Bartlett, Jordan, McAuLiffe’06). Any convex margin loss ℓ is classification-calibrated iff ℓ is differentiable at 0 and ℓ’(0) < 0. Classification calibration. has the same sign as , i.e., the Bayes rule. 10/5/17 Yao-Liang Yu

Outline Formulation Dual Optimization Extension 10/5/17 Yao-Liang Yu

Important optimization trick joint over x and t 10/5/17 Yao-Liang Yu

Slack for “wrong” prediction 10/5/17 Yao-Liang Yu

Lagrangian 10/5/17 Yao-Liang Yu

Dual problem only dot product is needed! 10/5/17 Yao-Liang Yu

The effect of C RdxR C  0? C  inf? Rn 10/5/17 Yao-Liang Yu

Karush-Kuhn-Tucker conditions Primal constraints on w, b and ξ: Dual constraints on α and β: Complementary slackness Stationarity 10/5/17 Yao-Liang Yu

Parsing the equations 10/5/17 Yao-Liang Yu

Support Vectors 10/5/17 Yao-Liang Yu

Recover b Take any i such that Then xi is on the hyperplane: How to recover ξ ? What if there is no such i ? 10/5/17 Yao-Liang Yu

More examples 10/5/17 Yao-Liang Yu

Just in case 10/5/17 Yao-Liang Yu

Outline Formulation Dual Optimization Extension 10/5/17 Yao-Liang Yu

Gradient Descent Step size (learning rate) const., if L is smooth diminishing, otherwise (Generalized) gradient O(nd) ! 10/5/17 Yao-Liang Yu

Stochastic Gradient Descent (SGD) average over n samples a random sample suffices O(d) diminishing step size, e.g., 1/sqrt{t} or 1/t averaging, momentum, variance-reduction, etc. sample w/o replacement; cycle; permute in each pass 10/5/17 Yao-Liang Yu

The derivative What about zero-one loss? All other losses are diff. What about perceptron? 10/5/17 Yao-Liang Yu

Solving the dual O(n*n) Can choose constant step size ηt = η Faster algorithms exist: e.g., choose a pair of αp and αq and derive a closed-form update 10/5/17 Yao-Liang Yu

A little history on optimization Gradient descent mentioned first in (Cauchy, 1847) First rigorous convergence proof (Curry, 1944) SGD proposed and analyzed (Robbins & Monro, 1951) 10/5/17 Yao-Liang Yu

Herbert Robbins (1915 – 2001) 10/5/17 Yao-Liang Yu

Outline Formulation Dual Optimization Extension 10/5/17 Yao-Liang Yu

Multiclass (Crammer & Singer’01) separate by a “safety margin” Prediction for correct class Prediction for wrong classes Soft-margin is similar Many other variants Calibration theory is more involved 10/5/17 Yao-Liang Yu

Regression (Drucker et al.’97) 10/5/17 Yao-Liang Yu

Large-scale training (You, Demmel, et al.’17) Randomly partition training data evenly into p nodes Train SVM independently on each node Compute center on each node For a test sample Find the nearest center (node / SVM) Predict using the corresponding node / SVM 10/5/17 Yao-Liang Yu

Questions? 10/5/17 Yao-Liang Yu