Jan Rupnik Jozef Stefan Institute

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Lecture 9 Support Vector Machines
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
CHAPTER 10: Linear Discrimination
SOFT LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013.
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Presented by: Travis Desell.
Support Vector Machine (SVM) Classification
Support Vector Machines
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Universit at Dortmund, LS VIII
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
An Introduction to Support Vector Machines (M. Law)
Non-Bayes classifiers. Linear discriminants, neural networks.
Classification & Regression Part II
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Support Vector Machines (SVM): A Tool for Machine Learning Yixin Chen Ph.D Candidate, CSE 1/10/2002.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Linear machines márc Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Neural networks and support vector machines
Support vector machines
CS 9633 Machine Learning Support Vector Machines
PREDICT 422: Practical Machine Learning
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Large Margin classifiers
LECTURE 11: Advanced Discriminant Analysis
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Dan Roth Department of Computer and Information Science
Lecture 07: Soft-margin SVM
Support Vector Machines
Support Vector Machines
Perceptrons Support-Vector Machines
An Introduction to Support Vector Machines
An Introduction to Support Vector Machines
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Linear machines 28/02/2017.
CS 2750: Machine Learning Support Vector Machines
Lecture 07: Soft-margin SVM
Large Scale Support Vector Machines
PEGASOS Primal Estimated sub-GrAdient Solver for SVM
Support Vector Machines
Support vector machines
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Support Vector Machines
Support vector machines
CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis
Support vector machines
COSC 4368 Machine Learning Organization
Linear Discrimination
Introduction to Neural Networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Discriminative Training
Support Vector Machines 2
Presentation transcript:

Jan Rupnik Jozef Stefan Institute Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute

Outline Introduction Support Vector Machines Stochastic Subgradient Descent SVM - Pegasos Experiments

Introduction Support Vector Machines (SVMs) have become one of the most popular classification tools in the last decade Straightforward implementations could not handle large sets of training examples Recently methods for solving SVMs arose with linear computational complexity Pegasos: primal estimated subgradient approach for solving SVMs

Hard Margin SVM Support vectors Many possible hyperplanes perfectly separate the two classes (e.g. red line). Only one hyperplane with the maximum margin (blue line).

Soft Margin SVM Allow a small number of examples to be misclassified to find a large margin classifier. The trade off between the classification accuracy on the training set and the size of the margin is a parameter for the SVM

Problem setting Let S = {(x1,y1),..., (xm, ym)} be the set of input-output pairs, where xi є RN and yi є {-1,1}. Find the hyperplane with the normal vector w є RN and offset b є R that has good classification accuracy on the training set S and has a large margin. Classify a new example x as sign(w’x – b)

Optimization problem Regularized hinge loss: minw λ/2 w’w + 1/m Σi(1 – yi(w’xi – b))+ Expected hinge loss on the training set Trade off between margin and loss Positive for correctly classified examples, else negative Size of the margin First summand is a quadratic function, the sum is a piecewise linear function. The whole objective: piecewise quadratic. (1 – z)+ := max{0, 1-z} (hinge loss)

Perceptron We ignore the offset parameter b from now on (b = 0) Regularized Hinge Loss (SVM): minw λ/2 w’w + 1/m Σi(1 – yi(w’xi))+ Perceptron minw 1/m Σi(–yi(w’xi))+

Loss functions Standard 0/1 loss Penalizes all incorrectly classified examples with the same amount 0 1 Perceptron loss Penalizes incorrectly classified examples x proportionally to the size of |w’x| 0 1 Hinge loss Penalizes incorrectly classified examples and correctly classified examples that lie within the margin 0 1 Examples that are correctly classified but fall within the margin

Stochastic Subgradient Descent Gradient descent optimization in perceptron (smooth objective) Subgradient descent in pegasos (non differentiable objective) Gradient (unique) Subgradients (all equally valid)

Stochastic Subgradient Subgradient in perceptron: 1/m Σ–yixi for all misclassified examples Subgradient in SVM: λw + 1/m Σi(1 – yixi) for all misclassified examples For every point w the subgradient is a function of the training sample S. We can estimate it from a smaller random subset of S of size k, A, (stochastic part) and speed up computations. Stochastic subgradient in SVM: λw + 1/k Σ(x,y) є A && misclassified (1 – yx)

Pegasos – the algorithm Subgradient is zero on other training points Subsample Learning rate Projection into a ball (rescaling) Subgradient step

What’s new in pegasos? Sub-gradient descent technique 50 years old Soft Margin SVM 14 years old Typically the gradient descent methods suffer from slow convergence Authors of Pegasos proved that aggressive decrease in learning rate μt still leads to convergence. Previous works: μt = 1/(λ√t)) pegasos: μt = 1/ (λt) Proved that the solution always lies in a ball of radius 1/√λ

SVM Light A popular SVM solver with superlinear computational complexity Solves a large quadratic program Solution can be expressed in terms a small subset of training vectors, called support vectors Active set method to find the support vectors Solve a series of smaller quadratic problems Highly accurate solutions Algorithm and implementation by Thorsten Joachims T. Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT Press, 1999

Experiments Data Reuters RCV2 Roughly 800.000 news articles Already preprocessed to bag of word vectors, publicly available Number of features roughly 50.000 Sparse vectors category CCAT, which consists of 381.327 news

Quick convergence to suboptimal solutions 200 iterations took 9.2 CPU seconds, the objective value was 0.3% higher than the optimal solution 560 iterations to get a 0.1% accurate solution SVM Light takes roughly 4 hours of CPU time

Test set error Optimizing the objective value to a high precission is often not necessary The lowest error on the test set is achieved much earlier

Parameters k and T The product of kT determines how close to the optimal value we get If kT is fixed the k does not play a significant role

Conclusions and final notes Pegasos – one of the most efficient suboptimal SVM solvers Suboptimal solutions often generalize to new examples well Can take advantage of sparsity Linear solver Nonlinear extensions have been proposed, but they suffer from slower convergence

Thank you!