Large Margin classifiers

Slides:

Advertisements

Similar presentations

Regularization David Kauchak CS 451 – Fall 2013.

Advertisements

Linear Classifiers/SVMs

A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.

Data Mining Classification: Alternative Techniques

SOFT LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013.

Linear Separators.

1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.

PERCEPTRON LEARNING David Kauchak CS 451 – Fall 2013.

Linear Separators.

Support Vector Machines (and Kernel Methods in general)

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Linear Separators. Bankruptcy example R is the ratio of earnings to expenses L is the number of late payments on credit cards over the past year. We will.

Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.

The Perceptron Algorithm (Dual Form) Given a linearly separable training setand Repeat: until no mistakes made within the for loop return:

Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.

Support Vector Machines

Lecture 10: Support Vector Machines

1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.

Support Vector Machines a.k.a, Whirlwind o’ Vector Algebra Sec. 6.3 SVM Tutorial by C. Burges (on class “resources” page)

Support Vector Machines

Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.

SVM by Sequential Minimal Optimization (SMO)

ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013.

Recognition II Ali Farhadi. We have talked about Nearest Neighbor Naïve Bayes Logistic Regression Boosting.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.

Lecture 4 Linear machine

An Introduction to Support Vector Machine (SVM)

Linear Models for Classification

Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.

Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Lecture 14. Outline Support Vector Machine 1. Overview of SVM 2. Problem setting of linear separators 3. Soft Margin Method 4. Lagrange Multiplier Method.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:

Support Vector Machines

Support vector machines

PREDICT 422: Practical Machine Learning

Support Vector Machine

Gradient descent David Kauchak CS 158 – Fall 2016.

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Dan Roth Department of Computer and Information Science

Support Vector Machines

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Jan Rupnik Jozef Stefan Institute

Perceptrons Support-Vector Machines

CS 4/527: Artificial Intelligence

An Introduction to Support Vector Machines

Linear Discriminators

Support Vector Machines

Statistical Learning Dong Liu Dept. EEIS, USTC.

Linear Classifier by Dr

Large Scale Support Vector Machines

Support vector machines

Machine Learning Week 3.

Support Vector Machines

Support Vector Machine I

Support vector machines

Support vector machines

CS639: Data Management for Data Science

SVMs for Document Ranking

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Support Vector Machines 2

Presentation transcript:

Large Margin classifiers David Kauchak CS 158 – Fall 2016

Admin Assignment 5 Assignment 6 Midterm Course feedback back soon write tests for your code! variance scaling uses standard deviation for this class Assignment 6 Midterm Course feedback Thanks! We’ll go over it at the end of class today or the beginning of next class

Which hyperplane? Two main variations in linear classifiers: which hyperplane they choose when the data is linearly separable how they handle data that is not linearly separable

Linear approaches so far Perceptron: separable: non-separable: Gradient descent:

Linear approaches so far Perceptron: separable: finds some hyperplane that separates the data non-separable: will continue to adjust as it iterates through the examples final hyperplane will depend on which examples it saw recently Gradient descent: separable and non-separable finds the hyperplane that minimizes the objective function (loss + regularization) Which hyperplane is this?

Which hyperplane would you choose?

Large margin classifiers Choose the line where the distance to the nearest point(s) is as large as possible

Large margin classifiers The margin of a classifier is the distance to the closest points of either class Large margin classifiers attempt to maximize this

Support vectors For any separating hyperplane, there exist some set of “closest points” These are called the support vectors For n dimensions, there will be at least n+1 support vectors - they are vectors (i.e. points) and they support/define the line

Measuring the margin The margin is the distance to the support vectors, i.e. the “closest points”, on either side of the hyperplane

Measuring the margin negative examples positive examples

Measuring the margin What are the equations for the margin lines? negative examples positive examples

Measuring the margin What is c? We know they’re the same distance apart (otherwise, they wouldn’t be support vectors!) What is c?

Measuring the margin Depends! If we scale w, we vary the constant without changing the separating hyperplane

Measuring the margin Depends! If we scale w, we vary the constant without changing the separating hyperplane Larger w result in larger constants

Measuring the margin Depends! If we scale w, we vary the constant without changing the separating hyperplane Smaller w result in smaller constants

Measuring the margin For now, let’s assume c = 1. What is this distance?

Distance from the hyperplane How far away is this point from the hyperplane? f2 w=(1,2) f1 (-1,-2)

Distance from the hyperplane How far away is this point from the hyperplane? f2 w=(1,2) f1 (-1,-2)

Distance from the hyperplane How far away is this point from the hyperplane? f2 w=(1,2) (1,1) f1

Distance from the hyperplane How far away is this point from the hyperplane? f2 w=(1,2) Is it? (1,1) f1

Distance from the hyperplane Does that seem right? What’s the problem? f2 w=(1,2) (1,1) f1

Distance from the hyperplane How far away is the point from the hyperplane? f2 w=(2,4) (1,1) f1

Distance from the hyperplane How far away is the point from the hyperplane? f2 w=(2,4) (1,1) f1

Distance from the hyperplane How far away is this point from the hyperplane? f2 w=(1,2) (1,1) f1 length normalized weight vectors

Distance from the hyperplane How far away is this point from the hyperplane? f2 w=(1,2) (1,1) f1

Distance from the hyperplane The magnitude of the weight vector doesn’t matter f2 w=(2,4) (1,1) f1 length normalized weight vectors

Distance from the hyperplane The magnitude of the weight vector doesn’t matter f2 w=(0.5,1) (1,1) f1 length normalized weight vectors

Measuring the margin For now, let’s just assume c = 1. What is this distance?

Measuring the margin For now, let’s just assume c = 1.

Large margin classifier setup Select the hyperplane with the largest margin where the points are classified correctly and outside the margin! Setup as a constrained optimization problem: All points are classified correctly AND they have a prediction >= 1. subject to: what does this say?

Large margin classifier setup Select the hyperplane with the largest margin where the points are classified correctly and outside the margin! Setup as a constrained optimization problem: All points are classified correctly AND they have a prediction >= 1. subject to:

Maximizing the margin subject to: Maximizing the margin is equivalent to minimizing ||w||! (subject to the separating constraints)

Maximizing the margin subject to: The minimization criterion wants w to be as small as possible subject to: The constraints: make sure the data is separable encourages w to be larger (once the data is separable)

Measuring the margin For now, let’s just assume c = 1. Claim: it does not matter what c we choose for the SVM problem. Why?

Measuring the margin What is this distance?

Measuring the margin

Maximizing the margin subject to: What’s the difference? vs.

Maximizing the margin subject to: Learn the exact same hyperplane just scaled by a constant amount Because of this, often see it with c = 1 vs. subject to:

For those that are curious… scaled version of w

Maximizing the margin: the real problem subject to: Why the squared?

Maximizing the margin: the real problem subject to: subject to: Minimizing ||w|| is equivalent to minimizing ||w||2 The sum of the squared weights is a convex function!

Support vector machine problem subject to: This is a version of a quadratic optimization problem Maximize/minimize a quadratic function Subject to a set of linear constraints Many, many variants of solving this problem (we’ll see one in a bit)

Soft Margin Classification subject to: What about this problem?

Soft Margin Classification subject to: We’d like to learn something like this, but our constraints won’t allow it 

Slack variables subject to: slack variables (one for each example) What effect does this have?

Slack variables subject to: slack penalties

Slack variables trade-off between margin maximization and penalization penalized by how far from “correct” subject to: allowed to make a mistake

Soft margin SVM subject to: Still a quadratic optimization problem!

Demo http://cs.stanford.edu/people/karpathy/svmjs/demo/

Solving the SVM problem

Understanding the Soft Margin SVM subject to: Given the optimal solution, w, b: Can we figure out what the slack penalties are for each point?

Understanding the Soft Margin SVM What do the margin lines represent wrt w,b? subject to:

Understanding the Soft Margin SVM subject to: Or:

Understanding the Soft Margin SVM subject to: What are the slack values for points outside (or on) the margin AND correctly classified?

Understanding the Soft Margin SVM subject to: 0! The slack variables have to be greater than or equal to zero and if they’re on or beyond the margin then yi(wxi+b) ≥ 1 already

Understanding the Soft Margin SVM subject to: What are the slack values for points inside the margin AND classified correctly?

Understanding the Soft Margin SVM subject to: Difference from the point to the margin. Which is?

Understanding the Soft Margin SVM subject to: What are the slack values for points that are incorrectly classified?

Understanding the Soft Margin SVM subject to: Which is?

Understanding the Soft Margin SVM subject to: “distance” is the unnormalized projection, not to be confused with the true distance which would be with respect to w/||w|| “distance” to the hyperplane plus the “distance” to the margin ?

Understanding the Soft Margin SVM subject to: the points are incorrectly classified so the prediction (wx+b) times the label (yi) will have different signs “distance” to the hyperplane plus the “distance” to the margin Why -?

Understanding the Soft Margin SVM subject to: “distance” to the hyperplane plus the “distance” to the margin ?

Understanding the Soft Margin SVM subject to: “distance” to the hyperplane plus the “distance” to the margin 1

Understanding the Soft Margin SVM subject to: “distance” to the hyperplane plus the “distance” to the margin

Understanding the Soft Margin SVM subject to:

Understanding the Soft Margin SVM Does this look familiar?

Hinge loss! 0/1 loss: Hinge: Exponential: Squared loss:

Understanding the Soft Margin SVM subject to: Do we need the constraints still?

Understanding the Soft Margin SVM subject to: Unconstrained problem!

Understanding the Soft Margin SVM Does this look like something we’ve seen before? Gradient descent problem!

Soft margin SVM as gradient descent multiply through by 1/C and rearrange let λ=1/C What type of gradient descent problem?

Soft margin SVM as gradient descent One way to solve the soft margin SVM problem is using gradient descent hinge loss L2 regularization

Gradient descent SVM solver pick a starting point (w) repeat until loss doesn’t decrease in all dimensions: pick a dimension move a small amount in that dimension towards decreasing loss (using the derivative) hinge loss L2 regularization Finds the largest margin hyperplane while allowing for a soft margin

Support vector machines: 2013 One of the most successful (if not the most successful) classification approach: decision tree Support vector machine k nearest neighbor perceptron algorithm

Support vector machines: 2016 One of the most successful (if not the most successful) classification approach: decision tree Support vector machine k nearest neighbor perceptron algorithm

Trends over time