Binary Classification Problem Learn a Classifier from the Training Set

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
Lecture 9 Support Vector Machines
ECG Signal processing (2)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
Separating Hyperplanes
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.
The Perceptron Algorithm (Primal Form) Repeat: until no mistakes made within the for loop return:. What is ?
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Kernel Technique Based on Mercer’s Condition (1909)
Dual Problem of Linear Program subject to Primal LP Dual LP subject to ※ All duality theorems hold and work perfectly!
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
The Perceptron Algorithm (Dual Form) Given a linearly separable training setand Repeat: until no mistakes made within the for loop return:
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
Support Vector Machines Kernel Machines
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Binary Classification Problem Linearly Separable Case
Reformulated - SVR as a Constrained Minimization Problem subject to n+1+2m variables and 2m constrains minimization problem Enlarge the problem size and.
The Perceptron Algorithm (Primal Form) Repeat: until no mistakes made within the for loop return:. What is ?
Support Vector Machines
Sparse Kernels Methods Steve Gunn.
Unconstrained Optimization Problem
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Support Vector Machines Classification
SVM Support Vectors Machines
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Learning in Feature Space (Could Simplify the Classification Task)  Learning in a high dimensional space could degrade generalization performance  This.
Classification and Regression
Mathematical Programming in Support Vector Machines
An Introduction to Support Vector Machines Martin Law.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
SVM by Sequential Minimal Optimization (SMO)
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
An Introduction to Support Vector Machines (M. Law)
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Data Mining via Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison IFIP TC7 Conference on System Modeling and Optimization Trier.
Support Vector Machines
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California,
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support vector machines
Geometrical intuition behind the dual problem
An Introduction to Support Vector Machines
Support vector machines
University of Wisconsin - Madison
University of Wisconsin - Madison
Minimal Kernel Classifiers
Presentation transcript:

Binary Classification Problem Learn a Classifier from the Training Set Given a training dataset Main goal: Predict the unseen class label for new data Find a function by learning from data The simplest function is linear:

Binary Classification Problem Linearly Separable Case Benign Malignant

Support Vector Machines Maximizing the Margin between Bounding Planes

Why We Maximize the Margin? (Based on Statistical Learning Theory) The Structural Risk Minimization (SRM): The expected risk will be less than or equal to empirical risk (training error)+ VC (error) bound

Summary the Notations Let be a training dataset and represented by matrices equivalent to , where

Support Vector Classification (Linearly Separable Case, Primal) The hyperplane is determined by solving the minimization problem: It realizes the maximal margin hyperplane with geometric margin

Support Vector Classification (Linearly Separable Case, Dual Form) The dual problem of previous MP: subject to Applying the KKT optimality conditions, we have . But where is Don’t forget

Dual Representation of SVM (Key of Kernel Methods: ) The hypothesis is determined by

Soft Margin SVM (Nonseparable Case) If data are not linearly separable Primal problem is infeasible Dual problem is unbounded above Introduce the slack variable for each training point The inequality system is always feasible e.g.

Robust Linear Programming Preliminary Approach to SVM s.t. (LP) where : nonnegative slack (error) vector The term , 1-norm measure of error vector, is called the training error. For the linearly separable case, at solution of (LP):

Support Vector Machine Formulations (Two Different Measures of Training Error) 2-Norm Soft Margin: 1-Norm Soft Margin (Conventional SVM):

Tuning Procedure How to determine C? overfitting C The final value of parameter is one with the maximum testing set correctness !

Lagrangian Dual Problem subject to subject to where

1-Norm Soft Margin SVM Dual Formulation The Lagrangian for 1-norm soft margin: where The partial derivatives with respect to primal variables equal zeros

Substitute: in where and

Dual Maximization Problem for 1-Norm Soft Margin Dual: The corresponding KKT complementarity:

Slack Variables for 1-Norm Soft Margin SVM Non-zero slack can only occur when The contribution of outlier in the decision rule will be at most The trade-off between accuracy and regularization directly controls by C The points for which lie at the bounding planes This will help us to find

Two-spiral Dataset (94 White Dots & 94 Red Dots)

Learning in Feature Space (Could Simplify the Classification Task) Learning in a high dimensional space could degrade generalization performance This phenomenon is called curse of dimensionality By using a kernel function, that represents the inner product of training example in feature space, we never need to explicitly know the nonlinear map. Even do not know the dimensionality of feature space There is no free lunch Deal with a huge and dense kernel matrix Reduced kernel can avoid this difficulty

Linear Machine in Feature Space Let be a nonlinear map from the input space to some feature space The classifier will be in the form (Primal): Make it in the dual form:

Kernel: Represent Inner Product in Feature Space Definition: A kernel is a function such that where The classifier will become:

A Simple Example of Kernel Polynomial Kernel of Degree 2: and the nonlinear map Let defined by . Then . There are many other nonlinear maps, , that satisfy the relation:

Power of the Kernel Technique Consider a nonlinear map that consists of distinct features of all the monomials of degree d. Then . For example: Is it necessary? We only need to know ! This can be achieved

Kernel Technique Based on Mercer’s Condition (1909) The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from input space to feature space (might be infinite dim.) 2. do inner product in the feature space

More Examples of Kernel is an integer: Polynomial Kernel : ) (Linear Kernel Gaussian (Radial Basis) Kernel : The -entry of represents the “similarity” of data points and

Nonlinear 1-Norm Soft Margin SVM In Dual Form Linear SVM: Nonlinear SVM:

1-norm Support Vector Machines Good for Feature Selection Solve the quadratic program for some : min s. t. , , denotes where or membership. Equivalent to solve a Linear Program as follows:

SVM as an Unconstrained Minimization Problem (QP) At the solution of (QP): where Hence (QP) is equivalent to the nonsmooth SVM: min Change (QP) into an unconstrained MP Reduce (n+1+m) variables to (n+1) variables

Smooth the Plus Function: Integrate Step function: Sigmoid function: p-function: Plus function:

SSVM: Smooth Support Vector Machine Replacing the plus function in the nonsmooth SVM by the smooth , gives our SSVM: min nonsmooth SVM as goes to infinity. The solution of SSVM converges to the solution of (Typically, )

Newton-Armijo Method: Quadratic Approximation of SSVM generated by solving a The sequence quadratic approximation of SSVM, converges to the of SSVM at a quadratic rate. unique solution Converges in 6 to 8 iterations At each iteration we solve a linear system of: n+1 equations in n+1 variables Complexity depends on dimension of input space It might be needed to select a stepsize

Newton-Armijo Algorithm Start with any . Having stop if else : (i) Newton Direction : globally and quadratically converge to unique solution in a finite number of steps (ii) Armijo Stepsize : such that Armijo’s rule is satisfied

Nonlinear Smooth SVM Nonlinear Classifier: by a nonlinear kernel : min Replace by a nonlinear kernel : min Use Newton-Armijo algorithm to solve the problem Each iteration solves m+1 linear equations in m+1 variables Nonlinear classifier depends on the data points with nonzero coefficients :

Conclusion An overview of SVMs for classification SSVM: A new formulation of support vector machine as a smooth unconstrained minimization problem Can be solved by a fast Newton-Armijo algorithm No optimization (LP, QP) package is needed There are many important issues did not address this lecture such as: How to solve conventional SVM? How to select parameters: How to deal with massive datasets?

{ Perceptron  . i=0n wi xi g 1 if i=0n wi xi >0 o(xi) = Linear threshold unit (LTU) x0=1 x1 w1 w0 w2 x2  . i=0n wi xi g wn xn 1 if i=0n wi xi >0 o(xi) = -1 otherwise {

Possibilities for function g Sign function Step function Sigmoid (logistic) function sign(x) = +1, if x > 0 -1, if x  0 step(x) = 1, if x > threshold 0, if x  threshold (in picture above, threshold = 0) sigmoid(x) = 1/(1+e-x) Adding an extra input with activation x0 = 1 and weight wi, 0 = -T (called the bias weight) is equivalent to having a threshold at T. This way we can always assume a 0 threshold.

Using a Bias Weight to Standardize the Threshold 1 -T w1 x1 w2 x2 w1x1+ w2x2 < T w1x1+ w2x2 - T < 0

Perceptron Learning Rule (x, t)=([2,1], -1) o =sgn(0.45-0.6+0.3) =1 x2 x2 w = [0.25 –0.1 0.5] x2 = 0.2 x1 – 0.5 o=-1 w = [0.2 –0.2 –0.2] (x, t)=([-1,-1], 1) o = sgn(0.25+0.1-0.5) =-1 x1 x1 (x, t)=([1,1], 1) o = sgn(0.25-0.7+0.1) = -1 -0.5x1+0.3x2+0.45>0  o = 1 w = [0.2 0.2 0.2] w = [-0.2 –0.4 –0.2] x2 x2 x1 x1

The Perceptron Algorithm Rosenblatt, 1956 Given a linearly separable training set and learning rate and the initial weight vector, bias: and let

The Perceptron Algorithm (Primal Form) Repeat: until no mistakes made within the for loop return: . What is ?

The Perceptron Algorithm ( STOP in Finite Steps ) Theorem (Novikoff) Let be a non-trivial training set, and let Suppose that there exists a vector and . Then the number of mistakes made by the on-line perceptron algorithm on is at most

Proof of Finite Termination Proof: Let The algorithm starts with an augmented weight vector and updates it at each mistake. Let be the augmented weight vector prior to the th mistake. The th update is performed when where is the point incorrectly classified by .

Update Rule of Perceotron Similarly,

Update Rule of Perceotron

The Perceptron Algorithm (Dual Form) Given a linearly separable training set and Repeat: until no mistakes made within the for loop return:

What We Got in the Dual Form Perceptron Algorithm? The number of updates equals: implies that the training point has been misclassified in the training process at least once. implies that removing the training point will not affect the final results The training data only appear in the algorithm through the entries of the Gram matrix, which is defined below: