Statistical Learning Dong Liu Dept. EEIS, USTC.

Statistical Learning Dong Liu Dept. EEIS, USTC

Chapter 3. Support Vector Machine (SVM)
Max-margin linear classification Soft-margin linear classification The kernel trick Efficient algorithm for SVM 2018/11/27 Chap 3. SVM

Linear classification
Which one is the optimal? 2018/11/27 Chap 3. SVM

Classification margin
We want to maximize the margin Intuitively, this way the classifier is the most tolerant to noise Theoretically, this way the classifier has the best generalization ability margin 2018/11/27 Chap 3. SVM

Geometric margin & Functional margin
For a point , its distance to the decision boundary is Geometric margin is Functional margin is Since we can amplify both w and b by a scaling factor, without changing the geometric margin, we can set 2018/11/27 Chap 3. SVM

Maximize geometric margin 1/2
The problem is Equivalent to 2018/11/27 Chap 3. SVM

Maximize geometric margin 2/2
Using the Lagrange multiplier According to KKT condition is determined by the samples that have non-zero where These samples are termed support vectors 2018/11/27 Chap 3. SVM

Support vectors margin is determined by the samples that have non-zero
wT x + b = 1 margin is determined by the samples that have non-zero where These samples are termed support vectors wT x + b = -1 2018/11/27 Chap 3. SVM

Lagrange dual For a general constrained optimization problem
We try to solve And its dual problem is Under certain conditions, the original problem and its dual problem are equivalent 2018/11/27 Chap 3. SVM

Lagrange dual of max-margin
Original problem: Dual problem: Plus KKT condition: 2018/11/27 Chap 3. SVM

Solution of max-margin
Once we have solved the dual problem, we have Summary: for max-margin classification, we can solve the dual problem to find out support vectors, and then determine the best decision boundary 2018/11/27 Chap 3. SVM

Why soft margin 1/4 If the dataset is linearly non-separable, how to define margin? 2018/11/27 Chap 3. SVM

Why soft margin 2/4 We may still define margin disregarding the “error” samples margin 2018/11/27 Chap 3. SVM

Why soft margin 3/4 Or, even if the dataset is linearly separable, we still want to give rise to a large margin 2018/11/27 Chap 3. SVM

Why soft margin 4/4 We may still define margin but allow samples to be exceptions margin 2018/11/27 Chap 3. SVM

Soft margin formulation 1/3
We change our objective to where the indicator function is Compared to the “hard” margin 2018/11/27 Chap 3. SVM

Using the Lagrange multiplier Since the indicator function is intractable, we replace it with So the problem becomes It can be interpreted as to minimize hinge loss with L2 norm regularization 2018/11/27 Chap 3. SVM

Define slack variables The problem becomes x1 x2 wT x + b = 0 wT x + b = -1 wT x + b = 1 2018/11/27 Chap 3. SVM

Soft margin solution 1/2 Using the Lagrange multiplier
The KKT condition: 2018/11/27 Chap 3. SVM

Soft margin solution 2/2 Thus the Lagrange dual problem is
Samples are categorized into Support vectors 2018/11/27 Chap 3. SVM

Support vectors in soft margin SVM
wT x + b = 1 margin is determined by the samples that have non-zero where These samples are termed support vectors wT x + b = -1 2018/11/27 Chap 3. SVM

Using basis functions A non-linear transform to allow for an easier (linear) classification 2018/11/27 Chap 3. SVM

SVM with basis functions
Consider to solve The dual problem is The solution is 2018/11/27 Chap 3. SVM

From basis function to kernel function
We notice that, all the basis functions appear in the form of inner product We define the inner product of basis functions as which is termed kernel function The space of basis function is then termed Reproducing Kernel Hilbert Space (RKHS) 2018/11/27 Chap 3. SVM

Kernel function: example
For example, we can prove that if , then is a kernel function Since we can set Similarly, we can prove the following kernel functions RBF = Radial-Basis Function 2018/11/27 Chap 3. SVM

Kernel function: benefit
For SVM (and alike), defining kernel function is equivalent to designing basis functions, which is termed kernel trick Sometimes it is easier to express in kernel function than in basis function, for example RBF kernel Sometimes we can prove a function is a kernel function but it is difficult to write out its corresponding basis function If a function satisfies the Mercer’s condition, it is a kernel function 2018/11/27 Chap 3. SVM

Kernelized SVM The dual problem is Once solved, we have 2018/11/27
Chap 3. SVM

Kernelized SVM: example
Using the RBF kernel 2018/11/27 Chap 3. SVM

More about the kernel trick
There are many problems that can be formulated using the kernel trick, i.e. using kernel function to replace basis function The Representation Theorem claims that, the solution to can be expressed as 2018/11/27 Chap 3. SVM

SMO algorithm The (dual) problem is
For this problem, sequential minimal optimization (SMO) is an efficient algorithm Choose two Lagrange multipliers as variables, optimize over them while keeping the other multipliers unchanged, and iterate 2018/11/27 Chap 3. SVM

SMO algorithm: considering two variables
Due to the KKT condition, we have either or And our objective is a quadratic function 2018/11/27 Chap 3. SVM

Chapter summary Kernel trick Margin; geometric ~; soft ~
Dictionary Toolbox Kernel trick Margin; geometric ~; soft ~ Mercer’s condition Representation theorem RKHS Slack variable Support vector Hinge loss Kernel function Lagrange dual Sequential minimal optimization (SMO) 2018/11/27 Chap 3. SVM

Statistical Learning Dong Liu Dept. EEIS, USTC.

Similar presentations

Presentation on theme: "Statistical Learning Dong Liu Dept. EEIS, USTC."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Learning Dong Liu Dept. EEIS, USTC.

Similar presentations

Presentation on theme: "Statistical Learning Dong Liu Dept. EEIS, USTC."— Presentation transcript:

Similar presentations

About project

Feedback