Classification and Regression

Classification and Regression
A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update the weight vector and bias when there is a misclassified point Converge when problem is linearly separable

Classification Problem 2-Category Linearly Separable Case
Benign Malignant

Algebra of the Classification Problem Linearly Separable Case
Given l points in the n dimensional real space Represented by an matrix or Membership of each point in the classes is specified by an diagonal matrix D : if and Separate and by two bounding planes such that: More succinctly: , where

Robust Linear Programming
Preliminary Approach to SVM s.t. (LP) where : nonnegative slack (error) vector The term , 1-norm measure of error vector, is called the training error. For the linearly separable case, at solution of (LP):

Support Vector Machines Maximizing the Margin between Bounding Planes

Support Vector Classification
(Linearly Separable Case, Primal) The hyperplane that solves the minimization problem: realizes the maximal margin hyperplane with geometric margin

Soft Margin SVM (Nonseparable Case) If data are not linearly separable
Primal problem is infeasible Dual problem is unbounded above Introduce the slack variable for each training point The inequality system is always feasible e.g.

Two Different Measures of
Training Error 2-Norm Soft Margin: 1-Norm Soft Margin:

Optimization Problem Formulation
Problem setting: Given functions and , defined on a domain subject to where is called the objective function and are called constraints.

Definitions and Notation
Feasible region: where A solution of the optimization problem is a point such that for which and is called a global minimum.

A point is called a local minimum of the optimization problem if such that At the solution , an inequality constraint is said to be active if , otherwise it is called an inactive constraint. where is called the slack variable

Remove an inactive constraint in an optimization problem will NOT affect the optimal solution Very useful feature in SVM If then the problem is called unconstrained minimization problem Least square problem is in this category SSVM formulation is in this category Difficult to find the global minimum without convexity assumption

Gradient and Hessian Let be a differentiable function. The
gradient of function at a point is defined as If is a twice differentiable function. The Hessian matrix of at a point is defined as

The Most Important Concept in Optimization (minimization)
A point is said to be an optimal solution of a unconstrained minimization if there exists no decent direction A point is said to be an optimal solution of a constrained minimization if there exists no feasible decent direction There might exist decent direction but move along this direction will leave out the feasible region

Two Important Algorithms for Unconstrained Minimization Problem
Steepest decent with exact line search Newton’s method

Linear Program and Quadratic Program
An optimization problem in which the objective function and all constraints are linear functions is called a linear programming problem formulation is in this category If the objective function is convex quadratic while the constraints are all linear then the problem is called convex quadratic programming problem Standard SVM formulation is in this category

Lagrangian Dual Problem
subject to

Lagrangian Dual Problem
subject to subject to where

Weak Duality Theorem be a feasible solution of the primal Let
problem and a feasible solution of the dual problem. Then Corollary:

Saddle Point of Lagrangian
Let satisfying Then is called The saddle point of the Lagrangian function

Dual Problem of Linear Program
Primal LP subject to Dual LP subject to All duality theorems hold and work perfectly!

Dual Problem of Strictly Convex Quadratic Program
Primal QP subject to With strictly convex assumption, we have Dual QP subject to

Support Vector Classification
(Linearly Separable Case, Dual Form) The dual problem of previous MP: subject to Applying the KKT optimality conditions, we have . But where is Don’t forget

Dual Representation of SVM
(Key of Kernel Methods: ) The hypothesis is determined by

Learning in Feature Space
(Could Simplify the Classification Task) Learning in a high dimensional space could degrade generalization performance This phenomenon is called curse of dimensionality By using a kernel function, that represents the inner product of training example in feature space, we never need to explicitly know what the nonlinear map is. Even do not know the dimensionality of feature space There is no free lunch Deal with a huge and dense kernel matrix Reduced kernel can avoid this difficulty

Kernel Technique Based on Mercer’s Condition (1909)
The value of kernel function represents the inner product in feature space Kernel functions merge two steps 1. map input data from input space to feature space (might be infinite dim.) 2. do inner product in the feature space

Linear Machine in Feature Space
Let be a nonlinear map from the input space to some feature space The classifier will be in the form (Primal): Make it in the dual form:

Kernel: Represent Inner Product
in Feature Space Definition: A kernel is a function such that where The classifier will become:

A Simple Example of Kernel
Polynomial Kernel of Degree 2: and the nonlinear map Let defined by . Then . There are many other nonlinear maps, , that satisfy the relation:

Power of the Kernel Technique
Consider a nonlinear map that consists of distinct features of all the monomials of degree d. Then . For example: Is it necessary? We only need to know ! This can be achieved

2-Norm Soft Margin Dual Formulation
The Lagrangian for 2-norm soft margin: where The partial derivatives with respect to primal variables equal zeros

Dual Maximization Problem
For 2-Norm Soft Margin Dual: The corresponding KKT complementarity: Use above conditions to find

Introduce Kernel in Dual Formulation
For 2-Norm Soft Margin The feature space implicitly defined by Suppose solves the QP problem: Then the decision rule is defined by Use above conditions to find

Introduce Kernel in Dual Formulation
for 2-Norm Soft Margin is chosen so that for any with Because: and

Sequential Minimal Optimization
(SMO) Deals with an equality constraint and a box constraints of dual problem Works on the smallest working set (only 2) Find the optimal solution by only changing value that is in the working set The solution can be analytically defined The best feature of SMO

Analytical Solution for Two Points
Suppose that we change In order to keep the equality constraint we have to change two value such that The new value has to satisfy the box constraints We have a more restriction on changing

A Restrictive Constraint on New
Suppose that we change Once we have we can get A restrictive constraint: where if and if

-Support Vector Regression
(Linear Case: ) Given the training set: Represented by an matrix and a vector Try to find such that that is where Motivated by SVM: should be as small as possible Some tiny error should be discarded

-Insensitive Loss Function
(Tiny Error Should Be Discarded) -insensitive loss function: If then is defined as: The loss made by the estimation function, at the data point is

-Insensitive Linear Regression
Find with the smallest overall error

Five Popular Loss Functions

-Insensitive Loss Regression
Linear -insensitive loss function: where is a real function Quadratic -insensitive loss function:

- insensitive Support Vector Regression Model
Motivated by SVM: should be as small as possible Some tiny error should be discarded where

Why minimize ? probably approximately correct (pac)
Consider performing linear regression for any training data distribution and then Occam’s razor: the simplest is the best

Reformulated - SVR as a Constrained Minimization Problem
subject to n+1+2m variables and 2m constrains minimization problem Enlarge the problem size and computational complexity for solving the problem

SV Regression by Minimizing Quadratic -Insensitive Loss
We have the following problem: where

Primal Formulation of SVR for Quadratic -Insensitive Loss
subject to Extremely important: At the solution

Simplify Dual Formulation
of SVR subject to The case , problem becomes to the least squares linear regression with a weight decay factor

Kernel in Dual Formulation for SVR
Suppose solves the QP problem: subject to Then the regression function is defined by where is chosen such that with

Probably Approximately Correct Learning
pac Model Key assumption: Training and testing data are generated i.i.d. fixed but unknown according to an distribution When we evaluate the “quality” of a hypothesis (classification function) we should take the ( i.e. “average unknown distribution into account error” or “expected error” made by the ) We call such measure risk functional and denote it as

Generalization Error of pac Model
Let be a set of training examples chosen i.i.d. according to Treat the generalization error as a r.v. depending on the random selection of Find a bound of the trail of the distribution of in the form r.v. is a function of and ,where is the confidence level of the error bound which is given by learner

Probably Approximately Correct
We assert: or The error made by the hypothesis then the error bound will be less that is not depend on the unknown distribution

Find the Hypothesis with Minimum
Expected Risk? Let the training examples chosen i.i.d. according to with the probability density be The expected misclassification error made by is The ideal hypothesis should has the smallest expected risk Unrealistic !!!

Empirical Risk Minimization (ERM)
and are not needed) Replace the expected risk over by an average over the training example The empirical risk: Find the hypothesis with the smallest empirical risk Only focusing on empirical risk will cause overfitting

Overfitting Red dots : Solid : Spot : nonlinear regression
Overfitting is a phenomena that the resulting function fits the training set too well, but does not have a good prediction performance on unseen data. Red dots : generated by f(x) with random noise Solid : Spot : nonlinear regression which passes through this 8 points

Tuning Procedure overfitting The final value of parameter is one with the maximum testing set correctness !

VC Confidence (The Bound between )
The following inequality will be held with probability C. J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2 (2) (1998), p

Capacity (Complexity) of Hypothesis
Space :VC-dimension A given training set is shattered by if for every labeling of with this labeling if and only consistent Three (linear independent) points shattered by a hyperplanes in

Shattering Points with Hyperplanes
Can you always shatter three points with a line in ? Theorem: Consider some set of m points in . Choose a point as origin. Then the m points can be shattered by oriented hyperplanes if and only if the position vectors of the rest points are linearly independent.

Definition of VC-dimension
(A Capacity Measure of Hypothesis Space ) The Vapnik-Chervonenkis dimension, , of hypothesis space defined over the input space is the size of the (existent) largest finite subset of shattered by If arbitrary large finite set of can be shattered by , then Let then

Classification and Regression

Similar presentations

Presentation on theme: "Classification and Regression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classification and Regression

Similar presentations

Presentation on theme: "Classification and Regression"— Presentation transcript:

Similar presentations

About project

Feedback