Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Similar presentations


Presentation on theme: "Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU."— Presentation transcript:

1 Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU

2 Sparse Kernel Methods 2 General Model of Learning Learning Model (Vapnik, 2000)  Generator (G): generates random vectors x, drawn independently from a fixed but unknown distribution F(x).  Supervisor (S): returns an output value y according to a conditional distribution F(y|x), also fixed but unknown.  Learning Machine (LM): capable of implementing a set of functions  The Learning Problem: Choose a function that best approximates the supervisor’s response based on a training set of N i.i.d. observations drawn from the distribution F(x,y) = F(x)F(y|x), GS LM

3 Sparse Kernel Methods 3 Risk Minimization To best approximate the supervisor’s response, find the function that minimizes  Risk Functional  L: Loss function  Note F(x, y) is fixed but unknown and the only information available is contained in the training set.  How to estimate the Risk?

4 Sparse Kernel Methods 4 Classification and Regression Classification Problem  Supervisor’s output y  {0, 1}  The loss function Regression Problem  Supervisor’s output y: real value  The loss function

5 Sparse Kernel Methods 5 Empirical Risk Minimization Framework Empirical Risk  F(x, y) is unknown.  Estimation of R(w) based on training data Empirical Risk Minimization Framework  Find a function that minimizes the empirical risk.  Fundamental assumption in inductive learning  For classification problem, this leads to finding a function with the minimum training error.  For regression problem, leads to the least squares error method.

6 Sparse Kernel Methods 6 Over-Fitting Problem Over-Fitting  Small training error (empirical risk), but large generalization error!  Consider the problem of polynomial curve fitting. Polynomials of sufficiently high degree can perfectly fit a given finite set of training data. However, when applied to new (unknown) data, the prediction quality can be very poor. Why Over-Fitting?  Many Possible Causes: Insufficient data, Noise, etc  Source of continuing debate  However, we know that over-fitting is closely related to model complexity (the expressive power of the learning machine).

7 Sparse Kernel Methods 7 Over-Fitting Problem: Illustration (Bishop, 2006) M: Degree of polynomial, Green: True function, Red: Least Squares Error Estimation

8 Sparse Kernel Methods 8 How to Avoid Over-Fitting Problem General Idea  Penalize models with high complexity  Occam’s Razor Regularization  Add regularized functional to risk functional  E.g., Ridge regression SRM (Structural Risk Minimization) Principle  Due to Vapnik (1996) h: Capacity of a set of functions

9 Sparse Kernel Methods 9 How to Avoid Over-Fitting Problem Bayesian Methods  Incorporate Prior Knowledge on the form of functions  Prior distribution F(w)  Final result: Predictive distribution F(y|D), where D is a training set, is obtained by marginalizing on w. Remark. 1) Bayesian framework gives probabilistic generative models. 2) Strong connection with regularization theory 3) Kernels can be generated based on generative models.

10 Sparse Kernel Methods 10 Motivation: Linear Regression Primal Problem  Use Ridge regression  Solution

11 Sparse Kernel Methods 11 Motivation: Linear Regression Dual Problem

12 Sparse Kernel Methods 12 Motivation: Linear Regression Discussion  In primal formulation, we should invert D X D matrix.  In dual formulation, invert N X N matrix.  Dual representation shows the predicted value is a linear combination of observed values with weights given by the function k(). So Why Dual?  Note the solution of the dual problem is determined by K.  K is called a Gram matrix and is defined by the function k() called kernel function.  The major observation here is that we can solve the regression problem only by knowing the Gram matrix K, or alternatively the kernel function k.  Can generalize to other form of functions if we define new kernel function!

13 Sparse Kernel Methods 13 Beyond Linear Relations Extension to Nonlinear Function  Feature Space Transform  And Define the set of functions  For example, polynomials of degree D.  By using feature space transform, we can extend the linear relation to nonlinear relations.  These models is still a linear model since the function is linear in the unknowns (w). ※  (x): a vector of basis functions.

14 Sparse Kernel Methods 14 Beyond Linear Relations Problems in Feature Space Transform  Difficulty in finding the appropriate transform.  Curse of Dimensionality: The number of parameters rapidly increases. So Kernel Functions!  Note in dual formulation, the only necessary information is the kernel function.  Kernel function is defined as an inner product of two vectors.  If we can find an appropriate kernel function, we can solve the problem without explicitly considering the feature space transform.  Some kernel functions have the effect of considering infinite dimensional feature space.

15 Sparse Kernel Methods 15 Kernel Functions A Kernel is a function k that for all x, z  X satisfies where  is a mapping from X to a feature space F Example

16 Sparse Kernel Methods 16 Characterization of Kernel Functions How to Find a Kernel Function?  First define feature space transform, then define a kernel as an inner product in the space.  Direct method to characterize a kernel Characterization of Kernel (Shawe-Taylor and Cristianini, 2004) A function, which is either continuous or has a finite domain, can be decomposed as if and only if it is a finite positive semi-definite function, that is, for any choice of finite set, the matrix is positive semi-definite.  For proof, see the reference (RKHS, Reproducing Kernel Hilbert Space)  Alternative characterization is given by Mercer’s Theorem.

17 Sparse Kernel Methods 17 Examples of Kernel Functions Example  1 st : Polynomial Kernel, 2 nd : Gaussian Kernel  3 rd : Kernel derived from generative model, where p(x) is a probability.  4 th : Kernel defined on power set of a given set S.  There are many known techniques for constructing new kernels from existing kernels, see reference.

18 Sparse Kernel Methods 18 Kernel in Practice In practical applications, you can choose a kernel that reflects the similarity between two objects.  Note  Hence if appropriately normalized, the kernel represents the similarity between two objects in some feature space. Remark. 1) Kernel Trick: Develop a learning algorithm based on inner products. Then replace the inner product with a kernel (e.g., Regression, Classification, etc). 2) Generalized Distance: We can generalize the notion of kernel to the case where it represents dissimilarity in some feature space (conditionally positive semi-definite kernel). Then we can use the kernel in learning algorithms based on distance between objects (e.g., clustering, Nearest Neighbor, etc).

19 Sparse Kernel Methods 19 Support Vector Machines 2 Class Classification Problem  Given a training set, where, find a function that satisfies for all points having and for points having  Equivalently,, for all n.

20 Sparse Kernel Methods 20 Support Vector Machines: Linearly Separable Case Linearly Separable Case  If we can find such a function f(x).  In this case, the points (training data) are separated by a hyperplane (separating hyperplane) f(x) = 0 in the feature space.  There can be infinitely many such functions. Margin  Margin: the distance between the hyperplane and the closest point.

21 Sparse Kernel Methods 21 Support Vector Machines: Linearly Separable Case Maximum Margin Classifiers  Find a hyperplane with the maximum margin.  Why Maximum Margin? Recall SRM. Maximum margin hyperplane corresponds to the case with the smallest capacity (Vapnik, 2000). So it is the solution when we choose SRM framework.

22 Sparse Kernel Methods 22 Support Vector Machines: Linearly Separable Case Formulation: Quadratic Programming 1) The parameters are normalized so that the margin = 1. 2) Then the margin =

23 Sparse Kernel Methods 23 Support Vector Machines: Linearly Separable Case Dual Formulation 1) Obtained by applying Lagrange Duality 2) 3) The hyperplane found is

24 Sparse Kernel Methods 24 Support Vector Machines: Linearly Separable Case Discussion  KKT condition  So only if Such vectors are called support vectors. Note the maximum margin hyperplane is dependent only on the support vectors.  Sparsity.  Note to solve the dual problem, we only need the kernel function k, and so we need not consider the feature space transform explicitly.  The form of the maximum margin hyperplane shows that the prediction is given by a combination of observations (with weights given by kernels), specifically, the support vectors.

25 Sparse Kernel Methods 25 Support Vector Machines: Linearly Separable Case Example (Bishop, 2006) Gaussian kernels are used here.

26 Sparse Kernel Methods 26 Support Vector Machines: Overlapping Classes Overlapping Class  Introduce slack variables.  The results are almost the same except additional constraints.  For details, see reference.

27 Sparse Kernel Methods 27 SVM for Regression  -insensitive Error Function

28 Sparse Kernel Methods 28 SVM for Regression Formulation

29 Sparse Kernel Methods 29 SVM for Regression Solution  Similar to SVM for classification, use Lagrange dual.  Then the solution is given by  By considering KKT condition, we can show that the dual variable is positive only if the corresponding point is either on the boundary of or outside the  -tube.  So Sparsity results.

30 Sparse Kernel Methods 30 Summary Classification and Regression based on Kernels  Dual formulation  extension to arbitrary kernels  Sparsity: Support Vectors Some Limitations of SVM  Choice of Kernel  Solution Algorithm: Efficiency of solving large scale QP problems.  Multi-class Problem

31 Sparse Kernel Methods 31 Related Topics Relevance Vector Machines  Use prior knowledge on the distribution of the functions (parameters) Choose ,  that maximizes the above function (marginal likelihood function). Then using them, find the predictive distribution of y given a new value x by using posterior of w.

32 Sparse Kernel Methods 32 Related Topics Gaussian Process  For any finite set of points, jointly have a Gaussian distribution.  Usually, due to lake of prior knowledge, the mean = 0.  The covariance is defined by a kernel function k.  The regression problem given a set of observations reduces to finding a conditional distribution of y.

33 Sparse Kernel Methods 33 References General introductory material for machine learning [1] Pattern Recognition and Machine Learning by C. M. Bishop, Springer, 2006. Very well written book with an emphasis on Bayesian methods. Fundamentals of Statistical learning theory and kernel methods [2] Statistical Learning Theory by V. Vapnik, John Wiley and Sons, 1996 [3] The Nature of Statistical Learning Theory, 2 nd Ed. by V. Vapnik, Springer, 2000 Both books deal with essentially the same topic but in [3], mathematical details are kept at minimum, while [2] gives all the details. Origin of SVM. Kernel Engineering [4] Kernel Methods for Pattern Analysis by J. Shawe-Taylor and N. Cristianini, Cambridge University Press, 2004 Deals with various kernel methods with applications to problems with texts, sequences, trees, etc. Gaussian Process [5] Gaussian Processes for Machine Learning by C. Rasmussen and C. Williams, MIT Press, 2006 Presents up-to-date survey on Gaussian process and related topics.


Download ppt "Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU."

Similar presentations


Ads by Google