1 Introduction to Kernels Max Welling October 1 2004 (chapters 1,2,3,4)

2 Introduction What is the goal of (pick your favorite name): - Machine Learning - Data Mining - Pattern Recognition - Data Analysis - Statistics Automatic detection of non-coincidental structure in data. Desiderata: - Robust algorithms insensitive to outliers and wrong model assumptions. - Stable algorithms: generalize well to unseen data. - Computationally efficient algorithms: large datasets.

3 Let’s Learn Something Find the common characteristic (structure) among the following statistical methods? 1. Principal Components Analysis 2. Ridge regression 3. Fisher discriminant analysis 4. Canonical correlation analysis Answer: We consider linear combinations of input vector: Linear algorithm are very well understood and enjoy strong guarantees. (convexity, generalization bounds). Can we carry these guarantees over to non-linear algorithms?

4 Feature Spaces non-linear mapping to F 1. high-D space 2. infinite-D countable space : 3. function space (Hilbert space) example:

5 Ridge Regression (duality) regularization target input problem: solution: dxd inverse inverse Gram-matrix Dual Representation linear comb. data

6 Kernel Trick Note: In the dual representation we used the Gram matrix to express the solution. Kernel Trick: Replace : kernel If we use algorithms that only depend on the Gram-matrix, G, then we never have to know (compute) the actual features This is the crucial point of kernel methods

7 Modularity Kernel methods consist of two modules: 1) The choice of kernel (this is non-trivial) 2) The algorithm which takes kernels as input Modularity: Any kernel can be used with any kernel-algorithm. some kernels: some kernel algorithms: - support vector machine - Fisher discriminant analysis - kernel regression - kernel PCA - kernel CCA

8 What is a proper kernel Definition: A finitely positive semi-definite function is a symmetric function of its arguments for which matrices formed by restriction on any finite subset of points is positive semi-definite. Theorem: A function can be written as where is a feature map iff k(x,y) satisfies the semi-definiteness property. Relevance: We can now check if k(x,y) is a proper kernel using only properties of k(x,y) itself, i.e. without the need to know the feature map!

9 Reproducing Kernel Hilbert Spaces The proof of the above theorem proceeds by constructing a very special feature map (note that more feature maps may give rise to a kernel) i.e. we map to a function space. definition function space:reproducing property:

10 Mercer’s Theorem Theorem: X is compact, k(x,y) is symmetric continuous function s.t. is a positive semi-definite operator: i.e. then there exists an orthonormal feature basis of eigen-functions such that: Hence: k(x,y) is a proper kernel. Note: Here we construct feature vectors in L2, where the RKHS construction was in a function space.

11 Learning Kernels All information is tunneled through the Gram-matrix information bottleneck. The real art is to pick an appropriate kernel. e.g. take the RBF kernel: if c is very small: G=I (all data are dissimilar): over-fitting if c is very large: G=1 (all data are very similar): under-fitting We need to learn the kernel. Here is some ways to combine kernels to improve them: k1 k2 cone any positive polynomial

12 Stability of Kernel Algorithms Call a pattern a sample S. Is this pattern also likely to be present in new data: ? We can use concentration inequalities (McDiamid’s theorem) to prove that: Theorem: Let be a IID sample from P and define the sample mean of f(x) as: then it follows that: (prob. that sample mean and population mean differ less than is more than,independent of P! Our objective for learning is to improve generalize performance: cross-validation, Bayesian methods, generalization bounds,...

13 Rademacher Complexity Prolem: we only checked the generalization performance for a single fixed pattern f(x). What is we want to search over a function class F? Intuition: we need to incorporate the complexity of this function class. Rademacher complexity captures the ability of the function class to fit random noise. ( uniform distributed) xi f1 f2 (empirical RC)

14 Generalization Bound Theorem: Let f be a function in F which maps to [0,1]. (e.g. loss functions) Then, with probability at least over random draws of size every f satisfies: Relevance: The expected pattern E[f]=0 will also be present in a new data set, if the last 2 terms are small: - Complexity function class F small - number of training data large

15 Linear Functions (in feature space) Consider the function class: and a sample: Then, the empirical RC of F B is bounded by: Relevance: Since: it follows that if we control the norm in kernel algorithms, we control the complexity of the function class (regularization).

16 Margin Bound (classification) Theorem: Choose c>0 (the margin). F : f(x,y)=-yg(x), y=+1,-1 S: : (0,1) : probability of violating bound. (prob. of misclassification) Relevance: We our classification error on new samples. Moreover, we have a strategy to improve generalization: choose the margin c as large possible such that all samples are correctly classified : (e.g. support vector machines).

1 Introduction to Kernels Max Welling October 1 2004 (chapters 1,2,3,4)

Similar presentations

Presentation on theme: "1 Introduction to Kernels Max Welling October 1 2004 (chapters 1,2,3,4)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Introduction to Kernels Max Welling October 1 2004 (chapters 1,2,3,4)

Similar presentations

Presentation on theme: "1 Introduction to Kernels Max Welling October 1 2004 (chapters 1,2,3,4)"— Presentation transcript:

Similar presentations

About project

Feedback