# Basics of Kernel Methods in Statistical Learning Theory Mohammed Nasser Professor Department of Statistics Rajshahi University

## Presentation on theme: "Basics of Kernel Methods in Statistical Learning Theory Mohammed Nasser Professor Department of Statistics Rajshahi University"— Presentation transcript:

Basics of Kernel Methods in Statistical Learning Theory Mohammed Nasser Professor Department of Statistics Rajshahi University E-mail: mnasser.ru@gmail.com

Contents Glimpses of Historical Development Definition and Examples of Kernel Some Mathematical Properties of Kernels Construction of Kernels Heuristic Presentation of Kernel Methods Meaning of Kernels Mercer Theorem and Its Latest Development Direction of Future Development Conclusion 2

Jerome H. Friedman Vladimir Vapnik Computer Scientists’ Contribution to Statistics: Kernel Methods 3

Early History  In 1900 Karl Pearson published his famous article on goodness of fit, judged as one of first best twelve scientific articles in twentieth century.  In 1902 Jacques Hadamard pointed that mathematical models of physical phenomena should have the properties that  A solution exists  The solution is unique  The solution depends continuously on the data, in some reasonable topology ( Well-Posed Problem) 4

Early History  In 1940 Fréchet, PhD student of Hadamard highly criticized mean and standard deviation as measures of location and scale respectively. But he did express his belief in development of statistics without proposing any alternative.  During sixties and seventies Tukey, Huber and Hampel tried to develop Robust Statistics in order to remove ill-posedness of classical statistics.  Robustness means insensitivity to minor change in both model and sample, high tolerance to major changes and good performance at model.  Data Mining onslaught and the problem of non-linearity and nonvectorial data have made robust statistics somewhat nonattractive. Let Us See What KM present…………….

 Support Vector Machines (SVM) introduced in COLT-92 (conference on learning theory) greatly developed since then.  Result: a class of algorithms for Pattern Recognition (Kernel Machines)  Now: a large and diverse community, from machine learning, optimization, statistics, neural networks, functional analysis, etc  Centralized website: www.kernel-machines.org  First Text book (2000): see www.support-vector.netwww.support-vector.net  Now ( 2012): At least twenty books of different taste are avialable in international market  The book, “ The Elements of Statistical Learning”(2001) by Hastie,Tibshirani and Friedman went into second edition within seven years. Recent History 6

History More  David Hilbert used the German word ‘kern’ in his first paper on integral equations(Hilbert 1904).  The mathematical result underlying the kernel trick, Mercer's theorem, is almost a century old (Mercer 1909). It tells us that any `reasonable' kernel function corresponds to some feature space.  which kernels can be used to compute distances in feature spaces was developed by Schoenberg (1938).  The methods for representing kernels in linear spaces were first studied by Kolmogorov (1941) for a countable input domain.  The method for representing kernels in linear spaces for the general case was developed by Aronszajn (1950).  Dunford and Schwartz (1963) showed that Mercer's theorem also holds true for general compact spaces.T 7

History More  The use of Mercer's theorem for interpreting kernels as inner products in a feature space was introduced into machine learning by Aizerman, Braverman and Rozonoer (1964)  Berg, Christensen and Ressel (1984) published a good monograph on the theory of kernels.  Saitoh (1988) showed the connection between positivity (a `positive matrix‘ defined in Aronszajn (1950)) and the positive semi-definiteness of all finite set kernel matrices.  Reproducing kernels were extensively used in machine learning and neural networks by Poggio and Girosi, see for example Poggio and Girosi (1990), a paper on radial basis function networks.  The theory of kernels was used in approximation and regularization theory, and the first chapter of Spline Models for Observational Data (Wahba 1990) gave a number of theoretical results on kernel functions. 8

The common characteristic (structure) among the following statistical methods? 1. Principal Components Analysis 2. (Ridge ) regression 3. Fisher discriminant analysis 4. Canonical correlation analysis 5.Singular value decomposition 6. Independent component analysis Kernel methods: Heuristic View We consider linear combinations of input vector: We make use concepts of length and dot product available in Euclidean space. KPCA SVR KFDA KCCA KICA 9

10 Linear learning typically has nice properties –Unique optimal solutions, Fast learning algorithms –Better statistical analysis But one big problem –Insufficient capacity That means, in many data sets it fails to detect nonlinearship among the variables. The other demerits - Cann’t handle non-vectorial data Kernel methods: Heuristic View 10

Vectors Collections of features e.g. height, weight, blood pressure, age,... Can map categorical variables into vectors Matrices Images, Movies Remote sensing and satellite data (multispectral) Strings Documents Gene sequences Structured Objects XML documents Graphs Data 11

Kernel methods: Heuristic View Genome-wide data mRNA expression data protein-protein interaction data hydrophobicity data sequence data (gene, protein) 12

13

Original SpaceFeature Space    Kernel methods: Heuristic View 14

Definition of Kernels Definition: A finitely positive semi-definite function is a symmetric function of its arguments for which matrices formed by restriction on any finite subset of points is positive semi-definite.  It is a generalized dot product  It is not generally bilinear  But it obeys C-S inequality 15

Theorem(Aronszajn,1950): A function can be written as where is a feature map iff k(x,y) satisfies the semi-definiteness property. We can now check if k(x,y) is a proper kernel using only properties of k(x,y) itself, i.e. without the need to know the feature map ! If the map is needed we may take help of MERCER THEOREM Kernel Methods: Basic Ideas Proper Kernel Is always a kernel. When is the converse true? 16

: 1) The choice of kernel (this is non-trivial) 2) The algorithm which takes kernels as input Modularity: Any kernel can be used with any kernel-algorithm. some kernels: some kernel algorithms: - support vector machine - Fisher discriminant analysis - kernel regression - kernel PCA - kernel CCA Kernel methods consist of two modules 17

Kernel Construction The set of kernels forms a closed convex cone 18

19

Reproducing Kernel Hilbert Space Reproducing kernel Hilbert space (RKHS)  X : set. A Hilbert space H consisting of functions on X is called a reproducing kernel Hilbert space (RKHS) if the evaluation functional is continuous for each –A Hilbert space H consisting of functions on X is a RKHS if and only if there exists (reproducing kernel) such that (by Riesz’s lemma) 20

Reproducing Kernel Hilbert Space II Theorem (construction of RKHS) If k: X x X  R is positive definite, there uniquely exists a RKHS H k on X such that (1) for all (2) the linear hull of is dense in H k, (3) is a reproducing kernel of H k, i.e., At this moment we put no structure on X. To have bettter properties of members of g in H we have to put extra structure on X and assume additional properties of K/ 21

Classification X ! Y Anything: continuous ( ,  d, …) discrete ({0,1}, {1,…k}, …) structured (tree, string, …) … discrete: – {0,1}binary – {1,…k}multi-class – tree, etc.structured Y=g(X) 22

Classification X Anything: continuous ( ,  d, …) discrete ({0,1}, {1,…k}, …) structured (tree, string, …) … Perceptron Logistic Regression Support Vector Machine Decision Tree Random Forest Kernel trick 23

Regression X ! Y Anything: continuous ( ,  d, …) discrete ({0,1}, {1,…k}, …) structured (tree, string, …) … Y=g(X) continuous: – ,  d Not Always 24

Regression X Anything: continuous ( ,  d, …) discrete ({0,1}, {1,…k}, …) structured (tree, string, …) … Perceptron Normal Regression Support Vector regression GLM Kernel trick 25

Steps for Kernel Methods DATA MATRIX Kernel Matrix, K= [k(x i,x j )] A positive semi definite matrix Algorithm f(x)= ∑ α i k(x i, x) Pattern function what K???? Traditional or non traditional Why p.s.d?? Kernel Methods: Heuristic View 26

Original SpaceFeature Space    Kernel methods: Heuristic View 27

The kernel methods approach is to stick with linear functions but work in a high dimensional feature space: Kernel Methods: Basic Ideas The expectation is that the feature space has a much higher dimension than the input space. Feature space has a inner- product like 28

So kernel methods use linear functions in a feature space: For regression this could be the function For classification require thresholding Kernel methods: Heuristic View Form of functions 29

non-linear mapping to F 1. high-D space 2. infinite-D countable space : 3. function space (Hilbert space) example: Kernel methods: Heuristic View Feature spaces 30

Consider the mapping Let us consider a linear equation in this feature space: We actually have an ellipse – i.e. a non-linear shape in the input space. Kernel methods: Heuristic View Example 31

Ridge Regression (duality) regularization target input problem: solution: dxd inverse inverse Inner product of obs. Dual Representation linear comb. data f(x)=w T x = ∑α i (x i,x) Kernel methods: Heuristic View 32

Note: In the dual representation we used the Gram matrix to express the solution. Kernel Trick: Replace : kernel I f we use algorithms that only depend on the Gram- matrix, G, then we never have to know (compute) the actual features Φ( x) Kernel methods” Heuristic View Kernel trick 33

Gist of Kernel methods  Choice of a Kernel Function.  Through choice of a kernel function we choose a Hilbert space.  We then apply the linear method in this new space without increasing computational complexity using mathematical niceties of this space. 34

Kernels to Similarity I ntuition of kernels as similarity measures: When the diagonal entries of the Kernel Gram Matrix are constant, kernels are directly related to similarities. –F or example Gaussian Kernel –In general, it is useful to think of a kernel as a similarity measure. 35

Distance between two points x 1 and x 2 in feature space: Kernels to Distance Distance between two points x 1 and S in feature space: 36

Kernel methods: Heuristic View Genome-wide data mRNA expression data protein-protein interaction data hydrophobicity data sequence data (gene, protein) 37

How can we make it positive semidefinite if it is not semidefinite? Similarity to Kernels 38

From Similarity Scores to Kernels Removal of negative eigenvalues Form the similarity matrix S, where the (i,j)-th entry of S denotes the similarity between the i-th and j-th data points. S is symmetric, but is in general not positive semi-definite, i.e., S has negative eigenvalues. 39

From Similarity Scores to Kernels t1t1 t2t2- - - tntn x1x1 x2x2 s 2m ------ xnxn t1t1 t2t2 - - -- - - tntn x1x1 x2x2 s 2m ------ xnxn 40

 Problems of empirical risk minimization Kernels as Measures of Function Regularity Empirical Risk functional, = 41

What Can We Do?  We can restrict the set of functions over which we minimize empirical risk functionals  modify the criterion to be minimized (e.g. adding a penalty for `complicated‘ functions). We can combine two. Structural riskMinimization Regularization 42

43

Best Approximation M H F f 44

Best approximation Assume M is finite dimensional with basis {k 1,......,k m } i.e., =a 1 k 1 +…….+a k n M gives m conditions (i=1,…,m) =0 i.e. - a i -…….. a n =0 45

RKHS approximation m conditions become: We can then estimate the parameters using: a=K -1 y In practice it can be ill-conditioned, so we minimise: Y i - a i -…….. a n =0 a=(K+ λI) -1 y 46

Approximation vs estimation Target space Hypothesis space True function Estimate Best possible estimate 47

How to choose kernels? There is no absolute rule for choosing the right kernel, adapted to a particular problem. Kernel should capture the desired similarity. –Kernels for vectors: Polynomial and Gaussian kernel –String kernel (text documents) –Diffusion kernel (graphs) –Sequence kernel (protein, DNA, RNA) 48

Kernel Selection Ideally select the optimal kernel based on our prior knowledge of the problem domain. Actually, consider a family of kernels defined in a way that again reflects our prior expectations. Simple way: require only limited amount of additional information from the training data. Elaborate way: Combine label information 49

50

Future Development  Mathematics:  Generalization of Mercer Theorem for pseudo metric spaces  Development of mathematical tools for multivariate regression  Statistics:  Application of kernels in multivariate data depth  Application of ideas of robust statistics  Application of these methods in circular data  They can be used to study nonlinear time series 51

http://www.kernel-machines.org/ –Papers, software, workshops, conferences, etc. Acknowledgement Jieping Ye Department of Computer Science and Engineering Arizona State University http://www.public.asu.edu/~jye02 52

Thank You 53

Download ppt "Basics of Kernel Methods in Statistical Learning Theory Mohammed Nasser Professor Department of Statistics Rajshahi University"

Similar presentations