Basics of Kernel Methods in Statistical Learning Theory Mohammed Nasser Professor Department of Statistics Rajshahi University

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Lecture 9 Support Vector Machines
ECG Signal processing (2)
Component Analysis (Review)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine

1 Welcome to the Kernel-Class My name: Max (Welling) Book: There will be class-notes/slides. Homework: reading material, some exercises, some MATLAB implementations.
Input Space versus Feature Space in Kernel- Based Methods Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of.
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
Pattern Recognition and Machine Learning: Kernel Methods.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Support Vector Machines
Support Vector Machine
Pattern Recognition and Machine Learning
Support Vector Machines and Kernel Methods
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Principal Component Analysis
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Support Vector Machines Kernel Machines
Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Support Vector Machines and Kernel Methods
Support Vector Machines
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
An Introduction to Support Vector Machines Martin Law.
Overview of Kernel Methods Prof. Bennett Math Model of Learning and Discovery 2/27/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Outline Separating Hyperplanes – Separable Case
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
An Introduction to Support Vector Machines (M. Law)
Basis Expansions and Regularization Part II. Outline Review of Splines Wavelet Smoothing Reproducing Kernel Hilbert Spaces.
1. 2  A Hilbert space H is a real or complex inner product space that is also a complete metric space with respect to the distance function induced.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Support Vector Machines and Kernel Methods Machine Learning March 25, 2010.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
ENEE698A Graduate Seminar Reproducing Kernel Hilbert Space (RKHS), Regularization Theory, and Kernel Methods Shaohua (Kevin) Zhou Center for Automation.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Properties of Kernels Presenter: Hongliang Fei Date: June 11, 2009.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
1 Kernel-class Jan Recap: Feature Spaces non-linear mapping to F 1. high-D space 2. infinite-D countable space : 3. function space (Hilbert.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CSE 4705 Artificial Intelligence
CS 9633 Machine Learning Support Vector Machines
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
An Introduction to Support Vector Machines
Principal Component Analysis
Presented by Nagesh Adluru
Welcome to the Kernel-Club
Recursively Adapted Radial Basis Function Networks and its Relationship to Resource Allocating Networks and Online Kernel Learning Weifeng Liu, Puskal.
Presentation transcript:

Basics of Kernel Methods in Statistical Learning Theory Mohammed Nasser Professor Department of Statistics Rajshahi University

Contents Glimpses of Historical Development Definition and Examples of Kernel Some Mathematical Properties of Kernels Construction of Kernels Heuristic Presentation of Kernel Methods Meaning of Kernels Mercer Theorem and Its Latest Development Direction of Future Development Conclusion 2

Jerome H. Friedman Vladimir Vapnik Computer Scientists’ Contribution to Statistics: Kernel Methods 3

Early History  In 1900 Karl Pearson published his famous article on goodness of fit, judged as one of first best twelve scientific articles in twentieth century.  In 1902 Jacques Hadamard pointed that mathematical models of physical phenomena should have the properties that  A solution exists  The solution is unique  The solution depends continuously on the data, in some reasonable topology ( Well-Posed Problem) 4

Early History  In 1940 Fréchet, PhD student of Hadamard highly criticized mean and standard deviation as measures of location and scale respectively. But he did express his belief in development of statistics without proposing any alternative.  During sixties and seventies Tukey, Huber and Hampel tried to develop Robust Statistics in order to remove ill-posedness of classical statistics.  Robustness means insensitivity to minor change in both model and sample, high tolerance to major changes and good performance at model.  Data Mining onslaught and the problem of non-linearity and nonvectorial data have made robust statistics somewhat nonattractive. Let Us See What KM present…………….

 Support Vector Machines (SVM) introduced in COLT-92 (conference on learning theory) greatly developed since then.  Result: a class of algorithms for Pattern Recognition (Kernel Machines)  Now: a large and diverse community, from machine learning, optimization, statistics, neural networks, functional analysis, etc  Centralized website:  First Text book (2000): see  Now ( 2012): At least twenty books of different taste are avialable in international market  The book, “ The Elements of Statistical Learning”(2001) by Hastie,Tibshirani and Friedman went into second edition within seven years. Recent History 6

History More  David Hilbert used the German word ‘kern’ in his first paper on integral equations(Hilbert 1904).  The mathematical result underlying the kernel trick, Mercer's theorem, is almost a century old (Mercer 1909). It tells us that any `reasonable' kernel function corresponds to some feature space.  which kernels can be used to compute distances in feature spaces was developed by Schoenberg (1938).  The methods for representing kernels in linear spaces were first studied by Kolmogorov (1941) for a countable input domain.  The method for representing kernels in linear spaces for the general case was developed by Aronszajn (1950).  Dunford and Schwartz (1963) showed that Mercer's theorem also holds true for general compact spaces.T 7

History More  The use of Mercer's theorem for interpreting kernels as inner products in a feature space was introduced into machine learning by Aizerman, Braverman and Rozonoer (1964)  Berg, Christensen and Ressel (1984) published a good monograph on the theory of kernels.  Saitoh (1988) showed the connection between positivity (a `positive matrix‘ defined in Aronszajn (1950)) and the positive semi-definiteness of all finite set kernel matrices.  Reproducing kernels were extensively used in machine learning and neural networks by Poggio and Girosi, see for example Poggio and Girosi (1990), a paper on radial basis function networks.  The theory of kernels was used in approximation and regularization theory, and the first chapter of Spline Models for Observational Data (Wahba 1990) gave a number of theoretical results on kernel functions. 8

The common characteristic (structure) among the following statistical methods? 1. Principal Components Analysis 2. (Ridge ) regression 3. Fisher discriminant analysis 4. Canonical correlation analysis 5.Singular value decomposition 6. Independent component analysis Kernel methods: Heuristic View We consider linear combinations of input vector: We make use concepts of length and dot product available in Euclidean space. KPCA SVR KFDA KCCA KICA 9

10 Linear learning typically has nice properties –Unique optimal solutions, Fast learning algorithms –Better statistical analysis But one big problem –Insufficient capacity That means, in many data sets it fails to detect nonlinearship among the variables. The other demerits - Cann’t handle non-vectorial data Kernel methods: Heuristic View 10

Vectors Collections of features e.g. height, weight, blood pressure, age,... Can map categorical variables into vectors Matrices Images, Movies Remote sensing and satellite data (multispectral) Strings Documents Gene sequences Structured Objects XML documents Graphs Data 11

Kernel methods: Heuristic View Genome-wide data mRNA expression data protein-protein interaction data hydrophobicity data sequence data (gene, protein) 12

13

Original SpaceFeature Space    Kernel methods: Heuristic View 14

Definition of Kernels Definition: A finitely positive semi-definite function is a symmetric function of its arguments for which matrices formed by restriction on any finite subset of points is positive semi-definite.  It is a generalized dot product  It is not generally bilinear  But it obeys C-S inequality 15

Theorem(Aronszajn,1950): A function can be written as where is a feature map iff k(x,y) satisfies the semi-definiteness property. We can now check if k(x,y) is a proper kernel using only properties of k(x,y) itself, i.e. without the need to know the feature map ! If the map is needed we may take help of MERCER THEOREM Kernel Methods: Basic Ideas Proper Kernel Is always a kernel. When is the converse true? 16

: 1) The choice of kernel (this is non-trivial) 2) The algorithm which takes kernels as input Modularity: Any kernel can be used with any kernel-algorithm. some kernels: some kernel algorithms: - support vector machine - Fisher discriminant analysis - kernel regression - kernel PCA - kernel CCA Kernel methods consist of two modules 17

Kernel Construction The set of kernels forms a closed convex cone 18

19

Reproducing Kernel Hilbert Space Reproducing kernel Hilbert space (RKHS)  X : set. A Hilbert space H consisting of functions on X is called a reproducing kernel Hilbert space (RKHS) if the evaluation functional is continuous for each –A Hilbert space H consisting of functions on X is a RKHS if and only if there exists (reproducing kernel) such that (by Riesz’s lemma) 20

Reproducing Kernel Hilbert Space II Theorem (construction of RKHS) If k: X x X  R is positive definite, there uniquely exists a RKHS H k on X such that (1) for all (2) the linear hull of is dense in H k, (3) is a reproducing kernel of H k, i.e., At this moment we put no structure on X. To have bettter properties of members of g in H we have to put extra structure on X and assume additional properties of K/ 21

Classification X ! Y Anything: continuous ( ,  d, …) discrete ({0,1}, {1,…k}, …) structured (tree, string, …) … discrete: – {0,1}binary – {1,…k}multi-class – tree, etc.structured Y=g(X) 22

Classification X Anything: continuous ( ,  d, …) discrete ({0,1}, {1,…k}, …) structured (tree, string, …) … Perceptron Logistic Regression Support Vector Machine Decision Tree Random Forest Kernel trick 23

Regression X ! Y Anything: continuous ( ,  d, …) discrete ({0,1}, {1,…k}, …) structured (tree, string, …) … Y=g(X) continuous: – ,  d Not Always 24

Regression X Anything: continuous ( ,  d, …) discrete ({0,1}, {1,…k}, …) structured (tree, string, …) … Perceptron Normal Regression Support Vector regression GLM Kernel trick 25

Steps for Kernel Methods DATA MATRIX Kernel Matrix, K= [k(x i,x j )] A positive semi definite matrix Algorithm f(x)= ∑ α i k(x i, x) Pattern function what K???? Traditional or non traditional Why p.s.d?? Kernel Methods: Heuristic View 26

Original SpaceFeature Space    Kernel methods: Heuristic View 27

The kernel methods approach is to stick with linear functions but work in a high dimensional feature space: Kernel Methods: Basic Ideas The expectation is that the feature space has a much higher dimension than the input space. Feature space has a inner- product like 28

So kernel methods use linear functions in a feature space: For regression this could be the function For classification require thresholding Kernel methods: Heuristic View Form of functions 29

non-linear mapping to F 1. high-D space 2. infinite-D countable space : 3. function space (Hilbert space) example: Kernel methods: Heuristic View Feature spaces 30

Consider the mapping Let us consider a linear equation in this feature space: We actually have an ellipse – i.e. a non-linear shape in the input space. Kernel methods: Heuristic View Example 31

Ridge Regression (duality) regularization target input problem: solution: dxd inverse inverse Inner product of obs. Dual Representation linear comb. data f(x)=w T x = ∑α i (x i,x) Kernel methods: Heuristic View 32

Note: In the dual representation we used the Gram matrix to express the solution. Kernel Trick: Replace : kernel I f we use algorithms that only depend on the Gram- matrix, G, then we never have to know (compute) the actual features Φ( x) Kernel methods” Heuristic View Kernel trick 33

Gist of Kernel methods  Choice of a Kernel Function.  Through choice of a kernel function we choose a Hilbert space.  We then apply the linear method in this new space without increasing computational complexity using mathematical niceties of this space. 34

Kernels to Similarity I ntuition of kernels as similarity measures: When the diagonal entries of the Kernel Gram Matrix are constant, kernels are directly related to similarities. –F or example Gaussian Kernel –In general, it is useful to think of a kernel as a similarity measure. 35

Distance between two points x 1 and x 2 in feature space: Kernels to Distance Distance between two points x 1 and S in feature space: 36

Kernel methods: Heuristic View Genome-wide data mRNA expression data protein-protein interaction data hydrophobicity data sequence data (gene, protein) 37

How can we make it positive semidefinite if it is not semidefinite? Similarity to Kernels 38

From Similarity Scores to Kernels Removal of negative eigenvalues Form the similarity matrix S, where the (i,j)-th entry of S denotes the similarity between the i-th and j-th data points. S is symmetric, but is in general not positive semi-definite, i.e., S has negative eigenvalues. 39

From Similarity Scores to Kernels t1t1 t2t tntn x1x1 x2x2 s 2m xnxn t1t1 t2t tntn x1x1 x2x2 s 2m xnxn 40

 Problems of empirical risk minimization Kernels as Measures of Function Regularity Empirical Risk functional, = 41

What Can We Do?  We can restrict the set of functions over which we minimize empirical risk functionals  modify the criterion to be minimized (e.g. adding a penalty for `complicated‘ functions). We can combine two. Structural riskMinimization Regularization 42

43

Best Approximation M H F f 44

Best approximation Assume M is finite dimensional with basis {k 1,......,k m } i.e., =a 1 k 1 +…….+a k n M gives m conditions (i=1,…,m) =0 i.e. - a i -…….. a n =0 45

RKHS approximation m conditions become: We can then estimate the parameters using: a=K -1 y In practice it can be ill-conditioned, so we minimise: Y i - a i -…….. a n =0 a=(K+ λI) -1 y 46

Approximation vs estimation Target space Hypothesis space True function Estimate Best possible estimate 47

How to choose kernels? There is no absolute rule for choosing the right kernel, adapted to a particular problem. Kernel should capture the desired similarity. –Kernels for vectors: Polynomial and Gaussian kernel –String kernel (text documents) –Diffusion kernel (graphs) –Sequence kernel (protein, DNA, RNA) 48

Kernel Selection Ideally select the optimal kernel based on our prior knowledge of the problem domain. Actually, consider a family of kernels defined in a way that again reflects our prior expectations. Simple way: require only limited amount of additional information from the training data. Elaborate way: Combine label information 49

50

Future Development  Mathematics:  Generalization of Mercer Theorem for pseudo metric spaces  Development of mathematical tools for multivariate regression  Statistics:  Application of kernels in multivariate data depth  Application of ideas of robust statistics  Application of these methods in circular data  They can be used to study nonlinear time series 51

–Papers, software, workshops, conferences, etc. Acknowledgement Jieping Ye Department of Computer Science and Engineering Arizona State University 52

Thank You 53