Welcome to the Kernel-Club

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

Lecture 9 Support Vector Machines
ECG Signal processing (2)
Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Olivier Duchenne , Armand Joulin , Jean Ponce Willow Lab , ICCV2011.
1 Welcome to the Kernel-Class My name: Max (Welling) Book: There will be class-notes/slides. Homework: reading material, some exercises, some MATLAB implementations.
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Pattern Recognition and Machine Learning: Kernel Methods.
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
Support Vector Machine
Pattern Recognition and Machine Learning
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Sparse vs. Ensemble Approaches to Supervised Learning
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Support Vector Machines Kernel Machines
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Support Vector Machines and Kernel Methods
Support Vector Machines
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
SVM Support Vectors Machines
Learning in Feature Space (Could Simplify the Classification Task)  Learning in a high dimensional space could degrade generalization performance  This.
An Introduction to Support Vector Machines Martin Law.
Overview of Kernel Methods Prof. Bennett Math Model of Learning and Discovery 2/27/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
An Introduction to Support Vector Machines (M. Law)
Face Detection Using Large Margin Classifiers Ming-Hsuan Yang Dan Roth Narendra Ahuja Presented by Kiang “Sean” Zhou Beckman Institute University of Illinois.
Gaussian Processes Li An Li An
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.
ENEE698A Graduate Seminar Reproducing Kernel Hilbert Space (RKHS), Regularization Theory, and Kernel Methods Shaohua (Kevin) Zhou Center for Automation.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Properties of Kernels Presenter: Hongliang Fei Date: June 11, 2009.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
LMS Algorithm in a Reproducing Kernel Hilbert Space Weifeng Liu, P. P. Pokharel, J. C. Principe Computational NeuroEngineering Laboratory, University of.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Support Vector Machines Exercise solutions Ata Kaban The University of Birmingham.
1 Kernel-class Jan Recap: Feature Spaces non-linear mapping to F 1. high-D space 2. infinite-D countable space : 3. function space (Hilbert.
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.
CS 9633 Machine Learning Support Vector Machines
Object Orie’d Data Analysis, Last Time
LECTURE 10: DISCRIMINANT ANALYSIS
Probabilistic Models for Linear Regression
An Introduction to Support Vector Machines
An Introduction to Support Vector Machines
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Statistical Learning Dong Liu Dept. EEIS, USTC.
Principal Component Analysis
Presented by Nagesh Adluru
Summarizing Data by Statistics
Support Vector Machines
Mathematical Foundations of BME
Support Vector Machines and Kernels
LECTURE 09: DISCRIMINANT ANALYSIS
Quantum Foundations Lecture 3
Introduction to Machine Learning
Presentation transcript:

Welcome to the Kernel-Club Free dates: Oct.1, Oct.8, Oct.15, Oct.29, Nov.12. Aim: Read papers on Kernels and discuss them. To do: We need to find a suitable paper for next week. every session has one chair (the one who really reads the paper). URL: http://www.ics.uci/~welling/teatimetalks/KernelClub.html Material will be posted there: (slides, papers). Email me (welling@ics.uci.edu) or Gang (liang@ics.uci.edu) with questions, suggestions, concerns. Today: Intro to Kernels.

Introduction to Kernels (chapters 1,2,3,4) Max Welling October 1 2004

Introduction What is the goal of (pick your favorite name): - Machine Learning - Data Mining - Pattern Recognition - Data Analysis - Statistics Automatic detection of non-coincidental structure in data. Desiderata: - Robust algorithms insensitive to outliers and wrong model assumptions. - Stable algorithms: generalize well to unseen data. - Computationally efficient algorithms: large datasets.

Let’s Learn Something Find the common characteristic (structure) among the following statistical methods? 1. Principal Components Analysis 2. Ridge regression 3. Fisher discriminant analysis 4. Canonical correlation analysis Answer: We consider linear combinations of input vector: Linear algorithm are very well understood and enjoy strong guarantees. (convexity, generalization bounds). Can we carry these guarantees over to non-linear algorithms?

Feature Spaces non-linear mapping to F 1. high-D space 2. infinite-D countable space : 3. function space (Hilbert space) example:

Ridge Regression (duality) problem: regularization input target solution: dxd inverse inverse Gram-matrix Dual Representation linear comb. data

Kernel Trick Note: In the dual representation we used the Gram matrix to express the solution. Kernel Trick: Replace : kernel If we use algorithms that only depend on the Gram-matrix, G, then we never have to know (compute) the actual features This is the crucial point of kernel methods

Modularity Kernel methods consist of two modules: 1) The choice of kernel (this is non-trivial) 2) The algorithm which takes kernels as input Modularity: Any kernel can be used with any kernel-algorithm. some kernel algorithms: - support vector machine - Fisher discriminant analysis - kernel regression - kernel PCA - kernel CCA some kernels:

What is a proper kernel Definition: A finitely positive semi-definite function is a symmetric function of its arguments for which matrices formed by restriction on any finite subset of points is positive semi-definite. Theorem: A function can be written as where is a feature map iff k(x,y) satisfies the semi-definiteness property. Relevance: We can now check if k(x,y) is a proper kernel using only properties of k(x,y) itself, i.e. without the need to know the feature map!

Reproducing Kernel Hilbert Spaces The proof of the above theorem proceeds by constructing a very special feature map (note that more feature maps may give rise to a kernel) i.e. we map to a function space. definition function space: reproducing property:

Mercer’s Theorem Theorem: X is compact, k(x,y) is symmetric continuous function s.t. is a positive semi-definite operator: i.e. then there exists an orthonormal feature basis of eigen-functions such that: Hence: k(x,y) is a proper kernel. Note: Here we construct feature vectors in L2, where the RKHS construction was in a function space.

Learning Kernels All information is tunneled through the Gram-matrix information bottleneck. The real art is to pick an appropriate kernel. e.g. take the RBF kernel: if c is very small: G=I (all data are dissimilar): over-fitting if c is very large: G=1 (all data are very similar): under-fitting We need to learn the kernel. Here is some ways to combine kernels to improve them: cone k1 k2 any positive polynomial

Stability of Kernel Algorithms Our objective for learning is to improve generalize performance: cross-validation, Bayesian methods, generalization bounds,... Call a pattern a sample S. Is this pattern also likely to be present in new data: ? We can use concentration inequalities (McDiamid’s theorem) to prove that: Theorem: Let be a IID sample from P and define the sample mean of f(x) as: then it follows that: (prob. that sample mean and population mean differ less than is more than ,independent of P!

Rademacher Complexity Prolem: we only checked the generalization performance for a single fixed pattern f(x). What is we want to search over a function class F? Intuition: we need to incorporate the complexity of this function class. Rademacher complexity captures the ability of the function class to fit random noise. ( uniform distributed) f1 (empirical RC) f2 xi

Generalization Bound Theorem: Let f be a function in F which maps to [0,1]. (e.g. loss functions) Then, with probability at least over random draws of size every f satisfies: Relevance: The expected pattern E[f]=0 will also be present in a new data set, if the last 2 terms are small: - Complexity function class F small - number of training data large

Linear Functions (in feature space) Consider the function class: and a sample: Then, the empirical RC of FB is bounded by: Relevance: Since: it follows that if we control the norm in kernel algorithms, we control the complexity of the function class (regularization).

Margin Bound (classification) Theorem: Choose c>0 (the margin). F : f(x,y)=-yg(x), y=+1,-1 S: : (0,1) : probability of violating bound. (prob. of misclassification) Relevance: We our classification error on new samples. Moreover, we have a strategy to improve generalization: choose the margin c as large possible such that all samples are correctly classified: (e.g. support vector machines).