Kernel Methods Arie Nakhmani. Outline Kernel Smoothers Kernel Density Estimators Kernel Density Classifiers.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
ECG Signal processing (2)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Model assessment and cross-validation - overview
Data mining and statistical learning - lecture 6
Pattern Recognition and Machine Learning
Kernel methods - overview
統計計算與模擬 政治大學統計系余清祥 2003 年 6 月 9 日 ~ 6 月 10 日 第十六週:估計密度函數
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,
Optimal Bandwidth Selection for MLS Surfaces
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
1 An Introduction to Nonparametric Regression Ning Li March 15 th, 2004 Biostatistics 277.
Classification and Prediction: Regression Analysis
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Chapter 6-2 Radial Basis Function Networks 1. Topics Basis Functions Radial Basis Functions Gaussian Basis Functions Nadaraya Watson Kernel Regression.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Outline Separating Hyperplanes – Separable Case
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
1 E. Fatemizadeh Statistical Pattern Recognition.
Kernel Methods Jong Cheol Jeong. Out line 6.1 One-Dimensional Kernel Smoothers Local Linear Regression Local Polynomial Regression 6.2 Selecting.
Intro. ANN & Fuzzy Systems Lecture 23 Clustering (4)
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
CpSc 881: Machine Learning
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Kernel Methods Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Machine Learning 5. Parametric Methods.
Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression  Bayesian regression.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Local Likelihood & other models, Kernel Density Estimation & Classification, Radial Basis Functions.
Basis Expansions and Generalized Additive Models Basis expansion Piecewise polynomials Splines Generalized Additive Model MARS.
Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Dept. Computer Science & Engineering, Shanghai Jiao Tong University
Dept. Computer Science & Engineering, Shanghai Jiao Tong University
Ch8: Nonparametric Methods
CH 5: Multivariate Methods
Machine learning, pattern recognition and statistical data modelling
Overview of Supervised Learning
Outline Parameter estimation – continued Non-parametric methods.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Generally Discriminant Analysis
Basis Expansions and Generalized Additive Models (2)
Basis Expansions and Generalized Additive Models (1)
Model generalization Brief summary of methods
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Probabilistic Surrogate Models
Presentation transcript:

Kernel Methods Arie Nakhmani

Outline Kernel Smoothers Kernel Density Estimators Kernel Density Classifiers

Kernel Smoothers – The Goal Estimating a function by using noisy observations, when the parametric model for this function is unknown The resulting function should be smooth The level of “smoothness” should be set by a single parameter

Example N=100 sample points What is it: “smooth enough” ?

Example N=100 sample points

Exponential Smoother Smaller   smoother line, but more delayed

Exponential Smoother Simple Sequential Single parameter  Single value memory Too rough Delayed

Moving Average Smoother

m=11 Larger m  smoother, but straightened line

Moving Average Smoother Sequential Single parameter: the window size m Memory for m values Irregularly smooth What if we have p-dimensional problem with p>1 ???

Nearest Neighbors Smoother x0x0 m=16 Larger m  smoother, but biased line

Nearest Neighbors Smoother Not sequential Single parameter: the number of neighbors m Trivially extended to any number of dimensions Memory for m values Depends on metrics definition Not smooth enough Biased end-points

Low Pass Filter 2 nd order Butterworth: Why do we need kernel smoothers ???

Low Pass Filter The same filter…for log function

Low Pass Filter Smooth Simply extended to any number of dimensions Effectively, 3 parameters: type, order, and bandwidth Biased end-points Inappropriate for some functions (depends on bandwidth)

Kernel Average Smoother x0x0

Nadaraya-Watson kernel-weighted average: with the kernel: for Nearest Neighbor Smoother for Locally Weighted Average t

Popular Kernels Epanechnikov kernel: Tri-cube kernel: Gaussian Kernel:

Non-Symmetric Kernel Kernel example: Which kernel is that ???

Kernel Average Smoother Single parameter: window width Smooth Trivially extended to any number of dimensions Memory-based method – little or no training is required Depends on metrics definition Biased end-points

Local Linear Regression Kernel-weighted average minimizes: Local linear regression minimizes:

Local Linear Regression Solution: where: Other representation: equivalent kernel

Local Linear Regression x0x0

Equivalent Kernels

Local Polynomial Regression Why stop at local linear fits? Let’s minimize:

Local Polynomial Regression

Variance Compromise

Conclusions Local linear fits can help bias dramatically at the boundaries at a modest cost in variance. Local linear fits more reliable for extrapolation. Local quadratic fits do little at the boundaries for bias, but increase the variance a lot. Local quadratic fits tend to be most helpful in reducing bias due to curvature in the interior of the domain. λ controls the tradeoff between bias and variance. Larger λ makes lower variance but higher bias

Local Regression in Radial kernel:

Popular Kernels Epanechnikov kernel Tri-cube kernel Gaussian kernel

Example

Higher Dimensions The boundary estimation is problematic Many sample points are needed to reduce the bias Local regression is less useful for p>3 It’s impossible to maintain localness (low bias) and sizeable samples (low variance) at the same time

Structured Kernels Non-radial kernel: Coordinates or directions can be downgraded or omitted by imposing restrictions on A. Covariance can be used to adapt a metric A. (related to Mahalanobis distance) Projection-pursuit model

Structured Regression Divide into a set (X 1,X 2,…,X q ) with q<p and the remainder of the variables collect in vector Z. Conditionally linear model: For given Z fit a model by locally weighted least squares:

Density Estimation original distribution constant window estimation sample set Mixture of two normal distributions

Kernel Density Estimation Smooth Parzen estimate:

Comparison Mixture of two normal distributions Usually Bandwidth selection is more important than kernel function selection

Kernel Density Estimation Gaussian kernel density estimation: where denote the Gaussian density with mean zero and standard deviation. Generalization to : LPF

Kernel Density Classification  For a J class problem:

Radial Basis Functions Function f(x) is represented as expansion in basis functions: Radial basis functions expansion (RBF): where the sum-of-squares is minimized with respect to all the parameters (for Gaussian kernel):

Radial Basis Functions When assuming constant j = : the problem of “holes” The solution - Renormalized RBF:

Additional Applications Local likelihood Mixture models for density estimation and classification Mean-shift

Conclusions Memory-based methods: the model is the entire training data set Infeasible for many real-time applications Provides good smoothing result for arbitrary sampled function Appropriate for interpolation and extrapolation When the model is known, better use another fitting methods