Machine learning, pattern recognition and statistical data modelling

Machine learning, pattern recognition and statistical data modelling
Lecture 6. Kernel methods and additive models Coryn Bailer-Jones

Topics Think globally, act locally: kernel methods
Generalized Additive Models (GAMs) for regression (classification next week) confidence intervals

Kernel methods In the first lecture we looked at kernel methods for density estimation E.g. Gaussian kernel of width 2h in d dimensions estimated from N data points 𝑓𝑥= 1 𝑁 𝑛=1 𝑛=𝑁 1 2 ℎ 2  𝑑 2 exp  ∥𝑥− 𝑥 𝑛 ∥ 2 2h 2 

K-NN kernel density estimation
K = no. neighbours N = total no. points V = volume occupied by K neighbours Overcome fixed kernel size: Vary search volume size, V, until reach K neighbours 𝑓𝑥= 𝐾 𝑁𝑉 © Bishop (1995)

One-dimensional kernel smoothers: k-nn
𝑘−nn smoother:𝐸𝑌∣𝑋=𝑥= 𝑓  𝑥=Ave 𝑦 𝑖 ∣ 𝑥 𝑖 ∈ 𝑁 𝑘 𝑥 𝑁 𝑘 𝑥is the set of𝑘points nearest to𝑥in (e.g.) squared distance Drawback is that the estimator is not smooth in𝑥. 𝑘=30 © Hastie, Tibshirani, Friedman (2001)

One-dimensional kernel smoothers: Epanechnikov
Instead give more distant points less weight, e.g. with the Nadaraya−Watson kernel−weighted average 𝑓   𝑥 0 = 𝑖=1 𝑁 𝐾   𝑥 0, 𝑥 𝑖  𝑦 𝑖 𝑖=1 𝑁 𝐾   𝑥 0, 𝑥 𝑖  using the Epanechnikov kernel 𝐾   𝑥 0, 𝑥 𝑖 =𝐷  ∣𝑥− 𝑥 0 ∣   where𝐷𝑡= 1− 𝑡 2 if∣𝑡∣≤1 0otherwise Could generalize kernel to have variable width 𝐾   𝑥 0, 𝑥 𝑖 =𝐷  ∣𝑥− 𝑥 0 ∣ ℎ   𝑥 0   © Hastie, Tibshirani, Friedman (2001)

Kernel comparison Epanechnikov:𝐷𝑡= 1− 𝑡 2 if∣𝑡∣≤1 0otherwise Tri−cube:𝐷𝑡= 1− ∣𝑡∣ 3  3 if∣𝑡∣≤1 0otherwise © Hastie, Tibshirani, Friedman (2001)

k-nn and Epanechnikov kernels
𝑘=30 =0.2 Epanechnikov kernel has fixed width (bias approx. constant, variance not) k-nn has adaptive width (constant variance, bias varies as 1/density) free parameters: k or  © Hastie, Tibshirani, Friedman (2001)

Locally-weighted averages can be biased at boundaries
© Hastie, Tibshirani, Friedman (2001) Kernel is asymmetric at the boundary

Local linear regression
solve linear least squares in local region to predict at a single point green points: effective kernel © Hastie, Tibshirani, Friedman (2001)

Local quadratic regression
© Hastie, Tibshirani, Friedman (2001)

Bias-variance trade-off
higher order local fits reduce bias at cost of increased variance, esp. at boundary (see previous page) © Hastie, Tibshirani, Friedman (2001)

Kernels in higher dimensions
kernel smoothing and local regression generalize to higher dimensions... ...but curse of dimensionality not overcome cannot simultaneously retain localness (=low bias) and sufficient sample size (=low variance) without increasing total sample exponentially with dimension In general we need to make assumptions about underlying data/true function and use structured regression/classification

Generalized Additive Model
Could model a p−dimensional set of data using 𝑌 𝑋 1, 𝑋 2, ..., 𝑋 𝑝 = 𝑓 1  𝑋 1  𝑓 2  𝑋 2 ... 𝑓 𝑝  𝑋 𝑝  Idea is to fit each 1D function separately and then provide an algorithm to iteratively combine them. Do this by minimizing penalized RSS 𝑃𝑅𝑆𝑆= 𝑖=1 𝑁 𝑦 𝑖 −− 𝑗=1 𝑝 𝑓 𝑗  𝑥 𝑖𝑗  2  𝑗=1 𝑝  𝑗 𝑓 𝑗 ′′ 𝑡 𝑗  2 𝑑𝑡 𝑗 Could use a variety of smoothers for each 𝑓 𝑗 and the corresponding penalty. Here use cubic splines. To make solution unique must fix, e.g. = 1 𝑁 𝑖=1 𝑁 𝑦 𝑖 in which case 𝑖=1 𝑁 𝑓 𝑗  𝑥 𝑖𝑗 =0∀𝑗 Avoiding the curse: Split p−dimensional problem into p 1−dimensional ones

Backfitting algorithm for additive models
in principle this step is not required 𝑆 𝑗 is a smoothing splinefit as a function of 𝑥 𝑖𝑗 to the residuals, i.e. what𝑠ℎ𝑜𝑢𝑙𝑑be explained by 𝑓 𝑗 © Hastie, Tibshirani, Friedman (2001)

Generalized additive models on the rock data
Application of the gam{gam} package on the rock{MASS} data set. See the R scripts on the lecture web site

Confidence intervals with splines
Spline function estimate is 𝐟  𝑥=𝐇   =𝐇  𝐇 𝐓 𝐇  𝐍  −1 𝐇 𝐓 𝐲 = 𝐒  𝐲 The smoother matrix 𝐒  depends only on 𝑥 𝑖 andbut not on𝐲. 𝐕=Variance 𝐟  𝑥 = 𝐒  𝐒  𝑇 = 𝑑𝑖𝑎𝑔𝐕 gives the pointwise error estimates on either the training data or new data.

R packages for Generalized Additive Models
gam{gam} same as the package implemented in S-PLUS gam{mgcv} a variant on the above brutto{mda} automatically selects between smooth fit (cubic spline), linear fit and omitting variable altogether

Summary Kernel methods
improvements over nearest neighbours to reduce (or control) bias local linear, quadratic regression Generalized Additive Models defeat (cheat?) the curse of dimensionality by dividing into p 1- dimensional fitting problems typically use kernel or spline smoothers iterative backfitting algorithm MARS (multiple adaptive regression splines) piecewise linear basis functions if prevent pairwise interactions of dimensions, it is an additive model

Machine learning, pattern recognition and statistical data modelling

Similar presentations

Presentation on theme: "Machine learning, pattern recognition and statistical data modelling"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine learning, pattern recognition and statistical data modelling

Similar presentations

Presentation on theme: "Machine learning, pattern recognition and statistical data modelling"— Presentation transcript:

Similar presentations

About project

Feedback