Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine learning, pattern recognition and statistical data modelling

Similar presentations


Presentation on theme: "Machine learning, pattern recognition and statistical data modelling"β€” Presentation transcript:

1 Machine learning, pattern recognition and statistical data modelling
Lecture 6. Kernel methods and additive models Coryn Bailer-Jones

2 Topics Think globally, act locally: kernel methods
Generalized Additive Models (GAMs) for regression (classification next week) confidence intervals

3 Kernel methods In the first lecture we looked at kernel methods for density estimation E.g. Gaussian kernel of width 2h in d dimensions estimated from N data points π‘“ξ‚žπ‘₯ξ‚Ÿ= 1 𝑁 𝑛=1 𝑛=𝑁 1 ξ‚ž2 β„Ž 2 ξ‚Ÿ 𝑑 2 exp ξ‚ž βˆ₯π‘₯βˆ’ π‘₯ 𝑛 βˆ₯ 2 2h 2 ξ‚Ÿ

4 K-NN kernel density estimation
K = no. neighbours N = total no. points V = volume occupied by K neighbours Overcome fixed kernel size: Vary search volume size, V, until reach K neighbours π‘“ξ‚žπ‘₯ξ‚Ÿ= 𝐾 𝑁𝑉 Β© Bishop (1995)

5 One-dimensional kernel smoothers: k-nn
π‘˜βˆ’nn smoother:πΈξ‚žπ‘Œβˆ£π‘‹=π‘₯ξ‚Ÿ= 𝑓 ξ‚‘ ξ‚žπ‘₯ξ‚Ÿ=Ave 𝑦 𝑖 ∣ π‘₯ 𝑖 ∈ 𝑁 π‘˜ ξ‚žπ‘₯ξ‚Ÿ 𝑁 π‘˜ ξ‚žπ‘₯ξ‚Ÿis the set ofπ‘˜points nearest toπ‘₯in (e.g.) squared distance Drawback is that the estimator is not smooth inπ‘₯. π‘˜=30 Β© Hastie, Tibshirani, Friedman (2001)

6 One-dimensional kernel smoothers: Epanechnikov
Instead give more distant points less weight, e.g. with the Nadarayaβˆ’Watson kernelβˆ’weighted average 𝑓 ξ‚‘ ξ‚ž π‘₯ 0 ξ‚Ÿ= 𝑖=1 𝑁 𝐾  ξ‚ž π‘₯ 0, π‘₯ 𝑖 ξ‚Ÿ 𝑦 𝑖 𝑖=1 𝑁 𝐾  ξ‚ž π‘₯ 0, π‘₯ 𝑖 ξ‚Ÿ using the Epanechnikov kernel 𝐾  ξ‚ž π‘₯ 0, π‘₯ 𝑖 ξ‚Ÿ=𝐷 ξ‚ž ∣π‘₯βˆ’ π‘₯ 0 ∣  ξ‚Ÿ whereπ·ξ‚žπ‘‘ξ‚Ÿ= ξ‚ž1βˆ’ 𝑑 2 ξ‚Ÿifβˆ£π‘‘βˆ£β‰€1 0otherwise Could generalize kernel to have variable width 𝐾  ξ‚ž π‘₯ 0, π‘₯ 𝑖 ξ‚Ÿ=𝐷 ξ‚ž ∣π‘₯βˆ’ π‘₯ 0 ∣ β„Ž  ξ‚ž π‘₯ 0 ξ‚Ÿ ξ‚Ÿ Β© Hastie, Tibshirani, Friedman (2001)

7 Kernel comparison Epanechnikov:π·ξ‚žπ‘‘ξ‚Ÿ= ξ‚ž1βˆ’ 𝑑 2 ξ‚Ÿifβˆ£π‘‘βˆ£β‰€1 0otherwise Triβˆ’cube:π·ξ‚žπ‘‘ξ‚Ÿ= ξ‚ž1βˆ’ βˆ£π‘‘βˆ£ 3 ξ‚Ÿ 3 ifβˆ£π‘‘βˆ£β‰€1 0otherwise Β© Hastie, Tibshirani, Friedman (2001)

8 k-nn and Epanechnikov kernels
π‘˜=30 =0.2 Epanechnikov kernel has fixed width (bias approx. constant, variance not) k-nn has adaptive width (constant variance, bias varies as 1/density) free parameters: k or  Β© Hastie, Tibshirani, Friedman (2001)

9 Locally-weighted averages can be biased at boundaries
Β© Hastie, Tibshirani, Friedman (2001) Kernel is asymmetric at the boundary

10 Local linear regression
solve linear least squares in local region to predict at a single point green points: effective kernel Β© Hastie, Tibshirani, Friedman (2001)

11 Local quadratic regression
Β© Hastie, Tibshirani, Friedman (2001)

12 Bias-variance trade-off
higher order local fits reduce bias at cost of increased variance, esp. at boundary (see previous page) Β© Hastie, Tibshirani, Friedman (2001)

13 Kernels in higher dimensions
kernel smoothing and local regression generalize to higher dimensions... ...but curse of dimensionality not overcome cannot simultaneously retain localness (=low bias) and sufficient sample size (=low variance) without increasing total sample exponentially with dimension In general we need to make assumptions about underlying data/true function and use structured regression/classification

14 Generalized Additive Model
Could model a pβˆ’dimensional set of data using π‘Œξ‚ž 𝑋 1, 𝑋 2, ..., 𝑋 𝑝 ξ‚Ÿ= 𝑓 1 ξ‚ž 𝑋 1 ξ‚Ÿξ‚ƒ 𝑓 2 ξ‚ž 𝑋 2 ξ‚Ÿξ‚ƒ... 𝑓 𝑝 ξ‚ž 𝑋 𝑝 ξ‚Ÿξ‚ƒξ‚» Idea is to fit each 1D function separately and then provide an algorithm to iteratively combine them. Do this by minimizing penalized RSS 𝑃𝑅𝑆𝑆= 𝑖=1 𝑁 𝑦 𝑖 βˆ’ξ‚·βˆ’ 𝑗=1 𝑝 𝑓 𝑗 ξ‚ž π‘₯ 𝑖𝑗 ξ‚Ÿ 2  𝑗=1 𝑝  𝑗 𝑓 𝑗 β€²β€²ξ‚ž 𝑑 𝑗 ξ‚Ÿ 2 𝑑𝑑 𝑗 Could use a variety of smoothers for each 𝑓 𝑗 ξ‚žξ‚Ÿand the corresponding penalty. Here use cubic splines. To make solution unique must fixξ‚·, e.g. ξ‚·= 1 𝑁 𝑖=1 𝑁 𝑦 𝑖 in which case 𝑖=1 𝑁 𝑓 𝑗 ξ‚ž π‘₯ 𝑖𝑗 ξ‚Ÿ=0βˆ€π‘— Avoiding the curse: Split pβˆ’dimensional problem into p 1βˆ’dimensional ones

15 Backfitting algorithm for additive models
in principle this step is not required 𝑆 𝑗 is a smoothing splinefit as a function of π‘₯ 𝑖𝑗 to the residuals, i.e. whatπ‘ β„Žπ‘œπ‘’π‘™π‘‘be explained by 𝑓 𝑗 Β© Hastie, Tibshirani, Friedman (2001)

16 Generalized additive models on the rock data
Application of the gam{gam} package on the rock{MASS} data set. See the R scripts on the lecture web site

17 Confidence intervals with splines
Spline function estimate is 𝐟 ξ‚‘ ξ‚žπ‘₯ξ‚Ÿ=𝐇 ξ‚Έ ξ‚‘ =𝐇 ξ‚ž 𝐇 𝐓 𝐇 ξ‚Ά 𝐍 ξ‚Ÿ βˆ’1 𝐇 𝐓 𝐲 = 𝐒  𝐲 The smoother matrix 𝐒  depends only on π‘₯ 𝑖 andbut not on𝐲. 𝐕=Variance 𝐟 ξ‚‘ ξ‚žπ‘₯ξ‚Ÿ = 𝐒  𝐒  𝑇 ξƒˆ= π‘‘π‘–π‘Žπ‘”ξ‚žπ•ξ‚Ÿ gives the pointwise error estimates on either the training data or new data.

18 R packages for Generalized Additive Models
gam{gam} same as the package implemented in S-PLUS gam{mgcv} a variant on the above brutto{mda} automatically selects between smooth fit (cubic spline), linear fit and omitting variable altogether

19 Summary Kernel methods
improvements over nearest neighbours to reduce (or control) bias local linear, quadratic regression Generalized Additive Models defeat (cheat?) the curse of dimensionality by dividing into p 1- dimensional fitting problems typically use kernel or spline smoothers iterative backfitting algorithm MARS (multiple adaptive regression splines) piecewise linear basis functions if prevent pairwise interactions of dimensions, it is an additive model


Download ppt "Machine learning, pattern recognition and statistical data modelling"

Similar presentations


Ads by Google