Machine learning, pattern recognition and statistical data modelling

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Additive Models, Trees, etc. Based in part on Chapter 9 of Hastie, Tibshirani, and Friedman David Madigan.
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Regression “A new perspective on freedom” TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A AAA A A.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
Curse of Dimensionality Prof. Navneet Goyal Dept. Of Computer Science & Information Systems BITS - Pilani.
Pattern Recognition and Machine Learning: Kernel Methods.
1 Machine Learning: Lecture 7 Instance-Based Learning (IBL) (Based on Chapter 8 of Mitchell T.., Machine Learning, 1997)
Model assessment and cross-validation - overview
Data mining and statistical learning - lecture 6
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
Vector Generalized Additive Models and applications to extreme value analysis Olivier Mestre (1,2) (1) Météo-France, Ecole Nationale de la Météorologie,
CSC2515 Fall 2008 Introduction to Machine Learning Lecture 10a Kernel density estimators and nearest neighbors All lecture slides will be available as.ppt,.ps,
Instance Based Learning
Kernel methods - overview
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Missing at Random (MAR)  is unknown parameter of the distribution for the missing- data mechanism The probability some data are missing does not depend.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Additive Models and Trees
Basis Expansions and Regularization Based on Chapter 5 of Hastie, Tibshirani and Friedman.
The Chicken Project Dimension Reduction-Based Penalized logistic Regression for cancer classification Using Microarray Data By L. Shen and E.C. Tan Name.
1 An Introduction to Nonparametric Regression Ning Li March 15 th, 2004 Biostatistics 277.
Spline and Kernel method Gaussian Processes
PATTERN RECOGNITION AND MACHINE LEARNING
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Kernel Methods Jong Cheol Jeong. Out line 6.1 One-Dimensional Kernel Smoothers Local Linear Regression Local Polynomial Regression 6.2 Selecting.
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Linear Models for Classification
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Additive Models , Trees , and Related Models Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University.
Kernel Methods Arie Nakhmani. Outline Kernel Smoothers Kernel Density Estimators Kernel Density Classifiers.
Basis Expansions and Generalized Additive Models Basis expansion Piecewise polynomials Splines Generalized Additive Model MARS.
LECTURE 17: BEYOND LINEARITY PT. 2 March 30, 2016 SDS 293 Machine Learning.
RECITATION 2 APRIL 28 Spline and Kernel method Gaussian Processes Mixture Modeling for Density Estimation.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
1 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Machine learning, pattern recognition and statistical data modelling Lecture 10.
1 C.A.L. Bailer-Jones. Machine learning and pattern recognition Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
PREDICT 422: Practical Machine Learning
Lecture 2: Overview of Supervised Learning
LECTURE 11: Advanced Discriminant Analysis
Dept. Computer Science & Engineering, Shanghai Jiao Tong University
Ch8: Nonparametric Methods
Machine learning, pattern recognition and statistical data modelling
Chapter 2: Overview of Supervised Learning
Overview of Supervised Learning
Bias and Variance of the Estimator
K Nearest Neighbor Classification
Neuro-Computing Lecture 4 Radial Basis Function Network
Lecture 1: Introduction to Machine Learning Methods
Biointelligence Laboratory, Seoul National University
Generally Discriminant Analysis
Basis Expansions and Generalized Additive Models (2)
Basis Expansions and Generalized Additive Models (1)
Machine Learning: UNIT-4 CHAPTER-1
Model generalization Brief summary of methods
CSC2515 Fall 2007 Introduction to Machine Learning Lecture 7b: Kernel density estimators and nearest neighbors All lecture slides will be available as.
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
Presentation transcript:

Machine learning, pattern recognition and statistical data modelling Lecture 6. Kernel methods and additive models Coryn Bailer-Jones

Topics Think globally, act locally: kernel methods Generalized Additive Models (GAMs) for regression (classification next week) confidence intervals

Kernel methods In the first lecture we looked at kernel methods for density estimation E.g. Gaussian kernel of width 2h in d dimensions estimated from N data points 𝑓𝑥= 1 𝑁 𝑛=1 𝑛=𝑁 1 2 ℎ 2  𝑑 2 exp  ∥𝑥− 𝑥 𝑛 ∥ 2 2h 2 

K-NN kernel density estimation K = no. neighbours N = total no. points V = volume occupied by K neighbours Overcome fixed kernel size: Vary search volume size, V, until reach K neighbours 𝑓𝑥= 𝐾 𝑁𝑉 © Bishop (1995)

One-dimensional kernel smoothers: k-nn 𝑘−nn smoother:𝐸𝑌∣𝑋=𝑥= 𝑓  𝑥=Ave 𝑦 𝑖 ∣ 𝑥 𝑖 ∈ 𝑁 𝑘 𝑥 𝑁 𝑘 𝑥is the set of𝑘points nearest to𝑥in (e.g.) squared distance Drawback is that the estimator is not smooth in𝑥. 𝑘=30 © Hastie, Tibshirani, Friedman (2001)

One-dimensional kernel smoothers: Epanechnikov Instead give more distant points less weight, e.g. with the Nadaraya−Watson kernel−weighted average 𝑓   𝑥 0 = 𝑖=1 𝑁 𝐾   𝑥 0, 𝑥 𝑖  𝑦 𝑖 𝑖=1 𝑁 𝐾   𝑥 0, 𝑥 𝑖  using the Epanechnikov kernel 𝐾   𝑥 0, 𝑥 𝑖 =𝐷  ∣𝑥− 𝑥 0 ∣   where𝐷𝑡= 3 4 1− 𝑡 2 if∣𝑡∣≤1 0otherwise Could generalize kernel to have variable width 𝐾   𝑥 0, 𝑥 𝑖 =𝐷  ∣𝑥− 𝑥 0 ∣ ℎ   𝑥 0   © Hastie, Tibshirani, Friedman (2001)

Kernel comparison Epanechnikov:𝐷𝑡= 3 4 1− 𝑡 2 if∣𝑡∣≤1 0otherwise Tri−cube:𝐷𝑡= 1− ∣𝑡∣ 3  3 if∣𝑡∣≤1 0otherwise © Hastie, Tibshirani, Friedman (2001)

k-nn and Epanechnikov kernels 𝑘=30 =0.2 Epanechnikov kernel has fixed width (bias approx. constant, variance not) k-nn has adaptive width (constant variance, bias varies as 1/density) free parameters: k or  © Hastie, Tibshirani, Friedman (2001)

Locally-weighted averages can be biased at boundaries © Hastie, Tibshirani, Friedman (2001) Kernel is asymmetric at the boundary

Local linear regression solve linear least squares in local region to predict at a single point green points: effective kernel © Hastie, Tibshirani, Friedman (2001)

Local quadratic regression © Hastie, Tibshirani, Friedman (2001)

Bias-variance trade-off higher order local fits reduce bias at cost of increased variance, esp. at boundary (see previous page) © Hastie, Tibshirani, Friedman (2001)

Kernels in higher dimensions kernel smoothing and local regression generalize to higher dimensions... ...but curse of dimensionality not overcome cannot simultaneously retain localness (=low bias) and sufficient sample size (=low variance) without increasing total sample exponentially with dimension In general we need to make assumptions about underlying data/true function and use structured regression/classification

Generalized Additive Model Could model a p−dimensional set of data using 𝑌 𝑋 1, 𝑋 2, ..., 𝑋 𝑝 = 𝑓 1  𝑋 1  𝑓 2  𝑋 2 ... 𝑓 𝑝  𝑋 𝑝  Idea is to fit each 1D function separately and then provide an algorithm to iteratively combine them. Do this by minimizing penalized RSS 𝑃𝑅𝑆𝑆= 𝑖=1 𝑁 𝑦 𝑖 −− 𝑗=1 𝑝 𝑓 𝑗  𝑥 𝑖𝑗  2  𝑗=1 𝑝  𝑗 𝑓 𝑗 ′′ 𝑡 𝑗  2 𝑑𝑡 𝑗 Could use a variety of smoothers for each 𝑓 𝑗 and the corresponding penalty. Here use cubic splines. To make solution unique must fix, e.g. = 1 𝑁 𝑖=1 𝑁 𝑦 𝑖 in which case 𝑖=1 𝑁 𝑓 𝑗  𝑥 𝑖𝑗 =0∀𝑗 Avoiding the curse: Split p−dimensional problem into p 1−dimensional ones

Backfitting algorithm for additive models in principle this step is not required 𝑆 𝑗 is a smoothing splinefit as a function of 𝑥 𝑖𝑗 to the residuals, i.e. what𝑠ℎ𝑜𝑢𝑙𝑑be explained by 𝑓 𝑗 © Hastie, Tibshirani, Friedman (2001)

Generalized additive models on the rock data Application of the gam{gam} package on the rock{MASS} data set. See the R scripts on the lecture web site

Confidence intervals with splines Spline function estimate is 𝐟  𝑥=𝐇   =𝐇  𝐇 𝐓 𝐇  𝐍  −1 𝐇 𝐓 𝐲 = 𝐒  𝐲 The smoother matrix 𝐒  depends only on 𝑥 𝑖 andbut not on𝐲. 𝐕=Variance 𝐟  𝑥 = 𝐒  𝐒  𝑇 = 𝑑𝑖𝑎𝑔𝐕 gives the pointwise error estimates on either the training data or new data.

R packages for Generalized Additive Models gam{gam} same as the package implemented in S-PLUS gam{mgcv} a variant on the above brutto{mda} automatically selects between smooth fit (cubic spline), linear fit and omitting variable altogether

Summary Kernel methods improvements over nearest neighbours to reduce (or control) bias local linear, quadratic regression Generalized Additive Models defeat (cheat?) the curse of dimensionality by dividing into p 1- dimensional fitting problems typically use kernel or spline smoothers iterative backfitting algorithm MARS (multiple adaptive regression splines) piecewise linear basis functions if prevent pairwise interactions of dimensions, it is an additive model