Machine learning, pattern recognition and statistical data modelling

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Linear Regression.
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Chapter 4: Linear Models for Classification
Data mining and statistical learning - lecture 6
Visual Recognition Tutorial
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Basis Expansions and Regularization Based on Chapter 5 of Hastie, Tibshirani and Friedman.
Visual Recognition Tutorial
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Review of Lecture Two Linear Regression Normal Equation
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Learning Theory Reza Shadmehr LMS with Newton-Raphson, weighted least squares, choice of loss function.
INTRODUCTION TO Machine Learning 3rd Edition
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Linear Models for Classification
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Machine Learning 5. Parametric Methods.
Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression  Bayesian regression.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
1 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Machine learning, pattern recognition and statistical data modelling Lecture 10.
1 C.A.L. Bailer-Jones. Machine learning and pattern recognition Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
unit #3 Neural Networks and Pattern Recognition
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
CEE 6410 Water Resources Systems Analysis
Deep Feedforward Networks
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
10701 / Machine Learning.
The Elements of Statistical Learning
Machine learning, pattern recognition and statistical data modelling
Bias and Variance of the Estimator
Probabilistic Models for Linear Regression
Roberto Battiti, Mauro Brunato
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Robust Full Bayesian Learning for Neural Networks
Generally Discriminant Analysis
Learning Theory Reza Shadmehr
Basis Expansions and Generalized Additive Models (2)
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Mathematical Foundations of BME
Recap: Naïve Bayes classifier
Presentation transcript:

Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones

What is machine learning? Data description and interpretation finding simpler relationship between variables (predictors and responses) finding naural groups or classes in data relating observables to physical quantities Prediction capturing relationship between “inputs” and “outputs” for a set of labelled data with the goal of predicting outputs for unlabelled data (“pattern recognition”) Learning from data dealing with noise coping with high dimensions (many potentially relevant variables) fitting models to data generalizing

Concepts: types of problems supervised learning predictors (x) and responses (y) infer P(y | x), perhaps modelled as f(x ; w) discrete y is a classification problem; real-valued is regression unsupervised learning no distinction between predictors and responses infer P(x), or things about this e.g. no. of modes/classes (mixture modelling, peak finding) low dimensional projections (descriptions) (PCA, SOM, MDS) outlier detection (discovery)

Concepts: probabilities and Bayes likelihood of x given y 𝑝𝑦∣𝑥= 𝑝𝑥∣𝑦𝑝𝑦 𝑝𝑥 𝑝𝑦,𝑥=𝑝𝑦∣𝑥𝑝𝑥=𝑝𝑥∣𝑦𝑝𝑦 Two levels of inference (learning): 1. Prediction:𝑥=predictors (input),𝑦=response(s) (output) 2. Model fitting:𝑥=data,𝑦=model parameters 𝑝𝑦∣𝑥,= 𝑝𝑥∣𝑦,𝑝𝑦∣ 𝑝𝑥∣ denominator ′just′ a normalization constant 𝑝𝑥∣= 𝑝 𝑥∣𝑦,𝑝𝑦∣𝑑𝑦 prior over y posterior probability of y given x evidence for model 

Concepts: solution procedure need some kind of expression for P(y | x) or P(x) e.g. f(x ; w) = P(y | x ) parametric, semi-, or non-parametric. E.g. density estimation and nonlinear regression parametric: Gaussian distribution P(x), spline f(x) semi-parametric: sum of several Gaussians, additive model, local regression non-parametric: k-nn, kernel estimate, k-nn parametric models: fit to data need to infer adjustable parameters, w, from data generally minimize a loss function on a labelled data set w.r.t w compare different models

Concepts: objective function Different functions suitable for continuous (regression) or discrete (class.) problems. Let𝑓 𝐱 𝑖 be (real−valued) model prediction for target 𝑦 𝑖 𝑖 𝑦 𝑖 −𝑓 𝑥 𝑖  2 residual sum of squares (RSS) 𝑖 ∣ 𝑦 𝑖 −𝑓 𝑥 𝑖 ∣ 𝐿 𝐿−norm 𝑖 exp − 𝑦 𝑖 𝑓 𝑥 𝑖  exponential 𝑖 𝑣 𝑖 where 𝑣 𝑖 = 0if∣ 𝑟 𝑖 ∣≤ ∣ 𝑟 𝑖 ∣−otherwise −insensitive 𝑣 𝑖 = 𝑟 𝑖 2 2 if∣ 𝑟 𝑖 ∣≤𝑐 𝑐∣ 𝑟 𝑖 ∣− 𝑐 2 2 otherwise Huber where 𝑟 𝑖 = 𝑦 𝑖 −𝑓 𝐱 𝑖  For discrete outputs (e.g. via𝑎𝑟𝑔𝑚𝑎𝑥) we have 1−0 loss and cross−entropy.

Loss functions

Models: linear modelling (linear least squares) Data: 𝐱 𝑖 , 𝑦 𝑖 𝐱= 𝑥 1, 𝑥 2, ..., 𝑥 𝑗 ,..., 𝑥 𝑝 Model:𝐲=𝐱 Least squares solution:   = 𝑚𝑖𝑛  𝑖=1 𝑁  𝑦 𝑖 − 𝑗=1 𝑝 𝑥 𝑖,𝑗  𝑗  2 In matrix form this is 𝑅𝑆𝑆= 𝐲−𝐗 𝑇 𝐲−𝐗 minimize w.r.tand the solution is   𝑟𝑖𝑑𝑔𝑒 =  𝐗 𝐓 𝐗 −1 𝐗 𝐓 𝐲

Concepts: maximum likelihood (as a loss function) Let𝑓 𝐱 𝑖 ∣𝐰be function estimate for 𝑦 𝑖 . Probability of getting these for all𝑁training points is 𝑝Data∣𝐰= 𝑖=1 𝑁 𝑝 𝑓 𝐱 𝑖 ∣𝐰≡𝐿 𝑓 𝐱 𝑖  ∣𝐰 is the likelihood. In practice we minimize the negative log likelihood 𝐸=−ln𝐿=− 𝑖=1 𝑁 ln 𝑝𝑓 𝐱 𝑖 ∣𝐰 If we assume that the model predictions follow an i.i.d Gaussian distribution about the true values, then 𝑝𝑓 𝐱 𝑖 ∣𝐰= 1 2 1 2  exp − 1 2 𝑓 𝐱 𝑖 − 𝑦 𝑖  2  2 −ln𝐿𝐰= 𝑁 2 ln2𝑁ln 1 2  2 𝑖=1 𝑁 𝑓 𝐱 𝑖 ∣𝐰− 𝑦 𝑖  2 i.e. ML with constant (unknown) noise corresponds to minimizing RSS w.r.t the model parameters𝐰

Concepts: generalization and regularization given a specific set of data, we nonetheless want a general solution therefore, must make some kind of assumption(s) smoothness in functions priors on model parameters (or functions, or predictions) restricting model space regularization involves a free parameter, although this can also be inferred from the data

Models: penalized linear modelling (ridge regression) Data: 𝐱 𝑖 , 𝑦 𝑖 𝐱= 𝑥 1, 𝑥 2, ..., 𝑥 𝑗 ,..., 𝑥 𝑝 Model:𝐲=𝐱 Least squares solution:   = 𝑚𝑖𝑛  𝑖=1 𝑁  𝑦 𝑖 − 𝑗=1 𝑝 𝑥 𝑖,𝑗  𝑗  2  𝑗=1 𝑝  𝑗 2 where≥0 In matrix form this is 𝑅𝑆𝑆= 𝐲−𝐗 𝑇 𝐲−𝐗  𝐓  minimize w.r.tand the solution is   𝑟𝑖𝑑𝑔𝑒 =  𝐗 𝐓 𝐗𝐈 −1 𝐗 𝐓 𝐲𝐈is the𝑝×𝑝identity matrix

Models: ridge regression (as regularization) 𝐲  =𝑋   =  𝐗 𝐓 𝐗𝐈 −1 𝐗 𝐓 𝐲=𝐀𝐲 = 𝑗=1 𝑝 𝐮 𝑗 𝑑 𝑗 2 𝑑 𝑗 2  𝐮 𝑗 𝑇 𝐲 The eigenvectors 𝑑 𝑗 2 measure the variance in the projection onto 𝐯 𝑗 𝑑𝑓=𝑡𝑟𝐀= 𝑗=1 𝑝 𝑑 𝑗 2 𝑑 𝑗 2  the regularization projects the data onto the PCs and downweights (“shrinks”) them inversely proportional to their variance limits the model space one free parameter: large  implies large degree of regularization, df() is small

Models: ridge regression  vs. df() © Hastie, Tibshirani, Friedman (2001)

Models: splines © Hastie, Tibshirani, Friedman (2001)

Concepts: regularization (in splines) Avoid know selection by selecting all points as knots Avoid overfitting via regularization that is, minimise a penalized sum-of-squares 𝑅𝑆𝑆𝑓,= 𝑖=1 𝑁 𝑦 𝑖 −𝑓 𝑥 𝑖  2  𝑓′′𝑡 2 𝑑𝑡 𝑓is the fitting function with continuous second derivatives =0⇒𝑓is any function which interpolates the data (could be wild) =∞⇒straight line least squares fit (no second derivative tolerated)

Concepts: regularization (in smoothing splines) Solution is a cubic spline with knots at each of the 𝑥 𝑖 i.e. 𝑓𝑥= 𝑗=1 𝑁 ℎ 𝑗 𝑥  𝑗 the residual sum of squares (error) to be minimized is 𝑅𝑆𝑆𝑓,= 𝐲−𝐇 𝑇 𝐲−𝐇  𝑇  𝐍  where 𝐇 𝑖𝑗 = ℎ 𝑗  𝑥 𝑖 and  𝑁 𝑗𝑘 = ℎ ′ ′ 𝑗 𝑡ℎ′ ′ 𝑘 𝑡𝑑𝑡 The solution is   =  𝐇 𝐓 𝐇  𝐍  −1 𝐇 𝐓 𝐲 Compare to ridge regression   𝑟𝑖𝑑𝑔𝑒 =  𝐗 𝐓 𝐗𝐈 −1 𝐗 𝐓 𝐲𝐈is the𝑝×𝑝identity matrix

Concepts: regularization (in smoothing splines)

Concepts: regularization in ANNs and SVMs In feedforward neural network regularization can be done with weight decay 𝐸= 1 2 𝑁 𝑘  𝑂 𝑘,𝑛 − 𝑇 𝑘,𝑛  2  1 2  𝑤 2 In SVMs the comes regularization is in the initial formulation (margin maximization) with the error (loss) function as the constraint 𝐸= ∥𝐰∥ 2 𝐶 𝑖 𝑛  𝑖 s.t 𝑦 𝑖  𝐱 𝑖 .𝐰𝑏−1  𝑖 ≥0,  𝑖 ≥0 Regularization parameter is 1 𝐶

Concepts: model comparison and selection cross validation n-fold, leave-one-out, generalized compare and select models using just the training set account for model complexity plus bias from finite-sized training set Bayes Information Criterion Akaike Information Criterion k is no. of parameters; N is no. of training vectors smallest BIC or AIC corresponds to optimal model Bayesian evidence for model (hypothesis) H, P(D | H) probability that data arises from model, marginalized over all model parameters 𝐴𝐼𝐶=−2ln𝐿2𝑘 𝐵𝐼𝐶=−2ln𝐿ln𝑁𝑘

Concepts: Occam's razor and Bayesian evidence D = data H = hypothesis (model) w = model parameters Simpler model, H1, predicts less of the data space Evidence naturally penalizes more complex models 𝑝𝐰∣𝐷, 𝐻 𝑖 = 𝑝𝐷∣𝐰, 𝐻 𝑖 𝑝𝐰∣ 𝐻 𝑖  𝑝𝐷∣ 𝐻 𝑖  Posterior= Likelihood×Prior Evidence after MacKay (1992)

Concepts: curse of dimensionality to retain density, no. vectors must grow exponentially with no. dimensions generally cannot do this overcome curse in various ways make assumptions: structured regression limit model space generalized additive models basis functions and kernels

Models: basis expansions p−dimensional data:𝐗= 𝑋 1, 𝑋 2, ...., 𝑋 𝑗 ,..., 𝑋 𝑝  Basis expansion:𝑓 𝐗 = 𝑚=1 𝑀  𝑚 ℎ 𝑚 𝐗 linear model quadractic terms higher order terms other transformations, e.g. split range with an indicator function generalized additive models ℎ 𝑚 𝑋 = 𝑋 𝑚 𝑚=1,...,𝑝 ℎ 𝑚 𝑋 = 𝑋 𝑗 𝑋 𝑘 ℎ 𝑚 𝑋 =log 𝑋 𝑗 ,  𝑋 𝑗  ℎ 𝑚 𝑋 =𝐼 𝐿 𝑚 ≤ 𝑋 𝑗 ≤ 𝑈 𝑚  ℎ 𝑚 𝑋 = ℎ 𝑚  𝑋 𝑚 𝑚=1,...,𝑝

Models: MLP neural network basis functions 𝑦= 𝑗=1 𝐽 𝑤 𝑗,𝑘 𝐻 𝑗 𝐻 𝑗 =𝑔 𝑣 𝑗  where 𝑣 𝑗 = 𝑖=1 𝑝 𝑤 𝑖,𝑗 𝑥 𝑖 𝑔 𝑣 𝑗 = 1 1 𝑒 − 𝑣 𝑗 𝐽sigmoidal basis function function

Models: radial basis function neural networks 𝑦𝐱= 𝑤 𝑘0  𝑗=1 𝐽 𝑤 𝑗,𝑘  𝑗 𝐱where 𝐱=exp − ∥𝐱−  𝑗 ∥ 2 2  𝑗 2  are the radial basis functions.

Concepts: optimization With gradient information gradient descent add second derivative (Hessian): Newton, quasi-Newton, Levenberg- Marquardt, conjugate gradients pure gradient methods get stuck in local minima random restart committee/ensemble of models momentum terms (non-gradient info.) without gradient information expectation-maximization (EM) algorithm simulated annealing genetic algorithms

Concepts: marginalization (Bayes again) We are often not interested in the actual model parameters,𝐰; these are just a means to an end. That is, we are interested in𝑃𝑦∣𝐱 whereas model inference gives𝑃𝑦∣𝐱,𝐰 A Bayesian𝑚𝑎𝑟𝑔𝑖𝑛𝑎𝑙𝑖𝑧𝑒𝑠over parameters of no interest 𝑃𝑦∣𝐱= 𝑃 𝑦∣𝐱,𝐰𝑃𝐰∣𝐱𝑑𝐰 𝑃𝐰∣𝐱is the prior over the model weights (conditioned on the input data, but we could assume independence).