INTRODUCTION TO Machine Learning 3rd Edition

Slides:



Advertisements
Similar presentations
: INTRODUCTION TO Machine Learning Parametric Methods.
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Pattern Recognition and Machine Learning
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model Assessment and Selection
Model assessment and cross-validation - overview
Data mining in 1D: curve fitting
Visual Recognition Tutorial
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Model Selection. Agenda Myung, Pitt, & Kim Olsson, Wennerholm, & Lyxzen.
Bayesian Learning Rong Jin.
Computer vision: models, learning and inference
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Review of statistical modeling and probability theory Alan Moses ML4bio.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Estimating standard error using bootstrap
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
CH 5: Multivariate Methods
Statistics in MSmcDESPOT
Maximum Likelihood Estimation
Special Topics In Scientific Computing
Data Mining Lecture 11.
Bias and Variance of the Estimator
INTRODUCTION TO Machine Learning
10701 / Machine Learning Today: - Cross validation,
INTRODUCTION TO Machine Learning 3rd Edition
Pattern Recognition and Machine Learning
INTRODUCTION TO Machine Learning
Machine Learning” Dr. Alper Özpınar.
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Parametric Estimation
INTRODUCTION TO Machine Learning 3rd Edition
Presentation transcript:

INTRODUCTION TO Machine Learning 3rd Edition Lecture Slides for INTRODUCTION TO Machine Learning 3rd Edition ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml3e

CHAPTER 4: Parametric Methods

Parametric Estimation X = { xt }t where xt ~ p (x) Parametric estimation: Assume a form for p (x |q ) and estimate q , its sufficient statistics, using X e.g., N ( μ, σ2) where q = { μ, σ2} Remember that a probability density function is defined as: p (x0) ≡ lim ε→0 P(x0 ≤ x < x0 + ε) “distributed as” probability density function

Maximum Likelihood Estimation Likelihood of q given the sample X l (θ|X) = p (X |θ) = ∏t p (xt|θ) Log likelihood L(θ|X) = log l (θ|X) = ∑t log p (xt|θ) Maximum likelihood estimator (MLE) θ* = argmaxθ L(θ|X)

Examples: Bernoulli/Multinomial Bernoulli: Two states, failure/success, x in {0,1} P (x) = pox (1 – po ) (1 – x) L (po|X) = log ∏t poxt (1 – po ) (1 – xt) MLE: po = ∑t xt / N Multinomial: K>2 states, xi in {0,1} P (x1,x2,...,xK) = ∏i pixi L(p1,p2,...,pK|X) = log ∏t ∏i pixit MLE: pi = ∑t xit / N

Gaussian (Normal) Distribution p(x) = N ( μ, σ2) MLE for μ and σ2: μ σ

Bias and Variance Unknown parameter q Estimator di = d (Xi) on sample Xi Bias: bq(d) = E [d] – q Variance: E [(d–E [d])2] Mean square error: r (d,q) = E [(d–q)2] = (E [d] – q)2 + E [(d–E [d])2] = Bias2 + Variance q

Bayes’ Estimator Treat θ as a random var with prior p (θ) Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X) Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθ Maximum a Posteriori (MAP): θMAP = argmaxθ p(θ|X) Maximum Likelihood (ML): θML = argmaxθ p(X|θ) Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ

Comparing ML, MAP, and Bayes’ Let Θ be the set of all possible solutions θ’s Maximum a Posteriori (MAP): Selects the θ that satisfies θMAP = argmaxθ p(θ|X) Maximum Likelihood (ML): Selects the θ that satisfies θML = argmaxθ p(X|θ) Bayes’: Constructs the “weighted average” over all θ’s in Θ: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ Note that if the θ’s in Θ are uniformly distributed then θMAP=θML since θMAP = argmaxθ p(θ|X) = argmaxθ p(X|θ) p(θ)/p(X) = argmaxθ p(X|θ) = θML

Bayes’ Estimator: Example xt ~ N (θ, σo2) and θ ~ N ( μ, σ2) θML = m θMAP = θBayes’ =

Parametric Classification

Given the sample ML estimates are Discriminant

Equal variances Single boundary at halfway between means

Variances are different Two boundaries

Regression second term can be ignored

Regression: From LogL to Error Maximizing the log likelihood above is the same as minimizing the Error. θ that minimizes Error is called “least squares estimate” Trick: When maximizing a likelihood l that contains exponents, instead minimize error E = - log l

Linear Regression and so

Polynomial Regression where then

Maximum Likelihood and Least Squares Taken from Tom Mitchell’s Machine Learning textbook: “… under certain assumptions (*) any learning algorithm that minimizes the squared error between the output hypothesis predictions and the training data will output a maximum likelihood hypothesis.” (*): Assumption: “the observed training target values are generated by adding random noise to the true target value, where the random noise is drawn independently for each example from a Normal distribution with zero mean.”

Other Error Measures Square Error: Relative Square Error (ERSE): Absolute Error: E (θ |X) = ∑t |rt – g(xt| θ)| ε-sensitive Error: E (θ |X) = ∑ t 1(|rt – g(xt| θ)|>ε) (|rt – g(xt|θ)| – ε) R2 : Coefficient of Determination: R2 = 1 – ERSE (see next slide)

Coefficient of Determination: R2 Figure adapted from Wikipedia (Sept. 2015) "Coefficient of Determination" by Orzetto - Own work. Licensed under CC BY-SA 3.0 via Commons – https://commons.wikimedia.org/wiki/File:Coefficient_of_Determination.svg#/media/File:Coefficient_of_Determination.svg The closer R2 is to 1 the better as this means that the g(.) estimates (on the right graph) fit the data well in comparison to the simple estimate given by the average value (on the left graph)

Bias and Variance Given X = {xt,rt} drawn from unknown pdf p(x,r) Expected Square Error of the g(.) estimate at x: it may be that for a sample X, g(.) is a very good fit, and for another sample it is not Given samples X’s, all of size N drawn from the same joint density p(x,r). Expected value, averaged over X’s: noise variance of noise added does not depend on g(.) or X squared error g(x) deviation from E[r|x] depend on g(.) and X bias how much g(.) is wrong disregarding effect of varying samples variance how much g(.) fluctuates as samples varies

Estimating Bias and Variance M samples Xi={xti , rti}, i=1,...,M are used to fit gi (x), i =1,...,M

Bias/Variance Dilemma Examples: gi(x)=2 has no variance and high bias gi(x)= ∑t rti/N has lower bias but higher variance As we increase complexity of g(.), bias decreases (a better fit to data) and variance increases (fit varies more with data) Bias/Variance dilemma: (Geman et al., 1992)

added noise ~ N(0,1) A linear regression for each of 5 samples f f bias gi g Polynomial regression of order 3 (left) and order 5 (right) for each of 5 samples See Figure 4.5’s caption, p. 83 variance dotted red line in each plot is the average of the five fits, 𝑔 (𝑥)

Polynomial Regression Best fit “min error” This is Fig. 4.6 p. 84 Same settings as plots on the previous slide, but using 100 models instead of 5

Fitted polynomials of orders 1, …, 8 on training set This is Fig. 4.7 p. 85 Best fit, “elbow” Settings as plots on previous slides, but using training and validation sets (50 instances each)

Model Selection (1 of 2 slides) Methods to fine-tune model complexity Cross-validation: Measures generalization accuracy by testing on data unused during training Regularization: Penalizes complex models E’=error on data + λ model complexity (where λ is a penalty weight) the lower the better Other measures of “goodness of fit” with complexity penalty: Akaike’s information criterion (AIC) AIC ≡ log p(X|θML,M) – k(M) Bayesian information criterion (BIC) BIC ≡ log p(X|θML,M) – k(M) log(N)/2 where: M is a model log p(X|θML,M): is the log likelihood of M where M’s parameters θML have been estimated using maximum likelihood k(M) = number of adjustable parameters in θML of the model M N= size of sample X For both AIC and BIC, the higher the better λ is a penalty weight: If λ is too high, then models allowed may be too simple, and that increases the bias. “Optimal” value for λ can be found using cross-validation. AIC and BIC are defined in Chapter 16, equations (16.36) and (16.35) on p. 470

Model Selection (2 of 2 slides) Methods to fine-tune model complexity Minimum description length (MDL): Kolmogorov complexity, shortest description of data Given a dataset X, MDL (M) = Description length of model M + Description length of data in X not correctly described by M the lower the better Structural risk minimization (SRM) Uses a set of models and their complexities (measured usually using their number of free parameters or their VC-dimension) Selects the simplest model in terms of order and best in terms of empirical error on the data

Bayesian Model Selection Used when we have Prior on models, p(model) Regularization, when prior favors simpler models Bayes, MAP of the posterior, p(model|data) Average over a number of models with high posterior (voting, ensembles: Chapter 17)

Regression example Coefficients increase in magnitude as order increases: 1: [-0.0769, 0.0016] 2: [0.1682, -0.6657, 0.0080] 3: [0.4238, -2.5778, 3.4675, -0.0002 4: [-0.1093, 1.4356, -5.5007, 6.0454, -0.0019] Regularization (L2):