CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {

Slides:

Advertisements

Similar presentations

: INTRODUCTION TO Machine Learning Parametric Methods.

Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.

Pattern Recognition and Machine Learning

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

INTRODUCTION TO Machine Learning 3rd Edition

Chapter 4: Linear Models for Classification

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Machine Learning CMPT 726 Simon Fraser University

1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.

MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Principles of Pattern Recognition

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.

Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna.

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI

Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.

INTRODUCTION TO Machine Learning 3rd Edition

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.

Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.

Machine Learning 5. Parametric Methods.

Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna.

Review of statistical modeling and probability theory Alan Moses ML4bio.

Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CS479/679 Pattern Recognition Dr. George Bebis

Chapter 3: Maximum-Likelihood Parameter Estimation

Probability Theory and Parameter Estimation I

Ch3: Model Building through Regression

Parameter Estimation 主講人：虞台文.

CH 5: Multivariate Methods

Maximum Likelihood Estimation

Special Topics In Scientific Computing

Pattern Classification, Chapter 3

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

Course Outline MODEL INFORMATION COMPLETE INCOMPLETE

INTRODUCTION TO Machine Learning

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Pattern Recognition and Machine Learning

INTRODUCTION TO Machine Learning

Parametric Methods Berlin Chen, 2005 References:

Multivariate Methods Berlin Chen

Multivariate Methods Berlin Chen, 2005 References:

Parametric Estimation

Test #1 Thursday September 20th

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

Presentation transcript:

CHAPTER 4: Parametric Methods

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = { x t } t where x t ~ p (x) Parametric estimation: Assume a form for p (x | θ ) and estimate θ, its sufficient statistics, using X e.g., N ( μ, σ 2 ) where θ = { μ, σ 2 } Problem: How can we obtain θ from X? Assumption: X contains samples of a one- dimensional random variable Later multivariate estimation: X contains multiple and not only a single measurement.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 3 Maximum Likelihood Estimation Density function p with parameters θ is given and x t ~p (X | θ ) Likelihood of θ given the sample X l ( θ |X) = p (X | θ ) = ∏ t p (x t | θ ) We look θ for that “maximizes the likelihood of the sample”! Log likelihood L( θ |X) = log l ( θ |X) = ∑ t log p (x t | θ ) Maximum likelihood estimator (MLE) θ * = argmax θ L( θ |X) Homework: Sample: 0, 3, 3, 4, 5 and x~N( ,  )? Use MLE to find( ,  )!

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 4 Examples: Bernoulli/Multinomial Bernoulli: Two states, failure/success, x in {0,1} P (x) = p o x (1 – p o ) (1 – x) L (p o |X) = log ∏ t p o x t (1 – p o ) (1 – x t ) MLE: p o = ∑ t x t / N Multinomial: K>2 states, x i in {0,1} P (x 1,x 2,...,x K ) = ∏ i p i x i L(p 1,p 2,...,p K |X) = log ∏ t ∏ i p i x i t MLE: p i = ∑ t x i t / N

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 5 Gaussian (Normal) Distribution p(x) = N ( μ, σ 2 ) MLE for μ and σ 2 : μ σ

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 6 Bias and Variance Unknown parameter θ Estimator d i = d (X i ) on sample X i Bias: b θ (d) = E [d] – θ Variance: E [(d–E [d]) 2 ] Mean square error of the estimator d: r (d, θ ) = E [(d– θ ) 2 ] = (E [d] – θ ) 2 + E [(d–E [d]) 2 ] = Bias 2 + Variance Error in the Model itselfVariation/randomness of the model

7 Bayes’ Estimator Treat θ as a random var with prior p ( θ ) Bayes’ rule: p ( θ |X) = p(X| θ ) * p( θ ) / p(X) Maximum a Posteriori (MAP): θ MAP = argmax θ p( θ |X) Maximum Likelihood (ML): θ ML = argmax θ p(X| θ ) Bayes’ Estimator: θ Bayes’ = E[ θ |X] = ∫ θ p( θ |X) d θ Comments: ML just takes the maximum value of the density function Compared with ML, MAP additionally considers priors Bayes’ estimator averages over all possible values of θ which are weighted by their likelihood to occur (which is measured by a probability distribution p( θ )). For MAP see:

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 8 Bayes’ Estimator: Example x t ~ N ( θ, σ o 2 ) and θ ~ N ( μ, σ 2 ) θ ML = m θ MAP = θ Bayes’ = σ   : converges to m

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 9 Parametric Classification kind of p(C i |x)

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 10 Given the sample ML estimates are Discriminant becomes

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 11 Equal variances Single boundary at halfway between means

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 12 Variances are different Two boundaries Homework!

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 13 Regression Maximizing the probability of the sample again!

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 14 Regression: From LogL to Error Skip to 20!

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 15 Linear Regression Relationship to what we discussed in Topic2??

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 16 Polynomial Regression Here we get k+1 equations with k+1 unknowns!

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 17 Other Error Measures Square Error: Relative Square Error: Absolute Error: E ( θ |X) = ∑ t |r t – g(x t | θ )| ε -sensitive Error: E ( θ |X) = ∑ t 1(|r t – g(x t | θ )|> ε ) (|r t – g(x t | θ )| – ε )

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 18 Bias and Variance biasvariance noisesquared error To be revisited next week!

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 19 Estimating Bias and Variance M samples X i ={x t i, r t i }, i=1,...,M are used to fit g i (x), i =1,...,M Initially skip!

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 20 Bias/Variance Dilemma Example: g i (x)=2 has no variance and high bias g i (x)= ∑ t r t i /N has lower bias with variance As we increase complexity, bias decreases (a better fit to data) and variance increases (fit varies more with data) Bias/Variance dilemma: (Geman et al., 1992)

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 21 bias variance f gigi g f Already visited as Topic4!

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 22 Polynomial Regression Best fit “min error”

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 23 Model Selection Cross-validation: Measure generalization accuracy by testing on data unused during training Regularization: Penalize complex models E’=error on data + λ model complexity Akaike’s information criterion (AIC), Bayesian information criterion (BIC) Minimum description length (MDL): Kolmogorov complexity, shortest description of data Structural risk minimization (SRM) Remark: will be discussed in more depth later: Topic 11

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 24 Bayesian Model Selection Prior on models, p(model) Regularization, when prior favors simpler models Bayes, MAP of the posterior, p(model|data) Average over a number of models with high posterior (voting, ensembles: Chapter 15)

CHAPTER 5: Multivariate Methods

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 26 Multivariate Data Multiple measurements (sensors) d inputs/features/attributes: d-variate N instances/observations/examples

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 27 Multivariate Parameters

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 28 Parameter Estimation

29 Multivariate Normal Distribution Mahalanobis distance between x and 

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 30 Multivariate Normal Distribution Mahalanobis distance: ( x – μ ) T ∑ –1 ( x – μ ) measures the distance from x to μ in terms of ∑ (normalizes for difference in variances and correlations) Bivariate: d = 2 Remark:  is the correlation between the two variables Called z-score zi for xi

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 31 Bivariate Normal

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 32

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 33 Independent Inputs: Naive Bayes If x i are independent, offdiagonals of ∑ are 0, Mahalanobis distance reduces to weighted (by 1/ σ i ) Euclidean distance: If variances are also equal, reduces to Euclidean distance

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 34 Parametric Classification If p (x | C i ) ~ N ( μ i, ∑ i ) Discriminant functions are

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 35 Estimation of Parameters

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 36 Different S i Quadratic discriminant skip

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 37 likelihoods posterior for C 1 discriminant: P (C 1 |x ) = 0.5

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 38 Common Covariance Matrix S Shared common sample covariance S Discriminant reduces to which is a linear discriminant Initially skip!

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 39 Common Covariance Matrix S Initially skip!

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 40 Diagonal S When x j j = 1,..d, are independent, ∑ is diagonal p (x|C i ) = ∏ j p (x j |C i )(Naive Bayes’ assumption) Classify based on weighted Euclidean distance (in s j units) to the nearest mean Likely covered in April!

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 41 Diagonal S variances may be different

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 42 Diagonal S, equal variances Nearest mean classifier: Classify based on Euclidean distance to the nearest mean Each mean can be considered a prototype or template and this is template matching

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 43 Diagonal S, equal variances * ?

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 44 Model Selection As we increase complexity (less restricted S), bias decreases and variance increases Assume simple models (allow some bias) to control variance (regularization) AssumptionCovariance matrixNo of parameters Shared, HypersphericSi=S=s2ISi=S=s2I1 Shared, Axis-alignedS i =S, with s ij =0d Shared, HyperellipsoidalSi=SSi=Sd(d+1)/2 Different, Hyperellipsoidal SiSi K d(d+1)/2

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 45 Discrete Features Binary features: if x j are independent (Naive Bayes’) the discriminant is linear Estimated parameters skip!

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 46 Discrete Features Multinomial (1-of-n j ) features: x j  {v 1, v 2,..., v n j } if x j are independent skip!

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 47 Multivariate Regression Multivariate linear model Multivariate polynomial model: Define new higher-order variables z 1 =x 1, z 2 =x 2, z 3 =x 1 2, z 4 =x 2 2, z 5 =x 1 x 2 and use the linear model in this new z space (basis functions, kernel trick, SVM: Chapter 10) skip!