ETHEM ALPAYDIN © The MIT Press, 2010 Lecture Slides for.

Slides:



Advertisements
Similar presentations
: INTRODUCTION TO Machine Learning Parametric Methods.
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
INTRODUCTION TO Machine Learning 3rd Edition
Data mining in 1D: curve fitting
Visual Recognition Tutorial
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Machine Learning CMPT 726 Simon Fraser University
Bayesian Learning Rong Jin.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
INTRODUCTION TO Machine Learning 3rd Edition
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
BCS547 Neural Decoding.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
Parameter Estimation 主講人:虞台文.
CH 5: Multivariate Methods
Maximum Likelihood Estimation
Special Topics In Scientific Computing
Bias and Variance of the Estimator
INTRODUCTION TO Machine Learning
10701 / Machine Learning Today: - Cross validation,
INTRODUCTION TO Machine Learning 3rd Edition
Pattern Recognition and Machine Learning
INTRODUCTION TO Machine Learning
Machine Learning” Dr. Alper Özpınar.
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Parametric Estimation
Linear Discrimination
Presentation transcript:

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for

Parametric Estimation X = { x t } t=1 N where x t ~ p(x) Here x is one dimensional and the densities are univariate. Parametric estimation: Assume a form for p (x | θ) and estimate θ, its sufficient statistics, using X e.g., N ( μ, σ 2 ) where θ = { μ, σ 2 } 3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Maximum Likelihood Estimation Likelihood of θ given the sample X l (θ| X ) p( X |θ) = ∏ t=1 N p (x t |θ) Log likelihood L (θ| X ) log l (θ| X ) = ∑ t=1 N log p (x t |θ) Maximum likelihood estimator (MLE) θ * = arg max θ L (θ| X ) 4

Examples: Bernoulli Density Two states, failure/success, x in {0,1} P (x) = p x (1 – p ) (1 – x) l (θ| X )= ∏ t=1 N p (x t |θ) L (p| X ) = log l (θ| X ) = log ∏ t p x t (1 – p ) (1 – x t ) = (∑ t x t ) log p+ (N – ∑ t x t ) log (1 – p ) MLE: p = (∑ t x t ) / N MLE: Maximum Likelihood Estimation 5Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Examples: Multinomial Density K > 2 states, x i in {0,1} P (x 1,x 2,...,x K ) = ∏ i p i x i L (p 1,p 2,...,p K | X ) = log ∏ t ∏ i p i x i t where x i t = 1 if experiment t chooses state i x i t = 0 otherwise MLE: p i = (∑ t x i t ) / N PS. K: mutually exclusive 6Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Gaussian (Normal) Distribution p(x) = N ( μ, σ 2 ) Given a sample X = { x t } t=1 N with x t ~ N ( μ, σ 2 ), the log likelihood of a Gaussian sample is μ σ L (Why? N samples?) 7

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Gaussian (Normal) Distribution MLE for μ and σ 2 : ( Exercise!) μ σ 8

θ Bias and Variance Unknown parameter θ Estimator d i = d (X i ) on sample X i Bias: b θ (d) = E [d] – θ Variance: E [(d–E [d]) 2 ] If b θ (d) = 0 d is an unbiased estimator of θ If E [(d–E [d]) 2 ] = 0 d is a consistent estimator of θ 9Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Expected value If the probability distribution of X admits a probability density function f (x), then the expected value can be computed as It follows directly from the discrete case definition that if X is a constant random variable, i.e. X = b for some fixed real number b, then the expected value of X is also b. The expected value of an arbitrary function of X, g(X), with respect to the probability density function f(x) is given by the inner product of f and g: 10

Bias and Variance For example: Var [m]  0 as N  ∞ m is also a consistent estimator 11Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Bias and Variance For example: (see P ) s 2 is a biased estimator of σ 2 (N/(N-1))s 2 is a unbiased estimator of σ 2 Mean square error: r (d, θ ) = E [(d– θ ) 2 ] (see P. 66) = (E [d] – θ ) 2 + E [(d–E [d]) 2 ] = Bias 2 + Variance 12Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) θ

Standard Deviation In statistics, the standard deviation is often estimated from a random sample drawn from the population. The most common measure used is the sample standard deviation, which is defined by where is the sample (formally, realizations from a random variable X) and is the sample mean. 13

Bayes’ Estimator Treat θ as a random variable with prior p(θ) Bayes’ rule: Maximum a Posteriori (MAP): θ MAP = arg max θ p(θ| X ) Maximum Likelihood (ML): θ ML = arg max θ p( X |θ) Bayes’ estimator: θ Bayes = E[θ| X ] = ∫ θ p(θ| X ) dθ 14Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

MAP VS ML If p(θ) is an uniform distribution then θ MAP = arg max θ p(θ| X ) = arg max θ p( X |θ) p(θ) / p( X ) = arg max θ p( X |θ) = θ ML θ MAP = θ ML where p(θ) / p( X ) is a constant 15Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Bayes’ Estimator: Example If p (θ| X ) is normal, then θ ML = m and θ Bayes =θ MAP Example: Suppose x t ~ N (θ, σ 2 ) and θ ~ N ( μ 0, σ 0 2 ) θ Bayes = The Bayes’ estimator is a weighted average of the prior mean μ 0 and the sample mean m. θ ML = m 16

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Parametric Classification The discriminant function Assume that are Gaussian 17 log likelihood of a Gaussian sample

Given the sample Maximum Likelihood (ML) estimates are Discriminant becomes 18Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

The first term is a constant and if the priors are equal, those terms can be dropped. Assume that variances are equal, then becomes Choose C i if 19Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Equal variances Single boundary at halfway between means Likelihood functions and the posteriors with equal priors for two classes when the input is one-dimensional. Variances are equal and the posteriors intersect at one point, which is the threshold of decision. 20

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Variances are different Two boundaries Likelihood functions and the posteriors with equal priors for two classes when the input is one-dimensional. Variances are unequal and the posteriors intersect at two points. 21

Exercise For a two-class problem, generate normal samples for two classes with different variances, then use parametric classification to estimate the discriminant points. Compare these with the theoretical values. PS. You can use normal sample generation tools. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)22

Regression Regression assumes 0 mean Gaussian noise added to the model; Here, the model is linear. 23Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) : the probability of the output given the input Given an sample X = { x t, r t } t=1 N, the log likelihood is

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Regression: From LogL to Error Ignore the second term: (because it does not depend on our estimator) Maximizing this is equivalent to minimizing Least squares estimate 24

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Example: Linear Regression (Exercise!!) 25

Regression Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 26 w 1 =? w 0 =? See page 36 Linear, second-order and sixth-order polynomials are fitted to the same set of points.

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Example: Polynomial Regression 27

Other Error Measures Square Error: Relative Square Error: Absolute Error: E (θ| X ) = ∑ t |r t – g(x t |θ)| ε-sensitive Error: E (θ| X ) = ∑ t 1(|r t – g(x t |θ)|>ε) (|r t – g(x t |θ)| – ε) 28Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Bias and Variance bias 2 variance (See Eq. 4.17) The expected value (average over samples X, all of size N and drawn from the same joint density p(x, r)) : The expected square error at x : (See Eq. 4.11) (See page 66 and 76) 29Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) noisesquared error squared error = bais 2 +variance

Estimating Bias and Variance Samples X i ={x t i, r t i }, i =1,...,M, t = 1,2,…,N are used to fit g i (x), i =1,...,M 30Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) θ

Bias/Variance Dilemma Examples: g i (x) = 2 has no variance and high bias g i (x) = ∑ t r t i /N has lower bias with variance As we increase model complexity, bias decreases (a better fit to data) and variance increases (fit varies more with data) Bias/Variance dilemma (Geman et al., 1992) 31Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

bias f gigi g f variance 32

Polynomial Regression Best fit “min error” overfittingunderfitting 33Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Best fit, “elbow”( 肘部 ) Fig Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Model Selection (1) Cross-validation: Measure generalization accuracy by testing on data unused during training To find the optimal complexity Regularization ( 調整 ): Penalize complex models E’ = error on data + λ model complexity Structural risk minimization (SRM): To find the model simplest in terms of order and best in terms of empirical error on the data Model complexity measure: polynomials of increasing order, VC dimension,... Minimum description length (MDL): Kolmogorov complexity of a data set is defined as the shortest description of data 35Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Model Selection (2) Bayesian Model Selection: Prior on models, p(model) Discussions: When the prior is chosen such that we give higher probabilities to simpler models, the Bayesian approach, regularization, SRM, and MDL are equivalent. Cross-validation is the best approach if there is a large enough validation dataset. 36Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Regression example Coefficients increase in magnitude as order increases: 1: [ , ] 2: [0.1682, , ] 3: [0.4238, , , : [ , , , , ] Please compare with Fig. 4.5, p Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)