Download presentation
Presentation is loading. Please wait.
Published byGeorgia Rose Modified over 8 years ago
1
ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml2e Lecture Slides for
3
Parametric Estimation X = { x t } t=1 N where x t ~ p(x) Here x is one dimensional and the densities are univariate. Parametric estimation: Assume a form for p (x | θ) and estimate θ, its sufficient statistics, using X e.g., N ( μ, σ 2 ) where θ = { μ, σ 2 } 3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
4
Maximum Likelihood Estimation Likelihood of θ given the sample X l (θ| X ) p( X |θ) = ∏ t=1 N p (x t |θ) Log likelihood L (θ| X ) log l (θ| X ) = ∑ t=1 N log p (x t |θ) Maximum likelihood estimator (MLE) θ * = arg max θ L (θ| X ) 4
5
Examples: Bernoulli Density Two states, failure/success, x in {0,1} P (x) = p x (1 – p ) (1 – x) l (θ| X )= ∏ t=1 N p (x t |θ) L (p| X ) = log l (θ| X ) = log ∏ t p x t (1 – p ) (1 – x t ) = (∑ t x t ) log p+ (N – ∑ t x t ) log (1 – p ) MLE: p = (∑ t x t ) / N MLE: Maximum Likelihood Estimation 5Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
6
Examples: Multinomial Density K > 2 states, x i in {0,1} P (x 1,x 2,...,x K ) = ∏ i p i x i L (p 1,p 2,...,p K | X ) = log ∏ t ∏ i p i x i t where x i t = 1 if experiment t chooses state i x i t = 0 otherwise MLE: p i = (∑ t x i t ) / N PS. K: mutually exclusive 6Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
7
Gaussian (Normal) Distribution p(x) = N ( μ, σ 2 ) Given a sample X = { x t } t=1 N with x t ~ N ( μ, σ 2 ), the log likelihood of a Gaussian sample is μ σ L (Why? N samples?) 7
8
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Gaussian (Normal) Distribution MLE for μ and σ 2 : ( Exercise!) μ σ 8
9
θ Bias and Variance Unknown parameter θ Estimator d i = d (X i ) on sample X i Bias: b θ (d) = E [d] – θ Variance: E [(d–E [d]) 2 ] If b θ (d) = 0 d is an unbiased estimator of θ If E [(d–E [d]) 2 ] = 0 d is a consistent estimator of θ 9Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
10
Expected value If the probability distribution of X admits a probability density function f (x), then the expected value can be computed as It follows directly from the discrete case definition that if X is a constant random variable, i.e. X = b for some fixed real number b, then the expected value of X is also b. The expected value of an arbitrary function of X, g(X), with respect to the probability density function f(x) is given by the inner product of f and g: http://en.wikipedia.org/wiki/Expected_value 10
11
Bias and Variance For example: Var [m] 0 as N ∞ m is also a consistent estimator 11Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
12
Bias and Variance For example: (see P. 65-66) s 2 is a biased estimator of σ 2 (N/(N-1))s 2 is a unbiased estimator of σ 2 Mean square error: r (d, θ ) = E [(d– θ ) 2 ] (see P. 66, next slide) = (E [d] – θ ) 2 + E [(d–E [d]) 2 ] = Bias 2 + Variance 12Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) θ
13
Bias and Variance 13Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
14
Standard Deviation In statistics, the standard deviation is often estimated from a random sample drawn from the population. The most common measure used is the sample standard deviation, which is defined by where is the sample (formally, realizations from a random variable X) and is the sample mean. http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation 14
15
Bayes’ Estimator Treat θ as a random variable with prior p(θ) Bayes’ rule: Maximum a Posteriori (MAP): θ MAP = arg max θ p(θ| X ) Maximum Likelihood (ML): θ ML = arg max θ p( X |θ) Bayes’ estimator: θ Bayes = E[θ| X ] = ∫ θ p(θ| X ) dθ 15Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
16
MAP VS ML If p(θ) is an uniform distribution then θ MAP = arg max θ p(θ| X ) = arg max θ p( X |θ) p(θ) / p( X ) = arg max θ p( X |θ) = θ ML θ MAP = θ ML where p(θ) / p( X ) is a constant 16Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
17
Bayes’ Estimator: Example If p (θ| X ) is normal, then θ ML = m and θ Bayes =θ MAP Example: Suppose x t ~ N (θ, σ 2 ) and θ ~ N ( μ 0, σ 0 2 ) θ Bayes = The Bayes’ estimator is a weighted average of the prior mean μ 0 and the sample mean m. θ ML = m 17
18
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Parametric Classification The discriminant function Assume that are Gaussian 18 log likelihood of a Gaussian sample
19
Given the sample Maximum Likelihood (ML) estimates are Discriminant becomes 19Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
20
The first term is a constant and if the priors are equal, those terms can be dropped. Assume that variances are equal, then becomes Choose C i if 20Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
21
Equal variances Single boundary at halfway between means Likelihood functions and the posteriors with equal priors for two classes when the input is one-dimensional. Variances are equal and the posteriors intersect at one point, which is the threshold of decision. 21
22
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Variances are different Two boundaries Likelihood functions and the posteriors with equal priors for two classes when the input is one-dimensional. Variances are unequal and the posteriors intersect at two points. 22
23
Exercise For a two-class problem, generate normal samples for two classes with different variances, then use parametric classification to estimate the discriminant points. Compare these with the theoretical values. PS. You can use normal sample generation tools. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)23
24
Regression Regression assumes 0 mean Gaussian noise added to the model; Here, the model is linear. 24Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) : the probability of the output given the input Given an sample X = { x t, r t } t=1 N, the log likelihood is
25
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Regression: From LogL to Error Ignore the second term: (because it does not depend on our estimator) Maximizing this is equivalent to minimizing Least squares estimate 25
26
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Example: Linear Regression (Exercise!!) 26
27
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Example: Polynomial Regression 27
28
Other Error Measures Square Error: Relative Square Error: Absolute Error: E (θ| X ) = ∑ t |r t – g(x t |θ)| ε-sensitive Error: E (θ| X ) = ∑ t 1(|r t – g(x t |θ)|>ε) 28Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
29
Bias and Variance (See Eq. 4.17) Now let’s note that g(.) is a random variable (function) of samples S: The expected square error at a particular point x wrt to a fixed g(x) and variations in r based on p(r|x): 29Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) noisesquared error Estimate for the error at point x Expectation of our estimate for the error at point x (wrt sample variation)
30
Bias and Variance (See Eq. 4.17) We would like to compute the above over all points x in the space: The expected square error at a particular point x wrt to a fixed g(x) and variations in r based on p(r|x): 30Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) noisesquared error
31
Bias and Variance bias 2 variance The expected value (average over samples X, all of size N and drawn from the same joint density p(x, r)) : (See Eq. 4.11) (See page 66 and 76) 31Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) squared error squared error = bais 2 +variance Now let’s note that g(.) is a random variable (function) of samples S:
32
Estimating Bias and Variance Samples X i ={x t i, r t i }, i =1,...,M, t = 1,2,…,N are used to fit g i (x), i =1,...,M 32Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) θ
33
Bias/Variance Dilemma Examples: g i (x) = 2 has no variance and high bias g i (x) = ∑ t r t i /N has lower bias with variance As we increase model complexity, bias decreases (a better fit to data) and variance increases (fit varies more with data) Bias/Variance dilemma (Geman et al., 1992) 33Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
34
bias f gigi g f variance 34
35
Polynomial Regression Best fit “min error” overfittingunderfitting 35Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
36
Best fit, Fig. 4.7 36Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
37
Model Selection (1) Cross-validation: Measure generalization accuracy by testing on data unused during training To find the optimal complexity Regularization ( 調整 ): Penalize complex models E’ = error on data + λ model complexity Structural risk minimization (SRM): To find the model simplest in terms of order and best in terms of empirical error on the data Model complexity measure: polynomials of increasing order, VC dimension,... Minimum description length (MDL): Kolmogorov complexity of a data set is defined as the shortest description of data 37Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
38
Model Selection (2) Bayesian Model Selection: Prior on models, p(model) Discussions: When the prior is chosen such that we give higher probabilities to simpler models, the Bayesian approach, regularization, SRM, and MDL are equivalent. Cross-validation is the best approach if there is a large enough validation dataset. 38Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
39
Regression example Coefficients increase in magnitude as order increases: 1: [-0.0769, 0.0016] 2: [0.1682, -0.6657, 0.0080] 3: [0.4238, -2.5778, 3.4675, -0.0002 4: [-0.1093, 1.4356, -5.5007, 6.0454, -0.0019] Please compare with Fig. 4.5, p. 78. 39Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.