Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3: Maximum-Likelihood Parameter Estimation

Similar presentations


Presentation on theme: "Chapter 3: Maximum-Likelihood Parameter Estimation"— Presentation transcript:

1 Chapter 3: Maximum-Likelihood Parameter Estimation
Sec 1: Introduction Sec 2: Maximum-Likelihood Estimation Multivariate Case: unknown , known  Univariate Case: unknown  and unknown 2 Multivariate Case: unknown  and unknown 2 Bias Maximum-Likelihood Problem Statement Sec 5.1: When Do ML and Bayes Methods Differ? Sec 7: Problems of Dimensionality

2 Sec 1: Introduction Data availability in a Bayesian framework
We could design an optimal classifier if we knew: P(i) (priors) P(x | i) (class-conditional densities) Unfortunately, we rarely have this complete information! Design a classifier from a training sample No problem with prior estimation Number of samples are often too small for class-conditional estimation (large dimension of feature space!) Pattern Classification, Chapter 3

3 A priori information about the problem Normality of P(x | i)
P(x | i) ~ N( i, i) Characterized by i and i parameters Estimation techniques Maximum-Likelihood and Bayesian estimations Results nearly identical, but approaches are different We will not cover Bayesian estimation details Pattern Classification, Chapter 3

4 In either approach, we use P(i | x) for our classification rule!
Parameters in Maximum-Likelihood estimation are assumed fixed but unknown! Best parameters are obtained by maximizing the probability of obtaining the samples observed Bayesian methods view the parameters as random variables having some known distribution In either approach, we use P(i | x) for our classification rule! Pattern Classification, Chapter 3

5 Sec 2: Maximum-Likelihood Estimation
Has good convergence properties as the sample size increases Simpler than alternative techniques General principle Assume we have c classes and P(x | j) ~ N( j, j) P(x | j)  P (x | j, j) where: Pattern Classification, Chapter 3

6 Use the information provided by the training samples to estimate  = (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each category Suppose that D contains n samples, x1, x2,…, xn , and we simplify our notation by omitting class distinctions The Maximum Likelihood estimate of  is, by definition, the value that maximizes P(D | ) “It is the value of  that best agrees with the actually observed training sample” Pattern Classification, Chapter 3

7 Likelihood Log-likelihood (s fixed,  = unknown m)
Training data = red dots Likelihood Log-likelihood Pattern Classification, Chapter 3

8 Optimal estimation Let  = (1, 2, …, p)t and let  be the gradient operator We define l() as the log-likelihood function l() = ln P(D | ) New problem statement: determine  that maximizes the log-likelihood Pattern Classification, Chapter 3

9 l = 0 Set of necessary conditions for an optimum is:
n = number of training samples Pattern Classification, Chapter 3

10 Multivariate Gaussian: unknown , known 
Samples drawn from multivariate Gaussian population P(xi | ) ~ N(, )  =   =  so the ML estimate for  must satisfy: Pattern Classification, Chapter 3

11 Multiplying by  and rearranging, we obtain:
Just the arithmetic average of the training samples! Conclusion: If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-dimensional feature space; then we can estimate the vector  = (1, 2, …, c)t and perform an optimal classification! Pattern Classification, Chapter 3

12 Univariate Gaussian: unknown , unknown 2
Samples drawn from univariate Gaussian population P(xi | , 2) ~ N(, 2)  = (1, 2) = (, 2) Pattern Classification, Chapter 3

13 Combining (1) and (2), one obtains:
Summation: Combining (1) and (2), one obtains: Pattern Classification, Chapter 3

14 Multivariate Gaussian: Maximum-Likelihood estimates for  and 
Maximum-Likelihood estimate for  is: Maximum-Likelihood estimate for  is: Pattern Classification, Chapter 3

15 Bias An elementary unbiased estimator for  is:
Maximum-Likelihood estimate for 2 is biased An elementary unbiased estimator for  is: Pattern Classification, Chapter 3

16 Maximum-Likelihood Problem Statement
Let D = {x1, x2, …, xn} P(x1,…, xn | ) = 1,nP(xk | ); Our goal is to determine (value of  that makes this sample the most representative!) Pattern Classification, Chapter 3

17 |D| = n . . . . x2 . x1 . xn N(, ) = P(x | 1) P(x | c) P(x | k) D1 x11 . . . . x10 Dk . Dc x8 . . . x20 . . x1 x9 . . Pattern Classification, Chapter 3

18 Problem: find such that:
Pattern Classification, Chapter 3

19 Sec 5.1: When Do Maximum-Likelihood and Bayes Methods Differ?
They rarely differ ML is less complex, easier to understand Sources of system classification error Bayes Error - Error due to overlapping densities for different classes (inherent error, never eliminated) Model Error - Error due to having an incorrect model Estimation Error - Error from estimating parameters from a finite sample Pattern Classification, Chapter 3

20 Sec 7: Problems of Dimensionality Accuracy, Dimension, Training Sample Size
Classification accuracy depends upon the dimensionality and the amount of training data Case of two classes multivariate normal with the same covariance Bayes Error Pattern Classification, Chapter 1

21 If features are independent then:
Most useful features are the ones for which the difference between the means is large relative to the standard deviation It appears that adding new features improves accuracy It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: we have the wrong model ! Pattern Classification, Chapter 1

22 7 7 Pattern Classification, Chapter 1

23 Computational Complexity
Maximum-Likelihood Estimation Gaussian priors in d feature dimensions, with n training samples for each of c classes For each category, we have to compute the discriminant function Total = O(d2..n) Total for c classes = O(cd2.n)  O(d2.n) Costs increase when d and n are large! Pattern Classification, Chapter 1

24 Overfitting Number of training samples n can be inadequate for estimating the parameters What to do? Simplify the model – reduce the parameters Assume all classes have same covariance matrix And maybe the identity matrix or zero off-diagonal elements Assume statistical independence Reduce number of features d Principal Component Analysis, etc. Pattern Classification, Chapter 1

25 Pattern Classification, Chapter 1


Download ppt "Chapter 3: Maximum-Likelihood Parameter Estimation"

Similar presentations


Ads by Google