Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.

Similar presentations


Presentation on theme: "Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate."— Presentation transcript:

1 Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate Case: unknown  and unknown  2 l Bias l Appendix: Maximum-Likelihood Problem Statement

2 Pattern Classification, Chapter 3 1 l Introduction l Data availability in a Bayesian framework l We could design an optimal classifier if we knew: l P(  i ) (priors) l P(x |  i ) (class-conditional densities) Unfortunately, we rarely have this complete information! l Design a classifier from a training sample l No problem with prior estimation l Samples are often too small for class-conditional estimation (large dimension of feature space!) 1

3 Pattern Classification, Chapter 3 2 l A priori information about the problem l Normality of P(x |  i ) P(x |  i ) ~ N(  i,  i ) l Characterized by  i and  i parameters l Estimation techniques l Maximum-Likelihood and Bayesian estimations l Results nearly identical, but approaches are different l We will not cover Bayesian estimation details 1

4 Pattern Classification, Chapter 3 3 l Parameters in Maximum-Likelihood estimation are fixed but unknown! l Best parameters are obtained by maximizing the probability of obtaining the samples observed l Bayesian methods view the parameters as random variables having some known distribution l In either approach, we use P(  i | x) for our classification rule! 1

5 Pattern Classification, Chapter 3 4 l Maximum-Likelihood Estimation l Has good convergence properties as the sample size increases l Simpler than alternative techniques l General principle l Assume we have c classes and P(x |  j ) ~ N(  j,  j ) P(x |  j )  P (x |  j,  j ) where: 2

6 Pattern Classification, Chapter 3 5 l Use the information provided by the training samples to estimate  = (  1,  2, …,  c ), each  i (i = 1, 2, …, c) is associated with each category l Suppose that D contains n samples, x 1, x 2,…, x n l ML estimate of  is, by definition the value that maximizes P(D |  ) “It is the value of  that best agrees with the actually observed training sample” 2

7 Pattern Classification, Chapter 3 6 2 Likelihood Log-likelihood (  fixed,  = unknown 

8 Pattern Classification, Chapter 3 7 l Optimal estimation l Let  = (  1,  2, …,  p ) t and let   be the gradient operator l We define l(  ) as the log-likelihood function l(  ) = ln P(D |  ) l New problem statement: determine  that maximizes the log-likelihood 2

9 Pattern Classification, Chapter 3 8 Set of necessary conditions for an optimum is:   l = 0 2 n = number of training samples

10 Pattern Classification, Chapter 3 9 l Multivariate Gaussian: unknown , known  l Samples drawn from multivariate Gaussian population P(x i |  ) ~ N( ,  )  =   =  therefore: The ML estimate for  must satisfy: 2

11 Pattern Classification, Chapter 3 10 Multiplying by  and rearranging, we obtain: Just the arithmetic average of the samples of the training samples! Conclusion: If P(x k |  j ) (j = 1, 2, …, c) is supposed to be Gaussian in a d- dimensional feature space; then we can estimate the vector  = (  1,  2, …,  c ) t and perform an optimal classification! 2

12 Pattern Classification, Chapter 3 11 l Univariate Gaussian: unknown , unknown  2 l Samples drawn from univariate Gaussian population P(x i | ,  2 ) ~ N( ,  2 )  = (  1,  2 ) = ( ,  2 ) 2

13 Pattern Classification, Chapter 3 12 Summation: Combining (1) and (2), one obtains: 2

14 Pattern Classification, Chapter 3 13 l Bias l Maximum-Likelihood estimate for  2 is biased l An elementary unbiased estimator for  is: 2

15 Pattern Classification, Chapter 3 14 l Appendix: Maximum-Likelihood Problem Statement l Let D = {x 1, x 2, …, x n } P(x 1,…, x n |  ) =  1,n P(x k |  ); |D| = n Our goal is to determine (value of  that makes this sample the most representative!) 2

16 Pattern Classification, Chapter 3 15 |D| = n x1x1 x2x2 xnxn.................. x 11 x 20 x 10 x8x8 x9x9 x1x1 N(  j,  j ) = P(x j,  1 ) D1D1 DcDc DkDk P(x j |  1 ) P(x j |  k ) 2

17 Pattern Classification, Chapter 3 16  = (  1,  2, …,  c ) Problem: find such that: 2

18 Pattern Classification, Chapter 3 17 l Sources of final-system classification error (sec 3.5.1) l Bayes Error l Error due to overlapping densities for different classes (inherent error, never eliminated) l Model Error l Error due to having an incorrect model l Estimation Error l Error from estimating parameters from finite sample 1

19 Pattern Classification, Chapter 1 18 Problems of Dimensionality (sec 3.7) Accuracy, Dimension, Training Sample Size l Classification accuracy depends upon the dimensionality and the amount of training data l Case of two classes multivariate normal with the same covariance 7

20 Pattern Classification, Chapter 1 19 l If features are independent then: l Most useful features are the ones for which the difference between the means is large relative to the standard deviation l It appears that adding new features improves accuracy l It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: we have the wrong model ! 7

21 Pattern Classification, Chapter 1 20 77 7

22 Pattern Classification, Chapter 1 21 Computational Complexity l Maximum-Likelihood Estimation l Gaussian priors in d dimensions, with n training samples for each of c classes l For each category, we have to compute the discriminant function Total = O(d 2.. n) Total for c classes = O(cd 2.n)  O(d 2.n) l Cost increase when d and n are large! 7

23 Pattern Classification, Chapter 1 22 Overfitting l Number of training samples n can be inadequate for estimating the parameters l What to do? l Simplify the model – reduce the parameters l Assume all classes have same covariance matrix l Assume statistical independence l Reduce number of features d l Principal Component Analysis, etc. 8

24 Pattern Classification, Chapter 1 23 8


Download ppt "Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate."

Similar presentations


Ads by Google