Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.

Slides:



Advertisements
Similar presentations
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Advertisements

CS479/679 Pattern Recognition Dr. George Bebis
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Visual Recognition Tutorial
Assuming normally distributed data! Naïve Bayes Classifier.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
AGC DSP AGC DSP Professor A G Constantinides© Estimation Theory We seek to determine from a set of data, a set of parameters such that their values would.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
Introduction to Bayesian Parameter Estimation
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
Chapter Two Probability Distributions: Discrete Variables
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Chapter 3 (part 2): Maximum-Likelihood and Bayesian Parameter Estimation Bayesian Estimation (BE) Bayesian Estimation (BE) Bayesian Parameter Estimation:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,
Univariate Gaussian Case (Cont.)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Univariate Gaussian Case (Cont.)
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Probability theory retro
Parameter Estimation 主講人:虞台文.
CH 5: Multivariate Methods
Pattern Classification, Chapter 3
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
REMOTE SENSING Multispectral Image Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Learning From Observed Data
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate Case: unknown  and unknown  2 l Bias l Appendix: Maximum-Likelihood Problem Statement

Pattern Classification, Chapter 3 1 l Introduction l Data availability in a Bayesian framework l We could design an optimal classifier if we knew: l P(  i ) (priors) l P(x |  i ) (class-conditional densities) Unfortunately, we rarely have this complete information! l Design a classifier from a training sample l No problem with prior estimation l Samples are often too small for class-conditional estimation (large dimension of feature space!) 1

Pattern Classification, Chapter 3 2 l A priori information about the problem l Normality of P(x |  i ) P(x |  i ) ~ N(  i,  i ) l Characterized by  i and  i parameters l Estimation techniques l Maximum-Likelihood and Bayesian estimations l Results nearly identical, but approaches are different l We will not cover Bayesian estimation details 1

Pattern Classification, Chapter 3 3 l Parameters in Maximum-Likelihood estimation are fixed but unknown! l Best parameters are obtained by maximizing the probability of obtaining the samples observed l Bayesian methods view the parameters as random variables having some known distribution l In either approach, we use P(  i | x) for our classification rule! 1

Pattern Classification, Chapter 3 4 l Maximum-Likelihood Estimation l Has good convergence properties as the sample size increases l Simpler than alternative techniques l General principle l Assume we have c classes and P(x |  j ) ~ N(  j,  j ) P(x |  j )  P (x |  j,  j ) where: 2

Pattern Classification, Chapter 3 5 l Use the information provided by the training samples to estimate  = (  1,  2, …,  c ), each  i (i = 1, 2, …, c) is associated with each category l Suppose that D contains n samples, x 1, x 2,…, x n l ML estimate of  is, by definition the value that maximizes P(D |  ) “It is the value of  that best agrees with the actually observed training sample” 2

Pattern Classification, Chapter Likelihood Log-likelihood (  fixed,  = unknown 

Pattern Classification, Chapter 3 7 l Optimal estimation l Let  = (  1,  2, …,  p ) t and let   be the gradient operator l We define l(  ) as the log-likelihood function l(  ) = ln P(D |  ) l New problem statement: determine  that maximizes the log-likelihood 2

Pattern Classification, Chapter 3 8 Set of necessary conditions for an optimum is:   l = 0 2 n = number of training samples

Pattern Classification, Chapter 3 9 l Multivariate Gaussian: unknown , known  l Samples drawn from multivariate Gaussian population P(x i |  ) ~ N( ,  )  =   =  therefore: The ML estimate for  must satisfy: 2

Pattern Classification, Chapter 3 10 Multiplying by  and rearranging, we obtain: Just the arithmetic average of the samples of the training samples! Conclusion: If P(x k |  j ) (j = 1, 2, …, c) is supposed to be Gaussian in a d- dimensional feature space; then we can estimate the vector  = (  1,  2, …,  c ) t and perform an optimal classification! 2

Pattern Classification, Chapter 3 11 l Univariate Gaussian: unknown , unknown  2 l Samples drawn from univariate Gaussian population P(x i | ,  2 ) ~ N( ,  2 )  = (  1,  2 ) = ( ,  2 ) 2

Pattern Classification, Chapter 3 12 Summation: Combining (1) and (2), one obtains: 2

Pattern Classification, Chapter 3 13 l Bias l Maximum-Likelihood estimate for  2 is biased l An elementary unbiased estimator for  is: 2

Pattern Classification, Chapter 3 14 l Appendix: Maximum-Likelihood Problem Statement l Let D = {x 1, x 2, …, x n } P(x 1,…, x n |  ) =  1,n P(x k |  ); |D| = n Our goal is to determine (value of  that makes this sample the most representative!) 2

Pattern Classification, Chapter 3 15 |D| = n x1x1 x2x2 xnxn x 11 x 20 x 10 x8x8 x9x9 x1x1 N(  j,  j ) = P(x j,  1 ) D1D1 DcDc DkDk P(x j |  1 ) P(x j |  k ) 2

Pattern Classification, Chapter 3 16  = (  1,  2, …,  c ) Problem: find such that: 2

Pattern Classification, Chapter 3 17 l Sources of final-system classification error (sec 3.5.1) l Bayes Error l Error due to overlapping densities for different classes (inherent error, never eliminated) l Model Error l Error due to having an incorrect model l Estimation Error l Error from estimating parameters from finite sample 1

Pattern Classification, Chapter 1 18 Problems of Dimensionality (sec 3.7) Accuracy, Dimension, Training Sample Size l Classification accuracy depends upon the dimensionality and the amount of training data l Case of two classes multivariate normal with the same covariance 7

Pattern Classification, Chapter 1 19 l If features are independent then: l Most useful features are the ones for which the difference between the means is large relative to the standard deviation l It appears that adding new features improves accuracy l It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: we have the wrong model ! 7

Pattern Classification, Chapter

Pattern Classification, Chapter 1 21 Computational Complexity l Maximum-Likelihood Estimation l Gaussian priors in d dimensions, with n training samples for each of c classes l For each category, we have to compute the discriminant function Total = O(d 2.. n) Total for c classes = O(cd 2.n)  O(d 2.n) l Cost increase when d and n are large! 7

Pattern Classification, Chapter 1 22 Overfitting l Number of training samples n can be inadequate for estimating the parameters l What to do? l Simplify the model – reduce the parameters l Assume all classes have same covariance matrix l Assume statistical independence l Reduce number of features d l Principal Component Analysis, etc. 8

Pattern Classification, Chapter