Chapter 3: Maximum-Likelihood Parameter Estimation

Slides:



Advertisements
Similar presentations
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Advertisements

CS479/679 Pattern Recognition Dr. George Bebis
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Visual Recognition Tutorial
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
AGC DSP AGC DSP Professor A G Constantinides© Estimation Theory We seek to determine from a set of data, a set of parameters such that their values would.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
Introduction to Bayesian Parameter Estimation
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Chapter 3 (part 2): Maximum-Likelihood and Bayesian Parameter Estimation Bayesian Estimation (BE) Bayesian Estimation (BE) Bayesian Parameter Estimation:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,
Univariate Gaussian Case (Cont.)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Lecture 2. Bayesian Decision Theory
Univariate Gaussian Case (Cont.)
CS479/679 Pattern Recognition Dr. George Bebis
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Probability theory retro
Parameter Estimation 主講人:虞台文.
CH 5: Multivariate Methods
Maximum Likelihood Estimation
Pattern Classification, Chapter 3
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Outline Parameter estimation – continued Non-parametric methods.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
LECTURE 09: DISCRIMINANT ANALYSIS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Learning From Observed Data
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

Chapter 3: Maximum-Likelihood Parameter Estimation Sec 1: Introduction Sec 2: Maximum-Likelihood Estimation Multivariate Case: unknown , known  Univariate Case: unknown  and unknown 2 Multivariate Case: unknown  and unknown 2 Bias Maximum-Likelihood Problem Statement Sec 5.1: When Do ML and Bayes Methods Differ? Sec 7: Problems of Dimensionality

Sec 1: Introduction Data availability in a Bayesian framework We could design an optimal classifier if we knew: P(i) (priors) P(x | i) (class-conditional densities) Unfortunately, we rarely have this complete information! Design a classifier from a training sample No problem with prior estimation Number of samples are often too small for class-conditional estimation (large dimension of feature space!) Pattern Classification, Chapter 3

A priori information about the problem Normality of P(x | i) P(x | i) ~ N( i, i) Characterized by i and i parameters Estimation techniques Maximum-Likelihood and Bayesian estimations Results nearly identical, but approaches are different We will not cover Bayesian estimation details Pattern Classification, Chapter 3

In either approach, we use P(i | x) for our classification rule! Parameters in Maximum-Likelihood estimation are assumed fixed but unknown! Best parameters are obtained by maximizing the probability of obtaining the samples observed Bayesian methods view the parameters as random variables having some known distribution In either approach, we use P(i | x) for our classification rule! Pattern Classification, Chapter 3

Sec 2: Maximum-Likelihood Estimation Has good convergence properties as the sample size increases Simpler than alternative techniques General principle Assume we have c classes and P(x | j) ~ N( j, j) P(x | j)  P (x | j, j) where: Pattern Classification, Chapter 3

Use the information provided by the training samples to estimate  = (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each category Suppose that D contains n samples, x1, x2,…, xn , and we simplify our notation by omitting class distinctions The Maximum Likelihood estimate of  is, by definition, the value that maximizes P(D | ) “It is the value of  that best agrees with the actually observed training sample” Pattern Classification, Chapter 3

Likelihood Log-likelihood (s fixed,  = unknown m) Training data = red dots Likelihood Log-likelihood Pattern Classification, Chapter 3

Optimal estimation Let  = (1, 2, …, p)t and let  be the gradient operator We define l() as the log-likelihood function l() = ln P(D | ) New problem statement: determine  that maximizes the log-likelihood Pattern Classification, Chapter 3

l = 0 Set of necessary conditions for an optimum is: n = number of training samples Pattern Classification, Chapter 3

Multivariate Gaussian: unknown , known  Samples drawn from multivariate Gaussian population P(xi | ) ~ N(, )  =   =  so the ML estimate for  must satisfy: Pattern Classification, Chapter 3

Multiplying by  and rearranging, we obtain: Just the arithmetic average of the training samples! Conclusion: If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-dimensional feature space; then we can estimate the vector  = (1, 2, …, c)t and perform an optimal classification! Pattern Classification, Chapter 3

Univariate Gaussian: unknown , unknown 2 Samples drawn from univariate Gaussian population P(xi | , 2) ~ N(, 2)  = (1, 2) = (, 2) Pattern Classification, Chapter 3

Combining (1) and (2), one obtains: Summation: Combining (1) and (2), one obtains: Pattern Classification, Chapter 3

Multivariate Gaussian: Maximum-Likelihood estimates for  and  Maximum-Likelihood estimate for  is: Maximum-Likelihood estimate for  is: Pattern Classification, Chapter 3

Bias An elementary unbiased estimator for  is: Maximum-Likelihood estimate for 2 is biased An elementary unbiased estimator for  is: Pattern Classification, Chapter 3

Maximum-Likelihood Problem Statement Let D = {x1, x2, …, xn} P(x1,…, xn | ) = 1,nP(xk | ); Our goal is to determine (value of  that makes this sample the most representative!) Pattern Classification, Chapter 3

|D| = n . . . . x2 . x1 . xn N(, ) = P(x | 1) P(x | c) P(x | k) D1 x11 . . . . x10 Dk . Dc x8 . . . x20 . . x1 x9 . . Pattern Classification, Chapter 3

Problem: find such that: Pattern Classification, Chapter 3

Sec 5.1: When Do Maximum-Likelihood and Bayes Methods Differ? They rarely differ ML is less complex, easier to understand Sources of system classification error Bayes Error - Error due to overlapping densities for different classes (inherent error, never eliminated) Model Error - Error due to having an incorrect model Estimation Error - Error from estimating parameters from a finite sample Pattern Classification, Chapter 3

Sec 7: Problems of Dimensionality Accuracy, Dimension, Training Sample Size Classification accuracy depends upon the dimensionality and the amount of training data Case of two classes multivariate normal with the same covariance Bayes Error Pattern Classification, Chapter 1

If features are independent then: Most useful features are the ones for which the difference between the means is large relative to the standard deviation It appears that adding new features improves accuracy It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: we have the wrong model ! Pattern Classification, Chapter 1

7 7 Pattern Classification, Chapter 1

Computational Complexity Maximum-Likelihood Estimation Gaussian priors in d feature dimensions, with n training samples for each of c classes For each category, we have to compute the discriminant function Total = O(d2..n) Total for c classes = O(cd2.n)  O(d2.n) Costs increase when d and n are large! Pattern Classification, Chapter 1

Overfitting Number of training samples n can be inadequate for estimating the parameters What to do? Simplify the model – reduce the parameters Assume all classes have same covariance matrix And maybe the identity matrix or zero off-diagonal elements Assume statistical independence Reduce number of features d Principal Component Analysis, etc. Pattern Classification, Chapter 1

Pattern Classification, Chapter 1