ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Slides:



Advertisements
Similar presentations
CS479/679 Pattern Recognition Dr. George Bebis
Advertisements

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Maximum likelihood (ML) and likelihood ratio (LR) test
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Computer vision: models, learning and inference
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
Introduction to Bayesian Parameter Estimation
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
Maximum likelihood (ML)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Machine Learning 5. Parametric Methods.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Univariate Gaussian Case (Cont.)
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Univariate Gaussian Case (Cont.)
Lecture 1.31 Criteria for optimal reception of radio signals.
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
12. Principles of Parameter Estimation
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Ch3: Model Building through Regression
Parameter Estimation 主講人:虞台文.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Outline Parameter estimation – continued Non-parametric methods.
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Learning From Observed Data
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
12. Principles of Parameter Estimation
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part 1) D.H.S.: Chapter 3 (Part 2) J.O.S.: Tutorial Nebula: Links BGSU: Example A.W.M.: Tutorial A.W.M.: Links S.P.: Primer CSRN: Unbiased A.W.M.: Bias D.H.S: Chapter 3 (Part 1) D.H.S.: Chapter 3 (Part 2) J.O.S.: Tutorial Nebula: Links BGSU: Example A.W.M.: Tutorial A.W.M.: Links S.P.: Primer CSRN: Unbiased A.W.M.:Bias Audio: URL:

ECE 8443: Lecture 05, Slide 1 For problems where features are discrete: Bayes formula involves probabilities (not densities): where Bayes rule remains the same: The maximum entropy distribution is a uniform distribution: Discrete Features

ECE 8443: Lecture 05, Slide 2 Consider independent binary features: Assuming conditional independence: The likelihood ratio is: The discriminant function is: Discriminant Functions For Discrete Features

ECE 8443: Lecture 05, Slide 3 In Chapter 2, we learned how to design an optimal classifier if we knew the prior probabilities, P(  i ), and class-conditional densities, p(x|  i ). What can we do if we do not have this information? What limitations do we face? There are two common approaches to parameter estimation: maximum likelihood and Bayesian estimation. Maximum Likelihood: treat the parameters as quantities whose values are fixed but unknown. Bayes: treat the parameters as random variables having some known prior distribution. Observations of samples converts this to a posterior. Bayesian Learning: sharpen the a posteriori density causing it to peak near the true value. Introduction to Maximum Likelihood Estimation

ECE 8443: Lecture 05, Slide 4 I.I.D.: c data sets, D 1,...,D c, where D j drawn independently according to p(x|  j ). Assume p(x|  j ) has a known parametric form and is completely determined by the parameter vector  j (e.g., p(x|  j )  N(  j,  j ), where  j =[  1,...,  j,  11,  12,...,  dd ]). p(x|  j ) has an explicit dependence on  j : p(x|  j,  j ) Use training samples to estimate  1,  2,...,  c Functional independence: assume D i gives no useful information about  j for i  j. Simplifies notation to a set D of training samples (x 1,... x n ) drawn independently from p(x|  ) to estimate . Because the samples were drawn independently: General Principle

ECE 8443: Lecture 05, Slide 5 p(D|  ) is called the likelihood of  with respect to the data. Given several training points Top: candidate source distributions are shown Which distribution is the ML estimate? Middle: an estimate of the likelihood of the data as a function of  (the mean) Bottom: log likelihood The value of  that maximizes this likelihood, denoted, is the maximum likelihood estimate (ML) of . Example of ML Estimation

ECE 8443: Lecture 05, Slide 6 The ML estimate is found by solving this equation: The solution to this equation can be a global maximum, a local maximum, or even an inflection point. Under what conditions is it a global maximum? General Mathematics

ECE 8443: Lecture 05, Slide 7 A class of estimators – maximum a posteriori (MAP) – maximize where describes the prior probability of different parameter values. An ML estimator is a MAP estimator for uniform priors. A MAP estimator finds the peak, or mode, of a posterior density. MAP estimators are not transformation invariant (if we perform a nonlinear transformation of the input data, the estimator is no longer optimum in the new space). This observation will be useful later in the course. Maximum A Posteriori Estimation

ECE 8443: Lecture 05, Slide 8 Consider the case where only the mean,  = , is unknown: which implies: because: Gaussian Case: Unknown Mean

ECE 8443: Lecture 05, Slide 9 Rearranging terms: Significance??? Substituting into the expression for the total likelihood: Gaussian Case: Unknown Mean

ECE 8443: Lecture 05, Slide 10 Let  = [ ,  2 ]. The log likelihood of a SINGLE point is: The full likelihood leads to: Gaussian Case: Unknown Mean and Variance

ECE 8443: Lecture 05, Slide 11 This leads to these equations: In the multivariate case: The true covariance is the expected value of the matrix, which is a familiar result. Gaussian Case: Unknown Mean and Variance

ECE 8443: Lecture 05, Slide 12 Does the maximum likelihood estimate of the variance converge to the true value of the variance? Let’s start with a few simple results we will need later. Expected value of the ML estimate of the mean: Convergence of the Mean

ECE 8443: Lecture 05, Slide 13 The expected value of x i x j will be  2 for j  k since the two random variables are independent. The expected value of x i 2 will be  2 +  2. Hence, in the summation above, we have n 2 -n terms with expected value  2 and n terms with expected value  2 +  2. Thus, We see that the variance of the estimate goes to zero as n goes to infinity, and our estimate converges to the true estimate (error goes to zero). which implies: Variance of the ML Estimate of the Mean

ECE 8443: Lecture 05, Slide 14 Summary Discriminant functions for discrete features are completely analogous to the continuous case (end of Chapter 2). To develop an optimal classifier, we need reliable estimates of the statistics of the features. In Maximum Likelihood (ML) estimation, we treat the parameters as having unknown but fixed values. Justified many well-known results for estimating parameters (e.g., computing the mean by summing the observations). Biased and unbiased estimators. Convergence of the mean and variance estimates.