Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Visual Recognition Tutorial
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Introduction to Bayesian Parameter Estimation
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Chapter 3 (part 2): Maximum-Likelihood and Bayesian Parameter Estimation Bayesian Estimation (BE) Bayesian Estimation (BE) Bayesian Parameter Estimation:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,
Univariate Gaussian Case (Cont.)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Univariate Gaussian Case (Cont.)
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Probability theory retro
Parameter Estimation 主講人:虞台文.
Pattern Classification, Chapter 3
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Outline Parameter estimation – continued Non-parametric methods.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 09: DISCRIMINANT ANALYSIS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case Bayesian Parameter Estimation: General Estimation Problems of Dimensionality Computational Complexity Component Analysis and Discriminants Hidden Markov Models

Pattern Classification, Chapter 1 Bayesian Estimation (Bayesian learning to pattern classification problems) In MLE  was supposed fix In BE  is a random variable The computation of posterior probabilities P(i | x) lies at the heart of Bayesian classification Goal: compute P(i | x,Θ) Given the sample Θ, Bayes formula can be written Pattern Classification, Chapter 1 3

Pattern Classification, Chapter 1 To demonstrate the preceding equation, use: Pattern Classification, Chapter 1 3

Pattern Classification, Chapter 1 The desired probability density p(x) is unknown Assume that it has a known parametric form. The only thing assumed unknown is the value of a parameter vector θ. We shall express the fact that p(x) is unknown but has known parametric form p(x|θ) . Any information we might have about θ prior to observing the samples is assumed to be contained in a known prior density p(θ). Observation of the samples converts this to a posterior We hope density p(θ|D) is sharply peaked about the true value of θ. Pattern Classification, Chapter 1

Pattern Classification, Chapter 1 If p(θ|D) peaks very sharply about some value ˆθ, we obtain p(x|D) ~= p(x|ˆθ), i.e., the result we would obtain by substituting the estimate ˆθ for the true parameter vector. Pattern Classification, Chapter 1

Pattern Classification, Chapter 1 Bayesian Parameter Estimation: Gaussian Case Goal: Estimate  using the a-posteriori density P( | Θ) The univariate case: P( | Θ)  is the only unknown parameter (0 and 0 are known!) Pattern Classification, Chapter 1 4

Thus, p(μ|D) is an exponential function of a quadratic function of μ. If we write

Pattern Classification, Chapter 1 4

The univariate case P(x |Θ) hence p(x|D) is normally distributed with mean and variance Density p(x|D) is the desired class-conditional density and together with the prior probabilities it gives us the probabilistic information needed to design the classifier. In contrast to maximum likelihood methods that only make points estimates for rather that estimate a distribution for p(x|D).

Pattern Classification, Chapter 1 P( | Θ) computed P(x | Θ) remains to be computed! It provides: (Desired class-conditional density P(x | Θj, j)) Therefore: P(x | Θj, j) together with P(j) And using Bayes formula, we obtain the Bayesian classification rule: Pattern Classification, Chapter 1 4

Multivariate case

Pattern Classification, Chapter 1 Bayesian Parameter Estimation: General Theory P(x | Θ) computation can be applied to any situation in which the unknown density can be parametrized: the basic assumptions are: The form of P(x | ) is assumed known, but the value of  is not known exactly Our knowledge about  is assumed to be contained in a known prior density P() The rest of our knowledge  is contained in a set Θ of n random variables x1, x2, …, xn that follows P(x) Pattern Classification, Chapter 1 5

Pattern Classification, Chapter 1 The basic problem is: “Compute the posterior density P( |Θ)” then “Derive P(x | Θ)” Using Bayes formula, we have: And by independence assumption: Pattern Classification, Chapter 1 5

Pattern Classification, Chapter 1 Relation to the maximum likelihood solution: Suppose that p(D|θ) reaches a sharp peak at θ = ˆθ. If the prior density p(θ) is not zero at θ = ˆθ and does not change much in the surrounding neighborhood, then p(θ|D) also peaks at that point. Thus, p(x|D) will be approximately p(x|ˆθ), the result one would obtain by using the maximum likelihood estimate as if it were the true value. If the peak of p(D|θ) is very sharp, then the influence of prior information on the uncertainty in the true value of θ can be ignored. The Bayesian solution tells us how to use all the available information to compute the desired density p(x|D). Pattern Classification, Chapter 1

Convergence of p(x|D) to p(x): : Pattern Classification, Chapter 1

Do Maximum Likelihood and Bayes methods differ? There are several criteria that will influence our choice: 1- Computational complexity: maximum likelhood methods are often to be prefered. 2- Confidence in the prior information: If such information is reliable, Bayes methods gives better results. Further, general Bayesian methods with a “flat” or uniform prior are equivalent to maximum likelihood methods. Pattern Classification, Chapter 1

Problem of Dimensionality 1.How classification accuracy depends upon the dimensionality (and amount of training data)? 2. How much is the computational complexity of the classifier? If the features are statistically independent, there are some theoretical results that suggest the possibility of excellent performance. For example, consider the two-class multivariate normal case with the same covariance where If the a priori probabilities are equal, then the Bayes error rate is given by In the conditionally independent case, Each feature contributes to reducing the probability of error. In the conditionally independent case,

Pattern Classification, Chapter 1 Most useful features are the ones for which the difference between the means is large relative to the standard deviation. It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: we have the wrong model ! a)e.g., the Gaussian assumption or conditional assumption are wrong b) the number of design or training samples is finite and thus the distributions are not estimated accurately. Pattern Classification, Chapter 1 7

Pattern Classification, Chapter 1 7 7 Pattern Classification, Chapter 1 7

Pattern Classification, Chapter 1 Computational Complexity Our design methodology is affected by the computational difficulty “big oh” notation f(x) = O(h(x)) “big oh of h(x)” If: (An upper bound on f(x) grows no worse than h(x) for sufficiently large x!) f(x) = 2+3x+4x2 g(x) = x2 f(x) = O(x2) Pattern Classification, Chapter 1 7

Pattern Classification, Chapter 1 “big oh” is not unique! f(x) = O(x2); f(x) = O(x3); f(x) = O(x4) “big theta” notation f(x) = (h(x)) If: f(x) = (x2) but f(x)  (x3) Pattern Classification, Chapter 1 7

Complexity of the ML Estimation Gaussian priors in d dimensions classifier with n training samples for each of c classes For each category, we have to compute the discriminant function Total = O(d2..n) Total for c classes = O(cd2.n)  O(d2.n) Cost increase when d and n are large! 7

Pattern Classification, Chapter 1 Component Analysis and Discriminants Combine features in order to reduce the dimension of the feature space Linear combinations are simple to compute and tractable Project high dimensional data onto a lower dimensional space Two classical approaches for finding “optimal” linear transformation PCA (Principal Component Analysis) “Projection that best represents the data in a least- square sense” MDA (Multiple Discriminant Analysis) “Projection that best separates the data in a least-squares sense” Pattern Classification, Chapter 1 8

Pattern Classification, Chapter 1 PCA Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Multivariate Discriminant Analysis Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1 Hidden Markov Models: Markov Chains Goal: make a sequence of decisions Processes that unfold in time, states at time t are influenced by a state at time t-1 Applications: speech recognition, gesture recognition, parts of speech tagging and DNA sequencing, Any temporal process without memory T = {(1), (2), (3), …, (T)} sequence of states We might have 6 = {1, 4, 2, 2, 1, 4} Pattern Classification, Chapter 1 10

Pattern Classification, Chapter 1 The system can revisit a state at different steps and not every state need to be visited First-order Markov models Our productions of any sequence is described by the transition probabilities P(j(t + 1) | i (t)) = aij Pattern Classification, Chapter 1 10

Pattern Classification, Chapter 1 10

Pattern Classification, Chapter 1  = (aij, T) P(T | ) = a14 . a42 . a22 . a21 . a14 . P((1) = i) Example: speech recognition “production of spoken words” Production of the word: “pattern” represented by phonemes /p/ /a/ /tt/ /er/ /n/ // ( // = silent state) Transitions from /p/ to /a/, /a/ to /tt/, /tt/ to er/, /er/ to /n/ and /n/ to a silent state Pattern Classification, Chapter 1 10