# CS479/679 Pattern Recognition Dr. George Bebis

## Presentation on theme: "CS479/679 Pattern Recognition Dr. George Bebis"— Presentation transcript:

CS479/679 Pattern Recognition Dr. George Bebis
Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis

Bayesian Estimation Assumes that the parameters q are random variables that have some known a-priori distribution p(q). Estimates a distribution rather than making point estimates like ML. BE solution might not be of the parametric form assumed.

The Role of Training Examples in computing P(ωi /x)
If p(x/ωi) and P(ωi) are known, the Bayes’ rule allows us to compute the posterior probabilities P(ωi /x): Emphasize the role of the training examples D by introducing them in the computation of the posterior probabilities:

The Role of Training Examples (cont’d)
chain rule Using only the samples from class i marginalize chain rule

The Role of Training Examples (cont’d)
The training examples Di can help us to determine both the class-conditional densities and the prior probabilities: For simplicity, replace P(ωi /D) with P(ωi):

Bayesian Estimation (BE)
Need to estimate p(x/ωi,Di) for every class ωi If the samples in Dj give no information about qi, we need to solve c independent problems of the following form: “Given D, estimate p(x/D)”

BE Approach Estimate p(x/D) as follows: Since , we have:
Important equation: it links p(x/D) with p(θ/D)

BE Main Steps (1) Compute p(θ/D) : (2) Compute p(x/D) : where

Interpretation of BE Solution
Suppose p(θ/D) peaks sharply at , then p(x/D) can be approximated as follows: (i.e., the best estimate is obtained by setting ) (assuming that p(x/ θ) is smooth)

Interpretation of BE Solution (cont’d)
If we are less certain about the exact value of θ, we should consider a weighted average of p(x / θ) over the possible values of θ. The samples exert their influence on p(x / D) through P(θ / D).

Relation to ML solution
If p(D/ θ) peaks sharply at , then p(θ /D) will, in general, peak sharply at too (i.e., close to ML solution):

Case 1: Univariate Gaussian, Unknown μ
D={x1,x2,…,xn} (independently drawn) (1)

Case 1: Univariate Gaussian, Unknown μ (cont’d)
It can be shown that p(μ/D) has the form: x const. peaks at μn p(μ/D)

Case 1: Univariate Gaussian, Unknown μ (cont’d)
(i.e., lies between them) as (ML estimate) (ML estimate) implies more samples!

Case 1: Univariate Gaussian, Unknown μ (cont’d)

Case 1: Univariate Gaussian, Unknown μ (cont’d)
Bayesian Learning

Case 1: Univariate Gaussian, Unknown μ (cont’d)
not dependent on μ (2) As the number of samples increases, p(x/D) converges to p(x/μ)

Case 2: Multivariate Gaussian, Unknown μ
Assume p(x/μ)~N(μ,Σ) and p(μ)~N(μ0,Σ0) (known) D={x1,x2,…,xn} (independently drawn) (1) Compute p(μ/D):

Case 2: Multivariate Gaussian, Unknown μ (cont’d)
Substituting the expressions for p(xk/μ) and p(μ): where

Case 2: Multivariate Gaussian (cont’d)
Compute p(x/D): (2)

Recursive Bayes Learning
Develop an incremental learning algorithm: Dn: (x1, x2, …., xn-1, xn) Rewrite as follows: Dn-1

Recursive Bayes Learning (cont’d)
Then, can be written as follows: n=1,2,…

Recursive Bayes Learning -Example

Recursive Bayesian Learning (cont’d)
(x4=8) In general:

Recursive Bayesian Learning (cont’d)
p(θ/D4) peaks at p(θ/D0) Iterations ML estimate: Bayesian estimate:

Multiple Peaks For most p(x/θ) choices, p(θ/Dn) will have peak strongly at given enough samples; in this case: There might be cases, however, where cannot be determined uniquely from p(x/θ); in this case, p(θ/Dn) will contain multiple peaks The solution p(x/θ) should be obtained by integration in this case:

ML vs Bayesian Estimation
Number of training data The two methods are equivalent assuming infinite number of training data (and prior distributions that do not exclude the true solution). For small training data sets, they give different results in most cases. Computational complexity ML uses differential calculus or gradient search for maximizing the likelihood. Bayesian estimation requires complex multidimensional integration techniques.

ML vs Bayesian Estimation (cont’d)
Solution complexity Easier to interpret ML solutions (i.e., must be of the assumed parametric form). A Bayesian estimation solution might not be of the parametric form assumed. Prior distribution If the prior distribution p(θ) is uniform, Bayesian estimation solutions are equivalent to ML solutions. In general, the two methods will give different solutions.

ML vs Bayesian Estimation (cont’d)
General comments There are strong theoretical and methodological arguments supporting Bayesian estimation. In practice, ML estimation is simpler and can lead to comparable performance.

Computational Complexity
ML estimation dimensionality: d # training data: n # classes: c Learning complexity O(dn) O(d2n) d(d+1)/2 O(dn) O(d2) O(n) O(d3) O(1) These computations must be repeated c times! (n>d)

Computational Complexity
dimensionality: d # training data: n # classes: c Classification complexity O(d2) O(1) These computations must be repeated c times and take max Bayesian Estimation: higher learning complexity, same classification complexity

Main Sources of Error in Classifier Design
Bayes error The error due to overlapping densities p(x/ωi) Model error The error due to choosing an incorrect model. Estimation error The error due to incorrectly estimated parameters (e.g., due to small number of training examples)

Overfitting When the number of training examples is inadequate, the solution obtained might not be optimal. Consider the problem of fitting a curve to some data: Points were selected from a parabola (plus noise). A 10th degree polynomial fits the data perfectly but does not generalize well. A greater error on training data might improve generalization! Need more training examples than number or model parameters!

Overfitting (cont’d) Control model complexity. Shrinkage technique:
Assume diagonal covariance matrix (i.e., uncorrelated features). Use the same covariance matrix for all classes and consolidate data. Shrinkage technique: Shrink individual covariance matrices to same covariance: Shrink common covariance matrix to identity matrix: