Presentation on theme: "CS479/679 Pattern Recognition Dr. George Bebis"— Presentation transcript:
1CS479/679 Pattern Recognition Dr. George Bebis Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – SectionsCS479/679 Pattern Recognition Dr. George Bebis
2Bayesian EstimationAssumes that the parameters q are random variables that have some known a-priori distribution p(q).Estimates a distribution rather than making point estimates like ML.BE solution might not be of the parametric form assumed.
3The Role of Training Examples in computing P(ωi /x) If p(x/ωi) and P(ωi) are known, the Bayes’ rule allows us to compute the posterior probabilities P(ωi /x):Emphasize the role of the training examples D by introducing them in the computation of the posterior probabilities:
4The Role of Training Examples (cont’d) chain ruleUsing only thesamples fromclass imarginalizechain rule
5The Role of Training Examples (cont’d) The training examples Di can help us to determine both the class-conditional densities and the prior probabilities:For simplicity, replace P(ωi /D) with P(ωi):
6Bayesian Estimation (BE) Need to estimate p(x/ωi,Di) for every class ωiIf the samples in Dj give no information about qi,we need to solve c independent problems of the following form:“Given D, estimate p(x/D)”
7BE Approach Estimate p(x/D) as follows: Since , we have: Important equation: it links p(x/D) with p(θ/D)
8BE Main Steps(1) Compute p(θ/D) :(2) Compute p(x/D) :where
9Interpretation of BE Solution Suppose p(θ/D) peaks sharply at , then p(x/D) can be approximated as follows:(i.e., the best estimate is obtained by setting )(assuming that p(x/ θ) is smooth)
10Interpretation of BE Solution (cont’d) If we are less certain about the exact value of θ, we should consider a weighted average of p(x / θ) over the possible values of θ.The samples exert their influence on p(x / D) through P(θ / D).
11Relation to ML solution If p(D/ θ) peaks sharply at , then p(θ /D) will, in general, peak sharply at too (i.e., close to ML solution):
25Recursive Bayesian Learning (cont’d) p(θ/D4) peaks atp(θ/D0)IterationsML estimate:Bayesian estimate:
26Multiple PeaksFor most p(x/θ) choices, p(θ/Dn) will have peak strongly at given enough samples; in this case:There might be cases, however, where cannot be determined uniquely from p(x/θ); in this case, p(θ/Dn) will contain multiple peaksThe solution p(x/θ) should be obtained by integration in this case:
27ML vs Bayesian Estimation Number of training dataThe two methods are equivalent assuming infinite number of training data (and prior distributions that do not exclude the true solution).For small training data sets, they give different results in most cases.Computational complexityML uses differential calculus or gradient search for maximizing the likelihood.Bayesian estimation requires complex multidimensional integration techniques.
28ML vs Bayesian Estimation (cont’d) Solution complexityEasier to interpret ML solutions (i.e., must be of the assumed parametric form).A Bayesian estimation solution might not be of the parametric form assumed.Prior distributionIf the prior distribution p(θ) is uniform, Bayesian estimation solutions are equivalent to ML solutions.In general, the two methods will give different solutions.
29ML vs Bayesian Estimation (cont’d) General commentsThere are strong theoretical and methodological arguments supporting Bayesian estimation.In practice, ML estimation is simpler and can lead to comparable performance.
30Computational Complexity ML estimationdimensionality: d# training data: n# classes: cLearning complexityO(dn)O(d2n)d(d+1)/2O(dn)O(d2)O(n)O(d3)O(1)These computations must be repeated c times!(n>d)
31Computational Complexity dimensionality: d# training data: n# classes: cClassification complexityO(d2)O(1)These computations must be repeated c times and take maxBayesian Estimation: higher learning complexity, same classification complexity
32Main Sources of Error in Classifier Design Bayes errorThe error due to overlapping densities p(x/ωi)Model errorThe error due to choosing an incorrect model.Estimation errorThe error due to incorrectly estimated parameters(e.g., due to small number of training examples)
33OverfittingWhen the number of training examples is inadequate, the solution obtained might not be optimal.Consider the problem of fitting a curve to some data:Points were selected from a parabola (plus noise).A 10th degree polynomial fits the data perfectly but does not generalize well.A greater error ontraining data mightimprove generalization!Need more training examples than number or model parameters!
34Overfitting (cont’d) Control model complexity. Shrinkage technique: Assume diagonal covariance matrix (i.e., uncorrelated features).Use the same covariance matrix for all classes and consolidate data.Shrinkage technique:Shrink individual covariance matrices to same covariance:Shrink common covariance matrix to identity matrix: