Download presentation

Presentation is loading. Please wait.

Published byAlivia Ghent Modified about 1 year ago

1
Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis

2
Bayesian Estimation Assumes that the parameters are random variables that have some known a-priori distribution p( . Estimates a distribution rather than making point estimates like ML. BE solution might not be of the parametric form assumed.

3
The Role of Training Examples in computing P(ω i /x) If p(x/ω i ) and P(ω i ) are known, the Bayes’ rule allows us to compute the posterior probabilities P(ω i /x): Emphasize the role of the training examples D by introducing them in the computation of the posterior probabilities:

4
The Role of Training Examples (cont’d) chain rule marginalize chain rule Using only the samples from class i

5
The Role of Training Examples (cont’d) The training examples D i can help us to determine both the class-conditional densities and the prior probabilities: For simplicity, replace P(ω i /D) with P(ω i ):

6
Bayesian Estimation (BE) Need to estimate p(x/ω i,D i ) for every class ω i If the samples in D j give no information about i, we need to solve c independent problems of the following form: “Given D, estimate p(x/D)”

7
BE Approach Estimate p(x/D) as follows: Since, we have: Important equation: it links p(x/D) with p(θ/D)

8
BE Main Steps (1) Compute p(θ/D) : (2) Compute p(x/D) : where

9
Interpretation of BE Solution Suppose p(θ/D) peaks sharply at, then p(x/D) can be approximated as follows: (i.e., the best estimate is obtained by setting ) (assuming that p(x/ θ) is smooth)

10
Interpretation of BE Solution (cont’d) If we are less certain about the exact value of θ, we should consider a weighted average of p(x / θ) over the possible values of θ. The samples exert their influence on p(x / D) through P(θ / D).

11
Relation to ML solution If p(D/ θ) peaks sharply at, then p(θ /D) will, in general, peak sharply at too (i.e., close to ML solution):

12
Case 1: Univariate Gaussian, Unknown μ D={x 1,x 2,…,x n } (independently drawn) (1) (known)

13
Case 1: Univariate Gaussian, Unknown μ (cont’d) x const. It can be shown that p(μ/D) has the form: peaks at μ n p(μ/D)

14
Case 1: Univariate Gaussian, Unknown μ (cont’d) as as (ML estimate) (i.e., lies between them) implies more samples! (ML estimate)

15
Case 1: Univariate Gaussian, Unknown μ (cont’d)

16
Bayesian Learning

17
Case 1: Univariate Gaussian, Unknown μ (cont’d) not dependent on μ As the number of samples increases, p(x/D) converges to p(x/μ) (2)

18
Case 2: Multivariate Gaussian, Unknown μ D={x 1,x 2,…,x n } (independently drawn) Assume p(x/μ)~N(μ,Σ) and p(μ)~N(μ 0,Σ 0 ) Compute p(μ/D): (1) (known)

19
Case 2: Multivariate Gaussian, Unknown μ (cont’d) Substituting the expressions for p(x k /μ) and p(μ): where

20
Case 2: Multivariate Gaussian (cont’d) Compute p(x/D): (2)

21
Recursive Bayes Learning Develop an incremental learning algorithm: D n : (x 1, x 2, …., x n-1, x n ) Rewrite as follows: D n-1

22
Recursive Bayes Learning (cont’d) Then, can be written as follows: n=1,2,…

23
Recursive Bayes Learning -Example p(θ)

24
Recursive Bayesian Learning (cont’d) (x 4 =8) In general:

25
Recursive Bayesian Learning (cont’d) ML estimate: Iterations p(θ/D 4 ) peaks at Bayesian estimate: p(θ/D 0 )

26
Multiple Peaks For most p(x/θ) choices, p(θ/D n ) will have peak strongly at given enough samples; in this case: There might be cases, however, where cannot be determined uniquely from p(x/θ); in this case, p(θ/D n ) will contain multiple peaks The solution p(x/θ) should be obtained by integration in this case:

27
ML vs Bayesian Estimation Number of training data – The two methods are equivalent assuming infinite number of training data (and prior distributions that do not exclude the true solution). – For small training data sets, they give different results in most cases. Computational complexity – ML uses differential calculus or gradient search for maximizing the likelihood. – Bayesian estimation requires complex multidimensional integration techniques.

28
ML vs Bayesian Estimation (cont’d) Solution complexity – Easier to interpret ML solutions (i.e., must be of the assumed parametric form). – A Bayesian estimation solution might not be of the parametric form assumed. Prior distribution – If the prior distribution p(θ) is uniform, Bayesian estimation solutions are equivalent to ML solutions. – In general, the two methods will give different solutions.

29
ML vs Bayesian Estimation (cont’d) General comments – There are strong theoretical and methodological arguments supporting Bayesian estimation. – In practice, ML estimation is simpler and can lead to comparable performance.

30
Computational Complexity O(dn) O(d 2 n) O(d 2 ) O(1) O(n) dimensionality: d # training data: n # classes: c O(d 3 ) ML estimation Learning complexity (n>d) These computations must be repeated c times! d(d+1)/2 O(dn)

31
Computational Complexity O(1) O(d 2 ) Bayesian Estimation: higher learning complexity, same classification complexity Classification complexity These computations must be repeated c times and take max dimensionality: d # training data: n # classes: c

32
Main Sources of Error in Classifier Design Bayes error – The error due to overlapping densities p(x/ω i ) Model error – The error due to choosing an incorrect model. Estimation error – The error due to incorrectly estimated parameters (e.g., due to small number of training examples)

33
Overfitting When the number of training examples is inadequate, the solution obtained might not be optimal. Consider the problem of fitting a curve to some data: – Points were selected from a parabola (plus noise). – A 10th degree polynomial fits the data perfectly but does not generalize well. A greater error on training data might improve generalization! Need more training examples than number or model parameters!

34
Overfitting (cont’d) Control model complexity. – Assume diagonal covariance matrix (i.e., uncorrelated features). – Use the same covariance matrix for all classes and consolidate data. – Shrinkage technique: Shrink common covariance matrix to identity matrix: Shrink individual covariance matrices to same covariance:

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google