Download presentation

1
**236607 Visual Recognition Tutorial**

Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization Algorithm Jensen’s inequality EM for a mixture model Visual Recognition Tutorial

2
**Bayesian Estimation: General Theory**

Bayesian leaning considers (the parameter vector to be estimated) to be a random variable. Before we observe the data, the parameters are described by a prior which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior. Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This is Bayesian learning Visual Recognition Tutorial

3
**Bayesian parametric estimation**

Density function for x, given the training data set (it was defined in the Lect.2) From the definition of conditional probability densities The first factor is independent of X(n) since it just our assumed form for parameterized density. Therefore Visual Recognition Tutorial

4
**Bayesian parametric estimation**

Instead of choosing a specific value for , the Bayesian approach performs a weighted average over all values of If the weighting factor , which is a posterior of peaks very sharply about some value we obtain Thus the optimal estimator is the most likely value of given the data and the prior of Visual Recognition Tutorial

5
**Bayesian decision making**

Suppose we know the distribution of possible values of that is a prior Suppose we also have a loss function which measures the penalty for estimating when actual value is Then we may formulate the estimation problem as Bayesian decision making: choose the value of which minimizes the risk Note that the loss function is usually continuous. Visual Recognition Tutorial

6
**Maximum A-Posteriori (MAP) Estimation**

Let us look at : the optimal estimator is the most likely value of q given the data and the prior of q This “most likely value” is given by Visual Recognition Tutorial

7
**Maximum A-Posteriori (MAP) Estimation**

since the data is i.i.d. We can disregard the normalizing factor when looking for the maximum Visual Recognition Tutorial

8
**236607 Visual Recognition Tutorial**

MAP - continued So, the we are looking for is Visual Recognition Tutorial

9
**236607 Visual Recognition Tutorial**

Maximum likelihood In MAP estimator, the larger n (the size of the data), the less important is in the expression It can motivate us to omit the prior. What we get is the maximum likelihood (ML) method. Informally: we don’t use any prior knowledge about the parameters; we seek those values that “explain” the data in the best way . is a log-likelihood of with respect to X(n) . We seek a maximum of the likelihood function, log-likelihood, or their monotonically increasing function. Visual Recognition Tutorial

10
**Maximum likelihood – an example**

Let us find the ML estimator for the parameter of the exponential density function : so we are actually looking for the maximum of log-likelihood. Observe: The maximum is achieved where We have got the empirical mean (average) Visual Recognition Tutorial

11
**Maximum likelihood – another example**

Let us find the ML estimator for Observe: The maximum is at where This is the median of the sampled data. Visual Recognition Tutorial

12
**Bayesian estimation -revisited**

We saw Bayesian estimator for 0/1 loss function (MAP). What happens when we assume other loss functions? Example 1: (q is unidimensional). The total Bayesian risk here: We seek its minimum: Visual Recognition Tutorial

13
**Bayesian estimation -continued**

At the which is a solution we have That is, for the the optimal Bayesian estimator for the parameter is the median of the distribution Example 2: (squared error). Total Bayesian risk: Again, in order to find the minimum, let the derivative be equal 0: Visual Recognition Tutorial

14
**Bayesian estimation -continued**

The optimal estimator here is the conditional expectation of q given the data X(n) . Visual Recognition Tutorial

15
**236607 Visual Recognition Tutorial**

Jensen’s inequality Definition: function is convex over (a,b) if Convex Concave Jensen’s inequality: For convex function Visual Recognition Tutorial

16
**236607 Visual Recognition Tutorial**

Jensen’s inequality For d.r.v.with two mass points Let Jensen’s inequality is right for k-1 mass points, then due to induction assumption due to convexity Visual Recognition Tutorial

17
**Jensen’s inequality corollary**

Let Function log is concave, so from Jensen inequality we have: Visual Recognition Tutorial

18
**236607 Visual Recognition Tutorial**

EM Algorithm EM is iterative technique designed for probabilistic models. We have: two sample spaces: X which are observed Y which are missing Vector of parameters q which gives a distribution of X. We should find or Visual Recognition Tutorial

19
**236607 Visual Recognition Tutorial**

EM Algorithm The problem is that to calculate Is difficult, but calculation of is relatively easy We define: The algorithm makes cyclically two steps: E: Compute (see (10) below) M: Visual Recognition Tutorial

20
**236607 Visual Recognition Tutorial**

EM Algorithm EM is iterative technique designed for probabilistic models. Maximizing a function with lower-bound approximation vs. linear approximation Visual Recognition Tutorial

21
**236607 Visual Recognition Tutorial**

EM Algorithm Gradient descend makes linear approximation to the objective function (O.F.), Newton’s method makes quadratic approx. But optimal step is not known. EM instead makes a local approx. that is lower bound (l.b.) to the O.F. Choosing a new guess to maximize the l.b. will always be an improvement, if gradient is not zero. Thus two steps: E – compute a l.b., M-maximize the l.b. The bound used by EM is following from Jensen’s inequality. Visual Recognition Tutorial

22
**The General EM Algorithm**

We should make maximization of the function where X is a matrix of observed data. If f(q) is simple, we find maximum by equating its gradient to zero But if f(q) is a mixture (of simple functions) it is difficult. This is a situation for the EM. Given a guess for q find lower bound for f(q) with a function g(q, q(y)), parameterized by free variables q(y). Visual Recognition Tutorial

23
**236607 Visual Recognition Tutorial**

EM Algorithm Gradient descend makes linear approximation to the provided Define If we want the lower bound g(q,q) to touch f at the current guess for q , we choose q to maximize G(q, q) . Visual Recognition Tutorial

24
**236607 Visual Recognition Tutorial**

EM Algorithm Adding the Lagrange multiplier to the constraint on q gives: For this choice the bound becomes So indeed it touches the objective f(q) . Visual Recognition Tutorial

25
**236607 Visual Recognition Tutorial**

EM Algorithm Finding q to get a good bound is the “E” step. To get the next guess for q, we maximize the bound over q (this is the “M” step). It is problem-dependent. The relevant term of G is It may be difficult and also it isn’t strictly necessary to maximize the bound over q . This is sometimes called “generalized EM”. It is clear from the figure that the derivative of g at the current guess is identical to the derivative of f . Visual Recognition Tutorial

26
**236607 Visual Recognition Tutorial**

EM for a mixture model We have a mixture of two one-dimensional Gaussians (k=2). Let mixture coefficients be equal: Let variances be The problem is to find We have sample set Visual Recognition Tutorial

27
**236607 Visual Recognition Tutorial**

EM for a mixture model To use an algorithm of EM define hidden random variables (indicators) Thus for every i we have: We define every hidden variables: The aim is to calculate and to maximize Q. Visual Recognition Tutorial

28
**236607 Visual Recognition Tutorial**

EM for a mixture model For every xi we have: From the assumption of iid for the sample set we have: We see that an expression is linear in Visual Recognition Tutorial

29
**236607 Visual Recognition Tutorial**

EM for a mixture model STEP E: We want to calculate an expected value relative to Visual Recognition Tutorial

30
**236607 Visual Recognition Tutorial**

EM for a mixture model STEP M: Differentiating and equating to zero we’ll have: Thus Visual Recognition Tutorial

31
**EM mixture of Gaussians**

In what follows we use j instead of y because missing variables are discrete in this example. Model density is a linear combination of component densities p(x | j,q) : where M is a number of basis functions (parameter of the model), P(j) are mixing parameters. They actually are prior probabilities of the data point having been generated from component j of the mixture. Visual Recognition Tutorial

32
**EM mixture of Gaussians**

They satisfy The component density function p(x | j) are normalized: We shall use Gaussians for p(x | j) We should find Visual Recognition Tutorial

33
**EM mixture of Gaussians**

STEP E: calculate when (See formulas (8) and (10)) We have: We maximize (17) with constrain (12): Visual Recognition Tutorial

34
**EM mixture of Gaussians**

STEP M: Derivative of (18) with respect to Pnew(j): Thus Using (12) we shall have So from (21) and (20) : Visual Recognition Tutorial

35
**EM mixture model. General case**

By calculating derivatives from(18) due to and we’ll have: Visual Recognition Tutorial

36
**EM mixture model. General case**

Algorithm for calculating p(x) (formula (11)). For every x begin initialize do fixed number of times Calculate formulas (22),(23),(24) return formula (11). end Visual Recognition Tutorial

Similar presentations

© 2020 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google