 # Visual Recognition Tutorial

## Presentation on theme: "Visual Recognition Tutorial"— Presentation transcript:

236607 Visual Recognition Tutorial
Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization Algorithm Jensen’s inequality EM for a mixture model Visual Recognition Tutorial

Bayesian Estimation: General Theory
Bayesian leaning considers (the parameter vector to be estimated) to be a random variable. Before we observe the data, the parameters are described by a prior which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior. Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This is Bayesian learning Visual Recognition Tutorial

Bayesian parametric estimation
Density function for x, given the training data set (it was defined in the Lect.2) From the definition of conditional probability densities The first factor is independent of X(n) since it just our assumed form for parameterized density. Therefore Visual Recognition Tutorial

Bayesian parametric estimation
Instead of choosing a specific value for , the Bayesian approach performs a weighted average over all values of If the weighting factor , which is a posterior of peaks very sharply about some value we obtain Thus the optimal estimator is the most likely value of given the data and the prior of Visual Recognition Tutorial

Bayesian decision making
Suppose we know the distribution of possible values of that is a prior Suppose we also have a loss function which measures the penalty for estimating when actual value is Then we may formulate the estimation problem as Bayesian decision making: choose the value of which minimizes the risk Note that the loss function is usually continuous. Visual Recognition Tutorial

Maximum A-Posteriori (MAP) Estimation
Let us look at : the optimal estimator is the most likely value of q given the data and the prior of q This “most likely value” is given by Visual Recognition Tutorial

Maximum A-Posteriori (MAP) Estimation
since the data is i.i.d. We can disregard the normalizing factor when looking for the maximum Visual Recognition Tutorial

236607 Visual Recognition Tutorial
MAP - continued So, the we are looking for is Visual Recognition Tutorial

236607 Visual Recognition Tutorial
Maximum likelihood In MAP estimator, the larger n (the size of the data), the less important is in the expression It can motivate us to omit the prior. What we get is the maximum likelihood (ML) method. Informally: we don’t use any prior knowledge about the parameters; we seek those values that “explain” the data in the best way . is a log-likelihood of with respect to X(n) . We seek a maximum of the likelihood function, log-likelihood, or their monotonically increasing function. Visual Recognition Tutorial

Maximum likelihood – an example
Let us find the ML estimator for the parameter of the exponential density function : so we are actually looking for the maximum of log-likelihood. Observe: The maximum is achieved where We have got the empirical mean (average) Visual Recognition Tutorial

Maximum likelihood – another example
Let us find the ML estimator for Observe: The maximum is at where This is the median of the sampled data. Visual Recognition Tutorial

Bayesian estimation -revisited
We saw Bayesian estimator for 0/1 loss function (MAP). What happens when we assume other loss functions? Example 1: (q is unidimensional). The total Bayesian risk here: We seek its minimum: Visual Recognition Tutorial

Bayesian estimation -continued
At the which is a solution we have That is, for the the optimal Bayesian estimator for the parameter is the median of the distribution Example 2: (squared error). Total Bayesian risk: Again, in order to find the minimum, let the derivative be equal 0: Visual Recognition Tutorial

Bayesian estimation -continued
The optimal estimator here is the conditional expectation of q given the data X(n) . Visual Recognition Tutorial

236607 Visual Recognition Tutorial
Jensen’s inequality Definition: function is convex over (a,b) if Convex Concave Jensen’s inequality: For convex function Visual Recognition Tutorial

236607 Visual Recognition Tutorial
Jensen’s inequality For d.r.v.with two mass points Let Jensen’s inequality is right for k-1 mass points, then due to induction assumption due to convexity Visual Recognition Tutorial

Jensen’s inequality corollary
Let Function log is concave, so from Jensen inequality we have: Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM Algorithm EM is iterative technique designed for probabilistic models. We have: two sample spaces: X which are observed Y which are missing Vector of parameters q which gives a distribution of X. We should find or Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM Algorithm The problem is that to calculate Is difficult, but calculation of is relatively easy We define: The algorithm makes cyclically two steps: E: Compute (see (10) below) M: Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM Algorithm EM is iterative technique designed for probabilistic models. Maximizing a function with lower-bound approximation vs. linear approximation Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM Algorithm Gradient descend makes linear approximation to the objective function (O.F.), Newton’s method makes quadratic approx. But optimal step is not known. EM instead makes a local approx. that is lower bound (l.b.) to the O.F. Choosing a new guess to maximize the l.b. will always be an improvement, if gradient is not zero. Thus two steps: E – compute a l.b., M-maximize the l.b. The bound used by EM is following from Jensen’s inequality. Visual Recognition Tutorial

The General EM Algorithm
We should make maximization of the function where X is a matrix of observed data. If f(q) is simple, we find maximum by equating its gradient to zero But if f(q) is a mixture (of simple functions) it is difficult. This is a situation for the EM. Given a guess for q find lower bound for f(q) with a function g(q, q(y)), parameterized by free variables q(y). Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM Algorithm Gradient descend makes linear approximation to the provided Define If we want the lower bound g(q,q) to touch f at the current guess for q , we choose q to maximize G(q, q) . Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM Algorithm Adding the Lagrange multiplier to the constraint on q gives: For this choice the bound becomes So indeed it touches the objective f(q) . Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM Algorithm Finding q to get a good bound is the “E” step. To get the next guess for q, we maximize the bound over q (this is the “M” step). It is problem-dependent. The relevant term of G is It may be difficult and also it isn’t strictly necessary to maximize the bound over q . This is sometimes called “generalized EM”. It is clear from the figure that the derivative of g at the current guess is identical to the derivative of f . Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM for a mixture model We have a mixture of two one-dimensional Gaussians (k=2). Let mixture coefficients be equal: Let variances be The problem is to find We have sample set Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM for a mixture model To use an algorithm of EM define hidden random variables (indicators) Thus for every i we have: We define every hidden variables: The aim is to calculate and to maximize Q. Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM for a mixture model For every xi we have: From the assumption of iid for the sample set we have: We see that an expression is linear in Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM for a mixture model STEP E: We want to calculate an expected value relative to Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM for a mixture model STEP M: Differentiating and equating to zero we’ll have: Thus Visual Recognition Tutorial

EM mixture of Gaussians
In what follows we use j instead of y because missing variables are discrete in this example. Model density is a linear combination of component densities p(x | j,q) : where M is a number of basis functions (parameter of the model), P(j) are mixing parameters. They actually are prior probabilities of the data point having been generated from component j of the mixture. Visual Recognition Tutorial

EM mixture of Gaussians
They satisfy The component density function p(x | j) are normalized: We shall use Gaussians for p(x | j) We should find Visual Recognition Tutorial

EM mixture of Gaussians
STEP E: calculate when (See formulas (8) and (10)) We have: We maximize (17) with constrain (12): Visual Recognition Tutorial

EM mixture of Gaussians
STEP M: Derivative of (18) with respect to Pnew(j): Thus Using (12) we shall have So from (21) and (20) : Visual Recognition Tutorial

EM mixture model. General case
By calculating derivatives from(18) due to and we’ll have: Visual Recognition Tutorial

EM mixture model. General case
Algorithm for calculating p(x) (formula (11)). For every x begin initialize do fixed number of times Calculate formulas (22),(23),(24) return formula (11). end Visual Recognition Tutorial