# Visual Recognition Tutorial

## Presentation on theme: "Visual Recognition Tutorial"— Presentation transcript:

236607 Visual Recognition Tutorial
Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture model EM Algorithm General Setting Jensen’s inequality Visual Recognition Tutorial

Bayesian Estimation: General Theory
Bayesian leaning considers (the parameter vector to be estimated) to be a random variable. Before we observe the data, the parameters are described by a prior which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior. Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This is Bayesian learning Visual Recognition Tutorial

Bayesian parametric estimation
Density function for x, given the training data set (it was defined in the Lect.2) From the definition of conditional probability densities The first factor is independent of X(n) since it just our assumed form for parameterized density. Therefore Visual Recognition Tutorial

Bayesian parametric estimation
Instead of choosing a specific value for , the Bayesian approach performs a weighted average over all values of If the weighting factor , which is a posterior of peaks very sharply about some value we obtain Thus the optimal estimator is the most likely value of given the data and the prior of Visual Recognition Tutorial

Bayesian decision making
Suppose we know the distribution of possible values of that is a prior Suppose we also have a loss function which measures the penalty for estimating when actual value is Then we may formulate the estimation problem as Bayesian decision making: choose the value of which minimizes the risk Note that the loss function is usually continuous. Visual Recognition Tutorial

Maximum A-Posteriori (MAP) Estimation
Let us look at : the optimal estimator is the most likely value of q given the data and the prior of q This “most likely value” is given by Visual Recognition Tutorial

Maximum A-Posteriori (MAP) Estimation
since the data is i.i.d. We can disregard the normalizing factor when looking for the maximum Visual Recognition Tutorial

236607 Visual Recognition Tutorial
MAP - continued So, the we are looking for is Visual Recognition Tutorial

236607 Visual Recognition Tutorial
Maximum likelihood In MAP estimator, the larger n (the size of the data), the less important is in the expression It can motivate us to omit the prior. What we get is the maximum likelihood (ML) method. Informally: we don’t use any prior knowledge about the parameters; we seek those values that “explain” the data in the best way . is a log-likelihood of with respect to X(n) . We seek a maximum of the likelihood function, log-likelihood, or their monotonically increasing function. Visual Recognition Tutorial

Maximum likelihood – an example
Let us find the ML estimator for the parameter of the exponential density function : so we are actually looking for the maximum of log-likelihood. Observe: The maximum is achieved where We have got the empirical mean (average) Visual Recognition Tutorial

Maximum likelihood – another example
Let us find the ML estimator for Observe: The maximum is at where This is the median of the sampled data. Visual Recognition Tutorial

Bayesian estimation -revisited
We saw Bayesian estimator for 0/1 loss function (MAP). What happens when we assume other loss functions? Example 1: (q is unidimensional). The total Bayesian risk here: We seek its minimum: Visual Recognition Tutorial

Bayesian estimation -continued
At the which is a solution we have That is, for the the optimal Bayesian estimator for the parameter is the median of the distribution Example 2: (squared error). Total Bayesian risk: Again, in order to find the minimum, let the derivative be equal 0: Visual Recognition Tutorial

Bayesian estimation -continued
The optimal estimator here is the conditional expectation of q given the data X(n) . Visual Recognition Tutorial

236607 Visual Recognition Tutorial
Mixture Models Visual Recognition Tutorial

236607 Visual Recognition Tutorial
Mixture Models Introduce multinomial random variable Z with components Zk If and only if Zn takes kth value then Note that Visual Recognition Tutorial

236607 Visual Recognition Tutorial
Mixture Models where Marginal prob. of X is Visual Recognition Tutorial

236607 Visual Recognition Tutorial
Mixture Models A mixture model as graphical model. Z – multinomial latent variable Conditional prob. of Z. Define posterior Visual Recognition Tutorial

Unconditional Mixture Models
Cond. Mix.Mod. -> to solve regression and classification (supervised). Need observation of data X and labels Y that is (X,Y) pairs. Uncond. Mix.Mod. -> to solve density estimation problems Need only observation of data X. Applications – detection of outliers, compression, unsupervised classification (clustering) … Visual Recognition Tutorial

Unconditional Mixture Models
Visual Recognition Tutorial

Gaussian Mixture Models
Estimate from IID data D={x1,…,xN} Visual Recognition Tutorial

236607 Visual Recognition Tutorial
The K- means algorithm Group data D={x1,…,xN} into a set of K clusters, where K is given. Represents i-th cluster as one vector - its mean Data points assign to the nearest mean . Phase 1: values for the indicator variable are evaluated by assigning each point xn to the closed mean: Phase 2: recompute Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM Algorithm If Zn were observed, then it would be “class label” and estimate of mean would be We don’t know them and replace them by their conditional expectations, conditioning on data: But from (6),(7) depends on parameter estimates so we should iterate. Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM Algorithm Iteration formulas: Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM Algorithm Expectation step is (14) Maximization step is parameter updates (15)-(17) What relationship this algorithm has to quantity which we want to maximize - log likelihood (9) ? Calculating derivatives of l with respect to the parameters, we have Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM Algorithm Setting to zero yields Analogously and mixing proportions: Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM General Setting EM is iterative technique designed for probabilistic models. We have two sample spaces: X which are observed (dataset) Z which are missing (latent) A probability model is If we knew Z we would do ML estimation by maximizing Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM General Setting Z is not observed so we calculate incomplete log likelihood Given Z is not observed so the complete log likelihood is a random quantity and cannot be maximized directly. Thus we average over Z using some “averaging distribution” q(z|x). We hope that maximizing this surrogate expression will yield value of q which will be improvement of initial value of q. Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM General Setting The distribution can be used to obtain lower bound on log likelihood: EM is coordinate ascent on At the (t+1)st iteration, for fixed q(t), we first maximize with respect to q, which yield q(t+1). For this q(t+1) we then maximize with respect to q which yields q(t+1), Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM General Setting E step M step The M step is equivalently viewed as the maximization of the expected complete log likelihood. Proof: Second term is independent of q. Thus maximizing of with respect to q is equivalent to maximizing Visual Recognition Tutorial

236607 Visual Recognition Tutorial
EM General Setting The E step can be solved ones and for all: choise yields the maximum: Visual Recognition Tutorial

236607 Visual Recognition Tutorial
Jensen’s inequality Definition: function is convex over (a,b) if Convex Concave Jensen’s inequality: For convex function Visual Recognition Tutorial

236607 Visual Recognition Tutorial
Jensen’s inequality For d.r.v.with two mass points Let Jensen’s inequality is right for k-1 mass points, then due to induction assumption due to convexity Visual Recognition Tutorial

Jensen’s inequality corollary
Let Function log is concave, so from Jensen inequality we have: Visual Recognition Tutorial