Visual Recognition Tutorial

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Biointelligence Laboratory, Seoul National University
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.
Maximum likelihood (ML) and likelihood ratio (LR) test
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Lecture 5: Learning models using EM
Visual Recognition Tutorial
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Visual Recognition Tutorial
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
EM Algorithm Likelihood, Mixture Models and Clustering.
Maximum likelihood (ML)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
EM and expected complete log-likelihood Mixture of Experts
Model Inference and Averaging
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Sergios Theodoridis Konstantinos Koutroumbas Version 2
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Lecture 2: Statistical learning primer for biologists
Machine Learning 5. Parametric Methods.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Univariate Gaussian Case (Cont.)
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Univariate Gaussian Case (Cont.)
Deep Feedforward Networks
Visual Recognition Tutorial
Probability Theory and Parameter Estimation I
Model Inference and Averaging
Pattern Classification, Chapter 3
Latent Variables, Mixture Models and EM
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Summarizing Data by Statistics
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Biointelligence Laboratory, Seoul National University
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

236607 Visual Recognition Tutorial Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization Algorithm Jensen’s inequality EM for a mixture model 236607 Visual Recognition Tutorial

Bayesian Estimation: General Theory Bayesian leaning considers (the parameter vector to be estimated) to be a random variable. Before we observe the data, the parameters are described by a prior which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior. Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This is Bayesian learning 236607 Visual Recognition Tutorial

Bayesian parametric estimation Density function for x, given the training data set (it was defined in the Lect.2) From the definition of conditional probability densities The first factor is independent of X(n) since it just our assumed form for parameterized density. Therefore 236607 Visual Recognition Tutorial

Bayesian parametric estimation Instead of choosing a specific value for , the Bayesian approach performs a weighted average over all values of If the weighting factor , which is a posterior of peaks very sharply about some value we obtain . Thus the optimal estimator is the most likely value of given the data and the prior of . 236607 Visual Recognition Tutorial

Bayesian decision making Suppose we know the distribution of possible values of that is a prior Suppose we also have a loss function which measures the penalty for estimating when actual value is Then we may formulate the estimation problem as Bayesian decision making: choose the value of which minimizes the risk Note that the loss function is usually continuous. 236607 Visual Recognition Tutorial

Maximum A-Posteriori (MAP) Estimation Let us look at : the optimal estimator is the most likely value of q given the data and the prior of q This “most likely value” is given by 236607 Visual Recognition Tutorial

Maximum A-Posteriori (MAP) Estimation since the data is i.i.d. We can disregard the normalizing factor when looking for the maximum 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial MAP - continued So, the we are looking for is 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial Maximum likelihood In MAP estimator, the larger n (the size of the data), the less important is in the expression It can motivate us to omit the prior. What we get is the maximum likelihood (ML) method. Informally: we don’t use any prior knowledge about the parameters; we seek those values that “explain” the data in the best way . is a log-likelihood of with respect to X(n) . We seek a maximum of the likelihood function, log-likelihood, or their monotonically increasing function. 236607 Visual Recognition Tutorial

Maximum likelihood – an example Let us find the ML estimator for the parameter of the exponential density function : so we are actually looking for the maximum of log-likelihood. Observe: The maximum is achieved where We have got the empirical mean (average) 236607 Visual Recognition Tutorial

Maximum likelihood – another example Let us find the ML estimator for Observe: The maximum is at where This is the median of the sampled data. 236607 Visual Recognition Tutorial

Bayesian estimation -revisited We saw Bayesian estimator for 0/1 loss function (MAP). What happens when we assume other loss functions? Example 1: (q is unidimensional). The total Bayesian risk here: We seek its minimum: 236607 Visual Recognition Tutorial

Bayesian estimation -continued At the which is a solution we have That is, for the the optimal Bayesian estimator for the parameter is the median of the distribution Example 2: (squared error). Total Bayesian risk: Again, in order to find the minimum, let the derivative be equal 0: 236607 Visual Recognition Tutorial

Bayesian estimation -continued The optimal estimator here is the conditional expectation of q given the data X(n) . 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial Jensen’s inequality Definition: function is convex over (a,b) if Convex Concave Jensen’s inequality: For convex function 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial Jensen’s inequality For d.r.v.with two mass points Let Jensen’s inequality is right for k-1 mass points, then due to induction assumption due to convexity 236607 Visual Recognition Tutorial

Jensen’s inequality corollary Let Function log is concave, so from Jensen inequality we have: 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM Algorithm EM is iterative technique designed for probabilistic models. We have: two sample spaces: X which are observed Y which are missing Vector of parameters q which gives a distribution of X. We should find or 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM Algorithm The problem is that to calculate Is difficult, but calculation of is relatively easy We define: The algorithm makes cyclically two steps: E: Compute (see (10) below) M: 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM Algorithm EM is iterative technique designed for probabilistic models. Maximizing a function with lower-bound approximation vs. linear approximation 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM Algorithm Gradient descend makes linear approximation to the objective function (O.F.), Newton’s method makes quadratic approx. But optimal step is not known. EM instead makes a local approx. that is lower bound (l.b.) to the O.F. Choosing a new guess to maximize the l.b. will always be an improvement, if gradient is not zero. Thus two steps: E – compute a l.b., M-maximize the l.b. The bound used by EM is following from Jensen’s inequality. 236607 Visual Recognition Tutorial

The General EM Algorithm We should make maximization of the function where X is a matrix of observed data. If f(q) is simple, we find maximum by equating its gradient to zero But if f(q) is a mixture (of simple functions) it is difficult. This is a situation for the EM. Given a guess for q find lower bound for f(q) with a function g(q, q(y)), parameterized by free variables q(y). 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM Algorithm Gradient descend makes linear approximation to the provided Define If we want the lower bound g(q,q) to touch f at the current guess for q , we choose q to maximize G(q, q) . 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM Algorithm Adding the Lagrange multiplier to the constraint on q gives: For this choice the bound becomes So indeed it touches the objective f(q) . 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM Algorithm Finding q to get a good bound is the “E” step. To get the next guess for q, we maximize the bound over q (this is the “M” step). It is problem-dependent. The relevant term of G is It may be difficult and also it isn’t strictly necessary to maximize the bound over q . This is sometimes called “generalized EM”. It is clear from the figure that the derivative of g at the current guess is identical to the derivative of f . 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM for a mixture model We have a mixture of two one-dimensional Gaussians (k=2). Let mixture coefficients be equal: Let variances be The problem is to find We have sample set 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM for a mixture model To use an algorithm of EM define hidden random variables (indicators) Thus for every i we have: We define every hidden variables: The aim is to calculate and to maximize Q. 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM for a mixture model For every xi we have: From the assumption of iid for the sample set we have: We see that an expression is linear in . 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM for a mixture model STEP E: We want to calculate an expected value relative to 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM for a mixture model STEP M: Differentiating and equating to zero we’ll have: Thus 236607 Visual Recognition Tutorial

EM mixture of Gaussians In what follows we use j instead of y because missing variables are discrete in this example. Model density is a linear combination of component densities p(x | j,q) : where M is a number of basis functions (parameter of the model), P(j) are mixing parameters. They actually are prior probabilities of the data point having been generated from component j of the mixture. 236607 Visual Recognition Tutorial

EM mixture of Gaussians They satisfy The component density function p(x | j) are normalized: We shall use Gaussians for p(x | j) We should find 236607 Visual Recognition Tutorial

EM mixture of Gaussians STEP E: calculate when . (See formulas (8) and (10)) We have: We maximize (17) with constrain (12): 236607 Visual Recognition Tutorial

EM mixture of Gaussians STEP M: Derivative of (18) with respect to Pnew(j): Thus Using (12) we shall have So from (21) and (20) : 236607 Visual Recognition Tutorial

EM mixture model. General case By calculating derivatives from(18) due to and we’ll have: 236607 Visual Recognition Tutorial

EM mixture model. General case Algorithm for calculating p(x) (formula (11)). For every x begin initialize do fixed number of times Calculate formulas (22),(23),(24) return formula (11). end 236607 Visual Recognition Tutorial