5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Image Modeling & Segmentation
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
The General Linear Model. The Simple Linear Model Linear Regression.
Visual Recognition Tutorial
Descriptive statistics Experiment  Data  Sample Statistics Sample mean Sample variance Normalize sample variance by N-1 Standard deviation goes as square-root.
Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Lecture 5: Learning models using EM
Minimaxity & Admissibility Presenting: Slava Chernoi Lehman and Casella, chapter 5 sections 1-2,7.
Most slides from Expectation Maximization (EM) Northwestern University EECS 395/495 Special Topics in Machine Learning.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Maximum Likelihood We have studied the OLS estimator. It only applies under certain assumptions In particular,  ~ N(0, 2 ) But what if the sampling distribution.
Machine Learning CMPT 726 Simon Fraser University
Today Today: Chapter 9 Assignment: 9.2, 9.4, 9.42 (Geo(p)=“geometric distribution”), 9-R9(a,b) Recommended Questions: 9.1, 9.8, 9.20, 9.23, 9.25.
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Constrained Optimization Rong Jin. Outline  Equality constraints  Inequality constraints  Linear Programming  Quadratic Programming.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Biointelligence Laboratory, Seoul National University
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Non-Parametric Learning Prof. A.L. Yuille Stat 231. Fall Chp 4.1 – 4.3.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
An Empirical Likelihood Ratio Based Goodness-of-Fit Test for Two-parameter Weibull Distributions Presented by: Ms. Ratchadaporn Meksena Student ID:
Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
A Generalization of PCA to the Exponential Family Collins, Dasgupta and Schapire Presented by Guy Lebanon.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 Standard error Estimated standard error,s,. 2 Example 1 While measuring the thermal conductivity of Armco iron, using a temperature of 100F and a power.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Lecture 2: Statistical learning primer for biologists
Maximum Likelihood Estimation
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.
ES 07 These slides can be found at optimized for Windows)
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Introduction We consider the data of ~1800 phenotype measurements Each mouse has a given probability distribution of descending from one of 8 possible.
Information geometry.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 03: DECISION SURFACES
Generalized Iterative Scaling Exponential Family Distributions
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Maximum Likelihood Estimation
Latent Variables, Mixture Models and EM
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Estimation Maximum Likelihood Estimates Industrial Engineering
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Maximum Likelihood We have studied the OLS estimator. It only applies under certain assumptions In particular,  ~ N(0, 2 ) But what if the sampling distribution.
The Improved Iterative Scaling Algorithm: A gentle Introduction
Optimization under Uncertainty
Presentation transcript:

5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.

Topics Exponential Distributions, Sufficient Statistics, and MLE. Maximum Entropy Principle. Model Selection.

Exponential Distributions. Gaussians are a member of the class of exponential distribution. Parameters Statistics

Sufficient Statistics. The are the sufficient statistics of the distribution. Knowledge of is all we need to know about the data The rest is irrelevant. Almost all distributions can be expressed as Exponentials – Gaussian, Poisson, etc.

Sufficient Statistics of Gaussian One-Dimensional Gaussian and samples Sufficient statistics are And These are sufficient to learn the parameters of the distribution from data.

MLE for Gaussian To estimate the parameters – maximize Or equivalently, maximize: The sufficient statistics are chosen so that

Sufficient Statistics for Gaussian Distribution is of form: This is the same as a Gaussian with mean and variance

Exponential Models and MLE. MLE corresponds to maximizing Equivalent to minimizing Where

Exponential Models and MLE. This minimization is a convex optimization problem and hence has a unique solution. But finding this solution may be difficult. Algorithms such as Generalized Iterative Scaling are guaranteed to converge.

Maximum Entropy Principle. An alternative way to think of Exponential Distributions and MLE. Start with the Statistics, and then estimate the form and the parameters of the probability distribution. Using the Maximum Entropy principle.

Entropy The entropy of a distribution is Defined by Shannon as a measure of the information obtained by observing a sample from P(x).

Maximum Entropy Principle Maximum Entropy Principle. Select the distribution P(x) which maximizes the entropy subject to constraints. Lagrange multipliers The observed value of the statistics are

Maximum Entropy Minimize with respect to P(x). Gives the (exponential) form of the distribution: Maximizing with respect to the Lagrange parameters ensures that the constraints are satisfied:

Maximum Entropy. This gives the same result as MLE for Exponential Distributions. Maximum Entropy + Constraints = Exponential Distribution + MLE Parameter. The Max-Ent distribution which has the observed sufficient statistics is the exponential distribution with those statistics. Example: can obtain a Gaussian by performing Max-Ent on statistics

Minimax Principle. Construct a distribution incrementally by increasing the number of statistics The entropy of the Max-Ent distribution with M statistics is given by: Minimax Principle: select the statistics to minimize the entropy of the maximum entropy distribution. This relates to model selection.

Model Selection. Suppose we do not know which model generates the data. Two models Priors Model selection enables us to estimate which model is most likely to have generated the data

Model Selection. Calculate Compare with Observe that we must sum over all possible values of the model parameters

Model Selection & Minimax. The entropy of the Max-Ent distribution Is minus the probability of the data So the Minimax Principle is a form of model selection. But it estimates the parameters instead of summing them out.

Model Selection. Important Issue: Suppose the model has more parameters than Then is more flexible and can fit a larger number of data models. But summing over the parameters and penalizes this flexibility. Gives “Occam’s Razor” favoring the simpler model.

Model Selection. More advanced modeling requires performing model selection – where the models are complex. Beyond scope of this course.