Maximum Likelihood Estimation & Expectation Maximization

Slides:



Advertisements
Similar presentations
Image Modeling & Segmentation
Advertisements

Mixture Models and the EM Algorithm
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
Segmentation and Fitting Using Probabilistic Methods
Parameter Estimation using likelihood functions Tutorial #1
Visual Recognition Tutorial
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Lecture 5: Learning models using EM
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
Learning Bayesian Networks
Thanks to Nir Friedman, HU
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Lecture 2: Statistical learning primer for biologists
Maximum Likelihood Estimation
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Hidden Markov Models BMI/CS 576
Univariate Gaussian Case (Cont.)
Oliver Schulte Machine Learning 726
CS479/679 Pattern Recognition Dr. George Bebis
12. Principles of Parameter Estimation
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
Statistical Methods for Quantitative Trait Loci (QTL) Mapping
Ch3: Model Building through Regression
Parameter Estimation 主講人:虞台文.
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
Bayes Net Learning: Bayesian Approaches
Learning Sequence Motif Models Using Expectation Maximization (EM)
Parameter, Statistic and Random Samples
CS 2750: Machine Learning Expectation Maximization
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Latent Variables, Mixture Models and EM
Outline Parameter estimation – continued Non-parametric methods.
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Introduction to EM algorithm
CS498-EA Reasoning in AI Lecture #20
Statistical Models for Automatic Speech Recognition
EC 331 The Theory of and applications of Maximum Likelihood Method
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
LECTURE 07: BAYESIAN ESTIMATION
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Biointelligence Laboratory, Seoul National University
Machine Learning: Lecture 6
BN Semantics 3 – Now it’s personal! Parameter Learning 1
Machine Learning: UNIT-3 CHAPTER-1
12. Principles of Parameter Estimation
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Mathematical Foundations of BME Reza Shadmehr
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Applied Statistics and Probability for Engineers
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Maximum Likelihood Estimation & Expectation Maximization Lectures 3 – Oct 5, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 Many of the techniques we will be discussing in class are based on the probabilistic models. Examples include Bayesian networks, Hidden markov models or Markov Random Fields. So, today, before we start discussing computational methods in active areas of research, we’d like to spend a couple of lectures for some basics on probabilistic models.

Outline Probabilistic models in biology Mathematical foundations Model selection problem Mathematical foundations Bayesian networks Learning from data Maximum likelihood estimation Maximum a posteriori (MAP) Expectation and maximization

Parameter Estimation Assumptions For example, {i0,d1,g1,l0,s0} Fixed network structure Fully observed instances of the network variables: D={d[1],…,d[M]} Maximum likelihood estimation (MLE)! “Parameters” of the Bayesian network from Koller & Friedman

The Thumbtack example Parameter learning for a single variable. X: an outcome of a thumbtack toss Val(X) = {head, tail} Data A set of thumbtack tosses: x[1],…x[M] X

Maximum likelihood estimation Say that P(x=head) = Θ, P(x=tail) = 1-Θ P(HHTTHHH…<Mh heads, Mt tails>; Θ) = Definition: The likelihood function L(Θ : D) = P(D; Θ) Maximum likelihood estimation (MLE) Given data D=HHTTHHH…<Mh heads, Mt tails>, find Θ that maximizes the likelihood function L(Θ : D).

Likelihood function

MLE for the Thumbtack problem Given data D=HHTTHHH…<Mh heads, Mt tails>, MLE solution Θ* = Mh / (Mh+Mt ). Proof:

Continuous space Assuming sample x1, x2,…, xn is from a parametric distribution f (x|Θ) , estimate Θ. Say that the n samples are from a normal distribution with mean μ and variance σ2.

Continuous space (cont.) Let Θ1=μ, Θ2= σ2

Any Drawback? Is it biased? Is it? Yes. As an extreme, when n = 1, =0. The MLE systematically underestimates θ2 . Why? A bit harder to see, but think about n = 2. Then θ1 is exactly between the two sample points, the position that exactly minimizes the expression for . Any other choices for θ1, θ2 make the likelihood of the observed data slightly lower. But it’s actually pretty unlikely that two sample points would be chosen exactly equidistant from, and on opposite sides of the mean, so the MLE systematically underestimates θ2 .

Maximum a posteriori Incorporating priors. How? MLE vs MAP estimation

MLE for general problems Learning problem setting A set of random variables X from unknown distribution P* Training data D = M instances of X: {d[1],…,d[M]} A parametric model P(X; Θ) (a ‘legal’ distribution) Define the likelihood function: L(Θ : D) = Maximum likelihood estimation Choose parameters Θ* that satisfy:

MLE for Bayesian networks Structure G PG = P(x1,x2,x3,x4) = P(x1) P(x2) P(x3|x1,x2) P(x4|x1,x3) x2 x3 x1 x4 More generally? Parameters θ Θx1, Θx2 , Θx3|x1,x2 , Θx4|x1,x3 (more generally Θxi|pai) Given D: x[1],…x[m]…,x[M], estimate θ. (x1[m],x2[m],x3[m],x4[m]) Likelihood decomposition: The local likelihood function for Xi is:

Bayesian network with table CPDs The Student example The Thumbtack example Intelligence Difficulty X vs Grade Joint distribution P(X) P(I,D,G) = Parameters θ θI, θD, θG|I,D Data D: {H…x[m]…T} D: {(i1,d1,g1)…(i[m],d[m],g[m])…} Likelihood function θMh(1-θ)Mt L(θ:D) = P(D;θ) MLE solution

Maximum Likelihood Estimation Review Find parameter estimates which make observed data most likely General approach, as long as tractable likelihood function exists Can use all available information

Example – Gene Expression Instruction for making the proteins Instruction for when and where to make them “Coding” Regions “Regulatory” Regions (Regulons) Regulatory regions contain “binding sites” (6-20 bp). “Binding sites” attract a special class of proteins, known as “transcription factors”. Bound transcription factors can initiate transcription (making RNA). Proteins that inhibit transcription can also be bound to their binding sites. What turns genes on (producing a protein) and off? When is a gene turned on or off? Where (in which cells) is a gene turned on? How many copies of the gene product are produced?

Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA CG..A AC..T Regulatory Element (binding sites) Gene source: M. Tompa, U. of Washington

Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA CG..A AC..T Regulatory Element Gene source: M. Tompa, U. of Washington

Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA CG..A AC..T Regulatory Element Gene source: M. Tompa, U. of Washington

Regulation of Genes RNA polymerase Transcription Factor New protein DNA CG..A AC..T Regulatory Element source: M. Tompa, U. of Washington New protein

The Gene regulation example What determines the expression level of a gene? What are observed and hidden variables? e.G, e.TF’s: observed; Process.G: hidden variables  want to infer! Expression level of TF1 ... e.TF1 e.TF2 e.TF3 e.TF4 e.TFN Biological process the gene is involved in Process.G = p3 = p2 = p1 Expression level of a gene e.G

The Gene regulation example What determines the expression level of a gene? What are observed and hidden variables? e.G, e.TF’s: observed; Process.G: hidden variables  want to infer! How about BS.G’s? How deterministic is the sequence of a binding site? How much do we know? ... e.TF1 e.TF2 e.TF3 e.TF4 e.TFN Process.G ... BS1.G BSN.G = Yes = Yes Expression level of a gene Whether the gene has TF1’s binding site e.G

Not all data are perfect Most MLE problems are simple to solve with complete data. Available data are “incomplete” in some way.

Outline Learning from data Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) Expectation-maximization (EM) algorithm

Continuous space revisited Assuming sample x1, x2,…, xn is from a mixture of parametric distributions, x1 x2 … xm xm+1 … xn x

A real example CpG content of human gene promoters GC frequency “A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters” Saxonov, Berg, and Brutlag, PNAS 2006;103:1412-1417

Mixture of Gaussians Parameters θ means variances mixing parameters P.D.F

Apply MLE? No closed form solution known for finding θ maximizing L. However, what if we knew the hidden data?

EM as Chicken vs Egg IF zij known, could estimate parameters θ e.g., only points in cluster 2 influence μ2, σ2. IF parameters θ known, could estimate zij e.g., if |xi - μ1|/σ1 << |xi – μ2|/σ2, then zi1 >> zi2 BUT we know neither; (optimistically) iterate: E-step: calculate expected zij, given parameters M-step: do “MLE” for parameters (μ,σ), given E(zij) Overall, a clever “hill-climbing” strategy Convergence provable? YES

“Classification EM” If zij < 0.5, pretend it’s 0; zij > 0.5, pretend it’s 1 i.e., classify points as component 0 or 1 Now recalculate θ, assuming that partition Then recalculate zij , assuming that θ Then recalculate θ, assuming new zij , etc., etc.

Full EM xi’s are known; Θ unknown. Goal is to find the MLE Θ of: L (Θ : x1,…,xn ) (hidden data likelihood) Would be easy if zij’s were known, i.e., consider L (Θ : x1,…,xn, z11,z12,…,zn2 ) (complete data likelihood) But zij’s are not known. Instead, maximize expected likelihood of observed data E[ L(Θ : x1,…,xn, z11,z12,…,zn2 ) ] where expectation is over distribution of hidden data (zij’s).

The E-step Find E(zij), i.e., P(zij=1) Assume θ known & fixed. Let A: the event that xi was drawn from f1 B: the event that xi was drawn from f2 D: the observed data xi Then, expected value of zi1 is P(A|D) P(A|D) =

Complete data likelihood Recall: so, correspondingly, Formulas with “if’s” are messy; can we blend more smoothly?

M-step Find θ maximizing E[ log(Likelihood) ]

EM summary Fundamentally an MLE problem Useful if analysis is more tractable when 0/1 Hidden data z known Iterate: E-step: estimate E(z) for each z, given θ M-step: estimate θ maximizing E(log likelihood) given E(z) where “E(logL)” is wrt random z ~ E(z) = p(z=1)

EM Issues EM is guaranteed to increase likelihood with every E-M iteration, hence will converge. But may converge to local, not global, max. Issue is intrinsic (probably), since EM is often applied to NP-hard problems (including clustering, above, and motif-discovery, soon) Nevertheless, widely used, often effective

Acknowledgement Profs Daphne Koller & Nir Friedman, “Probabilistic Graphical Models” Prof Larry Ruzo, CSE 527, Autumn 2009 Prof Andrew Ng, ML lecture note