. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

Slides:



Advertisements
Similar presentations
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Advertisements

. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
. Inference and Parameter Estimation in HMM Lecture 11 Computational Genomics © Shlomo Moran, Ydo Wexler, Dan Geiger (Technion) modified by Benny Chor.
Expected Value Suppose it costs $2 to get a ticket from the parking ticket machine. Suppose if you are caught without a ticket, the fine is $20. Suppose.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Copyright © 2006 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
CHAPTER 13: Binomial Distributions
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and.
Phylogenetic Trees Lecture 4
Parameter Estimation using likelihood functions Tutorial #1
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Markov Chains Lecture #5
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
Today Today: More of Chapter 2 Reading: –Assignment #2 is up on the web site – –Please read Chapter 2 –Suggested.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
Chapter 4 Probability The description of sample data is only a preliminary part of a statistical analysis. A major goal is to make generalizations or inferences.
Lecture 5: Learning models using EM
This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Basic Model For Genetic Linkage Analysis Lecture #3 Prepared by Dan Geiger.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau.
1 Copyright M.R.K. Krishna Rao 2003 Chapter 5. Discrete Probability Everything you have learned about counting constitutes the basis for computing the.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
CASE STUDY: Genetic Linkage Analysis via Bayesian Networks
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Sets, Combinatorics, Probability, and Number Theory Mathematical Structures for Computer Science Chapter 3 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesProbability.
. EM with Many Random Variables Another Example of EM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG.
. EM and variants of HMM Lecture #9 Background Readings: Chapters 11.2, 11.6, 3.4 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
Week 21 Conditional Probability Idea – have performed a chance experiment but don’t know the outcome (ω), but have some partial information (event A) about.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Lecture 15: Linkage Analysis VII
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Probability Theory Modelling random phenomena. Permutations the number of ways that you can order n objects is: n! = n(n-1)(n-2)(n-3)…(3)(2)(1) Definition:
3/7/20161 Now it’s time to look at… Discrete Probability.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Hidden Markov Models BMI/CS 576
Comp. Genomics Recitation 6 14/11/06 ML and EM.
What is Probability? Quantification of uncertainty.
Hidden Markov Models - Training
Hidden Markov Models Part 2: Algorithms
Introduction to EM algorithm
CONTEXT DEPENDENT CLASSIFICATION
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Chapter 11 Probability.
Presentation transcript:

. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

2 This lecture plan: 1. Presentation and Correctness Proof of the EM algorithm. 2. Examples of Implementations The EM algorithm

3 A “model with parameters θ ” is a probabilistic space M, in which each simple event y is determined by values of random variables (dice). The parameters θ are the probabilities associated with the random variables. (In HMM of length L, the simple events are HMM-sequences of length L, and the parameters are the transition probabilities m kl and the emission probabilities e k (b)). An “observed data” is a non empty subset x  M. (In HMM, it is usually all the simple events which fit with a given output sequence). Given observed data x, the ML method seeks parameters θ* which maximize the likelihood of the data p(x|θ)=∑ y p(x,y|θ). (In HMM, x can be the transmitted letters,and y the hidden states) Finding such θ* is easy when the observed data is a simple event, but hard in general. Model, Parameters, ML

Assume a model with parameters as in the previous slide. Given observed data x, the likelihood of x under model parameters θ is given by p(x|θ)=∑ y p(x,y|θ). (The pairs (x,y) are the simple events which comprise x. Informally, y denotes the possible values of the“hidden data”). The EM algorithm receives x and parameters θ, and returns new parameters * s.t. p(x| *) ≥ p(x|θ), with equality only if λ*=θ. i.e., the new parameters increase the likelihood of the observed data. 4 The EM algorithm

5 EM uses the current parameters θ to construct a simpler ML problem L θ : Guarantee: if L θ (λ)>L θ (θ), than P(x| λ)>P(x| θ). log P(x| λ ) The EM algorithm u The graphs below are the logarithms of the likelihood functions λ Log(L θ )= E  [log P(x,y| λ )] θλ*λ*

6 Let x be the observed data. Let {(x,y 1 ),…,(x,y k )} be the set of (simple) events which comprise x. Our goal is to find parameters θ * which maximize the sum As this is hard, we start with some parameters θ, and only find λ * s.t. if λ*≠θ then: Derivation of the EM Algorithm Finding λ* is obtained via “virtual sampling”, defined next.

7 For given parameters θ, Let p i = p(y i |x,θ). (note that p 1 +…+p k =1). We use the p i ’s to define “virtual” sampling, in which: y 1 occurs p 1 times, y 2 occurs p 2 times, … y k occurs p k times

8 In each iteration the EM algorithm does the following. u (E step): Given θ, compute the function u (M step): Find * which maximizes L θ ( ) (Next iteration sets   * and repeat). The EM algorithm Comment: 1. At the M-step we only need that L θ ( *)>L θ (θ). This change yields the so called Generalized EM algorithm. It is used when it is hard to find the optimal *. 2. Usually, the computations use the function:

9 Correctness Theorem for the EM Algorithm

10 Correctness proof of EM

11 Correctness proof of EM (end)

12 The Baum-Welsh algorithm is the EM algorithm for HMM: u E step for HMM: where λ are the new parameters {m kl,e k (b)}. u M step for HMM: look for λ which maximizes L θ ( ). Example: Baum Welsh = EM for HMM

13 Baum Welsh = EM for HMM (cont) M kl Ek(b)Ek(b)

14 A simple example: EM for 2 coin tosses Consider the following experiment: Given a coin with two possible outcomes: H (head) and T (tail), with probabilities θ H, θ T = 1- θ H. The coin is tossed twice, but only the 1 st outcome, T, is seen. So the data is x = (T,*). We wish to apply the EM algorithm to get parameters that increase the likelihood of the data. Let the initial parameters be θ = (θ H, θ T ) = ( ¼, ¾ ).

15 EM for 2 coin tosses (cont) The “hidden data” which produce x are the sequences y 1 = (T,H); y 2 =(T,T); Hence the likelihood of x with parameters (θ H, θ T ), is p(x| θ) = P(x,y 1 |  ) + P(x,y 2 |  ) = q H q T +q T 2 For the initial parameters θ = ( ¼, ¾ ), we have: p(x| θ) = ¼ ∙ ¾ + ¾ ∙ ¾ = ¾ Note that in this case P(x,y i |  ) = P(y i |  ), for i = 1,2. we can always define y so that (x,y) = y (otherwise we set y’  (x,y) and replace the “ y ”s by “ y’ ”s).

16 EM for 2 coin tosses - E step Calculate L θ ( ) = L θ (λ H,λ T ). Recall: λ H,λ T are the new parameters, which we need to optimize p(y 1 |x,θ) = p(y 1,x|θ)/p(x|θ) = (¾∙ ¼)/ (¾) = ¼ p(y 2 |x,θ) = p(y 2,x|θ)/p(x|θ) = (¾∙ ¾)/ (¾) = ¾ Thus we have This is the “virtual sampling”

17 EM for 2 coin tosses - E step For a sequence y of coin tosses, let N H (y) be the number of H’s in y, and N T (y) be the number of T’s in y. Then In our example: y 1 = (T,H); y 2 =(T,T), hence: N H (y 1 ) = N T (y 1 )=1, N H (y 2 ) =0, N T (y 2 )=2

18 Thus Example: 2 coin tosses - E step N T = 7 / 4 N H = ¼ And in general:

19 EM for 2 coin tosses - M step Find * which maximizes L θ ( ) And as we already saw, is maximized when: [The optimal parameters (0,1), will never be reached by the EM algorithm!]

20 Let N k be the expected value of N k (y), given x and θ: N k =E(N k |x,θ) = ∑ y p(y|x,θ) N k (y), EM for single random variable (dice) Now, the probability of each y (≡(x,y)) is given by a sequence of dice tosses. The dice has m outcomes, with probabilities λ 1,..,λ m. Let N k (y) = #(outcome k occurs in y). Then Then we have:

21 L  (λ) for one dice NkNk

22 EM algorithm for n independent observations x 1,…, x n : Expectation step It can be shown that, if the x j are independent, then:

23 Example: The ABO locus A locus is a particular place on the chromosome. Each locus’ state (called genotype) consists of two alleles – one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type. Suppose we randomly sampled N individuals and found that N a/a have genotype a/a, N a/b have genotype a/b, etc. Then, the MLE is given by: The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O. We wish to estimate the proportion in a population of the 6 genotypes.

24 The ABO locus (Cont.) However, testing individuals for their genotype is a very expensive. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O) ? The problem is that among individuals measured to have blood type A, we don’t know how many have genotype a/a and how many have genotype a/o. So what can we do ?

25 The ABO locus (Cont.) The Hardy-Weinberg equilibrium rule states that in equilibrium the frequencies of the three alleles q a,q b,q o in the population determine the frequencies of the genotypes as follows: q a/b = 2q a q b, q a/o = 2q a q o, q b/o = 2q b q o, q a/a = [q a ] 2, q b/b = [q b ] 2, q o/o = [q o ] 2. In fact, Hardy-Weinberg equilibrium rule follows from modeling this problem as data x with hidden parameters y:

26 The ABO locus (Cont.) The dice’ outcome are the three possible alleles a, b and o. The observed data are the blood types A, B, AB or O. Each blood type is determined by two successive random sampling of alleles, which is an “ordered genotypes pair” – this is the hidden data. A ={(a,a), (a,o),(o,a)}; B={(b,b),(b,o),(o,b); AB={(a,b),(b,a)}; O={(o,o)}. So we have three parameters of one dice – q a,q b,q o - that we need to estimate. We start with parameters θ = (q a,q b,q o ), and then use EM to improve them.

27 EM setting for the ABO locus The observed data x =(x 1,..,x n ) is a sequence of elements (blood types) from the set {A,B,AB,O}. eg: (B,A,B,B,O,A,B,A,O,B, AB) are observations (x 1,…x 11 ). The hidden data (ie the y’s) for each x j is the set of ordered pairs of alleles that generates it. For instance, for A it is the set {aa, ao, oa}. The parameters  = {q a,q b, q o } are the (current) probabilities of the alleles. The complete implementation of the EM algorithm for this problem will be given in the tutorial.