. EM and variants of HMM Lecture #9 Background Readings: Chapters 11.2, 11.6, 3.4 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

Slides:



Advertisements
Similar presentations
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Advertisements

. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
. Inference and Parameter Estimation in HMM Lecture 11 Computational Genomics © Shlomo Moran, Ydo Wexler, Dan Geiger (Technion) modified by Benny Chor.
Hidden Markov Model in Biological Sequence Analysis – Part 2
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Hidden Markov Model.
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Statistical NLP: Lecture 11
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Markov Chains Lecture #5
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Learning Hidden Markov Models Tutorial #7 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Lecture 6, Thursday April 17, 2003
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
CASE STUDY: Genetic Linkage Analysis via Bayesian Networks
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
. EM with Many Random Variables Another Example of EM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
H IDDEN M ARKOV M ODELS. O VERVIEW Markov models Hidden Markov models(HMM) Issues Regarding HMM Algorithmic approach to Issues of HMM.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Other Models for Time Series. The Hidden Markov Model (HMM)
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Hidden Markov Models BMI/CS 576
Hidden Markov Models.
Hidden Markov Models - Training
Hidden Markov Models Part 2: Algorithms
Introduction to EM algorithm
Learning Bayesian networks
CONTEXT DEPENDENT CLASSIFICATION
Hidden Markov Model Lecture #6
Presentation transcript:

. EM and variants of HMM Lecture #9 Background Readings: Chapters 11.2, 11.6, 3.4 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

2 Reminder: Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and q is defined by H(p||q) = D(p||q) = ∑ x p(x)log[p(x)/q(x)] = ∑ x p(x)log(1/(q(x)) - -∑ x p(x)log(1/(p(x)). “The inefficiency of assuming distribution q when the correct distribution is p”. H(p)H(p)

3 Non negativity of relative entropy Claim: D(p||q)=∑ x p(x)log[p(x)/q(x)]≥0 Equality only if q=p. Proof We may take the log to base e – ie, log x = ln x. Then, for all x>0, ln x ≤ x-1, with equality only if x=1. Thus -D(p||q) = ∑ x p(x)ln[q(x)/p(x)] ≤ ∑ x p(x)[q(x)/p(x) – 1] = =∑ x [q(x) - p(x)] = 0

4 Relative entropy as average score for sequence comparisons Recall that we have defined the scoring function via Note that the average score is the relative entropy D(P||Q)=∑ a,b P(a,b)log[P(a,b)/Q(a,b)] where Q(a,b) = Q(a) Q(b).

5 The EM algorithm Consider a model where, for observed data x and model parameters θ, p(x|θ) is defined by: p(x|θ)=∑ y p(x,y|θ). y are the “hidden parameters” The EM algorithm receives x and parameters θ, and return new parameters s.t. p(x| ) > p(x|θ). Note: In Durbin et. Al. book, the initial parameters are denoted by θ 0, and the new parameters by θ.

6 Maximizing p(x| )=∑ y p(x,y| ). is equivalent to maximizing the logarithm log p(x| ) = log (∑ y p(x,y| )) The EM algorithm receives x and initial parameters θ, and 1. Finds s.t. log p(x| )  log p(x|  ) 2. Sets , and repeat. The EM algorithm

7 In each iteration the EM algorithm does the following. u (E step): Calculate Q θ ( ) = ∑ y p(y|x,θ)log p(x,y| ) u (M step): Find * which maximizes Q θ ( ) (Next iteration sets   * and repeats). The EM algorithm Comments: 1. When θ is clear, we shall use Q( ) instead of Q θ ( ) 2. At the M-step we only need that Q θ ( *)>Q θ (θ). This change yields the so called Generalized EM algorithm. It is important when it is hard to find the optimal *.

8 Example: EM for 2 coin tosses Consider the following experiment: Given a coin with two possible outcomes: H (head) and T (tail), with probabilities q H, q T = 1- q H. The coin is tossed twice, but only the 1 st outcome, T, is seen. So the data is x = (T,*). We wish to apply the EM algorithm to get parameters that increase the likelihood of the data. Let the initial parameters be θ = (q H, q T ) = ( ¼, ¾ ).

9 EM for 2 coin tosses The hidden data which can produce x are the sequences y 1 = (T,H); y 2 =(T,T); (note that with this definition (x,y i )=y i ). The likelihood of x with parameters (q H, q T ), is q H q T +q T 2 For the initial parameters θ = ( ¼, ¾ ), we have: p(x| θ) = P(x,y 1 |  ) + P(x,y 2 |  ) = ¾ * ¼ + ¾ * ¾ = ¾ (note that in this case P(x,y i |  ) = P(y i |  ), for i = 1,2.)

10 EM for 2 coin tosses : Expectation step Calculate Q θ ( ) = Q θ (q H,q T ). Note: q H,q T are variables Q θ ( ) = p(y 1 |x,θ)log p(x,y 1 | )+p(y 2 |x,θ)log p(x,y 2 | ) p(y 1 |x,θ) = p(y 1,x|θ)/p(x|θ) = (¾∙ ¼) / (¾) = ¼ p(y 2 |x,θ) = p(y 2,x|θ)/p(x|θ) = (¾∙ ¾)/ (¾) = ¾ Thus we have Q θ ( ) = ¼ log p(x,y 1 | ) + ¾ log p(x,y 2 | )

11 EM for 2 coin tosses : Expectation step For a sequence y of coin tosses, let N H (y) be the number of H’s in y, and N T (y) be the number of T’s in y. Then log p(y| ) = N H (y) log q H + N T (y) log q T [ In our example: log p(y 1 | ) = log q H + log q T log p(y 2 | ) = 2log q T ]

12 EM for 2 coin tosses : Expectation step Thus ¼ log p(x,y 1 | ) = ¼ (N H (y 1 ) log q H + N T (y 1 ) log q T ) = ¼ (log q H + log q T ) ¾ log p(x,y 2 | ) = ¾ ( N H (y 2 ) log q H + N T (y 2 ) log q T ) = ¾ (2 log q T ) Substituting in the equation for Q θ ( ) : Q θ ( ) = ¼ log p(x,y 1 | )+ ¾ log p(x,y 2 | ) = ( ¼ N H (y 1 )+ ¾ N H (y 2 ))log q H + ( ¼ N T (y 1 )+ ¾ N T (y 2 ))log q T Q θ ( ) = N H log q H + N T log q T N T = 7 / 4 N H = ¼

13 EM for 2 coin tosses : Maximization step Find * which maximizes Q θ ( ) Q θ ( ) = N H log q H + N T log q T = ¼ log q H + 7 / 4 log q T We saw earlier that this is maximized when: [The optimal parameters (0,1), will never be reached by the EM algorithm!]

14 EM for general stochastic processes But this time (x,y) is generated by a general stochastic process, which employs r discrete random variables (dices) Z 1,..., Z r. This can be viewed as a probabilistic state machine, where at each state one of the random variable Z i is sampled, and then the next state is determined – until a final state is reached. Now we wish to maximize likelihood of observation x with hidden data as before, ie maximize p(x| )=∑ y p(x,y | ).

15 EM for general stochastic processes In HMM, the random variables are the transmissions probabilities a kl and the emission probabilities e k (b). x stands for the visible information y stands for the sequence s of states (x,y) stands for the complete HMM s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi For brevity, we assume that (x,y) = y (otherwise we set y’  (x,y) and replace the “ y ”s by “ y’ ”s.

16 EM for general stochastic processes Each random variable Z k (k =1,...,r) has m k values z k,1,...z k,m k with probabilities {q kl,|l=1,...,m k }. Each y defines a sequence of outcomes (z k 1, l 1,...,z k n, l n ) of of the random variables used in y. In the HMM, these are the specific transitions and emissions, defined by the states and outputs of the sequence y j. Let N kl (y) = #(z kl appears in y).

17 Define N kl as the expected value of N kl (y) under θ: N kl =E(N kl |x,θ) = ∑ y p(y|x,θ) N kl (y), Then we have: EM for general stochastic processes (cont) Similarly to the dice case, we have:

18 Q  (λ) for general stochastic processes N kl

19 EM algorithm for general stochastic processes Maximization step Set q kl =N kl / (∑ l’ N kl’ ) Similarly to the one dice case we get: Expectation step Set N kl to E (N kl (y)|x,θ), ie: N kl = ∑ y p(y|x,θ) N kl (y)

20 EM algorithm for n independent observations x 1,…, x n : Expectation step It can be shown that, if the x j are independent, then:

21 Application to HMM For HMM, the random variables z kl are the state transitions and symbol emissions from state k, and q kl are the corresponding probabilities a kl and e k (b).

22 EM algorithm for HMM: (the Baum-Welch training): Expectation step (single observation x): A kl, the expected number of (k,l) transitions: A kl = ∑ s p(s|x,θ) N kl (x,s) Is computed by: E kb, the expected number of emissions of b from state k: E kb = ∑ s p(s|x,θ) E kb (x,s), computed by:

23 EM algorithm for HMM: (the Baum-Welch training): Expectation step (n observations x 1,...,x n ): A kl, the expected number of (k,l) transitions: A kl = ∑ j ∑ s p(s|x j,θ) N kl (x j,s) Is computed by: E kb = ∑ s p(s|x j,θ) E kb (x j,s), is computed by:

24 EM algorithm for HMM: (the Baum-Welch training): Maximization step: The new parameters are given by:

25 Correctness proof of EM Theorem: If λ* maximizes Q  (λ) = ∑ i p(y i |x,θ)log p(y i | λ), then P(x| λ*)  P(x| θ). Comment: In the proof we will assume only that Q  (λ*)  Q  (θ).

26 For each y we have p(x| ) p(y |x, ) = p(y,x| ), and hence: log p(x| ) = log p( y,x| ) – log p( y |x, ) Hence log p(x| λ) = ∑ y p(y|x, θ) [log p(y|λ) – log p(y|x, λ)] log p(x| λ) Proof (cont.) =1 (Next..)

27 Proof (end) log p(x| λ) = ∑ y p(y|x, θ) log p(y|λ) + ∑ y p(y|x,θ) log [1/p(y|x,λ)] Q(λ|θ)Q(λ|θ) Thus log p(x| λ*) - log p(x|θ) = Q(λ*) – Q(θ) + D(p(y|x,θ) || p(y|x,λ*)) ≥ Q(λ*) – Q(θ) ≥ 0 [since λ* maximizes Q(λ)]. QED Relative entropy 0 ≤

28 Example: The ABO locus A locus is a particular place on the chromosome. Each locus’ state (called genotype) consists of two alleles – one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type. Suppose we randomly sampled N individuals and found that N a/a have genotype a/a, N a/b have genotype a/b, etc. Then, the MLE is given by: The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O. We wish to estimate the proportion in a population of the 6 genotypes.

29 The ABO locus (Cont.) However, testing individuals for their genotype is a very expensive test. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O) ? The problem is that among individuals measured to have blood type A, we don’t know how many have genotype a/a and how many have genotype a/o. So what can we do ?

30 The ABO locus (Cont.) We use the Hardy-Weinberg equilibrium rule that tells us that in equilibrium the frequencies of the three alleles q a,q b,q o in the population determine the frequencies of the genotypes as follows: q a/b = 2q a q b, q a/o = 2q a q o, q b/o = 2q b q o, q a/a = [q a ] 2, q b/b = [q b ] 2, q o/o = [q o ] 2. So now we have three parameters that we need to estimate. Hardy-Weinberg equilibrium rule follows from modeling this problem as data x with hidden parameters y: We have three possible alleles a, b and o. The blood type A, B, AB or O is determined by two successive sampling of alleles (which define the genotype). For instance blood type A corresponds to the samplings (a,a), (a,o) and (o,a).

31 The Likelihood Function We wish to determine the probabilities of the six genotypes x a/a, x a/o,x b/b, x b/o, x a/b, x o/o. These are defined by the parameters  = {q a,q b, q o } eg: P(X= x a/b |  ) = P({(a,b), (b,a)} |  )= 2q a q b. Similarly P(X= x o/o |  ) = q o q o. And so on for the other four genotypes. So all we need is to find the parameters  = {q a,q b, q o }.

32 The Likelihood Function We wish to compute the parameters by sampling a data and then use MLE. This is naturally dealt by EM, because the sampled data – the blood types - have hidden parameters (the genotype) Assume the sampled data is {B,A,B,B,O,A,B,A,O,B, AB} What is its probability, for given parameters  ? Obtaining the maximum of this function yields the MLE. We use the EM algorithm to replace  by which increases the likelihood.

33 ABO loci as a special case of HMM Model the ABO sampling as an HMM with 6 states (genotypes): a/a, a/b, a/o, b/b, b/o, o/o, and 4 outputs (blood types): A,B,AB,O. Assume 3 transitions types: a, b and o, and a state is determined by 2 successive transitions. The probability of transition x is  x. Emission is done every other state, and is determined by the state. Eg, e a/o (A)=1, since a/o produces blood type A. ao a/o a/b A AB a/b AB b baa

34 A faster and simpler EM for ABO loci Can be solved via the Baum-Welch EM training. This is quite inefficient: for L sampling it requires running the forward and backward algorithm on HMM of length 2L, even that there are only 6 distinct genotypes. Direct application of the EM algorithm yields a simpler and more efficient way: Consider the input data {B,A,B,B,O,A,B,A,O,B, AB} as observations x 1,…x 11. The hidden data of an observation are the genotypes which produce it. Eg, for O it is (o,o), and for B it is (o,b), (b,o) and (b,b).

35 A faster EM for ABO loci For each genotype y we have N a (y), N b (y) and N o (y). Eg, N a (o,b)=0; N b (o,b) = N o (o,b) = 1. For each observation of blood type x j and for each allel z in {a,b,o} we compute N z j, the expected number of times that z appear in x j.

36 A faster EM for ABO loci The computation for blood type B: P(B|  ) = P((b,b)|  ) + p((b,o)|  ) +p((o,b)|  )) = q b 2 + 2q b q o. N o B and N b B, the expected number of occurrences of o and b in B, are given by: Observe that N b B + N o B = 2

37 A faster EM for ABO loci Similarly, P(A|  ) = q a 2 + 2q a q o. P(AB|  ) = p((b,a)|  ) + p((a,b)|  )) = 2q a q b ; P(O|  ) = p((o,o)|  ) = q a 2 N a AB = N b AB = 1 N o O = 2 [ N b O = N a O = N o AB = N b A = N a B = 0 ]

38 E step: compute N a, N b and N o Let #(A)=3, #(B)=5, #(AB)=1, #(O)=2 be the number of observations of A, B, AB, and O respectively. M step: compute new values of q a, q b and q o

39 EM in Practice Initial parameters: u Random parameters setting u “Best” guess from other source Stopping criteria: u Small change in likelihood of data u Small change in parameter values Avoiding bad local maxima: u Multiple restarts u Early “pruning” of unpromising ones

40 log P(x|  ) Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function has a higher likelihood than the current point MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem  E  ’ [log P(x,y|  )]

41 HMM model structure: 1. Duration Modeling Markov chains are rather limited in describing sequences of symbols with non-random structures. For instance, Markov chain forces the distribution of segments in which some state is repeated for k times to be (1-p)p k-1, for some p. Several ways enable modeling of other distributions. One is assigning more than one state to represent the same “real” state. A1A1 A2A2 A3A3 A4A4

42 HMM model structure: 2. Silent states u States which do not emit symbols (as we saw in the abo locus). u Can be used to model duration distributions. u Also used to allow arbitrary jumps (needed for use of HMM in pairwise alignments) u Need to adjust the Forward and Backward algorithms to count for the silent states Silent states: Regular states:

43 HMM model structure: 3. High Order Markov Chains Markov chains in which the transition probability depends on the last n states: P(x i |x i-1,...,x 1 ) = P(x i |x i-1,...,x i-n ) Can be represented by a standard Markov chain with more states: AA BB BA AB

44 HMM model structure: 4. Inhomogenous Markov Chains u An important task in analyzing DNA sequences is recognizing the genes which code for proteins. u A triplet of 3 nucleotides – called codon - codes for amino acids (see next slide). u It is known that in parts of DNA which code for genes, the three codons positions has different statistics. u Thus A Markov chain model for DNA should represent not only the Nucleotide (A, C, G or T), but also its position – the same nucleotide in different position will have different transition probabilities.

45 Genetic Code There are 20 amino acids from which proteins are build.