What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG.

What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG islands for porcupines We suspect the frequency and characteristics of CpG islands are quite different in porcupines How do we adjust the parameters in our model? LEARNING

Problem 3: Learning Re-estimate the parameters of the model based on training data

Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN:the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls 2.Estimation when the “right answer” is unknown Examples: GIVEN:the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice QUESTION:Update the parameters  of the model to maximize P(x|  )

1.When the right answer is known Given x = x 1 …x N for which the true  =  1 …  N is known, Define: A kl = # times k  l transition occurs in  E k (b) = # times state k in  emits b in x We can show that the maximum likelihood parameters  (maximize P(x|  )) are: A kl E k (b) a kl = ––––– e k (b) = –––––––  i A ki  c E k (c)

1.When the right answer is known Intuition: When we know the underlying states, Best estimate is the average frequency of transitions & emissions that occur in the training data Drawback: Given little data, there may be overfitting: P(x|  ) is maximized, but  is unreasonable 0 probabilities – VERY BAD Example: Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3  = F, F, F, F, F, F, F, F, F, F Then: a FF = 1; a FL = 0 e F (1) = e F (3) =.2; e F (2) =.3; e F (4) = 0; e F (5) = e F (6) =.1

Pseudocounts Solution for small training sets: Add pseudocounts A kl = # times k  l transition occurs in  + r kl E k (b) = # times state k in  emits b in x+ r k (b) r kl, r k (b) are pseudocounts representing our prior belief Larger pseudocounts  Strong prior belief Small pseudocounts (  < 1): just to avoid 0 probabilities

Pseudocounts Example: dishonest casino We will observe player for one day, 600 rolls Reasonable pseudocounts: r 0F = r 0L = r F0 = r L0 = 1; r FL = r LF = r FF = r LL = 1; r F (1) = r F (2) = … = r F (6) = 20(strong belief fair is fair) r L (1) = r L (2) = … = r L (6) = 5(wait and see for loaded) Above #s pretty arbitrary – assigning priors is an art

2.When the right answer is unknown We don’t know the true A kl, E k (b) Idea: We estimate our “best guess” on what A kl, E k (b) are We update the parameters of the model, based on our guess We repeat

The general process for finding θ in this case is 1.Start with an initial value of θ. 2.Find θ’ so that p(x 1,..., x n |θ’) > p(x 1,..., x n |θ) 3.set θ = θ’. 4.Repeat until some convergence criterion is met. A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum- Welch training. 2.When the right answer is unknown

We don’t know the true A kl, E k (b) Starting with our best guess of a model M with parameters  : Given x = x 1 …x N for which the true  =  1 …  N is unknown, We can get to a provably more likely parameter set  = ( a kl, e k (b) ) Principle: EXPECTATION MAXIMIZATION 1.E-STEP: Estimate A kl, E k (b) in the training data 2.M-STEP: Update  = ( a kl, e k (b) ) according to A kl, E k (b) 3.Repeat 1 & 2, until convergence

Baum Welch training We start with some values of a kl and e k (b), which define prior values of θ. Baum-Welch training is an iterative algorithm which attempts to replace θ by a θ * s.t. p( x |θ * ) > p( x |θ) Each iteration consists of few steps: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

Baum Welch training In case 1 we computed the optimal values of a kl and e k (b), (for the optimal θ) by simply counting the number A kl of transitions from state k to state l, and the number E k (b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states. S i = lS i-1 = k x i-1 = b … … x i = c

Baum Welch training When the states are unknown, the counting process is replaced by averaging process: For each edge s i-1  s i we compute the average number of “k to l” transitions, for all possible pairs (k,l), over this edge. Then, for each k and l, we take A kl to be the sum over all edges. S i = ?S i-1 = ? x i-1 = b x i = c ……

Baum Welch training Similarly, For each edge s i  b and each state k, we compute the average number of times that s i =k, which is the expected number of “k → b” transmission on this edge. Then we take E k (b) to be the sum over all such edges. These expected values are computed as follows: S i = ?S i-1 = ? x i-1 = b x i = c

s1s1 SiSi sLsL X1X1 XiXi XLXL S i-1 X i-1.. Baum Welch: step 1a Count expected number of state transitions For each i and for each k,l, compute the posterior state transitions probabilities: P(s i-1 =k, s i =l | x,θ) For this, we use the forwards and backwards algorithms

Estimating new parameters So, f k (i) a kl e l (x i+1 ) b l (i+1) A kl =  i P(  i = k,  i+1 = l | x,  ) =  i ––––––––––––––––– P(x |  ) Similarly, E k (b) = [1/P(x |  )]  {i | xi = b} f k (i) b k (i) kl x i+1 a kl e l (x i+1 ) b l (i+1)f k (i) x 1 ………x i-1 x i+2 ………x N xixi

Reminder: finding posterior state probabilities p(s i =k,x) = f k (s i ) b k (s i ) (since these are independent events) {f k (i) b k (i)} for every i, k are computed by one run of the backward/forward algorithms. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi f k (i) = p(x 1,…,x i,s i =k ), the probability that in a path which emits (x 1,..,x i ), state s i =k. b k (i)= p(x i+1,…,x L |s i =k), the probability that a path emits (x i+1,..,x L ), given that state s i =k.

Baum Welch: Step 1a (cont) Claim: s1s1 SiSi sLsL X1X1 XiXi XLXL S i-1 X i-1.. (a kl and e l (x i ) are the parameters defined by , and f k (i-1), b k (i) are the forward and backward functions)

Step 1a: Computing P(s i-1 =k, s i =l | x,θ) P(x 1,…,x L,s i-1 =k,s i =l|  ) = P(x 1,…,x i-1,s i-1 =k|  ) a kl e l (x i ) P(x i+1,…,x L |s i =l,  ) = f k (i-1) a kl e l (x i ) b l (i) Via the forward algorithm Via the backward algorithm s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL S i-1 X i-1 sisi XiXi x p(s i-1 =k,s i =l | x,  ) = f k (i-1) a kl e l (x i ) b l (i)

Step 1a (end) For each pair (k,l), compute the expected number of state transitions from k to l, as the sum of the expected number of k to l transitions over all L edges :

Step 1a for many sequences: When we have n input sequences (x 1,..., x n ), then A kl is given by:

Baum-Welch: Step 1b count expected number of symbols emissions for state k and each symbol b, for each i where X i =b, compute the expected number of times that S i =k. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi X i =b

Baum-Welch: Step 1b For each state k and each symbol b, compute the expected number of emissions of b from k as the sum of the expected number of times that s i = k, over all i’s for which x i = b.

Step 1b for many sequences When we have n sequences (x 1,..., x n ), the expected number of emissions of b from k is given by:

Summary of Steps 1a and 1b: the E part of the Baum Welch training These steps compute the expected numbers A kl of k,l transitions for all pairs of states k and l, and the expected numbers E k (b) of transmitions of symbol b from state k, for all states k and symbols b. The next step is the M step, which is identical to the computation of optimal ML parameters when all states are known.

Baum-Welch: step 2 Use the A kl ’s, E k (b)’s to compute the new values of a kl and e k (b). These values define θ *. The correctness of the EM algorithm implies that: p(x 1,..., x n |θ * )  p(x 1,..., x n |θ) i.e, θ * increases the probability of the data This procedure is iterated, until some convergence criterion is met.

The Baum-Welch Algorithm Initialization: Pick the best-guess for model parameters (or arbitrary) Iteration: 1.Forward 2.Backward 3.Calculate A kl, E k (b) 4.Calculate new model parameters a kl, e k (b) 5.Calculate new log-likelihood P(x |  ) GUARANTEED TO BE HIGHER BY EXPECTATION- MAXIMIZATION Until P(x |  ) does not change much

Viterbi training: maximizing the probabilty of the most probable path States are unknown. Viterbi training attempts to maximizes the probability of a most probable path, ie the value of p(s(x 1 ),..,s(x n ), x 1,..,x n |θ) Where s(x j ) is the most probable (under θ) path for x j. We assume only one sequence (n=1). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

Viterbi training (cont) Start from given values of a kl and e k (b), which define prior values of θ. Each iteration: Step 1: Use Viterbi’s algoritm to find a most probable path s(x), which maximizes p(s(x), x|θ). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

Viterbi training (cont) Step 2. Use the ML method for HMM with known parameters, to find θ * which maximizes p(s(x), x|θ * ) Note: In Step 1. the maximizing argument is the path s(x), in Step 2. it is the parameters θ *. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

Viterbi training (cont) 3. Set θ=θ *, and repeat. Stop when paths are not changed. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Claim 2 : If s(x) is the optimal path in step 1 of two different iterations, then in both iterations θ has the same values, and hence p(s(x), x |θ) will not increase in any later iteration. Hence the algorithm can terminate in this case.

Coin-Tossing Example 0.9 Fair loaded head tail 0.9 0.1 1/2 1/4 3/4 1/2 H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi L tosses Fair/Loade d Head/Tail Start 1/2

Example : Homogenous HMM, one sample Start with some probability tables Iterate until convergence E-step: Compute p  (h i |h i -1,x 1,…,x L ) from p  (h i, h i -1 | x 1,…,x L ) which is computed using the forward- backward algorithm as explained earlier. M-step: Update the parameters simultaneously:    i p  (h i =1 | h i-1 =1, x 1,…,x L )+p  (h i =0 | h i-1 =0, x 1,…,x L )/(L-1) H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi

H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi Coin-Tossing Example Numeric example: 3 tosses Outcomes: head, head, tail

Coin-Tossing Example Numeric example: 3 tosses Outcomes: head, head, tail P(x 1 =head,h 1 =loaded)= P(loaded 1 ) P(head| loaded 1 )= 0.5*0.75=0.375 P(x 1 =head,h 1 =fair)= P(fair 1 ) P(head| fair 1 )= 0.5*0.5=0.25 First coin is loaded {step 1- forward} F(h i )=P(x 1,…,x i,h i ) =  P(x 1,…,x i-1, h i-1 ) P(h i | h i-1 ) P(x i | h i ) h i-1 Recall:

Coin-Tossing Example - forward Numeric example: 3 tosses Outcomes: head, head, tail P(x 1,…,x i,h i ) =  P(x 1,…,x i-1, h i-1 ) P(h i | h i-1 ) P(x i | h i ) h i-1 P(x 1 =head,h 1 =loaded)= P(loaded 1 ) P(head| loaded 1 )= 0.5*0.75=0.375 P(x 1 =head,h 1 =fair)= P(fair 1 ) P(head| fair 1 )= 0.5*0.5=0.25 {step 1} P(x 1 =head,x 2 =head,h 2 =loaded) =  P(x 1,h 1 ) P(h 2 | h 1 ) P(x 2 | h 2 ) = p(x 1 =head, loaded 1 ) P(loaded 2 | loaded 1 ) P(x 2 =head | loaded 2 ) + p(x 1 =head, fair 1 ) P(loaded 2 | fair 1 ) P(x 2 =head | loaded 2 ) = 0.375*0.9*0.75 + 0.25*0.1*0.75=0.253125+ 0.01875= 0.271875 h1h1 {step 2} P(x 1 =head,x 2 =head,h 2 =fair) =p(x 1 =head, loaded 1 ) P(fair 2 | loaded 1 ) P(x 2 =head | fair 2 ) +p(x 1 =head, fair 1 ) P(fair 2 | fair 1 ) P(x 2 =head | fair 2 ) = 0.375*0.1*0.5 + 0.25*0.9*0.5= 0.01875 + 0.1125= 0.13125

Coin-Tossing Example - forward Numeric example: 3 tosses Outcomes: head, head, tail P(x 1,…,x i,h i ) =  P(x 1,…,x i-1, h i-1 ) P(h i | h i-1 ) P(x i | h i ) h i-1 P(x 1 =head,x 2 =head,h 2 =loaded) = 0.271875 P(x 1 =head,x 2 =head,h 2 =fair) = 0.13125 {step 2} P(x 1 =head,x 2 =head, x 3 =tail,h 3 =loaded) =  P(x 1, x 2, h 2 ) P(h 3 | h 2 ) P(x 3 | h 3 ) = p(x 1 =head, x 2 =head, loaded 2 ) P(loaded 3 | loaded 2 ) P(x 3 =tail | loaded 3 ) + p(x 1 =head, x 2 =head, fair 2 ) P(loaded 3 | fair 2 ) P(x 3 =tail | loaded 3 ) = 0.271875 *0.9*0.25 + 0.13125 *0.1*0.25=0.6445 h2h2 {step 3} P(x 1 =head,x 2 =head, x 3 =tail,h 3 =fair) = p(x 1 =head, x 2 =head, loaded 2 ) P(fair 3 | loaded 2 ) P(x 3 =tail | fair 3 ) + p(x 1 =head, x 2 =head, fair 2 ) P(fair 3 | fair 2 ) P(x 3 =tail | fair 3 ) = 0.271875 *0.1*0.5 + 0.13125 *0.9*0.5=0.07265

Coin-Tossing Example - backward Numeric example: 3 tosses Outcomes: head, head, tail b(h i ) = P(x i+1,…,x L |h i )= P(x i+1,…,x L |h i ) =  P(h i+1 | h i ) P(x i+1 | h i+1 ) b(h i+1 ) P(x 3 =tail | h 2 =loaded)=P(h 3 =loaded | h 2 =loaded) P(x 3 =tail | h 3 =loaded)+ P(h 3 =fair | h 2 =loaded) P(x 3 =tail | h 3 =fair)=0.9*0.25+0.1*0.5=0.275 P(x 3 =tail | h 2 =fair)=P(h 3 =loaded | h 2 =fair) P(x 3 =tail | h 3 =loaded)+ P(h 3 =fair | h 2 =fair) P(x 3 =tail | h 3 =fair)=0.1*0.25+0.9*0.5=0.475 {step 1} h i+1

Coin-Tossing Example - backward Numeric example: 3 tosses Outcomes: head, head, tail P(x 3 =tail | h 2 =loaded)=0.275 P(x 3 =tail | h 2 =fair)=0.475 {step 1} P(x 2 =head,x 3 =tail | h 1 =loaded) = P(loaded 2 | loaded 1 ) *P(head| loaded)* 0.275 +P(fair 2 | loaded 1 ) *P(head|fair)*0.475= 0.9*0.75*0.275+0.1*0.5*0.475=0.209 {step 2} P(x 2 =head,x 3 =tail | h 1 =fair) = P(loaded 2 | fair 1 ) *P(head|loaded)* 0.275 +P(fair 2 | fair 1 ) * P(head|fair)*0.475= 0.1*0.75*0.275+0.9*0.5*0.475=0.234 b(h i ) = P(x i+1,…,x L |h i )= P(x i+1,…,x L |h i ) =  P(h i+1 | h i ) P(x i+1 | h i+1 ) b(h i+1 ) h i+1

p(x 1,…,x L,h i,h i+1 )=f(h i ) p(h i+1 |h i ) p(x i+1 | h i+1 ) b(h i+1 ) Coin-Tossing Example Outcomes: head, head, tail f(h 1 =loaded) = 0.375, f(h 1 =fair) = 0.25 b(h 2 =loaded) = 0.275, b(h 2 =fair) = 0.475 P(x 1 =head,h 1 =loaded)= P(loaded 1 ) P(head| loaded 1 )= 0.5*0.75=0.375 P(x 1 =head,h 1 =fair)= P(fair 1 ) P(head| fair 1 )= 0.5*0.5=0.25 {step 1} Recall:

Coin-Tossing Example Outcomes: head, head, tail f(h 1 =loaded) = 0.375, f(h 1 =fair) = 0.25 b(h 2 =loaded) = 0.275, b(h 2 =fair) = 0.475 p(x 1,…,x L,h 1,h 2 )=f(h 1 ) p(h 1 |h 2 ) p(x 2 | h 2 ) b(h 2 ) p(x 1,…,x L,h 1 =loaded,h 2 =loaded)=0.375*0.9*0.75*0.275=0.0696 p(x 1,…,x L,h 1 =loaded,h 2 =fair)=0.375*0.1*0. 5*0.475=0.0089 p(x 1,…,x L,h 1 =fair,h 2 =loaded)=0.25*0.1*0.75*0.275=0.00516 p(x 1,…,x L,h 1 =fair,h 2 =fair)=0.25*0.9*0. 5*0.475=0.0534

Coin-Tossing Example p(h i |h i -1,x 1,…,x L )=p(x 1,…,x L,h i,h i-1 )/p(h i-1,x 1,…,x L ) f(h i-1 )*b(h i-1 ) =f(h i-1 ) p(h i-1 |h i ) p(x i | h i ) b(h i )/(f(h i-1 )*b(h i-1 ))

M-step M-step: Update the parameters simultaneously: (in this case we only have one parameter -  )   (  i p(h i =loaded | h i-1 =loaded, x 1,…,x L )+ p (h i =fair | h i-1 =fair, x 1,…,x L ))/(L-1)

Variants of HMMs

Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 = j)a jkl … A second order HMM with K states is equivalent to a first order HMM with K 2 states state HHstate HT state THstate TT a HHT a TTH a HTT a THH a THT a HTH

Modeling the Duration of States Length distribution of region X: E[l X ] = 1/(1-p) Geometric distribution, with mean 1/(1-p) This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions XY 1-p 1-q pq

Solution 1: Chain several states XY 1-p 1-q p q X X Disadvantage: Still very inflexible l X = C + geometric with mean 1/(1-p)

Solution 2: Negative binomial distribution Duration in X: m turns, where –During first m – 1 turns, exactly n – 1 arrows to next state are followed –During m th turn, an arrow to next state is followed m – 1 P(l X = m) = n – 1 (1 – p) n-1+1 p (m-1)-(n-1) = n – 1 (1 – p) n p m-n X p X X p 1 – p p …… Y 1 – p

Example: genes in prokaryotes EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3

Solution 3:Duration modeling Upon entering a state: 1.Choose duration d, according to probability distribution 2.Generate d letters according to emission probs 3.Take a transition to next state according to transition probs Disadvantage: Increase in complexity: Time: O(D 2 ) Space: O(D) where D = maximum duration of state X

Example: The ABO locus A locus is a particular place on the chromosome. Each locus’ state (called genotype) consists of two alleles – one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type. Suppose we randomly sampled N individuals and found that N a/a have genotype a/a, N a/b have genotype a/b, etc. Then, the MLE is given by: The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O. We wish to estimate the proportion in a population of the 6 genotypes.

The ABO locus (Cont.) However, testing individuals for their genotype is a very expensive test. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O) ? The problem is that among individuals measured to have blood type A, we don’t know how many have genotype a/a and how many have genotype a/o. So what can we do ? We use the Hardy-Weinberg equilibrium rule that tells us that in equilibrium the frequencies of the three alleles  a,  b,  o in the population determine the frequencies of the genotypes as follows:  a/b = 2  a  b,  a/o = 2  a  o,  b/o = 2  b  o,  a/a = [  a ] 2,  b/b = [  b ] 2,  o/o = [  o ] 2. So now we have three parameters that we need to estimate.

The Likelihood Function Let X be a random variable with 6 values x a/a, x a/o,x b/b, x b/o, x a/b, x o/o denoting the six genotypes. The parameters are  = {  a,  b,  o }. The probability P(X= x a/b |  ) = 2  a  b. The probability P(X= x o/o |  ) =  o  o. And so on for the other four genotypes. What is the probability of Data={B,A,B,B,O,A,B,A,O,B, AB} ? Obtaining the maximum of this function yields the MLE.

ABO loci as a special case of HMM Model the ABO sampling as an HMM with 6 states (genotypes): a/a, a/b, a/o, b/b, b/o, o/o, and 4 outputs (blood types): A,B,AB,O. Assume 3 transitions types: a, b and o, and a state is determined by 2 successive transitions. The probability of transition x is  x. Emission is done every other state, and is determined by the state. Eg, e a/o (A)=1, since a/o produces blood type A. ao a/o a/b A AB a/b AB b baa

A faster and simpler EM for ABO loci Can be solved via the Baum-Welch EM training. This is quite inefficient: for L sampling it requires running the forward and backward algorithm on HMM of length 2L, even that there are only 6 distinct genotypes. Direct application of the EM algorithm yields a simpler and more efficient way.

The EM algorithm in Bayes’ nets u E-step u Go over the data: u Sum the expectations of a hidden variables that you get from this data element u M-step u For every hidden variable x u Update your belief according to the expectation you calculated in the last E-step

EM - ABO Example a/b/o (hidden) A / B / AB / O (observed) Data type #people A 100 B 200 AB 50 O 50 We choose a “reasonable”  = {0.2,0.2,0.6}  = {  a,  b,  o } is the parameter we need to evaluate

EM - ABO Example E-step: M-step: With l = allele and m = blood type

EM - ABO Example E-step: we compute all the necessary elements

EM - ABO Example  0 = {0.2,0.2,0.6} n=400 (data size) E-step (1 st step): Data type #people A 100 B 200 AB 50 O 50

EM - ABO Example  0 = {0.2,0.2,0.6} n=400 (data size) Data type #people A 100 B 200 AB 50 O 50 E-step (1 st step):

EM - ABO Example  0 = {0.2,0.2,0.6} n=400 (data size) M-step (1 st step):

EM - ABO Example  1 = {0.2,0.35,0.45} E-step (2 nd step):

EM - ABO Example M-step (2 nd step):  1 = {0.2,0.35,0.45}

EM - ABO Example  2 = {0.21,0.38,0.41} E-step (3 rd step):

EM - ABO Example M-step (3 rd step):  2 = {0.29,0.56,0.15} No change

What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG.

Similar presentations

Presentation on theme: "What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG.

Similar presentations

Presentation on theme: "What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG."— Presentation transcript:

Similar presentations

About project

Feedback