Hidden Markov Models 1 2 K … x1 x2 x3 xK.

Hidden Markov Models 1 2 K … x1 x2 x3 xK

Viterbi, Forward, Backward
Initialization: V0(0) = 1 Vk(0) = 0, for all k > 0 Iteration: Vl(i) = el(xi) maxk Vk(i-1) akl Termination: P(x, *) = maxk Vk(N) FORWARD Initialization: f0(0) = 1 fk(0) = 0, for all k > 0 Iteration: fl(i) = el(xi) k fk(i-1) akl Termination: P(x) = k fk(N) ak0 BACKWARD Initialization: bk(N) = ak0, for all k Iteration: bl(i) = k el(xi+1) akl bk(i+1) Termination: P(x) = k a0k ek(x1) bk(1)

CpG islands in DNA sequences
A modeling Example A+ C+ G+ T+ A- C- G- T- CpG islands in DNA sequences

Methylation & Silencing
One way cells differentiate is methylation Addition of CH3in C-nucleotides Silences genes in region CG (denoted CpG) often mutates to TG, when methylated In each cell, one copy of X is silenced, methylation plays role Methylation is inherited during cell division From Wikipedia: DNA methylation is a type of chemical modification of DNA that can be inherited without changing the DNA sequence. As such, it is part of the epigenetic code. DNA methylation involves the addition of a methyl group to DNA — for example, to the number 5 carbon of the cytosine pyrimidine ring. DNA methylation is probably universal in eukaryotes. In humans, approximately 1% of DNA bases undergo DNA methylation. In adult somatic tissues, DNA methylation typically occurs in a CpG dinucleotide context; non-CpG methylation is prevalent in embryonic stem cells. In plants, cytosines are methylated both symmetrically (CpG or CpNpG) and asymmetrically (CpNpNp), where N can be any nucleotide. DNA methylation in mammals Between 60-70% of all CpGs are methylated. Unmethylated CpGs are grouped in clusters called "CpG islands" that are present in the 5' regulatory regions of many genes. In many disease processes such as cancer, gene promoter CpG islands acquire abnormal hypermethylation, which results in heritable transcriptional silencing. Reinforcement of the transcriptionally silent state is mediated by proteins that can bind methylated CpGs. These proteins, which are called methyl-CpG binding proteins, recruit histone deacetylases and other chromatin remodelling proteins that can modify histones, thereby forming compact, inactive chromatin termed heterochromatin. This link between DNA methylation and chromatin structure is very important. In particular, loss of Methyl-CpG-binding Protein 2 (MeCP2) has been implicated in Rett syndrome and Methyl-CpG binding domain protein 2 (MBD2) mediates the transcriptional silencing of hypermethylated genes in cancer. [edit] DNA methylation in humans In humans, the process of DNA methylation is carried out by three enzymes, DNA methyltransferase 1, 3a, and 3b (DNMT1, DNMT3a, DNMT3b). It is thought that DNMT3a and DNMT3b are the de novo methyltransferases that set up DNA methylation patterns early in development. DNMT1 is the proposed maintenance methyltransferase that is responsible for copying DNA methylation patterns to the daughter strands during DNA replication. DNMT3L is a protein that is homologous to the other DNMTs but has no catalytic activity. Instead, DNMT3L assists the de novo methyltransferases by increasing their ability to bind to DNA and stimulating their activity. Since many tumor suppressor genes are silenced by DNA methylation during carcinogenesis, there have been attempts to re-express these genes by inhibiting the DNMTs. 5-aza-2'-deoxycytidine (decitabine) is a nucleoside analog that inhibits DNMTs by trapping them in a covalent complex on DNA by preventing the ß-elimination step of catalysis, thus resulting in the enzymes' degradation. However, for decitabine to be active, it must be incorporated into the genome of the cell, but this can cause mutations in the daughter cells if the cell does not die. Additionally, decitabine is toxic to the bone marrow, a fact which limits the size of its therapeutic window. These pitfalls have led to the development of antisense RNA therapies that target the DNMTs by degrading their mRNAs and preventing their translation. However, it is currently unclear if targeting DNMT1 alone is sufficient to reactivate tumor suppressor genes silenced by DNA methylation. DNA methylation in plants Significant progress has been made in understanding DNA methylation in plants, specifically in the model plant, Arabidopsis thaliana. The principal DNA methyltransferases in A. thaliana, Met1, Cmt3, and Drm2, are similar at a sequence level to the mammalian methyltransferases. Drm2 is thought to participate in de-novo DNA methylation as well as in the maintenance of DNA methylation. Cmt3 and Met1 act principally in the maintenance of DNA methylation [1]. Other DNA methyltransferases are expressed in plants but have no known function (see [2]). The specificity for DNA methyltransferases is thought to be driven by RNA-directed DNA methylation. Specific RNA transcripts are produced from a genomic DNA template. These RNA transcripts may form double-stranded RNA molecules. The double stranded RNAs, through either the small interfering RNA (siRNA) or micro RNA (miRNA) pathways, direct the localization of DNA methyltransferases to specific targets in the genome [3]

Example: CpG Islands CpG nucleotides in the genome are frequently methylated (Write CpG not to confuse with CG base pair) C  methyl-C  T Methylation often suppressed around genes, promoters  CpG islands

Example: CpG Islands In CpG islands,
CG is more frequent Other pairs (AA, AG, AT…) have different frequencies Question: Detect CpG islands computationally

A model of CpG Islands – (1) Architecture
Not CpG Island

A model of CpG Islands – (2) Transitions
How do we estimate parameters of the model? Emission probabilities: 1/0 Transition probabilities within CpG islands Established from known CpG islands (Training Set) Transition probabilities within other regions Established from known non-CpG islands Note: these transitions out of each state add up to one—no room for transitions between (+) and (-) states + A C G T .180 .274 .426 .120 .171 .368 .188 .161 .339 .375 .125 .079 .355 .384 .182 = 1 = 1 = 1 = 1 - A C G T .300 .205 .285 .210 .233 .298 .078 .302 .248 .246 .208 .177 .239 .292 = 1 = 1 = 1 = 1

Log Likehoods— Telling “Prediction” from “Random”
Another way to see effects of transitions: Log likelihoods L(u, v) = log[ P(uv | + ) / P(uv | -) ] Given a region x = x1…xN A quick-&-dirty way to decide whether entire x is CpG P(x is CpG) > P(x is not CpG)  i L(xi, xi+1) > 0 A C G T -0.740 +0.419 +0.580 -0.803 -0.913 +0.302 +1.812 -0.685 -0.624 +0.461 +0.331 -0.730 -1.169 +0.573 +0.393 -0.679

What about transitions between (+) and (-) states? They affect Avg. length of CpG island Avg. separation between two CpG islands 1-p Length distribution of region X: P[lX = 1] = 1-p P[lX = 2] = p(1-p) … P[lX= k] = pk-1(1-p) E[lX] = 1/(1-p) Geometric distribution, with mean 1/(1-p) X Y p q 1-q

Right now, aA+A+ + aA+C+ + aA+G+ + aA+T+ = 1 We need to adjust aij so as to allow transitions between (+) and (-) states Say we want with probability p++ to stay within CpG, p-- to stay within non-CpG Let’s adjust all probs by that factor: example, let aA+G+  p++ × aA+G+ Now, let’s calculate probs between (+) and (-) states Total prob aA+S where S is a (-) state, is (1 – p++) Let qA-, qC-, qG-, qT- be the proportions of A, C, G, and T within non-CpG states in training set Then, let aA+A- = (1 – p++) × qA-; aA+C- = (1 – p++) × qC-; … Do the same for (-) to (+) states OK, but how do we estimate p++ and p--? Estimate average length of a CpG island: l+ = 1/(1-p)  p = 1 – 1/l+ Do the same for length between two CpG islands, l- A+ C+ G+ T+ A- C- G- T- 1–p

Applications of the model
Given a DNA region x, The Viterbi algorithm predicts locations of CpG islands Given a nucleotide xi, (say xi = A) The Viterbi parse tells whether xi is in a CpG island in the most likely general scenario The Forward/Backward algorithms can calculate P(xi is in CpG island) = P(i = A+ | x) Posterior Decoding can assign locally optimal predictions of CpG islands ^i = argmaxk P(i = k | x)

What if a new genome comes?
We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG islands for porcupines We suspect the frequency and characteristics of CpG islands are quite different in porcupines How do we adjust the parameters in our model? LEARNING

Re-estimate the parameters of the model based on training data
Problem 3: Learning Re-estimate the parameters of the model based on training data

Two learning scenarios
Estimation when the “right answer” is known Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls Estimation when the “right answer” is unknown GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice QUESTION: Update the parameters  of the model to maximize P(x|)

1. When the right answer is known
Given x = x1…xN for which the true  = 1…N is known, Define: Akl = # times kl transition occurs in  Ek(b) = # times state k in  emits b in x We can show that the maximum likelihood parameters  (maximize P(x|)) are: Akl Ek(b) akl = ––––– ek(b) = ––––––– i Aki c Ek(c)

1. When the right answer is known
Intuition: When we know the underlying states, Best estimate is the average frequency of transitions & emissions that occur in the training data Drawback: Given little data, there may be overfitting: P(x|) is maximized, but  is unreasonable 0 probabilities – VERY BAD Example: Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3  = F, F, F, F, F, F, F, F, F, F Then: aFF = 1; aFL = 0 eF(1) = eF(3) = .2; eF(2) = .3; eF(4) = 0; eF(5) = eF(6) = .1

Pseudocounts Solution for small training sets: Add pseudocounts
Akl = # times kl transition occurs in  + rkl Ek(b) = # times state k in  emits b in x + rk(b) rkl, rk(b) are pseudocounts representing our prior belief Larger pseudocounts  Strong priof belief Small pseudocounts ( < 1): just to avoid 0 probabilities

Pseudocounts r0F = r0L = rF0 = rL0 = 1; Example: dishonest casino
We will observe player for one day, 600 rolls Reasonable pseudocounts: r0F = r0L = rF0 = rL0 = 1; rFL = rLF = rFF = rLL = 1; rF(1) = rF(2) = … = rF(6) = 20 (strong belief fair is fair) rL(1) = rL(2) = … = rL(6) = 5 (wait and see for loaded) Above #s pretty arbitrary – assigning priors is an art

2. When the right answer is unknown
We don’t know the true Akl, Ek(b) Idea: We estimate our “best guess” on what Akl, Ek(b) are We update the parameters of the model, based on our guess We repeat

2. When the right answer is unknown
Starting with our best guess of a model M, parameters : Given x = x1…xN for which the true  = 1…N is unknown, We can get to a provably more likely parameter set  i.e.,  that increases the probability P(x | ) Principle: EXPECTATION MAXIMIZATION Estimate Akl, Ek(b) in the training data Update  according to Akl, Ek(b) Repeat 1 & 2, until convergence

Estimating new parameters
To estimate Akl: (assume | CURRENT, in all formulas below) At each position i of sequence x, find probability transition kl is used: P(i = k, i+1 = l | x) = [1/P(x)]  P(i = k, i+1 = l, x1…xN) = Q/P(x) where Q = P(x1…xi, i = k, i+1 = l, xi+1…xN) = = P(i+1 = l, xi+1…xN | i = k) P(x1…xi, i = k) = = P(i+1 = l, xi+1xi+2…xN | i = k) fk(i) = = P(xi+2…xN | i+1 = l) P(xi+1 | i+1 = l) P(i+1 = l | i = k) fk(i) = = bl(i+1) el(xi+1) akl fk(i) fk(i) akl el(xi+1) bl(i+1) So: P(i = k, i+1 = l | x, ) = –––––––––––––––––– P(x | CURRENT)

Estimating new parameters
So, Akl is the E[# times transition kl, given current ] fk(i) akl el(xi+1) bl(i+1) Akl = i P(i = k, i+1 = l | x, ) = i ––––––––––––––––– P(x | ) Similarly, Ek(b) = [1/P(x | )] {i | xi = b} fk(i) bk(i) fk(i) bl(i+1) akl k l x1………xi-1 xi+2………xN el(xi+1) xi xi+1

The Baum-Welch Algorithm
Initialization: Pick the best-guess for model parameters (or arbitrary) Iteration: Forward Backward Calculate Akl, Ek(b), given CURRENT Calculate new model parameters NEW : akl, ek(b) Calculate new log-likelihood P(x | NEW) GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION Until P(x | ) does not change much

The Baum-Welch Algorithm
Time Complexity: # iterations  O(K2N) Guaranteed to increase the log likelihood P(x | ) Not guaranteed to find globally best parameters Converges to local optimum, depending on initial conditions Too many parameters / too large model: Overtraining

Alternative: Viterbi Training
Initialization: Same Iteration: Perform Viterbi, to find * Calculate Akl, Ek(b) according to * + pseudocounts Calculate the new parameters akl, ek(b) Until convergence Notes: Not guaranteed to increase P(x | ) Guaranteed to increase P(x, | , *) In general, worse performance than Baum-Welch

Variants of HMMs

Higher-order HMMs How do we model “memory” larger than one time point?
P(i+1 = l | i = k) akl P(i+1 = l | i = k, i -1 = j) ajkl … A second order HMM with K states is equivalent to a first order HMM with K2 states aHHT state HH state HT aHT(prev = H) aHT(prev = T) aHTH state H state T aTHH aHTT aTHT state TH state TT aTH(prev = H) aTH(prev = T) aTTH

Modeling the Duration of States
1-p Length distribution of region X: E[lX] = 1/(1-p) Geometric distribution, with mean 1/(1-p) This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions X Y p q 1-q

Example: exon lengths in genes
Why the algorithm has to be generalized is because in a standard HMM the output in each step would be one base, leading to state durations of geometric length. If we look at empirical data, such as these plots, we see that the intron lengths seem to follow the geometric distribution fairly well, but for the exon that would be a pretty bad model. So in our state space the exons are generalized states, choosing a length from a general distribution and outputting the entire exon in one step, while the intron and intergene states still output one base at a time and follow the geometric distribution.

Solution 1: Chain several states
p 1-p X X X Y q 1-q Disadvantage: Still very inflexible lX = C + geometric with mean 1/(1-p)

Solution 2: Negative binomial distribution
p p p 1 – p 1 – p 1 – p X(1) X(2) X(n) Y …… Duration in X: m turns, where During first m – 1 turns, exactly n – 1 arrows to next state are followed During mth turn, an arrow to next state is followed m – m – 1 P(lX = m) = n – 1 (1 – p)n-1+1p(m-1)-(n-1) = n – 1 (1 – p)npm-n

Example: genes in prokaryotes
EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3

Solution 3: Duration modeling
Upon entering a state: Choose duration d, according to probability distribution Generate d letters according to emission probs Take a transition to next state according to transition probs Disadvantage: Increase in complexity: Time: O(D) Space: O(1) where D = maximum duration of state F d<Df xi…xi+d-1 Pf Warning, Rabiner’s tutorial claims O(D2) & O(D) increases

Viterbi with duration modeling
emissions emissions F L d<Df d<Dl Pf Pl transitions xi…xi + d – 1 xj…xj + d – 1 Precompute cumulative values Recall original iteration: VF(i) = maxk Vk(i – 1) akl  eF(xi) New iteration: VF(i) = maxk maxd=1…Df Vk(i – d)  Pf(d)  akF  j=i-d+1…ieF(xj)

Hidden Markov Models 1 2 K … x1 x2 x3 xK.

Similar presentations

Presentation on theme: "Hidden Markov Models 1 2 K … x1 x2 x3 xK."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hidden Markov Models 1 2 K … x1 x2 x3 xK.

Similar presentations

Presentation on theme: "Hidden Markov Models 1 2 K … x1 x2 x3 xK."— Presentation transcript:

Similar presentations

About project

Feedback