Probabilistic sequence modeling II: Markov chains Haixu Tang School of Informatics.

Probabilistic sequence modeling II: Markov chains Haixu Tang School of Informatics

A DNA profile (matrix) TATAAA TATAAT TATAAA TATTAA TTAAAA TAGAAA 1 2 3 4 5 6 T 8 1 6 1 0 1 C 0 0 0 0 0 0 A 0 7 1 7 8 7 G 0 0 1 0 0 0 1 2 3 4 5 6 T 9 2 7 2 1 2 C 1 1 1 1 1 1 A 1 8 2 8 9 8 G 1 1 2 1 1 1 Sparse data  pseudo-counts

Frequency & Profile model Frequency model: the order of nucleotides in the training sequences is ignored; Profile model: the training sequences are aligned  the order of nucleotides in the training sequences is fully preserved Markov chain model: orders are partially incorporated

Markov chain model Markov chain models dependencies between adjacent positions in the sequence –There are certain regions in the genome, like TATA within the regulatory area, upstream a gene. –The pattern CG is less common than expected for random sampling. Such dependencies can be modeled by Markov chains.

Finite 1 st order Markov Chain An integer time stochastic process, consisting of a set of m>1 states {s 1,…,s m } and 1.An m dimensional initial distribution vector ( p(s 1 ),.., p(s m )). 2.An m×m transition probabilities matrix M= (a s i s j ) For example, for DNA sequence, the states are {A, C, T, G}, p(A) the probability of A to be the 1 st letter in a DNA sequence, and a AG the probability that A follows G in a sequence (GA).

1 st Markov Chain X1X1 X2X2 X n-1 XnXn For each integer n, a Markov Chain assigns probability to sequences (x 1 …x n ) as follows:

Kth order Markov Chain kth Markov Chain assigns probability to sequences (x 1 …x n ) as follows: Initial distributionTransition probabilities

Matrix Representation 0100 0.800.20 0.300.50.2 00.050 0.95 AB B A C C D D M is a stochastic Matrix: The initial distribution vector (u 1 …u m ) defines the distribution of X 1 (p(X 1 =s i )=u i ). The transition probabilities Matrix M =(a st )

Representation of a Markov Chain as a Digraph (directed graph) Each directed edge A  B is associated with the positive transition probability from A to B. A B C D 0.2 0.3 0.5 0.05 0.95 0.2 0.8 1 0100 00.20 0.300.50.2 00.0500.9 5 AB B A C C D D

Classification of Markov Chain states A B C D States of Markov chains are classified by the digraph representation (omitting the actual probability values) A, C and D are recurrent states: they are in strongly connected components which are sinks in the graph. B is not recurrent – it is a transient state Alternative definitions: A state s is recurrent if it can be reached from any state reachable from s; otherwise it is transient.

Another example of Recurrent and Transient States A B C D A and B are transient states, C and D are recurrent states. Once the process moves from B to D, it will never come back.

Example: modeling CpG Islands In mammalian genomes, the dinucleotide CG often transforms to (methyl-C)G which often subsequently mutates to TG. Hence CG appears less than expected from what is expected from the independent frequencies of C and G alone. Due to biological reasons, this process is sometimes suppressed in short stretches of genomes such as in the upstream regions of many genes. These areas are called CpG islands.

Example: CpG Island We consider two questions (and some variants): Question 1: Given a short stretch of genomic data, does it come from a CpG island ? Question 2: Given a long piece of genomic data, does it contain CpG islands in it, where, and how long? We “solve” the first question by modeling sequences with and without CpG islands as Markov Chains over the same states {A,C,G,T} but different transition probabilities.

Example: CpG Island The “+” model: Use transition matrix A + = (a + st ), a + st = (the probability that t follows s in a CpG island)  positive samples The “-” model: Use transition matrix A - = (a - st ), a - st = (the probability that t follows s in a non CpG island sequence)  negative samples

Example: CpG Island With this model, to solve Question 1 we need to decide whether a given short sequence is more likely to come from the “+” model or from the “–” model. This is done by using the definitions of Markov Chain, in which the parameters are determined by training data.

Question 1: Using two Markov chains A + (For CpG islands): X i-1 XiXi ACGT A 0.180.270.430.12 C 0.17p + (C|C)0.274p + (T|C) G 0.16p + (C|G)p + (G|G)p + (T|G) T 0.08p + (C|T) p + (G|T)p + (T|T) We need to specify p + (x i | x i-1 ) where + stands for CpG Island. (Recall: rows must add up to one; columns need not.)

Question 1: Using two Markov chains A - (For non-CpG islands): X i-1 XiXi ACGT A 0.30.20.290.21 C 0.32p - (C|C)0.078p - (T|C) G 0.25p - (C|G) p - (G|G) p - (T|G) T 0.18p - (C|T)p - (G|T)p - (T|T) …and for p - (x i | x i-1 ) (where “-” stands for Non CpG island) we have:

Discriminating between the two models Given a sequence x=(x 1 ….x L ), now compute the likelihood ratio If RATIO>1, CpG island is more likely. Actually – the log of this ratio is computed. Note: p + (x 1 |x 0 ) is defined for convenience as p + (x 1 ). p - (x 1 |x 0 ) is defined for convenience as p - (x 1 ).

Log likelihood ratio test Taking logarithm yields If logQ > 0, then + is more likely (CpG island). If logQ < 0, then - is more likely (non-CpG island).

Where do the parameters (transition probabilities) come from ? Learning from training data. Source: A collection of sequences from CpG islands, and a collection of sequences from non-CpG islands. Input: Tuples of the form (x 1, …, x L, h), where h is + or - Output: Maximum Likelihood parameters (MLE) Count all pairs (X i =a, X i-1 =b) with label +, and with label -, say the numbers are N ba,+ and N ba,-.

CpG Island: Question 2 Now we solve the 2 nd question: Question 2: Given a long piece of genomic data, does it contain CpG islands in it, and where? For this, we need to decide which parts of a given long sequence of letters is more likely to come from the “+” model, and which parts are more likely to come from the “–” model. This is done by using the Hidden Markov Model, to be defined.

Question 2: Finding CpG Islands Given a long genomic string with possible CpG Islands, we define a Markov Chain over 8 states. C+C+ T+T+ G+G+ A+A+ C-C- T-T- G-G- A-A- The problem is that we don’t know the sequence of states which are traversed, but just the sequence of letters. Therefore we will use here Hidden Markov Model!

Gene finding using codon frequency Consider sequence x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9…. where x i is a nucleotide let p 1 = p x1 x2 x3 p x4 x5 x6 …. p 2 = p x2 x3 x4 p x5 x6 x7 …. p 3 = p x3 x4 x5 p x6 x7 x8 …. then probability that ith reading frame is the coding frame is: p i p 1 + p 2 + p 3 slide a window along the sequence and compute P i P i =

Inhomogeneous Markov chain: learning X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 aabbcc

Inhomogeneous Markov chain: prediction X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 aabbcc Reading frame 1 aabbcc Reading frame 2 aabbcc Reading frame 3

Gene finding using inhomogeneous Markov chain Consider sequence x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9…. where x i is a nucleotide let p 1 = a x1x2 b x2 x3 c x3x4 a x4x5 b x5x6 c x6x7 …. p 2 = b x1x2 c x2x3 a x3x4 b x4x5 c x5x6 a x6x7 …. p 3 = c x1x2 a x2x3 b x3x4 c x4x5 a x5x6 b x6x7 …. then probability that ith reading frame is the coding frame is: p i p 1 + p 2 + p 3 M. Bodorovsky, Genemark (commonly used gene finder for bacterial genomes) P i =

Probabilistic sequence modeling II: Markov chains Haixu Tang School of Informatics.

Similar presentations

Presentation on theme: "Probabilistic sequence modeling II: Markov chains Haixu Tang School of Informatics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistic sequence modeling II: Markov chains Haixu Tang School of Informatics.

Similar presentations

Presentation on theme: "Probabilistic sequence modeling II: Markov chains Haixu Tang School of Informatics."— Presentation transcript:

Similar presentations

About project

Feedback