Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.

Similar presentations


Presentation on theme: "1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler."— Presentation transcript:

1 1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler

2 2 So far we assumed every letter in a sequence is sampled randomly from some distribution q() This model could suffice for alignment scoring, but it is not the case in true genomes. There are special subsequences in the genome in which dependencies between nucleotides exist Example 1: TATA within the regulatory area, upstream a gene. Example 2: CG pairs We model such dependencies by Markov chains and hidden Markov Models (HMMs) Dependencies Along Biological Sequences

3 3 Markov Chains A chain of random variables in which the next one depends only on the current Given X=x 1 …x n, then P ( x i |x 1 …x i-1 ) =P ( x i |x i-1 ) The general case: k th –order Markov process Given X=x 1 …x n, then P ( x i |x 1 …x i-1 ) =P ( x i |x i-1 …x i-k ) X1X1 X2X2 X n-1 XnXn

4 4 Markov Chains An integer time stochastic process, consisting of a domain D of m>1 states { s 1,…,s m } and An m dimensional initial distribution vector ( p ( s 1 ),.., p ( s, )). An m x m transition probabilities matrix M = ( a ij ) For example: D can be the letters { A, C, G, T } p ( A ) the probability of A to be the 1 st letter in a sequence a AG the probability that G follows A in a sequence.

5 5 Markov Chains For each integer n, a Markov Chain assigns probability to sequences ( x 1 … x n ) over D (i.e, x i  D ) as follows: Similarly, ( x 1 … x i …) is a sequence of probability distributions over D. There is a rich theory which studies the properties of these sequences.

6 6 Matrix Representation The transition probabilities Matrix M =( a st ) M is a stochastic Matrix: The initial distribution vector ( U 1 … U m ) defines the distribution of X 1 P ( X 1 = s i )= U i Then after one move, the distribution changes to x 2 = x 1 M 0100 0.800.20 0.300.50.2 00.0500.95 AB B A C C D D

7 7 Matrix Representation Example: if X 1 =(0, 1, 0, 0) Then X 2 =(0.2, 0.5, 0, 0.3) And if X 1 =(0, 0, 0.5, 0.5) then X 2 =(0, 0.1, 0.5, 0.4). The i th distribution is X i = X 1 M i-1 0100 0.800.20 0.300.50.2 00.0500.95 AB B A C C D D

8 8 Representing a Markov Model as a Digraph 0100 0.800.20 0.300.50.2 00.0500.95 AB B A C C D D A B C D 0.2 0.3 0.5 0.05 0.95 0.2 0.8 1 Each directed edge A  B is associated with the transition probability from A to B.

9 9 Markov Chains – Weather Example Weather forecast: raining today  40% rain tomorrow  60% no rain tomorrow No rain today  20% rain tomorrow  80% no rain tomorrow Stochastic FSM: rain no rain 0.6 0.4 0.8 0.2

10 10 Markov Chains – Gambler Example Gambler starts with 10$ At each play we have one of the following: Gambler wins 1$ with probability p Gambler looses 1$ with probability 1-p Game ends when gambler goes broke, or gains a fortune of 100$ 01 2 99 100 p p p p 1-p Start (10$)

11 11 Properties of Markov Chain States States of Markov chains are classified by the digraph representation (omitting the actual probability values) Recurrent states: s is recurrent if it is accessible from all states that are accessible from s. C and D are recurrent states. Transient states: “ s is transient” if it will be visited a finite number of times as n . A and B are transient states. A B C D

12 12 Irreducible Markov Chains A Markov Chain is irreducible if the corresponding graph is strongly connected (and thus all its states are recurrent). A B C D A B C D E

13 13 Properties of Markov Chain states A state s has a period k if k is the GCD of the lengths of all the cycles that pass via s. Periodic states A state is periodic if it has a period k>1. in the shown graph the period of A is 2. Aperiodic states A state is aperiodic if it has a period k=1. in the shown graph the period of F is 1. A B C D E F

14 14 Ergodic Markov Chains A Markov chain is ergodic if: the corresponding graph is irreducible. It is not peridoic Ergodic Markov Chains are important they guarantee the corresponding Markovian process converges to a unique distribution, in which all states have strictly positive probability. A B C D

15 15 Stationary Distributions for Markov Chains Let M be a Markov Chain of m states, and let V =( v 1,…, v m ) be a probability distribution over the m states V =( v 1,…, v m ) is stationary distribution for M if VM = V.  one step of the process does not change the distribution V is a stationary distribution V is a left (row) Eigenvector of M with Eigenvalue 1

16 16 “Good” Markov chains A Markov Chains is good if the distributions X i satisfy the following as i  : converge to a unique distribution, independent of the initial distribution In that unique distribution, each state has a positive probability The Fundamental Theorem of Finite Markov Chains: A Markov Chain is good  the corresponding graph is ergodic.

17 17 “Bad” Markov Chains A Markov chains is not “good” if either : It does not converge to a unique distribution It does converge to a unique distribution, but some states in this distribution have zero probability For instance: Chains with periodic states Chains with transient states

18 18 An Example: Searching the Genome for CpG Islands In the human genome, the pair CG appears less than expected the pair CG often transforms to (methyl-C) G which often transforms to TG. Hence the pair CG appears less than expected from independent frequencies of C and G alone. Due to biological reasons, this process is sometimes suppressed in short stretches of genome such as in the start regions of many genes. These areas are called CpG islands (p denotes “pair”).

19 19 CpG Islands We consider two questions (and some variants): Question 1: Given a short stretch of genomic data, does it come from a CpG island ? Question 2: Given a long piece of genomic data, does it contain CpG islands in it, where, what length? We “solve” the first question by modeling strings with and without CpG islands as Markov Chains States are {A,C,G,T} but Transition probabilities are different

20 20 CpG Islands The “+” model Use transition matrix A + =( a + st ), Where: a + st = (the probability that t follows s in a CpG island) The “-” model Use transition matrix A - =( a - st ), Where: A - st = (the probability that t follows s in a non CpG island)

21 21 CpG Islands To solve Question 1 we need to decide whether a given short sequence of letters is more likely to come from the “+” model or from the “–” model. This is done by using the definitions of Markov Chain, in which the parameters are determined by known data and the log odds-ratio test.

22 22 CpG Islands – the “+” Model We need to specify p + ( x i | x i-1 ) where + stands for CpG Island. From Durbin et al we have: ACGT A 0.1800.2740.4260.120 C 0.1710.3680.2740.188 G 0.1610.3390.3750.125 T 0.0790.355 0.3840.182

23 23 CpG Islands – the “-” Model p - ( x i | x i-1 ) for non-CpG Island is given by: ACGT A 0.3000.2050.2850.210 C 0.3220.2980.0780.302 G 0.2480.2460.2980.208 T 0.1770.239 0.292

24 24 CpG Islands Given a string X =( x 1,…, x L ), now compute the ratio RATIO>1  CpG island is more likely RATIO<1  non-CpG island is more likely. X1X1 X2X2 X L-1 XLXL


Download ppt "1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler."

Similar presentations


Ads by Google