Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.

Similar presentations


Presentation on theme: "A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented."— Presentation transcript:

1 A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented by Sowmya Venkateswaran April 20,2006

2 Outline Motivations Motivations Drawbacks of existing methods Drawbacks of existing methods System and Methods System and Methods  Substitution Model  Hidden Markov Model  Pairwise Alignment using Viterbi Algorithm  Posterior Probability  Multiple Alignment Results Results Discussion Discussion

3 Motivation Progressive alignment techniques are used for Multiple Sequence Alignment Progressive alignment techniques are used for Multiple Sequence Alignment Used to deduce the phylogeny. Used to deduce the phylogeny. Identify protein families. Identify protein families. Probabilistic methods can be used to estimate the reliability of global/local alignments. Probabilistic methods can be used to estimate the reliability of global/local alignments.

4 Drawbacks of existing Systems Iterative application of global/local pairwise sequence alignment algorithms does not guarantee a globally optimum alignment. Iterative application of global/local pairwise sequence alignment algorithms does not guarantee a globally optimum alignment. A best scoring alignment may not correspond with true alignment. Hence reliability of a score/alignment needs to be inferred. A best scoring alignment may not correspond with true alignment. Hence reliability of a score/alignment needs to be inferred.

5 System and Methods The idea is to provide a probabilistic framework for a guide tree and define a vector of probabilities at each character site. The idea is to provide a probabilistic framework for a guide tree and define a vector of probabilities at each character site. Guide tree is constructed by using Neighbor Joining Clustering after producing a distance matrix. It can also be imported from CLUSTALW. Guide tree is constructed by using Neighbor Joining Clustering after producing a distance matrix. It can also be imported from CLUSTALW. At each internal node, a probabilistic alignment is performed. Pointers from parent to child sites are stored and so also is a vector of probabilities of the different character states( ‘A/C/T/G/-’ for nucleotides or the 20 amino acids with a gap) At each internal node, a probabilistic alignment is performed. Pointers from parent to child sites are stored and so also is a vector of probabilities of the different character states( ‘A/C/T/G/-’ for nucleotides or the 20 amino acids with a gap)

6 Substitution Model Consider 2 sequences x 1…n and y 1…m, whose alignment we would like to find and their parent in the guide tree is z 1…l. Consider 2 sequences x 1…n and y 1…m, whose alignment we would like to find and their parent in the guide tree is z 1…l. P a (x i ) is the probability that site x i contains character a. P a (x i ) is the probability that site x i contains character a. P a (x i ) = 1, if a character a appears at terminal node, else it is 0. P a (x i ) = 1, if a character a appears at terminal node, else it is 0. At internal nodes, different characters have different probabilities summing to 1. At internal nodes, different characters have different probabilities summing to 1. If the observed character is ambiguous, probability is shared among different characters. If the observed character is ambiguous, probability is shared among different characters.

7 Emission Probabilities P x i,y j represents the probability that x i and y j are aligned. p xi,yj =p zk (x i,y j )=∑p zk=a (x i,y j ) P z k =a (x i,y j )=q a ∑ b s ab p b (x i )∑ b s ab p b (y j ) q a is the character background probability s ab, probability of aligning characters a and b, is calculated with the Jukes Cantor Model s ab =1/n + (n-1)/n * e –(n/n-1) v when a=b s ab =1/n - 1/n * e –(n/n-1) v when a≠b n is the size of the alphabet, v is the NJ-estimated branch length v is the NJ-estimated branch length

8 Probabilities To find p xi,-, the probability that z k evolved to a character on one of the child sites and a gap on the other child is To find p xi,-, the probability that z k evolved to a character on one of the child sites and a gap on the other child is p zk=a (x i,-)=q a ∑ b s ab p b (x i )s a- The same applies for p xi,-. s a- is computed just like s ab. The same applies for p xi,-. s a- is computed just like s ab. Any other model can be used for calculation of s ab, instead of the Jukes Cantor Model. Ex: PAM (20 X 20) substitution matrix can be modified to include gaps and transformed to a (21X21) matrix, and the substitution probabilities can be derived from that. Any other model can be used for calculation of s ab, instead of the Jukes Cantor Model. Ex: PAM (20 X 20) substitution matrix can be modified to include gaps and transformed to a (21X21) matrix, and the substitution probabilities can be derived from that.

9 Hidden Markov Model Y p -,yj X p xi,- M p xi,yj 1-ε δ δ 1-2δ ε ε

10 Hidden Markov Model δ – probability of moving to an insert state (gap opening penalty) ; lower the value, higher the penalty. δ – probability of moving to an insert state (gap opening penalty) ; lower the value, higher the penalty. ε – probability of staying at an insert state (gap extension penalty); again, lower the value, more the extension penalty. ε – probability of staying at an insert state (gap extension penalty); again, lower the value, more the extension penalty. p xi,yj,p xi,-, p -,yj –emission frequencies for match, insert X and insert Y states. p xi,yj,p xi,-, p -,yj –emission frequencies for match, insert X and insert Y states. For testing purposes, δ and ε were estimated from pairwise alignments of terminal sequences such that δ=1/2(l m +1) and ε=1-1/(l g +1); l m and l g are the mean lengths of match and gap segments. For testing purposes, δ and ε were estimated from pairwise alignments of terminal sequences such that δ=1/2(l m +1) and ε=1-1/(l g +1); l m and l g are the mean lengths of match and gap segments.

11 Pairwise Alignment In this probabilistic model, the best alignment between 2 sequences corresponds to the Viterbi path through the HMM. In this probabilistic model, the best alignment between 2 sequences corresponds to the Viterbi path through the HMM. Since there are 3 states in the model, and each state needs 2-D space, we have 3 2-D tables : v M for match states, v X and v Y for the gap states. Since there are 3 states in the model, and each state needs 2-D space, we have 3 2-D tables : v M for match states, v X and v Y for the gap states. A move within M, X or Y tables produces an additional match or extends an existing gap. A move between M table and either X or Y table closes or opens a gap. A move within M, X or Y tables produces an additional match or extends an existing gap. A move between M table and either X or Y table closes or opens a gap.

12 Viterbi Recursion Initialization: v(0,0) = 1, v(i,-1) = v(-1,j)=0Recursion: v M (i,jp xi,yj v M (i-1,j-1 v M (i,j) = p xi,yj max { (1-2δ) v M (i-1,j-1), v X (i-1,j-1 (1-ε) v X (i-1,j-1), v Y (i-1,j-1 (1-ε) v Y (i-1,j-1) } v X (i,jp xi,- v M (i-1,j v X (i,j) = p xi,- max { δ v M (i-1,j), v X (i-1,j ε v X (i-1,j) } v Y (i,jp -,yj v M (i,j-1 v Y (i,j) = p -,yj max { δ v M (i,j-1), v Y (i,j-1 ε v Y (i,j-1) } Termination: v E =max(v M (n,m),v X (n,m),v Y (n,m))

13 Viterbi traceback At each cell, the relative probabilities of entering the different cells are stored. Ex: At each cell, the relative probabilities of entering the different cells are stored. Ex: p M-M = v M (i-1,j-1 p M-M = (1-2δ) v M (i-1,j-1)/N(i,j) where N(i,j) is the normalizing constant, given by v M (i-1,j-1v X (i-1,j-1v Y (i-1,j-1 N(i,j)=(1-2δ) v M (i-1,j-1)+(1-ε)*[v X (i-1,j-1)+ v Y (i-1,j-1)] The above equation is calculated for each of the 3 tables Trace back algorithm used to find the best path; a match step will create pointers from the parent site to the child sites, and a gap step will create pointer to one and a gap for the 2 nd child site.

14 Posterior Probabilities-Forward algorithm Forward algorithm-sum of probabilities of all paths entering a given cell from the start position. Initialization:f(0,0)=1;f(i,-1)=f(-1,j)=0;Recursion: i=0,…,n j=0,…,m, except (0,0) f M (i,j) = p xi,yj [ (1-2δ) f M (i-1,j-1) + (1-ε) ( f X (i-1,j-1)+ f Y (i-1,j-1))] f X (i,j) = p xi,- [ δ f M (i-1,j) + ε f X (i-1,j)] f Y (i,j) = p -,yj [ δ f M (i,j-1) + ε f Y (i,j-1)] Termination: f E =f M (n,m)+f X (n,m)+f Y (n,m)

15 Backward algorithm Sum of probabilities of all possible alignments between subsequences x i…n and y j…m. Initialization: b(n,m)=1; b(i,m+1) = f(n+1,j) = 0; Recursion: i=n,…,1 j=m,…,1, except (n,m) b M (i,j) = (1-2δ) p x(i+1),y(j+1) b M (i+1,j+1) + δ [ p x(i+1),- b X (i+1,j) + p -,y(j+1) b Y (i,j+1)] δ [ p x(i+1),- b X (i+1,j) + p -,y(j+1) b Y (i,j+1)] b X (i,j) = (1-ε) p x(i+1),y(j+1) b M (i+1,j+1) + ε p x(i+1),- b X (i+1,j) b Y (i,j) = (1-ε) p x(i+1),y(j+1) b M (i+1,j+1) + ε p -,y(j+1) b X (i+1,j)

16 Reliability Check Assumption: Posterior probability of the sites on the alignment path is a valid estimator of the local reliability of the alignment since it gives the proportion of total probability corresponding to all alignments passing through the cell (i,j). Assumption: Posterior probability of the sites on the alignment path is a valid estimator of the local reliability of the alignment since it gives the proportion of total probability corresponding to all alignments passing through the cell (i,j). Posterior probability for a match is given by: Posterior probability for a match is given by: P(x i ◊y j |x,y) = f M (i,j) b M (i,j) / f E where f M and b M are the total probabilities of all possible alignments between subsequences x 1…i and y 1…j and x i…n and y j…m respectively Similar probabilities are calculated for Insert X and Insert Y states too.

17 Multiple alignment Each parent node site has a vector of probabilities corresponding to each possible character state (including the gap). For a match, Each parent node site has a vector of probabilities corresponding to each possible character state (including the gap). For a match, p a (z k )=p zk=a (x i,y j )/∑ b p zk=b (x i,y j ) Pairwise alignment builds the tree progressively, from the terminal nodes towards an arbitrary root. Pairwise alignment builds the tree progressively, from the terminal nodes towards an arbitrary root. Once root node is defined, trace-back is done to find multiple alignment of the nodes below since each node stores pointers to the matching child sites. Once root node is defined, trace-back is done to find multiple alignment of the nodes below since each node stores pointers to the matching child sites. If a gap occurs in one of the internal nodes, a gap character state is introduced in all of the sequences of that sub-tree, and recursive call will not proceed further in that branch. If a gap occurs in one of the internal nodes, a gap character state is introduced in all of the sequences of that sub-tree, and recursive call will not proceed further in that branch.

18 Testing Algorithms tested on Algorithms tested on (i) simulated nucleotide sequences 50 random data sets generated using the program Rose. A root random sequence (length 500) was evolved on a random tree to yield sequences of “low” (no. of substitutions per site 0.5) and “high” (1.0) divergences. Also, the insertion/deletion length distribution was set to ‘short’ or ‘long’. (ii) Amino acid data sets from Ref1 of the BAliBASE database. Ref1 contains alignments of less than 6 equi-distant sequences, i.e., the percent-identity between 2 sequences is within a specified range with no large insertion or deletion. Datasets were divided into 3 groups based on lengths, and further into 3 based on similarities.

19 Results of Simulation on Nucleotide Sequences

20 Type1 and Type 2 errors vs. minimum posterior probability

21 Performance and Future Work ProAlign performs better than ClustalW for the nucleotide sequences, but not for amino acid sequences with sequence identity less than 25%. ProAlign performs better than ClustalW for the nucleotide sequences, but not for amino acid sequences with sequence identity less than 25%. Possible reasons may be that the model does not take into account, the protein secondary structure. So, the HMM can be extended to modeling protein secondary structure too. Possible reasons may be that the model does not take into account, the protein secondary structure. So, the HMM can be extended to modeling protein secondary structure too. Minimum posterior probability correlates well with correctness ; can be used to detect/remove unreliably aligned regions Minimum posterior probability correlates well with correctness ; can be used to detect/remove unreliably aligned regions


Download ppt "A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented."

Similar presentations


Ads by Google