A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.

Slides:

Advertisements

Similar presentations

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.

Advertisements

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.

Introduction to Bioinformatics

Molecular Evolution Revised 29/12/06

BNFO 602 Multiple sequence alignment Usman Roshan.

Lecture 6, Thursday April 17, 2003

Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.

Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.

Heuristic alignment algorithms and cost matrices

Space Efficient Alignment Algorithms and Affine Gap Penalties

. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.

1 Protein Multiple Alignment by Konstantin Davydov.

Expected accuracy sequence alignment

Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.

Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Sequence similarity.

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Phylogeny Tree Reconstruction

Similar Sequence Similar Function Charles Yan Spring 2006.

BNFO 602 Multiple sequence alignment Usman Roshan.

Introduction to Bioinformatics Algorithms Sequence Alignment.

Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Alignment III PAM Matrices. 2 PAM250 scoring matrix.

Multiple Sequence Alignments

A Hidden Markov model for progressive multiple alignment -Ari Loytynoja and Michel C.Milinkovitch Presnted by Santosh Kumar Kodicherla.

1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,

CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.

Chapter 5 Multiple Sequence Alignment.

Developing Pairwise Sequence Alignment Algorithms

Sequence Alignment.

Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Introduction to Profile Hidden Markov Models

Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.

CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)

Hidden Markov Models for Sequence Analysis 4

Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.

Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.

Expected accuracy sequence alignment Usman Roshan.

EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.

Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

Expected accuracy sequence alignment Usman Roshan.

MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.

Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.

Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.

Multiple sequence alignment (msa)

The ideal approach is simultaneous alignment and tree estimation.

Intro to Alignment Algorithms: Global and Local

In Bioinformatics use a computational method - Dynamic Programming.

Presentation transcript:

A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented by Sowmya Venkateswaran April 20,2006

Outline Motivations Motivations Drawbacks of existing methods Drawbacks of existing methods System and Methods System and Methods  Substitution Model  Hidden Markov Model  Pairwise Alignment using Viterbi Algorithm  Posterior Probability  Multiple Alignment Results Results Discussion Discussion

Motivation Progressive alignment techniques are used for Multiple Sequence Alignment Progressive alignment techniques are used for Multiple Sequence Alignment Used to deduce the phylogeny. Used to deduce the phylogeny. Identify protein families. Identify protein families. Probabilistic methods can be used to estimate the reliability of global/local alignments. Probabilistic methods can be used to estimate the reliability of global/local alignments.

Drawbacks of existing Systems Iterative application of global/local pairwise sequence alignment algorithms does not guarantee a globally optimum alignment. Iterative application of global/local pairwise sequence alignment algorithms does not guarantee a globally optimum alignment. A best scoring alignment may not correspond with true alignment. Hence reliability of a score/alignment needs to be inferred. A best scoring alignment may not correspond with true alignment. Hence reliability of a score/alignment needs to be inferred.

System and Methods The idea is to provide a probabilistic framework for a guide tree and define a vector of probabilities at each character site. The idea is to provide a probabilistic framework for a guide tree and define a vector of probabilities at each character site. Guide tree is constructed by using Neighbor Joining Clustering after producing a distance matrix. It can also be imported from CLUSTALW. Guide tree is constructed by using Neighbor Joining Clustering after producing a distance matrix. It can also be imported from CLUSTALW. At each internal node, a probabilistic alignment is performed. Pointers from parent to child sites are stored and so also is a vector of probabilities of the different character states( ‘A/C/T/G/-’ for nucleotides or the 20 amino acids with a gap) At each internal node, a probabilistic alignment is performed. Pointers from parent to child sites are stored and so also is a vector of probabilities of the different character states( ‘A/C/T/G/-’ for nucleotides or the 20 amino acids with a gap)

Substitution Model Consider 2 sequences x 1…n and y 1…m, whose alignment we would like to find and their parent in the guide tree is z 1…l. Consider 2 sequences x 1…n and y 1…m, whose alignment we would like to find and their parent in the guide tree is z 1…l. P a (x i ) is the probability that site x i contains character a. P a (x i ) is the probability that site x i contains character a. P a (x i ) = 1, if a character a appears at terminal node, else it is 0. P a (x i ) = 1, if a character a appears at terminal node, else it is 0. At internal nodes, different characters have different probabilities summing to 1. At internal nodes, different characters have different probabilities summing to 1. If the observed character is ambiguous, probability is shared among different characters. If the observed character is ambiguous, probability is shared among different characters.

Emission Probabilities P x i,y j represents the probability that x i and y j are aligned. p xi,yj =p zk (x i,y j )=∑p zk=a (x i,y j ) P z k =a (x i,y j )=q a ∑ b s ab p b (x i )∑ b s ab p b (y j ) q a is the character background probability s ab, probability of aligning characters a and b, is calculated with the Jukes Cantor Model s ab =1/n + (n-1)/n * e –(n/n-1) v when a=b s ab =1/n - 1/n * e –(n/n-1) v when a≠b n is the size of the alphabet, v is the NJ-estimated branch length v is the NJ-estimated branch length

Probabilities To find p xi,-, the probability that z k evolved to a character on one of the child sites and a gap on the other child is To find p xi,-, the probability that z k evolved to a character on one of the child sites and a gap on the other child is p zk=a (x i,-)=q a ∑ b s ab p b (x i )s a- The same applies for p xi,-. s a- is computed just like s ab. The same applies for p xi,-. s a- is computed just like s ab. Any other model can be used for calculation of s ab, instead of the Jukes Cantor Model. Ex: PAM (20 X 20) substitution matrix can be modified to include gaps and transformed to a (21X21) matrix, and the substitution probabilities can be derived from that. Any other model can be used for calculation of s ab, instead of the Jukes Cantor Model. Ex: PAM (20 X 20) substitution matrix can be modified to include gaps and transformed to a (21X21) matrix, and the substitution probabilities can be derived from that.

Hidden Markov Model Y p -,yj X p xi,- M p xi,yj 1-ε δ δ 1-2δ ε ε

Hidden Markov Model δ – probability of moving to an insert state (gap opening penalty) ; lower the value, higher the penalty. δ – probability of moving to an insert state (gap opening penalty) ; lower the value, higher the penalty. ε – probability of staying at an insert state (gap extension penalty); again, lower the value, more the extension penalty. ε – probability of staying at an insert state (gap extension penalty); again, lower the value, more the extension penalty. p xi,yj,p xi,-, p -,yj –emission frequencies for match, insert X and insert Y states. p xi,yj,p xi,-, p -,yj –emission frequencies for match, insert X and insert Y states. For testing purposes, δ and ε were estimated from pairwise alignments of terminal sequences such that δ=1/2(l m +1) and ε=1-1/(l g +1); l m and l g are the mean lengths of match and gap segments. For testing purposes, δ and ε were estimated from pairwise alignments of terminal sequences such that δ=1/2(l m +1) and ε=1-1/(l g +1); l m and l g are the mean lengths of match and gap segments.

Pairwise Alignment In this probabilistic model, the best alignment between 2 sequences corresponds to the Viterbi path through the HMM. In this probabilistic model, the best alignment between 2 sequences corresponds to the Viterbi path through the HMM. Since there are 3 states in the model, and each state needs 2-D space, we have 3 2-D tables : v M for match states, v X and v Y for the gap states. Since there are 3 states in the model, and each state needs 2-D space, we have 3 2-D tables : v M for match states, v X and v Y for the gap states. A move within M, X or Y tables produces an additional match or extends an existing gap. A move between M table and either X or Y table closes or opens a gap. A move within M, X or Y tables produces an additional match or extends an existing gap. A move between M table and either X or Y table closes or opens a gap.

Viterbi Recursion Initialization: v(0,0) = 1, v(i,-1) = v(-1,j)=0Recursion: v M (i,jp xi,yj v M (i-1,j-1 v M (i,j) = p xi,yj max { (1-2δ) v M (i-1,j-1), v X (i-1,j-1 (1-ε) v X (i-1,j-1), v Y (i-1,j-1 (1-ε) v Y (i-1,j-1) } v X (i,jp xi,- v M (i-1,j v X (i,j) = p xi,- max { δ v M (i-1,j), v X (i-1,j ε v X (i-1,j) } v Y (i,jp -,yj v M (i,j-1 v Y (i,j) = p -,yj max { δ v M (i,j-1), v Y (i,j-1 ε v Y (i,j-1) } Termination: v E =max(v M (n,m),v X (n,m),v Y (n,m))

Viterbi traceback At each cell, the relative probabilities of entering the different cells are stored. Ex: At each cell, the relative probabilities of entering the different cells are stored. Ex: p M-M = v M (i-1,j-1 p M-M = (1-2δ) v M (i-1,j-1)/N(i,j) where N(i,j) is the normalizing constant, given by v M (i-1,j-1v X (i-1,j-1v Y (i-1,j-1 N(i,j)=(1-2δ) v M (i-1,j-1)+(1-ε)*[v X (i-1,j-1)+ v Y (i-1,j-1)] The above equation is calculated for each of the 3 tables Trace back algorithm used to find the best path; a match step will create pointers from the parent site to the child sites, and a gap step will create pointer to one and a gap for the 2 nd child site.

Posterior Probabilities-Forward algorithm Forward algorithm-sum of probabilities of all paths entering a given cell from the start position. Initialization:f(0,0)=1;f(i,-1)=f(-1,j)=0;Recursion: i=0,…,n j=0,…,m, except (0,0) f M (i,j) = p xi,yj [ (1-2δ) f M (i-1,j-1) + (1-ε) ( f X (i-1,j-1)+ f Y (i-1,j-1))] f X (i,j) = p xi,- [ δ f M (i-1,j) + ε f X (i-1,j)] f Y (i,j) = p -,yj [ δ f M (i,j-1) + ε f Y (i,j-1)] Termination: f E =f M (n,m)+f X (n,m)+f Y (n,m)

Backward algorithm Sum of probabilities of all possible alignments between subsequences x i…n and y j…m. Initialization: b(n,m)=1; b(i,m+1) = f(n+1,j) = 0; Recursion: i=n,…,1 j=m,…,1, except (n,m) b M (i,j) = (1-2δ) p x(i+1),y(j+1) b M (i+1,j+1) + δ [ p x(i+1),- b X (i+1,j) + p -,y(j+1) b Y (i,j+1)] δ [ p x(i+1),- b X (i+1,j) + p -,y(j+1) b Y (i,j+1)] b X (i,j) = (1-ε) p x(i+1),y(j+1) b M (i+1,j+1) + ε p x(i+1),- b X (i+1,j) b Y (i,j) = (1-ε) p x(i+1),y(j+1) b M (i+1,j+1) + ε p -,y(j+1) b X (i+1,j)

Reliability Check Assumption: Posterior probability of the sites on the alignment path is a valid estimator of the local reliability of the alignment since it gives the proportion of total probability corresponding to all alignments passing through the cell (i,j). Assumption: Posterior probability of the sites on the alignment path is a valid estimator of the local reliability of the alignment since it gives the proportion of total probability corresponding to all alignments passing through the cell (i,j). Posterior probability for a match is given by: Posterior probability for a match is given by: P(x i ◊y j |x,y) = f M (i,j) b M (i,j) / f E where f M and b M are the total probabilities of all possible alignments between subsequences x 1…i and y 1…j and x i…n and y j…m respectively Similar probabilities are calculated for Insert X and Insert Y states too.

Multiple alignment Each parent node site has a vector of probabilities corresponding to each possible character state (including the gap). For a match, Each parent node site has a vector of probabilities corresponding to each possible character state (including the gap). For a match, p a (z k )=p zk=a (x i,y j )/∑ b p zk=b (x i,y j ) Pairwise alignment builds the tree progressively, from the terminal nodes towards an arbitrary root. Pairwise alignment builds the tree progressively, from the terminal nodes towards an arbitrary root. Once root node is defined, trace-back is done to find multiple alignment of the nodes below since each node stores pointers to the matching child sites. Once root node is defined, trace-back is done to find multiple alignment of the nodes below since each node stores pointers to the matching child sites. If a gap occurs in one of the internal nodes, a gap character state is introduced in all of the sequences of that sub-tree, and recursive call will not proceed further in that branch. If a gap occurs in one of the internal nodes, a gap character state is introduced in all of the sequences of that sub-tree, and recursive call will not proceed further in that branch.

Testing Algorithms tested on Algorithms tested on (i) simulated nucleotide sequences 50 random data sets generated using the program Rose. A root random sequence (length 500) was evolved on a random tree to yield sequences of “low” (no. of substitutions per site 0.5) and “high” (1.0) divergences. Also, the insertion/deletion length distribution was set to ‘short’ or ‘long’. (ii) Amino acid data sets from Ref1 of the BAliBASE database. Ref1 contains alignments of less than 6 equi-distant sequences, i.e., the percent-identity between 2 sequences is within a specified range with no large insertion or deletion. Datasets were divided into 3 groups based on lengths, and further into 3 based on similarities.

Results of Simulation on Nucleotide Sequences

Type1 and Type 2 errors vs. minimum posterior probability

Performance and Future Work ProAlign performs better than ClustalW for the nucleotide sequences, but not for amino acid sequences with sequence identity less than 25%. ProAlign performs better than ClustalW for the nucleotide sequences, but not for amino acid sequences with sequence identity less than 25%. Possible reasons may be that the model does not take into account, the protein secondary structure. So, the HMM can be extended to modeling protein secondary structure too. Possible reasons may be that the model does not take into account, the protein secondary structure. So, the HMM can be extended to modeling protein secondary structure too. Minimum posterior probability correlates well with correctness ; can be used to detect/remove unreliably aligned regions Minimum posterior probability correlates well with correctness ; can be used to detect/remove unreliably aligned regions