Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by Itai Sharon, other created by Itai Sharon

2 Multiple Sequence Alignment S 1 =AGGTC S 2 =GTTCG S 3 =TGAAC GGGGGG CCCCCC A-TA-T G--G-- TTATTA -TA-TA -G--G- GTTGTT CCACCA AG-AG- GTGGTG T-AT-A --A--A -GC-GC Possible alignment

3 Motivation Construction of phylogenetic trees Requires that sites being compared are homologous Extraction of conserved regions in proteins Construction of profiles characteristic for a protein family Repetitive sequences in DNA

4 Multiple Sequence Alignment (MSA) Definition Given strings s 1, s 2, …, s k an MSA algorithm maps them to strings s ’ 1, s ’ 2, …, s ’ k that may contain gaps, where: | s ’ 1 | = | s ’ 2 | = … = | s ’ k | The removal of gaps from s ’ i leaves s i Note It is usually convenient to represent an MSA as a matrix with k rows and | s ’ i | columns No column may consist solely of gaps

5 Assigning scores to an MSA We will consider additive functions only Points to consider regarding a scoring function Should not be dependent if the on the order of arguments Should reward the presence of many equal/strongly related residues and penalize unrelated residues and spaces In pairwise alignment the score is simply the sum of similarity scores of corresponding letters What is the “best” way to measure the similarity of k>2 letters?

6 Sum of Pairs (SP) The sum of pairs score of an MSA is the sum of scores of all pairwise alignments induced by it Example: Using a cost function  (x, x) = 0 and  (x, y) = 1 for x ≠ y this alignment has a SP value of - c - a d b - a - b c d a d a c - c d b - 4 + 6 + 2 = 12

7 Sum of Pairs SP tends to overcount mutations. For instance: Assume that our column consists of ( A, A, A, T ) and that  ( x, x ) = 1,  ( x, y ) = -1 The score for the column will be 3*  ( A, A ) + 3*  ( A, T ) = 3 – 3 = 0 While this could be explained by a single mutation: AAAT

8 How to Perform MSA? Multidimensional dynamic programming Tree alignments Star alignments Progressive alignment

9 Multidimensional DP Alignment Given k strings of length n, there is a natural generalization of the DP algorithm Instead of a 2-dimensional table, we now have a k -dimensional table to fill For each cell V ( i ), i =( i 1,.., i k ), compute an optimal multiple alignment for the k prefix sequences s 1 (1,.., i 1 ),..., s k (1,.., i k ) The adjacent cells are all cells V ( i-b ), where b i  {0,1} and  b i ≠0. Each cell depends on 2 k -1 adjacent cells Use the SP-score for computing the score

10 Multidimensional DP Alignment What’s the price? Number of cells to fill: O( n k ) Number of dependencies of each cell: 2 k -1 Time to compute the SP-score: O( k 2 ) In fact, the optimal SP-alignment problem was shown to be NP-complete! Well, these sequences need to be aligned… what can we do? Complexity: O(k 2 2 k n k )

11 Time Saving Heuristics – Relevance Tests Idea: Avoid computing score ( i ) for irrelevant cells Compute a lower bound L on the optimal alignment Any efficient approximation algorithm can be used For each cell V ( i ) compute an upper bound U on the best alignment that goes through it Ignore the cell if U < L

12 Time Saving Heuristics – Relevance Tests How do we compute the upper bound U for cell V ( i )? For cell i= ( …,i u,…,i v,… ) do the following: For each two indices 1  u < v  k compute the optimal score of a pairwise alignment of s u and s v, which goes via cell i Compute Claim: U is an upper bound on the best MSA that goes through cell i

13 Time Saving Heuristics – Relevance Tests How do we compute the optimal route? Recall the space efficient algorithm for pairwise alignment. can we go over all cells determine if they are relevant or not? No. Start with (0,…,0) and add to the list relevant entries until reaching ( n 1,…,n k ) What is the new time complexity? For each potential cell we’ve added O ( k 2 n 2 ) operations Depending on the quality of L we’ve eliminated (hopefully) many cells

14 Tree Alignments – Structure Input A set of k sequences S = { s 1, s 2, …, s k } Topology of the tree T whose leaves are the members of S Algorithm Find an assignment of sequences for the interior nodes of the tree that optimizes the overall score For each edge e =( v i,v j ) of T, its weight w ( e ) is the pairwise alignment score of v i and v j The overall score is defined by

15 Tree Alignments – an Example Suppose that We’re given the following tree: Given that  ( x, x )=1,  ( x, y )=0 and  ( x, - )=-1, the overall score of the alignment is score( T )=2+3+1=6 CAT GT CTG CG CT 1 1 2 CG 2 1 +3 1 +1=6

16 Tree Alignments – Notes The MSA can be recovered from the alignments on the different edges  Overall score of the alignment is not SP The tree alignment problem is NP-hard There exists an algorithm that finds an optimal alignment in time exponential in the number of sequences Tree alignment algorithm are applicable only when a tree topology is known

17 Star Alignments – Structure Choose a sequence s * that will serve as the center of the star How to choose: try all sequences, choose the one whose distances from all the rest is the smallest, etc. Add other sequences by aligning them to s * Add gaps to already aligned sequences when necessary Never remove a gap (“Once a gap, always a gap”) s3s3 s4s4 s2s2 s5s5 s6s6 s1s1

18 The Center Star Method Publication Gusfield, 1993 Assumption The cost function δ is a distance function that satisfies:  (x, y) =  (y, x) ≥ 0  (x, x) = 0  (x, z) +  (z, y) ≥  (x, y) Algorithm Runs in polynomial time alignment’s score is less than twice the score of the optimal alignment

19 The Center Star Method – Definitions Definitions M - the alignment produced by the algorithm M * – the best alignment, namely the one that gets the lowest score d ( i, j ), d * ( i, j ) – the distance induced by M ( M * ) on ( s i, s j ) DP ( s i, s j ) – minimum pairwise alignment score v ( M ) - score for alignment M : Note that it is always true that d ( i, j ) ≥ DP ( i, j )

20 The Center Star Method Input A set of k sequences S = { s 1, …, s k } Algorithm Find the center s * =. Suppose s * = s 1 for i =2 to k do: Suppose s 1, …, s i-1 are already aligned as s ’ 1, …, s ’ i-1 Align s i against s ’ 1 by running the DP algorithm to produce the alignment ( s ” 1, s ’ i ) Adjust s ’ 2, …, s ’ i-1 to s ” 1 by adding gaps to those columns where gaps were added to get s ” 1 from s ’ 1. Replace s ’ 1 by s ” 1, add s ’ i. end for

21 The Center Star Method – Time Analysis Choosing s * running the DP algorithm times – o( k 2 n 2 ) Adding s 2, …, s k to the MSA In step i the length of s ’ * is at most i · n Aligning s ’ * with s i takes o( i · n 2 ) time Performing k-1 such alignments takes o( k 2 n 2 ) time: Overall time complexity: o( k 2 n 2 )

22 The Center Star Method – Error Analysis Definition of S 1 Triangle inequality d(1,i)=DP(1,i )

23 Progressive Alignments Idea successively align pairs of sequences using pairwise alignment algorithms General structure Choose two sequences and align them using a pairwise alignment algorithm Choose another sequence and align it to the current alignment Repeat the previous stage as long as there are sequences left

24 Progressive Alignments Differences between algorithms Choosing the next sequence Progression involves aligning sequences vs. alignments only, or also alignments vs. alignments Scoring methods Progressive alignment algorithms Clustal W T-Coffee

25 CLUSTAL W Publication Thompson et al., 1994 The algorithm consists of three stages: Distance matrix construction, by pairwise alignment of each pair of sequences Guide tree construction from the distance matrix Progressive alignment of the sequences according to the branches in the guide tree More on ClustalW – next week…

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

Similar presentations

Presentation on theme: "Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

Similar presentations

Presentation on theme: "Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by."— Presentation transcript:

Similar presentations

About project

Feedback