Multiple Sequence alignment Chitta Baral Arizona State University.

Multiple Sequence alignment Chitta Baral Arizona State University

Motivation and Introduction Need for multiple sequence alignment –We have the sequences of several proteins which have similar function in a number of different species –We may want to know which part of these sequences are similar and which parts are different. What is multiple alignment? –Let s1, …, sk be a set of sequences over the same alphabet. –Spaces are inserted in s1, …, sk to make them all of same size. –When the extended sequences are aligned, no column can be made exclusively of spaces. –An example M Q P I L L L M L R - L L - M K - I L L L M P PV L I L –First important issue: defining the quality of an alignment.

The `sum-of-pairs’ (SP) measure Requirement of a good quality of alignment measure –Additive function –Function must be independent of order of arguments –Should reward presence of many equal or strongly related symbols (in the same column) and penalize unrelated symbols and spaces. SP function: sum of pairwise scores of all pairs of symbols in the column –SP-score(I, -, I, V) = s(I,-) + s(I,I) + s(I,V) + s(-, I) + s(-,V) + s(I,V). –s(-,-) = 0. Theorem: Let alpha be a multiple alignment of the set of sequences s1, …, sk; and alpha(I,j) denote the pairwise alignment of si and sj as induced by alpha. Then SP-score(alpha) = Sum over i,j [score(alpha(i,j)] –The above is only true if we have s(-,-) = 0. –This is because in pairwise alignment the presence of two aligned spaces (–) in the two sequences are ignored.

Optimal alignment using dynamic programming Consider k sequences, each of length n Use a k-dimensional array A[] of length n+1 in each dimension Initialize A[0,…,0] = 0. A[i 1, …, i k ]  max b {A[i-b] + SP-score(Column(s,i,b))} –Where b ranges over all non-zero binary vectors of k elements, and –Column(s,i,b) = (c j ) 1<= j <= k –With c j = s j [i j ] if b j =1 and c j =- if b j = 0. –Boldface indicates k-tuples. A[i 1,i 2,i 3 ]  max of –A[i 1, i 2, i 3 -1] + SP-score(-,-,s 3 [i 3 ]) –A[i 1, i 2 -1, i 3 ] + SP-score(-,s 2 [i 2 ],-) –A[i 1, i 2 -1, i 3 -1] + SP-score(-,s 2 [i 2 ],s 3 [i 3 ]) –A[i 1 -1, i 2, i 3 ] + SP-score(s 1 [i 1 ],-,-) –A[i 1 -1, i 2, i 3 -1] + SP-score(s 1 [i 1 ],-,s 3 [i 3 ]) –A[i 1 -1, i 2 -1, i 3 ] + SP-score(s 1 [i 1 ],s 2 [i 2 ],-) –A[i 1 -1, i 2 -1, i 3 -1] + SP-score(s 1 [i 1 ],s 2 [i 2 ],s 3 [i 3 ])

Complexity analysis of the dynamic programming algorithm Running time: –(n+1) k number of entries in the table –For each entry we need to find the maximum of 2 k -1 elements –Finding the SP-score corresponding to each element means adding O(k 2 ) numbers –Total = O(k 2 2 k n k ) i.e., exponential w.r.t. k.

A heuristic based approach Outline of the approach –We have k sequences of length n each and we want to compute the optimal alignments according to the SP measure relevant –We use dynamic programming, but try to avoid filling all entries of the k-dimensional array, and fill only the `relevant’ ones. Which cells are relevant and why –Idea: look at pairwise projections of cells. –Note: Optimal alignments may not lead to pairwise projections that are optimal. A T A – - T A T is optimal, but A _ and _ T are not optimal.

Heuristics based approach … cont Recall F(i,j) meant the score of the best alignment between the initial segment x 1…i and y 1…j. Lets denote it by sim(x[1..i],y[1..j]), and refer to it as a xy [i,j]. I.e., a xy [i,j] = sim(x[1..i],y[1..j]). Let b xy [i,j] = sim(x[i+1..n],y[j+1..m]). –Computed like a xy but backwards. And c xy [i,j] = a xy [i,j] + b xy [i,j]. –Means the highest score of an alignment that cuts at (i,j) Using the c matrix it is very easy to find the alignment. –Find a path from [n,m] to [0,0] that has the value c xy [n,m] all through. Suppose we know a lower bound L xy for c xy. I.e. we know for sure that sim(x,y) >= L xy. –In that case, c xy [i,j] < L xy means the cut through (i,j) does not lead to the best alignment.

Heuristic based approach.. cont aGATTC 0-2-4-6-8-10 A -2 -3-5-7 T -4-3-20 -4 T -6-5-41 C -8-7-6-32 G -10-7-8-5-30 G -12-9-8-7-5-2 cGATTC -7-12-17-22 A -7-4-2-7-12-17 T -10-7-5-2-7-12 T -13-10-7-5-2-7 C -14-13-10-5-4-2 G -17-14-13-8-4-2 G -22-17-14-11-7-2

H. B. A (cont) – A theorem Theorem: Let  be an optimal alignment involving s 1, …, s k. If SP-score(  ) >= L then score(  ij ) > = L ij, where L ij = L –  x<y & (x,y) =\= (i,j) (sim(s x,s y )). Proof: –SP-score(  ) >= L iff  x = L –iff  x = L - score(  ij ) –Implies  x = L - score(  ij ) ##because sim(s x,s y ) is the best score and hence is greater than or equal to score(  xy ). –iff score(  ij ) > = L –  x<y & (x,y) =\= (i,j) (sim(s x,s y )). Implication of this theorem: –Suppose we have a lower bound L of SP-score, over all possible alignments. –Then a cell with index (i 1, …, i k ) is relevant if the score of the best alignment (say  that cuts through (i 1, …, i k ) > = L –By the theorem, this implies for all x, y, 1 = L xy –Which means c xy (i x,i y ) > = L xy –This is because the best alignment will cut through i x i y. Idea of the algorithm: –Pick a lower bound L; Compute c xy and L xy for each pair x, y 1 < = x < y < = k. –Start with (0,…,0) and expand its influence to dependent relevant cells and continue until the final corner cell is reached.

The heuristic based algorithm Input: s = (s 1, …, s k ) and lower bound L Output: The value of an optimal alignment For all x, y, 1 <=x<y<=k Compute c xy For all x,y, 1 <=x<y<=k L xy  L -  (x,y) =\= (p,q) (sim(s p,s q )). pool  {0} While pool not empty do –i  the lexicographically smallest cell in the pool –pool  pool \ {i} –If c xy [i x,i y ]>= L xy, forall x,y, 1 <= x<y<=k then For all j dependent on i do –If j not in pool then pool  pool U {j}; a[j]  a[i] + SP-score(Column(s,i,j-i)) – else a[j]  max( a[j], a[i] + SP-score(Column(s,i,j-i)) Return a[n 1, …, n k ]

Star alignment Let s 1, …, s k be k sequences that we want to align Pick one of the sequences s c as the center –For each index i =\= c find optimal alignment between s i and s c –Aggregate these alignment using ``once a gap always a gap principle’’ Start with one pair of alignment and keep adding alignment with respect to another string using s c as a guide by adding gaps when necessary One way to select s c is to try all possibilities and pick the one that results in the best score. Another way is to compute all optimal pairwise alignments and select as the center the string that maximizes  i =\= c sim(s i,s c ).

Tree alignment Motivation: Sometimes we have an evolutionary tree for the sequences involved. –In that case we can compute the overall similarity based on pairwise alignment along tree edges. Input: k sequences and a tree with leaves as these sequences. Goal: Find a sequence asignment to the internal nodes of the tree so that the sum of the similarity between the sequences along each edges is maximized. Tree alignment is NP-hard, but approximation algorithms exist. Note: Star alignment can be viewed as a special case of tree alignment.

Multiple Sequence alignment Chitta Baral Arizona State University.

Similar presentations

Presentation on theme: "Multiple Sequence alignment Chitta Baral Arizona State University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Sequence alignment Chitta Baral Arizona State University.

Similar presentations

Presentation on theme: "Multiple Sequence alignment Chitta Baral Arizona State University."— Presentation transcript:

Similar presentations

About project

Feedback