Presentation on theme: "Computational Genomics Lecture #3a"— Presentation transcript:
1Computational Genomics Lecture #3a Multiple sequence alignmentBackground Readings: Chapters 2.5, 2.7 in the text book, Biological Sequence Analysis, Durbin et al., 2001.Chapters , in Introduction to Computational Molecular Biology, Setubal and Meidanis, 1997. Chapter 15 in Gusfield’s book.p. 81 in Kanehisa’s bookMuch of this class has been edited from Nir Friedman’s lecture which is available at Changes made by Dan Geiger, then Shlomo Moran, and finally Benny Chor.
2Ladies and Gentlemen Boys and Girls the holy grail Multiple Sequence Alignment
4Multiple Sequence Alignment Aligning more than two sequences.Definition: Given strings S1, S2, …,Sk a multiple (global) alignment map them to strings S’1, S’2, …,S’k that may contain blanks, where:|S’1|= |S’2|=…= |S’k|The removal of spaces from S’i leaves Si
5Multiple alignmentsWe use a matrix to represent the alignment of k sequences, K=(x1,...,xk). We assume no columns consists solely of blanks.The common scoring functions give a score to each column, and set: score(K)= ∑i score(column(i))x1x2x3x4MQ_ILR-KPVFor k=10, a scoring function has 2k -1 > 1000 entries to specify. The scoring function is symmetric - the order of arguments need not matter: score(I,_,I,V) = score(_,I,I,V).
6SUM OF PAIRS M Q _ I L R - K P V A common scoring function is SP – sum of scores of the projected pairwise alignments: SPscore(K)=∑i<j score(xi,xj).MQ_ILR-KPVNote that we need to specify the score(-,-) because a column may have several blanks (as long as not all entries are blanks).In order for this score to be written as ∑i score(column(i)),we set score(-,-) = 0. Why ?Because these entries appear in the sum of columns but notin the sum of projected pairwise alignments (lines).
7SUM OF PAIRS M Q _ I L R - K P V Definition: The sum-of-pairs (SP) value for a multiple global alignment A of k strings is the sum of the values of all projected pairwise alignments induced by A where the pairwise alignment function score(xi,xj) is additive.MQ_ILR-KPV
8Example Consider the following alignment: a c - c d b - 3 3 +4 - c - a d b d= 12a - b c d a dUsing the edit distance and for ,this alignment has a SP value of
9Multiple Sequence Alignment Given k strings of length n, there is a natural generalization of the dynamic programming algorithm that finds an alignment that maximizesSP-score(K) = ∑i<j score(xi,xj).Instead of a 2-dimensional table, we now have a k-dimensional table to fill.For each vector i =(i1,..,ik), compute an optimal multiple alignment for the k prefix sequences x1(1,..,i1),...,xk(1,..,ik).The adjacent entries are those that differ in their index by one or zero. Each entry depends on 2k-1 adjacent entries.
10The idea via K=2 V[i,j] V[i+1,j] V[i,j+1] V[i+1,j+1] Recall the notation:and the following recurrence for V:V[i,j]V[i+1,j]V[i,j+1]V[i+1,j+1]Note that the new cell index (i+1,j+1) differs from previous indices by one of 2k-1 non-zero binary vectors (1,1), (1,0), (0,1).
11Multiple Sequence Alignment Given k strings of length n, there is a generalization of the dynamic programming algorithm that finds an optimal SP alignment.Computational Cost:Instead of a 2-dimensional table we now have a k-dimensional table to fill.Each dimension’s size is n+1. Each entry depends on 2k-1 adjacent entries.Number of evaluations of scoring function : O(2knk)
12Complexity of the DP approach Number of cells nk.Number of adjacent cells O(2k).Computation of SP score for each column(i,b) is o(k2)Total run time is O(k22knk) which is totally unacceptable !Maybe one can do better?
13But MSA is IntractableNot much hope for a polynomial algorithm because the problem has been shown to be NP complete (proof is quiteTricky and recent. Some previous proofs were bogus).Look at Isaac Elias presentation of NP completeness proof.Need heuristic or approximation to reduce time.
14Multiple Sequence Alignment – Approximation Algorithm Now we will see an O(k2n2) multiple alignment algorithm for the SP-score that approximatethe optimal solution’s score by a factor of at most 2(1-1/k) < 2.
15Star-score(K) = ∑j>0score(S1,Sj). Star AlignmentsRather then summing up all pairwise alignments, select a fixed sequence S1 as a center, and setStar-score(K) = ∑j>0score(S1,Sj).The algorithm to find optimal alignment: at each step, add another sequence aligned with S1, keeping old gaps and possibly adding new ones (i.e. keeping old alignment intact).
16Multiple Sequence Alignment – Approximation Algorithm Polynomial time algorithm:assumption: the function δ is a distance function:(triangle inequality)Let D(S,T) be the value of the minimum global alignment between S and T.
17Multiple Sequence Alignment – Approximation Algorithm (cont.) Polynomial time algorithm:The input is a set Γ of k strings Si.1. Find “center string” S1 that minimizes2. Call the remaining strings S2, …,Sk.3. Add a string to the multiple alignment that initially contains only S1 as follows:Suppose S1, …,Si-1 are already aligned as S’1, …,S’i-1. Add Si by running dynamic programming algorithm on S’1 and Si to produce S’’1 and S’i.Adjust S’2, …,S’i-1 by adding gaps to those columns where gaps were added to get S’’1 from S’1.Replace S’1 by S’’1.
18Multiple Sequence Alignment – Approximation Algorithm (cont.) Time analysis:Choosing S1 – running dynamic programming algorithmtimes – O(k2n2)When Si is added to the multiple alignment, the length of S1is at most i* n, so the time to add all k strings is
19Multiple Sequence Alignment – Approximation Algorithm (cont.) Performance analysis:M - The alignment produced by this algorithm.d(i,j) - the distance M induces on the pair Si,Sj.M* - optimal alignment.For all i, d(1,i)=D(S1,Si)(we performed optimal alignment between S’1 and Si and )
21Multiple Sequence Alignment – Approximation Algorithm Algorithm relies heavily on scoring functionbeing a distance. It produced an alignmentwhose SP score is at most twice the minimum.What if scoring function was similarity?Can we get an efficient algorithm whose scoreis half the maximum? Third of maximum? …We dunno !
22Tree AlignmentsAssume that there is a tree T=(V,E) whose leaves are the input sequences.Want to associate a sequence in each internal node.Tree-score(K) = ∑(i,j)Escore(xi,xj).Finding the optimal assignment of sequences to the internal nodes is NP Hard.We will meet this problem again in the study ofphylogenetic trees (it is related to the parsimony problem).
23Multiple Sequence Alignment Heuristics Example - 4 sequences A, B, C, D.A.BDACABCDPerform all 6 pair wisealignments. Find scores.Build a “similarity tree”.distantsimilarB.Multiple alignment following the tree from A.BAlign most similar pairs allowinggaps to optimize alignment.DAAlign the next most similar pair.CNow, “align the alignments”, introducing gaps ifnecessary to optimize alignment of (BD) with (AC).
24(modified from Speed’s ppt presentation, see p. 81 in Kanehisa’s book) The tree-based progressive method for multiple sequence alignment, used in practice (Clustal) (a) a tree (dendrogram) obtained by “cluster analysis” (b) pairwise alignment of sequences’ alignments.(a)(b)L W R D G R G A L QL W R G G R G A A QD W R - G R T A S GDEHUG3DEPGG3DEBYG3DEZYG3DEBSGFL R R - A R T A S AL - R G A R A A A E(modified from Speed’s ppt presentation, see p. 81 in Kanehisa’s book)