. Pairwise and Multiple Alignment Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes.

. Pairwise and Multiple Alignment Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger and Shlomo Moran. www.cs.huji.ac.il Background Readings: Chapter 6 in Biological Sequence Analysis, Durbin et al., 2001. Chapter 3.4, 3.5 in Introduction to Computational Molecular Biology, Setubal and Meidanis, 1997.

2 Score(s[i],t[i]) Log Odd-Ratio Test for Alignment Taking logarithm of Q yields If log Q > 0, then s and t are more likely to be related. If log Q < 0, then they are more likely to be unrelated. How can we relate this quantity to a score function ?

3 Estimating p(·,·) for proteins Generate a large diverse collection of accepted mutations. An accepted mutation is a mutation due to an alignment of closely related protein sequences. For example, Hemoglobin alpha chain in humans and other organisms (homologous proteins). Let p a = n a /n where n a is the number of occurrences of letter a and n is the total number of letters in the collection, so n =  a n a. Mutation counts be the number of mutations a  b, be the total number of mutations that involve a, be the total number of amino acids involved in a mutation. Note that f is twice the number of mutations.

4 Where m a = [the probability that a is changed by a mutation]. We set m a so that: M aa = 1 –m a is 99%, since we need that 1% of amino acids change according to PAM-1. Hence the name, 1-Percent Accepted Mutation (PAM). In other words, Define M ab to be the symmetric probability matrix for switching between a and b. PAM-1 matrices

5 In general: How we compute m a so that 1/K of the letters are mutated? We have three constraints: 1. [#(mutations of a)] / [2#(all mutations)] = f a /f 2. #(all mutations) = n/K. 3. #(occurences of a)= p a n Thus we have:

6 PAM-1 matrices In PAM-1matrices, we want 1% of the letters to change, so we take K =100. K=50 yields 2% change, etc. And indeed:

7 Evolutionary distance The choice that 1% of amino acids change (and that K =100) is quite arbitrary. It could fit specific set of proteins whose evolutionary distance is such that indeed 1% of the letters have mutated. This is a unit of evolutionary change, not time because evolution acts differently on distinct sequence types. What is the substitution matrix for k units of evolutionary time ?

8 Model of Evolution We make some assumptions: 1. Each position changes independently of the rest 2. The probability of mutations is the same in each position 3. Evolution does not “remember” Time t t+  t+2  t+3  t+4  A A C CG T T T CG

9 Model of Evolution u How do we model such a process? u This process is called a Markov Chain A chain is defined by the transition probability  P(X t+  =b|X t =a) - the probability that the next state is b given that the current state is a  We often describe these probabilities by a matrix: M[  ] ab = P(X t+  =b|X t =a)

10 Multi-Step Changes  Thus M[2  ] = M[  ]M[  ]  By induction (HMW exercise): M[n  ] = M[  ] n  Based on M ab, we can compute the probabilities of changes over two time periods Using Conditional independence (No memory)

11 A Markov Model (chain) X1X1 X2X2 X n-1 XnXn Every variable x i has a domain. For example, suppose the domain are the letters {a, c, t, g}. Every variable is associated with a local probability table P(X i = x i | X i-1 = x i-1 ) and P(X 1 = x 1 ). The joint distribution is given by In short, we write: where Pa i are the parents of variable/node X i,namely, none or X i-1.

12 Markov Model of Evolution Revisited In the evolution model we studied earlier we had P(x 1 ) = (p a, p c, p g, p t ) which sum to 1 and called the prior probabilities, and P(x i |x i-1 ) = M[  ] which is a stationary transition probability table, not depending on the index i. The quantity we computed earlier from this model was the joint probability table X1X1 X2X2 X n-1 XnXn M M

13 Longer Term Changes  Estimate M[  ] = M (PAM-1 matrices)  Use M[n  ] = M n (PAM-n matrices) u Define u Use this quantity to define the score for your application of interest.

14 Comments regarding PAM u Historically researchers use PAM-250. (The only one published in the original paper.) u Original PAM matrices were based on small number of proteins (circa 1978). Later versions use many more examples. u Used to be the most popular scoring rule, but there are some problems with PAM matrices.

15 Degrees of freedom in PAM definition With K=100 the 1-PAM matrix is given by With K=50 the basic matrix is different, namely: Use the 1-PAM matrix to the fourth power: M[4  ] = M[  ] 4 Or Use the K=50 matrix to the second power: M[4  ] = M[2  ] 2 Thus we have two different ways to estimate the matrix M[4  ] :

16 Problems in building distance matrices u How do we find pairs of aligned sequences? u How far is the ancestor ? earlier divergence  low sequence similarity later divergence  high sequence similarity E.g., M[250  ] is known not to reflect well long period changes (see p. 43 at Durbin et al). u Does one letter mutate to the other or are they both mutations of a third letter ?

17 BLOSUM Outline u Idea: use aligned ungapped regions of protein families.These are assumed to have a common ancestor. Similar ideas but better statistics and modeling. u Procedure: l Cluster together sequences in a family whenever more than L% identical residues are shared. l Count number of substitutions across different clusters (in the same family). l Estimate frequencies using the counts. u Practice: Blosum50 (ie L=50) and Blosum62 are wildly used. (See page 43-44 in Durbin et al). Considered state of the art nowadays.

18 Multiple Sequence Alignment S 1 =AGGTC S 2 =GTTCG S 3 =TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG T-AT-A --A--A CCACCA -GC-GC

19 Multiple Sequence Alignment Definition: Given strings S 1, S 2, …,S k a multiple (global) alignment map them to strings S’ 1, S’ 2, …,S’ k that may contain blanks, where: 1.|S’ 1 |= |S’ 2 |=…= |S’ k | 2.The removal of spaces from S’ i leaves S i Aligning more than two sequences.

20 Multiple alignments We use a matrix to represent the alignment of k sequences, K=(x 1,...,x k ). We assume no columns consists solely of blanks. MQ_ILLL MLR-LL- MK_ILLL MPPVLIL The common scoring functions give a score to each column, and set: score(K)= ∑ i score(column(i)) For k=10, a scoring function has 2 k -1 > 1000 entries to specify. The scoring function is symmetric - the order of arguments need not matter: score(I,_,I,V) = score(_,I,I,V). x1x1 x2x2 x3x3 x4x4

21 SUM OF PAIRS MQ_ILLL MLR-LL- MK_ILLL MPPVLIL A common scoring function is SP – sum of scores of the projected pairwise alignments: SPscore(K)=∑ i<j score(x i,x j ). In order for this score to be written as ∑ i score(column(i)), we set score(-,-) = 0. Why ? Because these entries appear in the sum of columns but not in the sum of projected pairwise alignments (lines). Note that we need to specify the score(-,-) because a column may have several blanks (as long as not all entries are blanks).

22 SUM OF PAIRS MQ_ILLL MLR-LL- MK_ILLL MPPVLIL Definition: The sum-of-pairs (SP) value for a multiple global alignment A of k strings is the sum of the values of all projected pairwise alignments induced by A where the pairwise alignment function score(x i,x j ) is additive.

23 Example Consider the following alignment: a c - c d b - - c - a d b d a - b c d a d Using the edit distance and for, this alignment has a SP value of 3 3 +4 3 + 4 + 5 = 12

24 Multiple Sequence Alignment Given k strings of length n, there is a natural generalization of the dynamic programming algorithm that finds an alignment that maximizes SP-score(K) = ∑ i<j score(x i,x j ). Instead of a 2-dimensional table, we now have a k-dimensional table to fill. For each vector i =(i 1,..,i k ), compute an optimal multiple alignment for the k prefix sequences x 1 (1,..,i 1 ),...,x k (1,..,i k ). The adjacent entries are those that differ in their index by one or zero. Each entry depends on 2 k -1 adjacent entries.

25 The idea via K=2 V[i,j]V[i+1,j] V[i,j+1]V[i+1,j+1] Note that the new cell index (i+1,j+1) differs from previous indices by one of 2 k -1 non-zero binary vectors (1,1), (1,0), (0,1). Recall the notation: and the following recurrence for V :

26 The idea for arbitrary k Order the vectors i=(i 1,..,i k ) by increasing order of the sum ∑i j. Set s(0,..,0)=0, and for i > (0,...,0): The vector b ranges over all non-zero binary vectors. The vector i-b is the non-negative difference of i and b. The j th entry of column(i,b) equals c j = x j (i j ) if b i =1, and c j = ‘-’ otherwise. (Reflecting that b is 1 at location j if that location changed in the “current comparison”). Where

27 Complexity of the DP approach Number of cells n k. Number of adjacent cells O(2 k ). Computation of SP score for each column(i,b) is o(k 2 ) Total run time is O(k 2 2 k n k ) which is utterly unacceptable ! Not much hope for a polynomial algorithm because the problem has been shown to be NP complete. Need heuristic to reduce time.

28 Time saving heuristics: Relevance tests Heuristic: Avoid computing score(i) for irrelevant vectors. MQ_ILLL MLR-LL- MK_ILLL MPPVLIL x1x1 x2x2 x3x3 x4x4 Let L be a lower bound on the optimal SP score of a multiple alignment of the k sequences. A lower bound L can be obtained from an arbitrary multiple alignment, computed in any way. Main idea: Using L, compute lower bounds L uv for the optimal score for every two sequences s=x u and t=x v, 1  u < v  k. When processing vector i=(..i u,..i v …), the relevant cells are such that in every projection on x u and x v, the optimal pairwise score is above L uv.

29 Recall the Linear Space algorithm u V[i,j] = d(s[1..i],t[1..j]) u B[i,j] = d(s[i+1..n],t[j+1..m])  F[i,j] + B[i,j] = score of best alignment through (i,j) t s These computations done in linear space. Build such a table for every two sequences s=x u and t=x v, 1  u, v  k. This entry encodes the optimum through (i u,i v ).

30 Time saving heuristics: Relevance test But can we go over all cells determine if they are relevant or not ? No. Start with (0,…,0) and add to the list relevant entries until reaching (n 1,…,n k )

31 Multiple Sequence Alignment – Approximation Algorithm In tutorial time you will see an O(k 2 n 2 ) multiple alignment algorithm for the SP-score that errs by a factor of at most 2(1- 1/k) < 2.

32 Star Alignments Rather then summing up all pairwise alignments, select a fixed sequence x 0 as a center, and set Star-score(K) = ∑ j>0 score(x 0,x j ). The algorithm to find optimal alignment: at each step, add another sequence aligned with x 0, keeping old gaps and possibly adding new ones.

33 Tree Alignments Assume that there is a tree T=(V,E) whose leaves are the sequences. Associate a sequence in each internal node. Tree-score(K) = ∑ (i,j)  E score(x i,x j ). Finding the optimal assignment of sequences to the internal nodes is NP Hard. We will meet again this problem in the study of Phylogenetic trees

. Pairwise and Multiple Alignment Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes.

Similar presentations

Presentation on theme: ". Pairwise and Multiple Alignment Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

. Pairwise and Multiple Alignment Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes.

Similar presentations

Presentation on theme: ". Pairwise and Multiple Alignment Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes."— Presentation transcript:

Similar presentations

About project

Feedback