Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.

Similar presentations


Presentation on theme: "Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy."— Presentation transcript:

1 Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy

2 Multiple sequence alignment methods 2 Overview What a multiple alignment means Scoring a multiple alignment Break Multidimensional dynamic programming Progressive alignment methods

3 Multiple sequence alignment methods 3 What a multiple alignment means Homologous residues are aligned in columns – Structurally homologous – Evolutionarily homologous Similar 3D structural positions Diverging from a common ancestral residue

4 Multiple sequence alignment methods 4 Multiple alignment - issues Identifying unambiguously homologous positions is not possible A need to identify which alignment is best Protein structures and sequences evolve – Sequences not entirely superposable

5 Multiple sequence alignment methods 5 Multiple alignment - issues There always is an unambiguously correct evolutionary alignment – Common ancestral sequence Sheerly impossible to infer the evolutionary history Usually easier to construct a structural alignment

6 Multiple sequence alignment methods 6 Multiple alignment - issues Sequence diverges even faster than structure – Structurally unalignable protein parts cannot be aligned by sequence either Some parts are very well alignable – Use these parts to align whatever can be aligned Disregard the rest to assess alignment quality – Supposedly meaningless biases are omitted

7 Multiple sequence alignment methods 7 Scoring an alignment Some positions are more conserved than others – Position-specific scoring Sequences are not independent – Related to each other by a phylogenetic tree Specify a complete probabilistic model of molecular sequence evolution

8 Multiple sequence alignment methods 8 Complete probabilistic model Probabilities of all evolutionary events Prior probability of root ancestral sequence Probabilities of evolutionary change depend on evolutionary time Position-specific structural and functional constraints We just don’t have all the necessary data

9 Multiple sequence alignment methods 9 Workable approximations Assume that all columns are statistically independent Score for multiple alignment mGap score/penaltyScore for column i in the multiple alignment m

10 Multiple sequence alignment methods 10 Scoring an alignment Notations

11 Multiple sequence alignment methods 11 Minimum Entropy: Further simplification We already assumed independence between columns Complex statistical dependence between sequences (within columns) if their phylogenetic tree has many intermediate ancestors We assume independence between and within columns

12 Multiple sequence alignment methods 12 Minimum entropy Probability of column m i Score of column m i can be defined as the negative logarithm A regularized probability estimate as used in chapter 5 An entropy measure directly related to the Shannon entropy (chapter 11)

13 Multiple sequence alignment methods 13 Example (1)

14 Multiple sequence alignment methods 14 Example (2)

15 Multiple sequence alignment methods 15 Example (3) Will this ever be 0 in reality? Why (not)?

16 Multiple sequence alignment methods 16 Example (4)

17 Multiple sequence alignment methods 17 Minimum entropy Very near to the HMM formulation Choose the sequences carefully Usually the sample of sequences is biased Weighting schemes as discussed in chapter 5 are necessary This partially compensates for the defects of the assumption of sequence independence

18 Multiple sequence alignment methods 18 Sum of pairs Also assumes statistical independence between columns Uses substitution matrices For simple linear gap costs, s(a,-) s(-,a) and s(-,-) are defined, with s(-,-) = 0 Scores s(a,b) come from substitution matrices like PAM or BLOSUM

19 Multiple sequence alignment methods 19 Sum of pairs Substitution scores are usually log-odds scores for pairwise comparisons – log(p ab /q a q b ) + log(p bc /q b q c ) + log(p ac /q a q c ) – log(p abc /q a q b q c ) Each sequence is scored as if it descended from the N-1 other sequences Evolutionary events are over-counted

20 Multiple sequence alignment methods 20 Problem with SP scores Consider an alignment of N sequences All have leucine (L) at position i Score for an L-L alignment according to the BLOSUM50 matrix Number of symbol pairs in the column

21 Multiple sequence alignment methods 21 What if one sequence has glycine (G) at i? – G-L pair scores -4, difference with L-L is 9 The score is worse than the all-leucine column by a fraction Problem with SP scores

22 Multiple sequence alignment methods 22 What a multiple alignment means Scoring a multiple alignment Questions? Break

23 Multiple sequence alignment methods 23 Multidimensional dynamic programming We assume that columns of an alignment are statistically independent Gaps are scored with a linear gap cost Now we can calculate overall score S(m) Where S(m i ) is a score for column i

24 Multiple sequence alignment methods 24 Calculating the overall score Define as the maximum score of an alignment up to the subsequences ending with

25 Multiple sequence alignment methods 25

26 Multiple sequence alignment methods 26 Simple notation Introduce  i which is 0 or 1 and define the “product” Now recursion can be written as follows

27 Multiple sequence alignment methods 27 Complexity of algorithm The algorithm requires the computation of the whole dynamic programming matrix with L 1, L 2,…,L N entries. We have to view 2 N - 1 combinations of gaps in a column. All sequences have roughly the same length Memory complexity of algorithm is Time complexity is

28 Multiple sequence alignment methods 28 MSA Let a kl denote the pairwise alignment between sequences k and l the score of the complete alignment is given Let â kl be the optimal pairwise alignment of k, l Obviously

29 Multiple sequence alignment methods 29 Lower bound Assume that we have a lower bound of the optimal multiple alignment, so In other words Where

30 Multiple sequence alignment methods 30 Lower bound Now we can look only at pairwise alignments of k and l that score better  kl We need to obtain  (a), and this can be done by using a progressive alignment algorithm

31 Multiple sequence alignment methods 31 Restricted algorithm For each pair k, l we can find the complete set B kl of coordinate pairs (i k, i l ) such that the best alignment of x k to x l through (i k, i l ) scores more than  kl Now we only have to look at cells (i 1, i 2,…, i N ) which meet the following condition: (i k, i l ) is in B kl for all k, l

32 Multiple sequence alignment methods 32

33 Multiple sequence alignment methods 33 Progressive alignment methods The algorithms differ in several ways Choice of order to do the alignment Whether the progression involves only alignment of sequences to a single growing alignment or whether subfamilies are built upon a tree structure

34 Multiple sequence alignment methods 34 Feng-Doolittle progressive multiple alignment 1. Calculate a diagonal matrix of N(N-1)/2 distances between all pairs of N sequences by standard pairwise alignment 2. Construct a guide tree from the distance matrix using the Fitch&Margoliash clustering algorithm 3. Starting from the first node added to the tree, align the child nodes Repeat until all sequences have been aligned.

35 Multiple sequence alignment methods 35 Converting scores to distances Where S max is the maximum score S obs is the observed pairwise alignment score S rand is the expected score for aligning two random sequences

36 Multiple sequence alignment methods 36 Profile alignment Linear gap scores can be included in the SP score: Global alignment score:

37 Multiple sequence alignment methods 37 CLUSTALW progressive alignment 1. Construct a distance matrix of all N(N-1)/2 pair by pairwise dynamic programming alignment. 2. Construct a guide tree by a neighbor-joining clustering algorithm (Saitou & Nei). 3. Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence-profile and profile-profile alignment.

38 Multiple sequence alignment methods 38 Sequences are weighted to compensate for biased representation. The substitution matrix used to score an alignment is chosen based on the expected similarity of the sequences Position-specific gap-open profile penalties are multiplied by a modifier that is a function of the residues observed at the position. CLUSTALW properties

39 Multiple sequence alignment methods 39 Gap-open penalties are also decreased if the position is spanned by a consecutive stretch of five or more hydrophilic residues. Both gap-open and gap-extend penalties are increased if there are also no gaps occur nearby in the alignment. In the progressive alignment stage, if the score of an alignment is low, we have to accumulate profile information CLUSTALW properties

40 Multiple sequence alignment methods 40 Iterative refinement methods: Barton-Stenberg multiple alignment 1. Find two sequences with the highest pairwise similarity and align them using standard pairwise dynamic programming alignment. 2. Find the sequence that is most similar to a profile of the alignment of the first two and align it to the first two by profile-sequence alignment. Repeat until all sequences have been included in the multiply alignment.

41 Multiple sequence alignment methods 41 3. Remove sequence and realign it to a profile of the other aligned sequences by profile-sequence alignment. Repeat for sequences. 4. Repeat the previous realignment step a fixed number of times or until the alignment score converges. Iterative refinement methods: Barton-Stenberg multiple alignment


Download ppt "Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy."

Similar presentations


Ads by Google