1 Lecture 8 Chapter 6 Multiple Sequence Alignment Methods.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Multiple Sequence Alignment
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
COFFEE: an objective function for multiple sequence alignments
Molecular Evolution Revised 29/12/06
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Multiple Sequence Alignment “An Inexact Science”.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Multiple alignment: heuristics
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
BNFO 602 Multiple sequence alignment Usman Roshan.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Multiple sequence alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Introduction to Profile Hidden Markov Models
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Hidden Markov Models for Sequence Analysis 4
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Multiple sequence alignment Dr Alexei Drummond Department of Computer Science Semester 2, 2006.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Step 3: Tools Database Searching
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Presentation transcript:

1 Lecture 8 Chapter 6 Multiple Sequence Alignment Methods

2 The major goal of computational sequence analysis is to predict the structure and function of genes and proteins from their sequence.

3 Biological Motivation Compare a new sequence with the sequences in a protein family. Proteins can be categorized into families. A protein family is a collection of homologous proteins with similar sequence, 3-D structure, function, and/or similar evolutionary history. Gain insight into evolutionary relationships. By looking at the number of mutations that are necessary to go from an ancestor sequence to an extant sequence, one can get an estimate for the amount of time that the two sequences diverged in the evolutionary history.

4 The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z

5 Contents What a multiple alignment means Scoring a multiple alignment –Position specific (minimum entropy) scores –Sum of pair scores Multidimensional dynamic programming Progressive alignment methods Multiple alignment by profile HMM training

6 Multiple Sequence Alignment In chapter 5, we assumed that a reasonable multiple sequence alignment was already known and provided the starting point for constructing a profile HMM We know look at what a “reasonable” multiple alignment is, and at ways to construct one automatically from unaligned sequence MSA must usually be inferred from primary sequences alone ( 一個蛋白的一級(維)結構是 指由特定序列的氨基酸排列形成的胜肽鍵串 )

7 MSA Biological sequences are typically grouped into functional families. Biologists produce high quality multiple sequence alignments by hand using expert knowledge. Important factors are: Specific sorts of columns in alignments, such as highly conserved residues or buried hydrophobic residues; The influence of the secondary structure (α-helices, β- strands etc.) and tertiary structure, the alteration of by hydrophobic and hydrophilic columns in exposed β- strands, etc; Expected patterns of insertions and deletions, that tend to alternate with blocks of conserved sequence. Phylogenetic relationships between sequences, that dictate constraints on the changes that occur in columns and in the patterns of gaps.

8 MSA Manual multiple alignment is tedious An automatic method must have a way to assign a score so that better multiple alignment get better scores

9 Multiple Sequence Alignment: Why? Identify highly conserved residues –Likely to be essential sites for structure/function –More precision from multiple sequences –Better structure/function prediction, pairwise alignments Building gene/protein families –Use conserved regions to guide search Basis for phylogenetic analysis –Infer evolutionary relationships between genes

10 Multiple Sequence Alignment: Why? Remember: The goal of biological sequence comparison is to discover functional (or structural ) similarities. Unfortunately, if the sequence similarity is weak, pairwise alignment can fail to identify biologically related sequences (because weak pairwise similarities may fail the statistical test for significance). Indeed, similar proteins may not exhibit a strong sequence similarity. The good news is that simultaneous comparison of many sequences often allows one to find similarities that are invisible in pairwise sequence comparison. [Hubbard et al., 1996]: “Pairwise alignment whispers… multiple alignment shouts out loud.”

11 What a multiple alignment means In a MSA, homologous residues among a set of sequences are aligned together in columns Homologous is meant in both the structural and evolutionary sense Ideally, a column of aligned residues occupy similar 3-D structural positions and all diverge from a common ancestral residue

12

13 Figure 6.1 A manually generated multiple alignment of 10 immunoglobulin superfamily sequence ( 一群都帶著部分免疫球蛋白構形 的蛋白質便統稱為 Immunoglobulin superfamily) A crystal structure (晶體結構 )of one of the sequences (ltlk, telokin) is known

14 Figure 6.1 At the top: β-strands (a-g). At the bottom: identical residues (letter), or highly conservative residues (+). The conserved regions include 8 β-strands and certain key residues such as two completely conserved cysteines (C) in the b and f strands The other 9 sequences have been manually aligned to ltlk based on this expert structural knowledge

15 Issues Automatic multiple sequence alignment methods are a topic of extensive research in bioinformatics Except for trivial cases of highly identical sequences, it is not possible to unambiguously identify structurally or evolutionarily homologous positions and create a single “correct” multiple alignment Since protein structures also evolve, we do not expect two protein structures with different sequences to be entirely superposable Very similar sequences will generally be aligned unambiguously (a simple program can get the alignment right) For cases of interest (e.g. a family of proteins with only 30% average pairwise sequence identity), there is no objective way to define an unambiguously correct alignment Once again, in general, an automatic method must assign a score so that better multiple, alignments get better scores

16 Issues For cases of interest (e.g. a family of proteins with only 30% average pairwise sequence identity), there is no objective way to define an unambiguously correct alignment The globin family, often used as a “typical” protein family in computational work, is in fact exceptional: almost the entire structure is conserved among divergent sequences

17

18 The Choice of the sequences: Sequences sharing a common ancestor (homologous sequences) –PSI-BLAST, FASTA, Various Search Tools The Choice of an objective function Biological problem that lies in the definition of correctness –Sum of pair, Entropy score, Consistency based, … The Optimization of that function –Exact Algorithms (Dynamic Programming) –Progressive alignment (ClustalW) –Iterative approaches (SA, GA, …)

19 Problem Statement What are the conserved regions among a set of sequences over the same alphabet? Position Index EMQPILLL Sequence 1 DMLR-LL- Sequence 2 NMK-ILLL Sequence 3 DMPPVLIL Sequence 4 DM LL Consensus sequence

20 Scoring a Multiple Alignment The scoring system should take into account that: Some positions are more conserved than others, e.g. position-specific scoring; The sequences are not independent, but instead related by a phylogenetic tree.

21 Complex Scoring Specify a complete probabilistic model of molecular sequence evolution Given the correct phylogenetic tree for the sequences to be aligned, the probability for a multiple alignment is the product of the probabilities of all the evolutionary events necessary to produce that alignment via ancestral intermediate sequences times the prior probability for the root ancestral sequence

22 Complex Scoring The probabilities of evolutionary change would depend on the evolution-ary times along each branch of the tree, as well as position-specific structural and functional constraints imposed by natural selection, so that the key residues and structural elements would be conserved High-probability alignments would then be good structural and evolution-ary alignments under this model Unfortunately, we do not have enough data to parametrise such a complex evolutionary model

23 Simplifying Assumptions Partly or (as we do in this chapter) entirely ignore the phylogenetic tree Consider that individual columns of an alignment are statistically independent

24 Defining a scoring function for multiple alignment Almost all multiple alignment methods assume that the individual columns are statistically independent. However, most multiple alignment methods use affine gap scoring functions, so successive gap residues are not treated independently. For simplicity, here we will focus on definitions of S(m i ) for scoring a column of aligned residues with no gaps, which leads to S(m) =  i S(m i ) m: multiple alignment, m i are columns

25 Sum of Pairs (SP) Scores This is the standard method for scoring multiple alignments Assumes the statistical independence of columns. Columns are scored by a “sum of pairs” (SP) function. The SP score for a column is defined as: S(m i ) =  k<l s(m i k, m i l ), where scores s(a,b) come from a substitution matrix such as PAM or BLOSUM.

26 Sum of Pairs (SP) Scores Drawback: There is no probabilistic justification of the SP score. Each sequence is scored as if it descended form N-1 other sequences instead of a single ancestor. Evolutionary events are over-counted, a problem which increases as the number of sequences increases. Altschul, Carroll & Lipman [1989] proposed a weighting scheme designed to partially compensate for this defect in SP scores.

27 Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

28 Example : Sum of pair score Seq A: ARGTCAGATACGLAG---PGMCTETWV Seq B: ARATCGGAT---IAGTIYPGMCTHTWV Scoring substitutions are represented in matrices. The popular ones are PAM or BLOSUM. Sequence alignments

29 Similarity Measurement: SP-score Sum of Pairs (SP) -score is the similarity score among amino acids (or bases) at a particular position of a multiple sequence alignment. The gap-gap alignment has 0 similarity & distance score: s(-,-) = 0 S(M) = SUM s( m i  m j ) i<j  is the collection of amino acids at a position of alignment. S(P,R,-,P) = s(P,R) + s(P,-) + s(P,P) + s(R,-) +s(R,P) + s(-,P)

30 Similarity Measurement: SP- score Multiple alignment: 1 PEAALFGKFT---IKSDVW 2 AESALYGRFT---IKSDVW 3 PDTAIWGKF---SIKSETW 4 PEVIRMGDDNPFSFQSDVW Use only sequences 2 and 3: 2 AESALYGRFT---IKSDVW 3 PDTAIWGKF---SIKSETW Remove positions which contain only gaps (which produces an induced pairwise alignment, or a projection of the multiple alignment in 2 dimensions): 2 AESALYGRFT-IKSDVW 3 PDTAIWGKF-SIKSETW

Position Index EMQPILLL Sequence 1 DMLR-LL- Sequence 2 NMK-ILLL Sequence 3 DMPPVLIL Sequence 4 Given a multiple sequence alignment with Sum of Pairs (SP-score), we may compute the score of each position of the alignment and then add all the position scores to get the total score of the whole alignment. Or, we may compute the score for each induced pairwise alignment and add these scores. If we have N sequences, the number of pairs is N*(N-1)/2 Similarity Measurement: SP- score

32 3-D Dynamic Programming Hyper-lattice (similarity matrix)

33 Seq A 1 : ARGTCAGATACGLAG---PGMCTETWV---- Seq A 2 : ARATCGGAT---IAGTIYPGMCTHTWVIAGQ Seq A 3 : ARATCE--TACG--GTI-PGMCTHTWVIA-- Example : Sum of pair score (Cont.) Multiple Sequence alignments

34 A problem with SP scores: Example Consider an alignment of N sequences which all have leucine (L) at a certain position. The score of an L aligned to L is 5 (BLOSUM), so the score of the column is 5xN(N- 1)/2, where N(N-1)/2 is the number of symbol pairs in the column. If there were one glycine (G) in the column and N-1 Ls, the score would be 9x(N-1) less, because a G-L pair scores –4 and N-1 pairs are affected. So, the SP score for a column with one G is worse than the score for a column of all Ls by a fraction of. Notice the inverse dependent on N: the relative difference in score between the correct alignment and the incorrect alignment decreases with the number of sequences in the alignment. This is counter-intuitive, because the relative difference ought to increase with the more evidence we have for a conserved leucine.

35 Position Specific Scores m is a multiple alignment; m i the column of aligned symbols in column i; the symbol in column i for sequence j; is the observed counts for residue a in column i; where is 1 if and 0 otherwise

36 Position Specific Scores C i the count vector of observed symbols in column i for an alphabet of K different residues

37 Minimum Entropy Scores We assume that residues within the column are independent, as well as between columns. The probability of a column m i is: where p ia is the probability of residue a in column i.

38 Minimum Entropy Scores We define a column score as: The column score is an entropy measure. A conserved column would score 0. The maximum likelihood estimate for the paremeter p ia is

39 Simultaneous multiple sequence alignment by Multidimensional dynamic programming Assumptions: - the columns of an alignment are statistically independent - gaps are scored with a linear gap cost γ=gd for a gap of length g and some gap cost d. Note: Extension to affine gap a costs is possible but the formalism becomes tedious. Therefore the overall score for an alignment can be computed as sum of the scores for each column i

40

41

42 Note Using the notation if and if the recursion relation becomes: Complexity In general, if we assume that the sequences are roughly the same length the memory complexity of the (naive) dynamic programming algorithm for multiple sequence alignment is and the time complexity is.

43 Carillo-Lipman Algorithm(1988) Implementation:MSA by Lipman, Altschul & Kececioglu(1989) This algorithm reduces the volume of the multidimensional dynamic programming matrix. MSA can optimally align up to five to seven protein sequences of reasonable length ( residues). Assumption: the score of a multiple alignments is the sum of the scores of all pairwise alignments defined by the multiple alignment.

44 The score of a complete alignment a is defined as where denotes the pairwise alignment between sequences k and l. Let be the optimal pairwise alignment of k, l. Obviously, Assume that we have a lower boundσ(a) and S(a), the score of the optimal multiple alignment a, i.e

45 (We can obtain a good bound σ(a) by any fast heuristic multiple alignment algorithm, for instance progressive alignment algorithms, to be introduced in the sequel). Due to the sum of pairs (SP) score definition, we have: and thus Therefore we can set a a lower bound on where

46 The N(N-1)/2 optimum pairwise alignments are each calculated and scored by standard pairwise alignment. The higher the bounds are, the smaller the volume of multidimensional dynamic programming matrix that must be calculated.

47 For each pair k, l we can find the complete set of coordinate pairs such that the best alignment of to through scores more than This set is calculated in time by multiplying the forward and backward Viterbi (…!) scores for each cell of the complete pairwise dynamic programming table. The costly multidimensional dynamic programming algorithm can then restricted to evaluate only cells in the intersection of these sets: i.e. cells for which is in for all k, l.

48

49 Progressive Multiple Alignment Methods These (greedy) methods are the most commonly used approach to multiple sequence alignment. The general idea: Most progressive alignment algorithms build a “guide tree”, a binary tree whose leaves represent sequences and whose interior nodes represent a alignments. (The methods for constructing guide trees can be “quick and dirty” versions of those for phylogenetic trees.)

50 Progressive Multiple Alignment Methods Main heuristic: first align the most similar pairs of sequences, using a pairwise alignment method. Then walk up the tree and compute at each interior node the alignment of (alignments of) sequences associated with the direct descendants of that node. The root node will represent a complete multiple alignment of the input sequences. Note: Progressive alignment methods use no global scoring function of alignment correctness.

51 Feng-Doolittle Progressive MA Alignment (1987) Specific points (I) The guide tree is constructed using the clustering alignment by Fitch & Margoliash (1967), starting from a distance matrix obtained by converting pairwise alignment scores to (approximate) pairwise distances: where is the observed pairwise alignment scores; is the maximum scores, the average of the score of aligning either sequence to itself ; is the expected score for aligning two random shuffling of the two sequence (or by an approximate calculation given in [Feng & Doolittle, 1996 ].

52 Note: The effective score can be viewed as a normalized percentage similarity; it is expected to decay exponentially towards 0 with increasing evolutionary distance, hence the –log to make the measure more approximately linear with evolutionary distance.

53 Feng-Doolittle Algorithm’s Specific Points (II) Sequences to group alignments: A sequences is added to an existing group by aligning it pairwise to each sequence in the group in turn. The highest scoring pairwise alignment determines how the sequences will be aligned to the group. Group to group alignment: All sequence pairs between the two groups are tried; the best pairwise sequence alignment determines the alignment of the two groups.

54 Feng-Doolittle Algorithm’s Specific Points (II) After an alignment is completed, gap symbols are replaced with a neutral X character. The cost for aligning an X with anything (including a gap symbol) is 0, hence a desirable effect(“once a gap always a gap”) is obtained: gap (tend to ) occur in the same columns in subsequent pairwise alignments. Note: The X rewriting is not needed in the profile-based progressive alignment algorithms(to be introduced in the sequel).

55 Profile-based progressive multiple alignment Aligning Mas using SP scoring with linear gaps The gap scores can be included in the SP score by setting and Here an alignment of two Mas will be done so that gaps are inserted in whole columns, so the alignment within each one of the two MAs is not changed. Assuming that we have two Mas, one containing sequence 1 to n, and the other containing sequence n+1 to N, the global alignment score is:

56 Aligning MAs using SP scoring with linear gaps (cont’d) Note that the first two sums are unaffected by the global alignment, since adding columns of gap characters to a profile adds 0 score (s(-,-)=0). Therefore the optimal alignment of the two profiles can be obtained by only optimising the last sum with the cross terms. This can be done exactly like standard pairwise alignment, where columns are scored against columns by adding pair scores. Obviously, one of the profile can consist of a single sequence only, which corresponds to aligning a single sequence to a profile.

57 Remark Once an aligned group has been built up, it is advantageous to use position-specific information from the group’s multiple alignment to align a new sequence to it. The degree of sequence conservation at each position should be taken into account and mismatches at highly conserved positions penalized more stringently than mismatches at variable positions. Gap penalties might be reduced where lots of gaps occur in the cluster alignment, and increased where no gaps occur.

58 Profile-based Progressive Alignment: The CLUSTALW algorithm Construct a distance matrix of all N(N-1)/2 pairs by pairwise dynamic programming alignment followed by approximate conversion of similarity scores to evolutionary distances using the model of Kimura [1983] Construct a guide tree by using the Neighbor- Joining clustering algorithm [Saitou & Nei, 1987]. Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence- profile, and profile-profile alignment.

59 Additional Heuristics Contributing to CLUSTALW’s Accuracy In order to compensate for biased representation in large subfamilies, individual sequences are weighted according to the branch length in the Neighbor-Joining tree. The substitution matrix is chosen on the basis of the similarity expected of the alignment, e.g. BLOSUM80 for closely related sequences, and BLOSUM50 for distant sequences. Position-specific gap-open profile penalties are multiplied by a modifier that is a function of the residues observed at the position

60 Additional Heuristics Contributing to CLUSTALW’s Accuracy Both gap-open and gap-extend penalties are increased if there are no gaps in a column but gaps occur nearby in the alignment. This rule tries o force all the gaps to occur in the same places in an alignment. In the progressive alignment stage, if the score of an alignment is low, the guide tree may be adjusted on the fly to defer the low-scoring alignment until later in the progressive alignment phase when more profile information has been accumulated.

61 Iterative Refinement Methods for Multiple Sequence Alignment A problem with progressive alignment methods: The subalignments are ‘frozen’, i.e. once a group sequences has been aligned, their alignment to each other cannot be changed at a later. Example: align

62 Basic idea for iterative refinement MA methods An initial alignment is generated. Then one sequence (or a set of sequences) is taken out and realigned to a profile of the remaining aligned sequences. If a meaningful(!) score is being optimized, this either increases the overall score or results in the same score. Another sequence is chosen an realigned, an so on, until the alignment does not change. The procedure is guaranteed to converge to a local maximum of the score provided that all the sequences are tried and a maximum score exists, simply because the sequence space is finite.

63 Barton-Sternberg algorithm (1987) Find the two sequences with the highest pairwise similarity and align them using standard pairwise dynamic programming alignment. Find the sequence that is most similar to a profile of the alignment of the first two, and align it to them by profile- sequence alignment. Repeat until all sequences have been included in MA. Remove sequence and realign it to a profile of the other aligned sequences by profile-sequence alignment. Repeat for sequences Repeat the previous realignment step a fixed number of times, or until the alignment score converges.