Presentation is loading. Please wait.

Presentation is loading. Please wait.

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 16/11/06 Multiple sequence alignment 1 Sequence analysis 2006 Multiple.

Similar presentations


Presentation on theme: "C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 16/11/06 Multiple sequence alignment 1 Sequence analysis 2006 Multiple."— Presentation transcript:

1 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 16/11/06 Multiple sequence alignment 1 Sequence analysis 2006 Multiple alignment methods

2 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [2] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Multiple alignment idea Take three or more related sequences and align them so that the greatest number of similar characters are aligned in the same column of the alignment.

3 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [3] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Biological definitions for related sequences Homologues are similar sequences in two different organisms that have been derived from a common ancestor sequence. Homologues can be described as either orthologues or paralogues. Orthologues are similar sequences in two different organisms that have arisen due to a speciation event. Orthologues typically retain identical or similar functionality throughout evolution. Paralogues are similar sequences within a single organism that have arisen due to a gene duplication event. Xenologues are similar sequences that do not share the same evolutionary origin, but rather have arisen out of horizontal transfer events through symbiosis, viruses, etc.

4 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [4] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 So this means … Source:http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html

5 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [5] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Information content of a multiple alignment Sequences can be conserved across species and perform similar or identical functions hold information about which regions have high mutation rates over evolutionary time and which are evolutionarily conserved identification of regions or domains that are critical to functionality Sequences can be mutated or rearranged to perform an altered function which changes in the sequences have caused a change in the functionality

6 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [6] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Information content of a multiple alignment  

7 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [7] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 What to ask yourself How do we get a multiple alignment? (three or more sequences) What is our aim? Do we go for max accuracy? Least computational time? Or the best compromise? What do we want to achieve each time?

8 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [8] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Sequence–sequence alignment sequence

9 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [9] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Exhaustive & Heuristic algorithms Exhaustive approaches Examines all possible aligned positions simultaneously Looks for the optimal solution by DP Very (very) slow Heuristic approaches Strategy to find a near-optimal solution (by using rules of thumb) Shortcuts are taken by reducing the search space according to certain criteria Much faster

10 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [10] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Multiple alignment methods  Multi-dimensional dynamic programming extension of pairwise sequence alignment.  Progressive alignment incorporates phylogenetic information to guide the alignment process  Iterative alignment correct for problems with progressive alignment by repeatedly realigning subgroups of sequence

11 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [11] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Simultaneous multiple alignment The combinatorial explosion 2 sequences of length n n² comparisons Comparison number increases exponentially i.e. n N where n is the length of the sequences, and N is the number of sequences Impractical for even a small number of short sequences Multi-dimensional dynamic programming

12 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [12] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Multi-dimensional dynamic programming Murata et al. 1985

13 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [13] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 The MSA approach Key idea: restrict the computational costs by determining a minimal region within the n-dimensional space that contains the optimal path Lipman et al. 1989

14 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [14] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 The MSA method in detail 1.Let’s consider 3 sequences 2.Calculate all pair-wise alignment scores by Dynamic programming 3.Use the scores to predict a tree 4.Produce a heuristic multiple align. based on the tree (quick & dirty) 5.Calculate maximum cost for each sequence pair from multiple alignment (upper bound) & determine paths with < costs. 6.Determine spatial positions that must be calculated to obtain the optimal alignment (intersecting areas) Note Redundancy caused by highly correlated sequences is avoided 1.. 2.. 3.. 4.. 5.. 6.. 1 2 3 1 2 1 3 2 3 1 3 2 1 3 2 1 3

15 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [15] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 The DCA (Divide-and-Conquer) approach Stoye et al. 1997 Each sequence is cut in two behind a suitable cut position somewhere close to its midpoint. This way, the problem of aligning one family of (long) sequences is divided into the two problems of aligning two families of (shorter) sequences. This procedure is re-iterated until the sequences are sufficiently short. Optimal alignment by MSA. Finally, the resulting short alignments are concatenated.

16 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [16] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 So in effect …

17 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [17] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Multiple alignment methods  Multi-dimensional dynamic programming extension of pairwise sequence alignment.extension of pairwise sequence alignment.  Progressive alignment incorporates phylogenetic information to guide the alignment process  Iterative alignment correct for problems with progressive alignment by repeatedly realigning subgroups of sequence

18 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [18] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 The progressive alignment method Underlying idea: usually we are interested in aligning families of sequences that are evolutionary related. Principle: construct an approximate phylogenetic tree (guide tree) for the sequences to be aligned and then build up the alignment by progressively adding sequences in the order specified by the tree. Stepwise assembly of multiple alignment.

19 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [19] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Progressive alignment strategy A B C D E ABCDE A— B11— C2030— D27369— E30332027— All individual pairwise alignment and construction of distance matrix Calculating a guide tree; C & D the closest pair; A & B the next closest pair A B C D Aligning C/D and A/B separately using dynamic programming A B C D E Figure adapted from Xiong, J. “Essential Bioinformatics”

20 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [20] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 But how can we align blocks of sequences ? A B C D A B C D E ? The dynamic programming algorithm performs well for pairwise alignment (two axes). So we should try to treat the blocks as a “single” sequence …

21 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [21] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 How to represent a block of sequences ? Historically: consensus sequence single sequence that best represents the amino acids observed at each alignment position. Modern methods: alignment profile representation that retains the information about frequencies of amino acids observed at each alignment position.

22 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [22] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Consensus sequence Problem: loss of information For larger blocks of sequences it “punishes” especially more distant members within a Sequence 1 F A T N M G T S D P P T H T R L R K L V S Q Sequence 2 F V T N M N N S D G P T H T K L R K L V S T Consensus F * T N M * * S D * P T H T * L R K L V S *

23 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [23] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Alignment profiles Advantage: full representation of the sequence alignment (more information retained) Not only used in alignment methods, but also in sequence-database searching (to detect distant homologues) Also called PSSM (Position-specific scoring matrix)

24 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [24] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Multiple alignment profiles Gribskov created a probe: group of typical sequences of functionally related proteins that have been aligned by similarity in sequence or three-dimensional structure (in his case: globins & immunoglobulins). Then he constructed a profile, which consists of a sequence position- specific scoring matrix M(p,a) composed of 21 columns and N rows (N = length of probe). The first 20 columns of each row specify the score for finding, at that position in the target, each of the 20 amino acid residues. An additional column contains a penalty for insertions or deletions at that position (gap-opening and gap-extension). Gribskov et al. 1987

25 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [25] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Multiple alignment profiles

26 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [26] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Profile building A C D · · · W Y Gap penalties i 0.3 0.1 0 · · · 0.3 0.51.0 Position dependent gap penalties 0.5 0 0 · · · 0 0 0.2 · · · 0.1 0.2 1.0 Each aa is represented as a frequency, penalties as weights

27 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [27] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Profile-sequence alignment ACD …… VWY sequence

28 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [28] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Sequence to profile alignment A A V V L 0.4 A 0.2 L 0.4 V Score of amino acid L in sequence that is aligned against this profile position: Score = 0.4 * s(L, A) + 0.2 * s(L, L) + 0.4 * s(L, V) L

29 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [29] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Profile-profile alignment A C D.. Y ACD …… VWY profile

30 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [30] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Profile to profile alignment Match score of these two alignment columns using the a.a frequencies at the corresponding profile positions: Score = 0.4*0.75*s(A,G) + 0.2*0.75*s(L,G) + 0.4*0.75*s(V,G) + + 0.4*0.25*s(A,S) + 0.2*0.25*s(L,S) + 0.4*0.25*s(V,S) s(x,y) is value in amino acid exchange matrix (e.g. PAM250, Blosum62) for amino acid pair (x,y) 0.4 A 0.2 L 0.4 V 0.75 G 0.25 S

31 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [31] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 So for scoring profiles … Think of sequence-sequence alignment. Same principles but more information for each position. Reminder The sequence pair alignment score S comes from the sum of the positional scores M(aa i,aa j ) (i.e. the substitution matrix values at each alignment position minus penalties if applicable) Profile alignment scores are exactly the same, but the positional scores are more complex

32 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [32] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Scoring a profile position A C D.. Y Profile1 A C D.. Y 2 At each position (column) we have different residue frequencies for each amino acid (rows) SO: Instead of saying S=M(aa 1, aa 2 ) (one residue pair) For frequency f>0 (amino acid is actually there) we take:

33 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [33] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Log-average score Remember the substitution matrix formula? In log-average scoring (von Ohsen et al, 2003) What is the effect?

34 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [34] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Return to the progressive alignment 1.Perform pair-wise alignments of all of the sequences; 2.Use the alignment scores to produces a dendrogram using neighbour-joining methods (guide-tree); 3.Align the sequences sequentially, guided by the relationships indicated by the tree. Biopat (first method ever) MULTAL (Taylor 1987) DIALIGN (1&2, Morgenstern 1996) PRRP (Gotoh 1996) ClustalW (Thompson et al 1994) PRALINE (Heringa 1999) T Coffee (Notredame 2000) POA (Lee 2002) MUSCLE (Edgar 2004) Probcons (Do 2004)

35 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [35] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 Progressive multiple alignment 1 2 1 3 4 5 Guide treeMultiple alignment Score 1-2 Score 1-3 Score 4-5 Scores Similarity matrix 5×5 Scores to distances Iteration possibilities

36 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [36] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 General progressive multiple alignment technique (followed generated tree)

37 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [37] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 PRALINE progressive strategy

38 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [38] 16/11/2006 – Multiple sequence alignment 1 – Sequence analysis 2006 There are problems … Accuracy is very important Alignment errors during the construction of the MSA cannot be repaired anymore: propagated into the progressive steps. The comparisons of sequences at early steps during progressive alignments cannot make use of information from other sequences. It is only later during the alignment progression that more information from other sequences (e.g. through profile representation) becomes employed in the alignment steps. “ Once a gap, always a gap ” Feng & Doolittle, 1987


Download ppt "C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 16/11/06 Multiple sequence alignment 1 Sequence analysis 2006 Multiple."

Similar presentations


Ads by Google