1-month Practical Course Genome Analysis Lecture 5: Multiple Sequence Alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Techniques for Protein Sequence Alignment and Database Searching
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.
Sequence analysis course Lecture 7 Multiple sequence alignment 3 of 3 Optimizing progressive multiple alignment methods.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
1-month Practical Course Genome Analysis Lecture 5: Multiple Sequence Alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 07/01/08 Multiple sequence alignment 2 Sequence analysis 2007 Optimizing.
Multiple alignment: heuristics
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 16/11/06 Multiple sequence alignment 1 Sequence analysis 2006 Multiple.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Pair-wise alignment quality versus sequence identity (Vogt et al., JMB 249, ,1995)
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Medical Natural Sciences Year 2: Introduction to Bioinformatics Lecture 9: Multiple sequence alignment (III) Centre for Integrative Bioinformatics VU.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Step 3: Tools Database Searching
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Introduction to bioinformatics Lecture 7 Multiple sequence alignment (1)
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Introduction to bioinformatics lecture 8
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Multiple sequence alignment (msa)
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Multiple sequence alignment Why?
1-month Practical Course
Introduction to Bioinformatics
Introduction to bioinformatics 2007 Lecture 9
Introduction to bioinformatics Lecture 8
1-month Practical Course
1-month Practical Course Genome Analysis Iterative homology searching
Presentation transcript:

1-month Practical Course Genome Analysis Lecture 5: Multiple Sequence Alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The Netherlands ibi.vu.nl C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Multiple sequence alignment

 Sequences can be conserved across species and perform similar or identical functions. > hold information about which regions have high mutation rates over evolutionary time and which are evolutionarily conserved; > identification of regions or domains that are critical to functionality.  Sequences can be mutated or rearranged to perform an altered function. > which changes in the sequences have caused a change in the functionality. Multiple sequence alignment: the idea is to take three or more sequences and align them so that the greatest number of similar characters are aligned in the same column of the alignment.

What to ask yourself  How do we get a multiple alignment? (three or more sequences)  What is our aim? – Do we go for max accuracy, least computational time or the best compromise?  What do we want to achieve each time

Sequence-sequence alignment sequence

Multiple alignment methods  Multi-dimensional dynamic programming > extension of pairwise sequence alignment.  Progressive alignment > incorporates phylogenetic information to guide the alignment process  Iterative alignment > correct for problems with progressive alignment by repeatedly realigning subgroups of sequence

Simultaneous multiple alignment Multi-dimensional dynamic programming The combinatorial explosion  2 sequences of length n  n 2 comparisons  Comparison number increases exponentially  i.e. n N where n is the length of the sequences, and N is the number of sequences  Impractical for even a small number of short sequences

Multi-dimensional dynamic programming (Murata et al., 1985) Sequence 1 Sequence 2 Sequence 3

The MSA approach  MSA (Lipman et al., 1989, PNAS 86, 4412)  MSA restricts the amount of memory by computing bounds that approximate the centre of a multi-dimensional hypercube.  Calculate all pair-wise alignment scores.  Use the scores to to predict a tree.  Calculate pair weights based on the tree (lower bound).  Produce a heuristic alignment based on the tree.  Calculate the maximum weight for each sequence pair (upper bound).  Determine the spatial positions that must be calculated to obtain the optimal alignment.  Perform the optimal alignment.  Report the weight found compared to the maximum weight previously found (measure of divergence).  Extremely slow and memory intensive.  Max 8-9 sequences of ~250 residues.

The DCA approach  DCA (Stoye et al., 1997, Appl. Math. Lett. 10(2), 67-73)  Each sequence is cut in two behind a suitable cut position somewhere close to its midpoint.  This way, the problem of aligning one family of (long) sequences is divided into the two problems of aligning two families of (shorter) sequences.  This procedure is re-iterated until the sequences are sufficiently short.  Optimal alignment by MSA.  Finally, the resulting short alignments are concatenated.

So in effect … Sequence 1 Sequence 2 Sequence 3

Multiple alignment methods  Multi-dimensional dynamic programming > extension of pairwise sequence alignment.  Progressive alignment > incorporates phylogenetic information to guide the alignment process  Iterative alignment > correct for problems with progressive alignment by repeatedly realigning subgroups of sequence

The progressive alignment method  Underlying idea: usually we are interested in aligning families of sequences that are evolutionary related.  Principle: construct an approximate phylogenetic tree for the sequences to be aligned and than to build up the alignment by progressively adding sequences in the order specified by the tree.  But before going into details, some facts about multiple alignment profiles …

How to represent a block of sequences?  Historically: consensus sequence – single sequence that best represents the amino acids observed at each alignment position. When consensus sequences are used the pair-wise DP algorithm can be used without alterations  Modern methods: Alignment profile – representation that retains the information about frequencies of amino acids observed at each alignment position.

Multiple alignment profiles (Gribskov et al. 1987)  Gribskov created a probe: group of typical sequences of functionally related proteins that have been aligned by similarity in sequence or three-dimensional structure (in his case: globins & immunoglobulins).  Then he constructed a profile, which consists of a sequence position-specific scoring matrix M(p,a) composed of 21 columns and N rows (N = length of probe).  The first 20 columns of each row specify the score for finding, at that position in the target, each of the 20 amino acid residues. An additional column contains a penalty for insertions or deletions at that position (gap- opening and gap-extension).

Multiple alignment profiles ACDWYACDWY - i fA.. fC.. fD..  fW.. fY.. Gapo, gapx Position dependent gap penalties Core region Gapped region Gapo, gapx fA.. fC.. fD..  fW.. fY.. fA.. fC.. fD..  fW.. fY..

Profile building  Example: each aa is represented as a frequency; gap penalties as weights. ACDWYACDWY Gap penalties i  Position dependent gap penalties  

Profile-sequence alignment ACD……VWY sequence

Sequence to profile alignment AAVVLAAVVL 0.4 A 0.2 L 0.4 V Score of amino acid L in sequence that is aligned against this profile position: Score = 0.4 * s(L, A) * s(L, L) * s(L, V)

Profile-profile alignment ACD..YACD..Y ACD……VWY profile

Profile to profile alignment 0.4 A 0.2 L 0.4 V Match score of these two alignment columns using the a.a frequencies at the corresponding profile positions: Score = 0.4*0.75*s(A,G) + 0.2*0.75*s(L,G) + 0.4*0.75*s(V,G) *0.25*s(A,S) + 0.2*0.25*s(L,S) + 0.4*0.25*s(V,S) s(x,y) is value in amino acid exchange matrix (e.g. PAM250, Blosum62) for amino acid pair (x,y) 0.75 G 0.25 S

So, for scoring profiles …  Think of sequence-sequence alignment.  Same principles but more information for each position. Reminder:  The sequence pair alignment score S comes from the sum of the positional scores M(aa i,aa j ) (i.e. the substitution matrix values at each alignment position minus penalties if applicable)  Profile alignment scores are exactly the same, but the positional scores are more complex

Scoring a profile position  At each position (column) we have different residue frequencies for each amino acid (rows) SO:  Instead of saying S=M(aa 1, aa 2 ) (one residue pair)  For frequency f>0 (amino acid is actually there) we take: ACD..YACD..Y Profile 1 ACD..YACD..Y Profile 2

Progressive alignment 1.Perform pair-wise alignments of all of the sequences; 2.Use the alignment scores to produces a dendrogram using neighbour-joining methods (guide-tree); 3.Align the sequences sequentially, guided by the relationships indicated by the tree. Biopat (first integrated method ever) MULTAL (Taylor 1987) DIALIGN (1&2, Morgenstern 1996) PRRP (Gotoh 1996) ClustalW (Thompson et al 1994) PRALINE (Heringa 1999) T Coffee (Notredame 2000) POA (Lee 2002) MAFFT (Katoh 2002) MUSCLE (Edgar 2004) PROBCONS (Do 2005)

Progressive multiple alignment Guide treeMultiple alignment Score 1-2 Score 1-3 Score 4-5 Scores Similarity matrix 5×5 Clustering (tree-building) method Iteration possibilities Align

General progressive multiple alignment technique (follow generated tree) d root

PRALINE progressive strategy d 4

There are problems … Accuracy is very important !!!!  Alignment errors during the construction of the MSA cannot be repaired anymore: errors made at any alignment step are propagated through subsequent progressive steps.  The comparisons of sequences at early steps during progressive alignments cannot make use of information from other sequences.  It is only later during the alignment progression that more information from other sequences (e.g. through profile representation) becomes employed in further alignment steps. “Once a gap, always a gap” Feng & Doolittle, 1987

Clustal, ClustalW, ClustalX CLUSTAL W/X (Thompson et al., 1994) uses Neighbour Joining (NJ) algorithm (Saitou and Nei, 1984), widely used in phylogenetic analysis, to construct a guide tree. Sequence blocks are represented by a profile, in which the individual sequences are additionally weighted according to the branch lengths in the NJ tree. Further carefully crafted heuristics include: –(i) local gap penalties –(ii) automatic selection of the amino acid substitution matrix, (iii) automatic gap penalty adjustment –(iv) mechanism to delay alignment of sequences that appear to be distant at the time they are considered. CLUSTAL (W/X) does not allow iteration (Hogeweg and Hesper, 1984; Corpet, 1988, Gotoh, 1996; Heringa, 1999, 2002)

Sequence weighting dilemma Pair-wise alignment quality versus sequence identity (Vogt et al., JMB 249, ,1995)

Profile pre-processing Secondary structure-induced alignment Homology-extended alignment Matrix extension Objective: try to avoid (early) errors Additional strategies for multiple sequence alignment

Profile pre-processing Score 1-2 Score 1-3 Score ACD..YACD..Y Pi Px 1 Key Sequence Pre-alignment Pre-profile Master-slave (N-to-1) alignment

Pre-profile generation Score 1-2 Score 1-3 Score 4-5 ACD..YACD..Y ACD..YACD..Y Pre-profiles Pre-alignments ACD..YACD..Y Cut-off

Pre-profile alignment ACD..YACD..Y ACD..YACD..Y ACD..YACD..Y ACD..YACD..Y ACD..YACD..Y Pre-profiles Final alignment

Pre-profile alignment Final alignment

Pre-profile alignment Alignment consistency Ala131 A131 L133 C126 A131

PRALINE pre-profile generation Idea: use the information from all query sequences to make a pre-profile for each query sequence that contains information from other sequences You can use all sequences in each pre-profile, or use only those sequences that will probably align ‘correctly’. Incorrectly aligned sequences in the pre- profiles will increase the noise level. Select using alignment score: only allow sequences in pre-profiles if their alignment with the score higher than a given threshold value. In PRALINE, this threshold is given as prepro=1500 (alignment score threshold value is 1500 – see next two slides)

Flavodoxin-cheY consistency scores (PRALINE prepro=0) 1fx TEYTAETIARQL VL999ST AQGRKVACF FLAV_DESVH TEYTAETIAREL VL999ST AQGRKVACF FLAV_DESDE YDAVL999SAW GRKVAAF FLAV_DESGI TEGVAEAIAKTL DVVL999ST FLAV_DESSA STW fxn FLAV_MEGEL fcr TEVADFIGK DLLF FLAV_ANASP LFYGTQTGKTESVAEIIR FLAV_ECOLI GSDTGNTENIAKMIQ FLAV_AZOVI --79IGLFFGSNTGKTRKVAKSIK FLAV_ENTAG FLAV_CLOAB ILYSSKTGKTERVAK chy Avrg Consist Conservation fx1 G FLAV_DESVH G FLAV_DESDE A FLAV_DESGI FLAV_DESSA fxn FLAV_MEGEL fcr FLAV_ANASP FLAV_ECOLI FLAV_AZOVI FLAV_ENTAG FLAV_CLOAB chy Avrg Consist Conservation * Iteration 0 SP= AvSP= SId= 3838 AvSId= Consistency values are scored from 0 to 10; the value 10 is represented by the corresponding amino acid (red)

1fx IVYGSTTGNTEYTAETIARQL DLVLLGCSTW AQGRKVACF FLAV_DESVH IVYGSTTGNTEYTAETIAREL DLVLLGCSTW AQGRKVACF FLAV_DESSA IVYGSTTGNTET YDIVLFGCSTW SL98ADLKGKKVSVF FLAV_DESGI IVYGSTTGNTEGVA DVVLLGCSTW KKVGVF FLAV_DESDE IVFGSSTGNTE YDAVLFGCSAW GRKVAAF 4fxn IVYWSGTGNTE NI DILILGCSA ISGKKVALF FLAV_MEGEL IVYWSGTGNTEAMA DVILLGCPAMGSE GKKVGLF 2fcr IFFSTSTGNTTEVA YDLLFLGAPT DKLPEVDMKDLPVAIF FLAV_ANASP LFYGTQTGKTESVAEII YQYLIIGCPTW W GKLVAYF FLAV_AZOVI LFFGSNTGKTRKVAKSIK YQFLILGTPTLGEG KTVALF FLAV_ENTAG -266IGIFFGSDTGQTRKVAKLIHQKL DVRRATR88888SYPVLLLGTPT WQEF8-8NTLSEADLTGKTVALF FLAV_ECOLI IFFGSDTGNTENIAKMI YDILLLGIPT KLVALF FLAV_CLOAB ILYSSKTGKTERVAKLIE LQESEGIIFGTPTY SWE GKLGAAF 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQ-AGGYGFVI---SDWNMPNM DGLEL--LKTIRADGAMSALPVLM Avrg Consist Conservation fx1 G FLAV_DESVH G FLAV_DESSA G FLAV_DESGI G GATLV FLAV_DESDE AS fxn GS FLAV_MEGEL G MD--AWKQRTEDTGATVI fcr GLGDA5-8Y5DNFC FLAV_ANASP GTGDQ5-GY EEKISQRGG FLAV_AZOVI GLGDQ FLAV_ENTAG GLGDQL-NYSKNFVSA-MR--ILYDLVIARGACVVG8888EGYKFSFSAA6664NEFVGLPLDQEN88888EERIDSWLE FLAV_ECOLI GC FLAV_CLOAB STANS EDENARIFGERIANKVKQI chy VTAEA---KKENIIAA AQAGAS GYVVK-----PFTAATLEEKLNKIFEKLGM Avrg Consist Conservation * Iteration 0 SP= AvSP= SId= 3955 AvSId= Flavodoxin-cheY consistency scores (PRALINE prepro=1500) Consistency values are scored from 0 to 10; the value 10 is represented by the corresponding amino acid (red)

Multiple alignment methods  Multi-dimensional dynamic programming > extension of pairwise sequence alignment.  Progressive alignment > incorporate phylogenetic information to (create an order to) guide the alignment process  Iterative alignment > correct problems with progressive alignment by repeatedly realigning subgroups of sequences

Iteration Alignment iteration: do an alignment learn from it do it better next time Bootstrapping

Consistency-based iterationPre-profiles Multiple alignment positionalconsistencyscores

Pre-profile update iteration Pre-profiles Multiple alignment

Iteration Convergence Limit cycle Divergence

CLUSTAL X (1.64b) multiple sequence alignment Flavodoxin-cheY 1fx1 -PKALIVYGSTTGNTEYTAETIARQLANAG-Y-EVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPLFD-SLEETGAQGRK FLAV_DESVH MPKALIVYGSTTGNTEYTAETIARELADAG-Y-EVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPLFD-SLEETGAQGRK FLAV_DESGI MPKALIVYGSTTGNTEGVAEAIAKTLNSEG-M-ETTVVNVADVTAPGLAEGYDVVLLGCSTWGDDEIE------LQEDFVPLYE-DLDRAGLKDKK FLAV_DESSA MSKSLIVYGSTTGNTETAAEYVAEAFENKE-I-DVELKNVTDVSVADLGNGYDIVLFGCSTWGEEEIE------LQDDFIPLYD-SLENADLKGKK FLAV_DESDE MSKVLIVFGSSTGNTESIAQKLEELIAAGG-H-EVTLLNAADASAENLADGYDAVLFGCSAWGMEDLE------MQDDFLSLFE-EFNRFGLAGRK FLAV_CLOAB -MKISILYSSKTGKTERVAKLIEEGVKRSGNI-EVKTMNLDAVDKKFLQE-SEGIIFGTPTYYAN ISWEMKKWID-ESSEFNLEGKL FLAV_MEGEL --MVEIVYWSGTGNTEAMANEIEAAVKAAG-A-DVESVRFEDTNVDDVAS-KDVILLGCPAMGSE--E------LEDSVVEPFF-TDLAPKLKGKK 4fxn ---MKIVYWSGTGNTEKMAELIAKGIIESG-K-DVNTINVSDVNIDELLN-EDILILGCSAMGDE--V------LEESEFEPFI-EEISTKISGKK FLAV_ANASP SKKIGLFYGTQTGKTESVAEIIRDEFGNDVVT----LHDVSQAEVTDLND-YQYLIIGCPTWNIGELQ---SD-----WEGLYS-ELDDVDFNGKL FLAV_AZOVI -AKIGLFFGSNTGKTRKVAKSIKKRFDDETMSD---ALNVNRVSAEDFAQ-YQFLILGTPTLGEGELPGLSSDCENESWEEFLP-KIEGLDFSGKT 2fcr --KIGIFFSTSTGNTTEVADFIGKTLGAKADAP---IDVDDVTDPQALKD-YDLLFLGAPTWNTGADTERSGT----SWDEFLYDKLPEVDMKDLP FLAV_ENTAG MATIGIFFGSDTGQTRKVAKLIHQKLDGIADAP---LDVRRATREQFLS--YPVLLLGTPTLGDGELPGVEAGSQYDSWQEFTN-TLSEADLTGKT FLAV_ECOLI -AITGIFFGSDTGNTENIAKMIQKQLGKDVAD----VHDIAKSSKEDLEA-YDILLLGIPTWYYGEAQ-CD WDDFFP-TLEEIDFNGKL 3chy --ADKELKFLVVDDFSTMRRIVRNLLKELG----FNNVEEAEDGVDALN------KLQAGGYGFV--I------SDWNMPNMDG-LELLKTIR :.. : 1fx1 VACFGCGDSSYEYF--CGAVDAIEEKLKNLGAEIVQDG LRIDGDPRAARDDIVGWAHDVRGAI FLAV_DESVH VACFGCGDSSYEYF--CGAVDAIEEKLKNLGAEIVQDG LRIDGDPRAARDDIVGWAHDVRGAI FLAV_DESGI VGVFGCGDSSYTYF--CGAVDVIEKKAEELGATLVASS LKIDGEPDSAE--VLDWAREVLARV FLAV_DESSA VSVFGCGDSDYTYF--CGAVDAIEEKLEKMGAVVIGDS LKIDGDPERDE--IVSWGSGIADKI FLAV_DESDE VAAFASGDQEYEHF--CGAVPAIEERAKELGATIIAEG LKMEGDASNDPEAVASFAEDVLKQL FLAV_CLOAB GAAFSTANSIAGGS--DIALLTILNHLMVKGMLVYSGGVA----FGKPKTHLGYVHINEIQENEDENARIFGERIANKVKQIF FLAV_MEGEL VGLFGSYGWGSGE-----WMDAWKQRTEDTGATVIGTA IVN-EMPDNAPECKE-LGEAAAKA fxn VALFGSYGWGDGK-----WMRDFEERMNGYGCVVVETP LIVQNEPDEAEQDCIEFGKKIANI FLAV_ANASP VAYFGTGDQIGYADNFQDAIGILEEKISQRGGKTVGYWSTDGYDFNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSWVAQLKSEFGL FLAV_AZOVI VALFGLGDQVGYPENYLDALGELYSFFKDRGAKIVGSWSTDGYEFESSEAVV-DGKFVGLALDLDNQSGKTDERVAAWLAQIAPEFGLSL---- 2fcr VAIFGLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVR-DGKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV FLAV_ENTAG VALFGLGDQLNYSKNFVSAMRILYDLVIARGACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSWLEKLKPAVL FLAV_ECOLI VALFGCGDQEDYAEYFCDALGTIRDIIEPRGATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKWVKQISEELHLDEILNA 3chy AD--GAMSALPVL-----MVTAEAKKENIIAAAQAGAS GYV-VKPFTAATLEEKLNKIFEKLGM :..

Flavodoxin-cheY: Pre-processing (prepro  1500) 1fx1 -PKALIVYGSTTGNT-EYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACF FLAV_DESDE MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-EEFNRFGLAGRKVAAf FLAV_DESVH MPKALIVYGSTTGNT-EYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACf FLAV_DESSA MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-DSLENADLKGKKVSVf FLAV_DESGI MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-EDLDRAGLKDKKVGVf 2fcr --KIGIFFSTSTGNT-TEVADFIGKTLGA---KADAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLYDKLPEVDMKDLPVAIF FLAV_AZOVI -AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-PKIEGLDFSGKTVALf FLAV_ENTAG MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-NTLSEADLTGKTVALf FLAV_ANASP SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DVVTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL QSDWEGLY-SELDDVDFNGKLVAYf FLAV_ECOLI -AITGIFFGSDTGNT-ENIaKMIQKQLGK---DVADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE AQCDWDDFF-PTLEEIDFNGKLVALf 4fxn -MK--IVYWSGTGNT-EKMAELIAKGIIESG-KDVNTINVSDVNIDELL-NEDILILGCSAMGDEVL EESEFEPFI-EEIS-TKISGKKVALF FLAV_MEGEL MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL EDSVVEPFF-TDLA-PKLKGKKVGLf FLAV_CLOAB -MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQESEGIIFgTPTYYAN ISWEMKKWI-DESSEFNLEGKLGAAf 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NVEEAEDGVDALNKLQAGGYGFVI---SDWNMPNM DGLELL-KTIRADGAMSALPVLM T 1fx1 GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD GLRIDGD--PRAARDDIVGWAHDVRGAI FLAV_DESDE ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE GLKMEGD--ASNDPEAVASfAEDVLKQL FLAV_DESVH GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD GLRIDGD--PRAARDDIVGwAHDVRGAI FLAV_DESSA GCGDS-DY-TYFCGA-VDAIEEKLEKMgAVVIGD SLKIDGD--PE--RDEIVSwGSGIADKI FLAV_DESGI GCGDS-SY-TYFCGA-VDVIEKKAEELgATLVAS SLKIDGE--PD--SAEVLDwAREVLARV fcr GLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKS-VRDGKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV FLAV_AZOVI GLGDQVGYPENYLDA-LGELYSFFKDRgAKIVGSWSTDGYEFESSEA-VVDGKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-- FLAV_ENTAG GLGDQLNYSKNFVSA-MRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L FLAV_ANASP GTGDQIGYADNFQDA-IGILEEKISQRgGKTVGYWSTDGYDFNDSKA-LRNGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL FLAV_ECOLI GCGDQEDYAEYFCDA-LGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA 4fxn G-----SY-GWGDGKWMRDFEERMNGYGCVVVET PLIVQNE--PDEAEQDCIEFGKKIANI FLAV_MEGEL G-----SY-GWGSGEWMDAWKQRTEDTgATVIGT AIVNEM--PDNA-PECKElGEAAAKA FLAV_CLOAB STANSIAGGSDIA---LLTILNHLMVKgMLVYSG----GVAFGKPKTHLGYVHINEIQENEDENARIfGERiANkVKQIF chy VTAEAKK--ENIIAA AQAGAS GYVV-----KPFTAATLEEKLNKIFEKLGM G Iteration 0 SP= AvSP= SId= 4009 AvSId= 0.313

Flavodoxin-cheY: Local Pre-processing (locprepro  300) 1fx1 --PKALIVYGSTTGNTEYTAETIARQLANAGYEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPL--FDSLEETGAQGRKVACF FLAV_DESVH -MPKALIVYGSTTGNTEYTaETIARELADAGYEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPL--FDSLEETGAQGRKVACf FLAV_DESSA -MSKSLIVYGSTTGNTETAaEYVAEAFENKEIDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPL--YDSLENADLKGKKVSVf FLAV_DESGI -MPKALIVYGSTTGNTEGVaEAIAKTLNSEGMETTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPL--YEDLDRAGLKDKKVGVf FLAV_DESDE -MSKVLIVFGSSTGNTESIaQKLEELIAAGGHEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSL--FEEFNRFGLAGRKVAAf 4fxn --MK--IVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNIDELLN-EDILILGCSAMGDEVL------E-ESEFEPF--IEEIS-TKISGKKVALF FLAV_MEGEL -MVE--IVYWSGTGNTEAMaNEIEAAVKAAGADVESVRFEDTNVDDVAS-KDVILLgCPAMGSEEL------E-DSVVEPF--FTDLA-PKLKGKKVGLf 2fcr ---KIGIFFSTSTGNTTEVADFIGKTLGAKADAPI--DVDDVTDPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFL-YDKLPEVDMKDLPVAIF FLAV_ANASP -SKKIGLFYGTQTGKTESVaEIIRDEFGNDVVTLH--DVSQAEV-TDLNDYQYLIIgCPTWNIGEL QSDWEGL--YSELDDVDFNGKLVAYf FLAV_AZOVI --AKIGLFFGSNTGKTRKVaKSIKKRFDDETMSDA-LNVNRVSA-EDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEF--LPKIEGLDFSGKTVALf FLAV_ENTAG -MATIGIFFGSDTGQTRKVaKLIHQKLDG--IADAPLDVRRATR-EQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEF--TNTLSEADLTGKTVALf FLAV_ECOLI --AITGIFFGSDTGNTENIaKMIQKQLGKDVADVH--DIAKSSK-EDLEAYDILLLgIPTWYYGEA QCDWDDF--FPTLEEIDFNGKLVALf FLAV_CLOAB --MKISILYSSKTGKTERVaKLIEEGVKRSGNIEVKTMNLDAVDKKFLQESEGIIFgTPTYYA NISWEMKKWIDESSEFNLEGKLGAAf 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQ-AGGYGFVI---SDWNMPNM DGLEL--LKTIRADGAMSALPVLM 1fx1 GCGDS--SY-EYFCGA-VD--AIEEKLKNLGAEIVQD GLRID--GDPRAARDDIVGWAHDVRGAI FLAV_DESVH GCGDS--SY-EYFCGA-VD--AIEEKLKNLgAEIVQD GLRID--GDPRAARDDIVGwAHDVRGAI FLAV_DESSA GCGDS--DY-TYFCGA-VD--AIEEKLEKMgAVVIGD SLKID--GDPE--RDEIVSwGSGIADKI FLAV_DESGI GCGDS--SY-TYFCGA-VD--VIEKKAEELgATLVAS SLKID--GEPD--SAEVLDwAREVLARV FLAV_DESDE ASGDQ--EY-EHFCGA-VP--AIEERAKELgATIIAE GLKME--GDASNDPEAVASfAEDVLKQL fxn GS------Y-GWGDGKWMR--DFEERMNGYGCVVVET PLIVQ--NEPDEAEQDCIEFGKKIANI FLAV_MEGEL GS------Y-GWGSGEWMD--AWKQRTEDTgATVIGT AI-VN--EMPDNA-PECKElGEAAAKA fcr GLGDAE-GYPDNFCDA-IE--EIHDCFAKQGAKPVGFSNPDDYDYEESKSVRD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV FLAV_ANASP GTGDQI-GYADNFQDA-IG--ILEEKISQRgGKTVGYWSTDGYDFNDSKALRN-GKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL FLAV_AZOVI GLGDQV-GYPENYLDA-LG--ELYSFFKDRgAKIVGSWSTDGYEFESSEAVVD-GKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-- FLAV_ENTAG GLGDQL-NYSKNFVSA-MR--ILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L FLAV_ECOLI GCGDQE-DYAEYFCDA-LG--TIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA FLAV_CLOAB STANSIAGGSDIALLTILNHLMVKgMLVYSGGVAFGKPKTHLGYVH INEIQENEDENARIfGERiANkVKQIF chy VTAEA---KKENIIAA AQAGAS GYVVK-----PFTAATLEEKLNKIFEKLGM G

Profile pre-processing Secondary structure-induced alignment Homology-extended alignment Matrix extension Objective: integrate secondary structure information to anchor alignments and avoid errors (“Structure more conserved than sequence”) Strategies for multiple sequence alignment

VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE (oligomers) SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold) Protein structure hierarchical levels

Why use (predicted) structural information “Structure more conserved than sequence” –Many structural protein families (e.g. globins) have family members with very low sequence similarities. For example, globin sequences identities can be as low as 10% while still having an identical fold. This means that you can still observe equivalent secondary structures in homologous proteins even if sequence similarities are extremely low. But you are dependent on the quality of prediction methods. For example, secondary structure prediction is currently at 76% correctness. So, 1 out of 4 predicted amino acids is still incorrect.

C5 anaphylatoxin -- human (PDB code 1kjs) and pig (1c5a)) proteins are superposed Two superposed protein structures with two well- superposed helices Red: well superposed Blue: low match quality

How to combine ss and aa info Dynamic programming search matrix Amino acid substitution matrices MDAGSTVILCFV HHHCCCEEEEEE MDAASTILCGSMDAASTILCGS HHHHCCEEECCHHHHCCEEECC C H E H C E Default

In terms of scoring… So how would you score a profile using this extra information? –Same formula as in lecture 6, but you can use sec. struct. specific substitution scores in various combinations. Where does it fit in? –Very important: structure is more conserved than sequence so if structures have forgotten how to match (I.e. they are too divergent), the secondary structure elements might help the alignment.

Sequences to be aligned Predict secondary structure HHHHCCEEECCCEEECCHH HHHCCCCEECCCEEHHH HHHHHHHHHHHHHCCCEEEE CCCCCCEECCCEEEECCHH HHHHHCCEEEECCCEECCC Align sequences using secondary structure Secondary structure Multiple alignment

Using predicted secondary structure 1fx1 -PK-ALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACF e eeee b ssshhhhhhhhhhhhhhttt eeeee stt tttttt seeee b ee sss ee ttthhhhtt ttss tt eeeee FLAV_DESVH MPK-ALIVYGSTTGNTEYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACf e eeeeee hhhhhhhhhhhhhhh eeeeee eeeeee hhhhhh eeeee FLAV_DESGI MPK-ALIVYGSTTGNTEGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLYED-LDRAGLKDKKVGVf e eeeeee hhhhhhhhhhhhhh eeeeee hhhhhh eeeeeee hhhhhh eeeeee FLAV_DESSA MSK-SLIVYGSTTGNTETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLYDS-LENADLKGKKVSVf eeeeee hhhhhhhhhhhhhh eeeee eeeee hhhhhhh h eeeee FLAV_DESDE MSK-VLIVFGSSTGNTESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLFEE-FNRFGLAGRKVAAf eeee hhhhhhhhhhhhhh eeeee hhhhhhhhhhheeeee hhhhhhh hh eeeee 2fcr --K-IGIFFSTSTGNTTEVADFIGKTLGAK---ADAPIDVDDVTDPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFLYDKLPEVDMKDLPVAIF eeeee ssshhhhhhhhhhhhhggg b eeggg s gggggg seeeeeee stt s s s sthhhhhhhtggg tt eeeee FLAV_ANASP SKK-IGLFYGTQTGKTESVaEIIRDEFGND--VVTL-HDVSQAE-VTDLNDYQYLIIgCPTWNIGEL QSDWEGLYSE-LDDVDFNGKLVAYf eeeee hhhhhhhhhhhh eee hhh hhhhhhheeeeee hhhhhhhhh eeeeee FLAV_ECOLI -AI-TGIFFGSDTGNTENIaKMIQKQLGKD--VADV-HDIAKSS-KEDLEAYDILLLgIPTWYYGEA QCDWDDFFPT-LEEIDFNGKLVALf eee hhhhhhhhhhhh eee hhh hhhhhhheeeee hhhhh eeeeee FLAV_AZOVI -AK-IGLFFGSNTGKTRKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFLPK-IEGLDFSGKTVALf eee hhhhhhhhhhhhh hhh hhhhhhheeeee hhhhhhhhh eeeeee FLAV_ENTAG MAT-IGIFFGSDTGQTRKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFTNT-LSEADLTGKTVALf eeee hhhhhhhhhhhh hhh hhhhhhheeeee hhhhh eeeee 4fxn ----MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVNIDELLNE-DILILGCSAMGDEVL------E-ESEFEPFIEE-IST-KISGKKVALF eeeee ssshhhhhhhhhhhhhhhtt eeeettt sttttt seeeeee btttb ttthhhhhhh hst t tt eeeee FLAV_MEGEL M---VEIVYWSGTGNTEAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVASK-DVILLgCPAMGSEEL------E-DSVVEPFFTD-LAP-KLKGKKVGLf hhhhhhhhhhhhhh eeeee hhhhhhhh eeeee eeeee FLAV_CLOAB M-K-ISILYSSKTGKTERVaKLIEEGVKRSGNIEVKTMNL-DAVDKKFLQESEGIIFgTPTY-YANI SWEMKKWIDE-SSEFNLEGKLGAAf eee hhhhhhhhhhhhhh eeeeee hhhhhhhhhh eeee hhhhhhhhh eeeee 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFNN-VEEAEDGV-DALNKLQAGGYGFVISD---WNMPNM DGLELLKTIRADGAMSALPVLMV tt eeee s hhhhhhhhhhhhhht eeeesshh hhhhhhhh eeeee s sss hhhhhhhhhh ttttt eeee 1fx1 GCGDS-SY-EYFCGAVDAIEEKLKNLGAEIVQD GLRIDGD--PRAARDDIVGWAHDVRGAI eee s ss sstthhhhhhhhhhhttt ee s eeees gggghhhhhhhhhhhhhh FLAV_DESVH GCGDS-SY-EYFCGAVDAIEEKLKNLgAEIVQD GLRIDGD--PRAARDDIVGwAHDVRGAI eee hhhhhhhhhhhh eeeee eeeee hhhhhhhhhhhhhh FLAV_DESGI GCGDS-SY-TYFCGAVDVIEKKAEELgATLVAS SLKIDGE--P--DSAEVLDwAREVLARV eee hhhhhhhhhhhh eeeee hhhhhhhhhhh FLAV_DESSA GCGDS-DY-TYFCGAVDAIEEKLEKMgAVVIGD SLKIDGD--P--ERDEIVSwGSGIADKI hhhhhhhhhhhh eeeee e eee FLAV_DESDE ASGDQ-EY-EHFCGAVPAIEERAKELgATIIAE GLKMEGD--ASNDPEAVASfAEDVLKQL e hhhhhhhhhhhhhh eeeee ee hhhhhhhhhhh 2fcr GLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV eee ttt ttsttthhhhhhhhhhhtt eee b gggs s tteet teesseeeettt ss hhhhhhhhhhhhhhhht FLAV_ANASP GTGDQIGYADNFQDAIGILEEKISQRgGKTVGYWSTDGYDFNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL hhhhhhhhhhhhhh eeee hhhhhhhhhhhhhhhh FLAV_ECOLI GCGDQEDYAEYFCDALGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA hhhhhhhhhhhhhh eeee hhhhhhhhhhhhhhhhhh FLAV_AZOVI GLGDQVGYPENYLDALGELYSFFKDRgAKIVGSWSTDGYEFESSEAVVD-GKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-- e hhhhhhhhhhhhhh eeeee hhhhhhhhhhh FLAV_ENTAG GLGDQLNYSKNFVSAMRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L hhhhhhhhhhhhhhh eeee hhhhhhh hhhhhhhhhhhh 4fxn G-----SYGWGDGKWMRDFEERMNGYGCVVVET PLIVQNE--PDEAEQDCIEFGKKIANI e eesss shhhhhhhhhhhhtt ee s eeees ggghhhhhhhhhhhht FLAV_MEGEL G-----SYGWGSGEWMDAWKQRTEDTgATVIGT AIVNEM--PDNAPE-CKElGEAAAKA hhhhhhhhhhh eeeee eeee h hhhhhhhh FLAV_CLOAB STANSIA-GGSDIALLTILNHLMVK-gMLVYSG----GVAFGKPKTHLG-----YVHINEI--QENEDENARIfGERiANkV--KQIF-- hhhhhhhhhhhhhh eeeee hhhh hhh hhhhhhhhhhhh h 3chy TAEAKKENIIAAAQAGASGY VVK----P-FTAATLEEKLNKIFEKLGM ess hhhhhhhhhtt see ees s hhhhhhhhhhhhhhht G

Profile pre-processing Secondary structure-induced alignment Homology-extended alignment Matrix extension Strategies for multiple sequence alignment

PSI-PRALINE Multiple alignment of distant sequences using PSI-BLAST

Distant sequences Methyltransferase (16.7% sequence identity) Same  /  fold 1GZ0A SEIYGIHAVQALLERAPERFQEVFILKGREDKRL LPLIHALESQGVVIQLANRQYLDEKSDGAVHQG IIARVKPGRQ 1IPAA MRITSTANPRIKELARLLERKHRDSQRRFLIEGA REIERALQAGIELEQALVWEGGLNPEEQQVYAAL LALLEVSEAVLKKLSVRDNPAGLIALARMPER

How well do alignment methods perform Normal pair-wise alignment 10% correctly aligned positions T-COFFEE (Notredame et al., 2000) 15% correctly aligned positions MUSCLE (Edgar, 2004) 15% correctly aligned positions

Profile-profile alignment Homology detection has provided a solution BLAST (sequence-sequence) PSI-BLAST (profile-sequence) New generation methods (profile-profile)New generation methods (profile-profile)

State-of-the-art homology detection Sequence used to scan database and collect homologous information (PSI-BLAST) –Local pair-wise alignment PSI-BLAST profile used to scan profiles of other sequences –Local pair-wise alignment The difference between methods is the profile- profile scoring scheme used for the alignment

The PRALINE way Sequence used to scan database and collect homologous information (PSI-BLAST) –Local alignment PSI-BLAST profiles are used for the alignment instead of the original sequences –Global, pair-wise OR multiple alignment 92%PRALINE scores 92% correctly aligned positions on the previous example (methyltransferase)

PSI Pair-wise alignment

Multiple alignment PSI PREPRO

Example: methyltransferases

A B The effects of using E- value thresholds of increasing stringency in PRALINEPSI on the 624 HOMSTRAD pairwise alignments. (A) The difference between the average Q scores of PRALINEPSI and the basic PRALINE method (B) The distributions of improved, equal and worsened cases compared with the basic PRALINE method for each E- value threshold. The ‘inc’ column is the PRALINEPSI incremental strategy starting from a threshold of 10 -6, and the ‘max’ column is PRALINEPSI’s theoretical upper limit for the tested threshold range.

Profile pre-processing Secondary structure-induced alignment Homology-extended alignment Matrix extension Objective: try to avoid (early) errors Strategies for multiple sequence alignment

Integrating alignment methods and alignment information with T-Coffee Integrating different pair-wise alignment techniques (NW, SW,..) Combining different multiple alignment methods (consensus multiple alignment) Combining sequence alignment methods with structural alignment techniques Plug in user knowledge

Matrix extension T-Coffee Tree-based Consistency Objective Function For alignmEnt Evaluation Cedric Notredame Des Higgins J. Mol. Biol., 302, ;2000 Jaap HeringaJ. Mol. Biol., 302, ;2000

Using different sources of alignment information Clustal Dialign Clustal Lalign Structure alignments Manual T-Coffee

T-Coffee library system Seq1AA1Seq2AA2Weight 3V315L3310 3V316L3414 5L336R3521 5l336I3635

Matrix extension

Search matrix extension – alignment transitivity

T-Coffee Direct alignment Other sequences

Search matrix extension

but..... T-COFFEE (V1.23) multiple sequence alignment Flavodoxin-cheY 1fx1 ----PKALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK----- FLAV_DESVH ---MPKALIVYGSTTGNTEYTAETIARELADAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK----- FLAV_DESGI ---MPKALIVYGSTTGNTEGVAEAIAKTLNSEG-METTVVNVADVT-APGLAEGYDVVLLGCSTWGDDEIE------LQEDFVPL-YEDLDRAGLKDKK----- FLAV_DESSA ---MSKSLIVYGSTTGNTETAAEYVAEAFENKE-IDVELKNVTDVS-VADLGNGYDIVLFGCSTWGEEEIE------LQDDFIPL-YDSLENADLKGKK----- FLAV_DESDE ---MSKVLIVFGSSTGNTESIAQKLEELIAAGG-HEVTLLNAADAS-AENLADGYDAVLFGCSAWGMEDLE------MQDDFLSL-FEEFNRFGLAGRK fxn MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVN-IDELL-NEDILILGCSAMGDEVLE ESEFEPF-IEEIS-TKISGKK----- FLAV_MEGEL -----MVEIVYWSGTGNTEAMANEIEAAVKAAG-ADVESVRFEDTN-VDDVA-SKDVILLGCPAMGSEELE DSVVEPF-FTDLA-PKLKGKK----- FLAV_CLOAB ----MKISILYSSKTGKTERVAKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQ-ESEGIIFGTPTYYAN ISWEMKKW-IDESSEFNLEGKL fcr -----KIGIFFSTSTGNTTEVADFIGKTLGAKA---DAPIDVDDVTDPQAL-KDYDLLFLGAPTWNTGA----DTERSGTSWDEFLYDKLPEVDMKDLP----- FLAV_ENTAG ---MATIGIFFGSDTGQTRKVAKLIHQKLDGIA---DAPLDVRRAT-REQF-LSYPVLLLGTPTLGDGELPGVEAGSQYDSWQEF-TNTLSEADLTGKT----- FLAV_ANASP ---SKKIGLFYGTQTGKTESVAEIIRDEFGNDV---VTLHDVSQAE-VTDL-NDYQYLIIGCPTWNIGEL QSDWEGL-YSELDDVDFNGKL----- FLAV_AZOVI ----AKIGLFFGSNTGKTRKVAKSIKKRFDDET-M-SDALNVNRVS-AEDF-AQYQFLILGTPTLGEGELPGLSSDCENESWEEF-LPKIEGLDFSGKT----- FLAV_ECOLI ----AITGIFFGSDTGNTENIAKMIQKQLGKDV---ADVHDIAKSS-KEDL-EAYDILLLGIPTWYYGEA QCDWDDF-FPTLEEIDFNGKL chy ADKELKFLVVD--DFSTMRRIVRNLLKELGFN-NVE-EAEDGVDALNKLQ-AGGYGFVISDWNMPNMDGLE LLKTIRADGAMSALPVLMV :... :. :: 1fx VACFGCGDSS--YEYFCGA-VDAIEEKLKNLGAEIVQDG LRIDGDPRAA--RDDIVGWAHDVRGAI FLAV_DESVH VACFGCGDSS--YEYFCGA-VDAIEEKLKNLGAEIVQDG LRIDGDPRAA--RDDIVGWAHDVRGAI FLAV_DESGI VGVFGCGDSS--YTYFCGA-VDVIEKKAEELGATLVASS LKIDGEPDSA----EVLDWAREVLARV FLAV_DESSA VSVFGCGDSD--YTYFCGA-VDAIEEKLEKMGAVVIGDS LKIDGDPE----RDEIVSWGSGIADKI FLAV_DESDE VAAFASGDQE--YEHFCGA-VPAIEERAKELGATIIAEG LKMEGDASND--PEAVASFAEDVLKQL fxn VALFGS------YGWGDGKWMRDFEERMNGYGCVVVETP LIVQNEPD--EAEQDCIEFGKKIANI FLAV_MEGEL VGLFGS------YGWGSGEWMDAWKQRTEDTGATVIGTA IV--NEMP--DNAPECKELGEAAAKA FLAV_CLOAB GAAFSTANSI--AGGSDIA-LLTILNHLMVKGMLVY----SGGVAFGKPKTHLGYVHINEIQENEDENARIFGERIANKVKQIF fcr VAIFGLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRDG-KFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV FLAV_ENTAG VALFGLGDQLNYSKNFVSA-MRILYDLVIARGACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSWLEKLKPAVL FLAV_ANASP VAYFGTGDQIGYADNFQDA-IGILEEKISQRGGKTVGYWSTDGYDFNDSKALRNG-KFVGLALDEDNQSDLTDDRIKSWVAQLKSEFGL FLAV_AZOVI VALFGLGDQVGYPENYLDA-LGELYSFFKDRGAKIVGSWSTDGYEFESSEAVVDG-KFVGLALDLDNQSGKTDERVAAWLAQIAPEFGLSL---- FLAV_ECOLI VALFGCGDQEDYAEYFCDA-LGTIRDIIEPRGATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKWVKQISEELHLDEILNA 3chy TAEAKKENIIAAAQAGASGYVVKPFT---AATLEEKLNKIFEKLGM