Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple Sequence Alignment (MSA). Ecole Phylogénomique, Carry le Rouet 2006 Plan Introduction to sequence alignments Multiple alignment construction.

Similar presentations


Presentation on theme: "Multiple Sequence Alignment (MSA). Ecole Phylogénomique, Carry le Rouet 2006 Plan Introduction to sequence alignments Multiple alignment construction."— Presentation transcript:

1 Multiple Sequence Alignment (MSA)

2 Ecole Phylogénomique, Carry le Rouet 2006 Plan Introduction to sequence alignments Multiple alignment construction  Traditional approaches  Alignment parameters  Alternative approaches Multiple alignment main applications MACSIMS : Multiple Alignment of Complete Sequences Information Management System

3 Ecole Phylogénomique, Carry le Rouet 2006 Local alignment / Global alignment Sequence A Sequence B Global alignment Sequence alignment on their whole length G G C T G A C C A C C - T T | | | | | | | G A - T C A C T T C C A T G Local alignment Alignment of the high similarity regions G A C C A C C T T | | | | | | | G A T C A C - T T Optimal local pairwise alignment : Smith and Waterman, 1981 Optimal global pairwise alignment : Needleman and Wunsch, 1970

4 Ecole Phylogénomique, Carry le Rouet 2006 Pairwise alignment / Multiple alignment Query: 177 EMGDTGPCGPCSEIHYDRIGGRDAAHLVNQDDPNVLEIWNLVFIQYNR---EADG----I 229 G G GP E+ Y LE+ LVF+QY + AD I Sbjct: 193 AGG--GNAGPAFEVLYKG LEVATLVFMQYKKAPANADPSQVVI 233 Query: 230 LK-----PLPKKSIDTGMGLERLVSVLQNKMSNYDTDLFVPYFEAIQKGTGARPYTGKVG 284 +K P+ K +DTG GLERLV + Q + YD L E +++ G ++ Sbjct: 234 IKGEKYVPMETKVVDTGYGLERLVWMSQGTPTAYDAVLGY-VIEPLKRMAGVEKIDERIL 292 Query: 285 AEDA DGIDMAYR VLADHARTITVAL 309 E++ D D+ Y +ADH + +T L Sbjct: 293 MENSRLAGMFDIEDMGDLRYLREQVAKRVGISVEELERLIRPYELIYAIADHTKALTFML 352

5 Ecole Phylogénomique, Carry le Rouet 2006 Conservation profileSecondary structure What is a multiple alignment? A representation of a set of sequences, in which equivalent residues (e.g. functional or structural) are aligned in columns Conserved residues

6 Ecole Phylogénomique, Carry le Rouet 2006 MACS Schematic overview of complete alignmentSchematic overview of complete alignment e.g. domain organisation (Interpro) e.g. domain organisation (Interpro) SH3 SH2 PI-PLC-X PI-PLC-Y PH C2 CH rhoGEF DAG_PE-bind Key:

7 Ecole Phylogénomique, Carry le Rouet 2006 Why multiple alignments? Applications : phylogeny domain organisation functional residue identification 2D/3D structure prediction transmembrane prediction … Integration of a sequence in the context of the protein family

8 Ecole Phylogénomique, Carry le Rouet 2006 MSA Construction

9 Ecole Phylogénomique, Carry le Rouet 2006 Multiple alignment construction Traditional approaches  Optimal multiple alignment  Progressive multiple alignment Alignment parameters  Residue similarity matrices  Gap penalties Alternative approaches  Iterative alignment methods  Combinatorial algorithms  PipeAlign : a protein family analysis tool

10 Ecole Phylogénomique, Carry le Rouet 2006 Traditional Approaches

11 Ecole Phylogénomique, Carry le Rouet 2006 Is the direct extension of pairwise dynamic programming to N-dimension (Sankoff, 1975). Examine all possible alignments to find the optimal alignment Optimal multiple alignment Problem The optimised mathematical alignment is not necessarily the biologically optimal alignment CPU time and memory required are prohibitive for practical purposes (the required time is proportional to N k for k sequences with length N) : limited to <10 sequences Exemple : alignment of 3 sequences

12 Ecole Phylogénomique, Carry le Rouet 2006 Principle : Progressively align the sequences (or sequence groups) by pair Problem : Which sequences begin with ? In which order ?  first align closest sequences How to estimate the distance between the sequences ?  align all pairs of sequences  calculate distance matrix from the pairwise alignments : distance matrix  construct a guide tree from this distance matrix  progressive multiple alignment following branching order in tree Progressive multiple alignment Heuristic algorithm which avoids calculating all possible alignments, but does not garuantee ‘optimal’ alignment

13 Ecole Phylogénomique, Carry le Rouet 2006 Progressive multiple alignment Step 1 : Pairwise alignment of all sequences Hbb_human 1 LTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDLST... |.| :|. | | ||||. | | ||| |:. :| |. :| | ||| Hba_human 3 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLS.... Hbb_human 1 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST... | |. |||.|| ||| ||| :|||||||||||||||||||||:|||||| Hbb_horse 2 VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSN... Hba_human 3 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLSH... || :| | | | || | | ||| |:. :| |. :| | |||. Hbb_horse 2 LSGEEKAAVLALWDKVNEE..EVGGEALGRLLVVYPWTQRFFDSFGDLSN... Ex : pairwise alignment of 2 globin sequences The alignment can be obtained with : - global or local method - dynamic programming or heuristic methods Example : in Clustalx => global alignments => choice between - heuristic method (used in Fasta program) => faster - dynamic programming (Smith & Waterman) => better Example : Alignment of 7 globins (Hbb_human, Hbb_horse, Hba_human, Hba_horse, Myg_phyca, Glb5_petma and Lgb2_lupla)

14 Ecole Phylogénomique, Carry le Rouet Hbb_human Hbb_horse Hba_human Hba_horse Myg_phyca Glb5_petma Lgb2_lupla distance between 2 sequences = 1- nb of identical residues nb of compared residues Step 2 : Distance matrix construction In Clustalx : Ex : Hbb_human vs Hbb_horse = 83% identity = 17% distance Progressive multiple alignment

15 Ecole Phylogénomique, Carry le Rouet Join the 2 closest sequences - Recalculate distances and join the 2 closest sequences or nodes - Step 3 is repeated until all sequences are joined Sequential branching Hba_human Hba_horse Hbb_horse Hbb_human Myg_phyca Glb5_petma Lgb2_lupla Hbb_human Hbb_horse Hba_human Hba_horse Myg_phyca Glb5_petma Lgb2_lupla Guide tree Step 3 : Sequential branching / Guide tree construction Progressive multiple alignment

16 Ecole Phylogénomique, Carry le Rouet 2006 xxxxxxxxxxxxxxx Step 4 : Progressive alignment The progressive multiple alignment follows the branching order in tree Hbb_human Hbb_horse Hba_human Hba_horse Myg_phyca Glb5_petma Lgb2_lupla Progressive multiple alignment xxxxxxxxxxxxxxx

17 Ecole Phylogénomique, Carry le Rouet 2006 H1 H2 H3 H4 H6 H7H5 Progressive multiple alignment

18 Ecole Phylogénomique, Carry le Rouet 2006 Progressive Local Global SB ML UPGMA NJ SBpima multal multalign pileup clustalx MLpima SB - Sequential Branching UPGMA - Unweighted Pair Grouping Method ML - Maximum Likelihood NJ - Neighbor-Joining Progressive multiple alignment methods

19 Ecole Phylogénomique, Carry le Rouet 2006 Alignment Parameters

20 Ecole Phylogénomique, Carry le Rouet 2006 Residue similarity matrices Dynamic programming methods score an alignment using residue similarity matrices, containing a score for matching all pairs of residues For proteins, a wide variety of matrices exist: Identity, PAM, Blosum, Gonnet etc. PAM 250

21 Ecole Phylogénomique, Carry le Rouet 2006 Residue similarity matrices Dynamic programming methods score an alignment using residue similarity matrices, containing a score for matching all pairs of residues For proteins, a wide variety of matrices exist: Identity, PAM, Blosum, Gonnet etc. Matrices range from strict ones for comparing closely related sequences to soft ones for very divergent sequences. Matrices are generally constructed by observing the mutations in large sets of alignments, either sequence-based or structure-based ClustalW automatically selects a suitable matrix depending on the observed pairwise % identity. A single best matrix does not exist!!

22 A gap penalty is a cost for introducing gaps into the alignment, corresponding to insertions or deletions in the sequences Fixed penalty : P = a L with L the length of gap Linear (or affine) penalty : P = x + y L Position specific and residue specific penalties : ex : in ClustalW, gap penalties are : - lowered at existing gaps - increased close to (less than 8 residues) existing gaps - lowered in hydrophilic stretches (loops) otherwise : gap opening penalties are modified according to their observed relative frequencies adjacent to gaps (Pascarella & Argos, 1992) x : gap opening penalty (gop) y : gap extension penalty (gep) Gap penalties SFGDLSNPGAVMG HF-DLS-----HG Goal is to introduce gaps in sequence segments corresponding to flexible regions of the protein structure

23 Ecole Phylogénomique, Carry le Rouet 2006 Alternative Approaches

24 Ecole Phylogénomique, Carry le Rouet 2006 Iterative alignment methods PRRP Iterative Alignment e.g. PRRP (Gotoh, 1993) - refine an initial progressive multiple alignment by iteratively dividing the alignment into 2 profiles and realigning them. SAGA Genetic Algorithms e.g. SAGA (Notredame et al, 1996) - iteratively refine an alignment using genetic algorithms (evolves a population of alignments in a quasi evolutionary manner) DIALIGN Segment-to-segment alignment: DIALIGN (Morgenstern et al. 1999) - search for locally conserved motifs in all sequences and compares segments of sequences instead of single residues Hidden Markov Models: - iteratively refine an alignment using HMMs e.g.HMMER e.g.HMMER (Eddy, 1998) SAM SAM (Karplus et al, 2001)

25 Ecole Phylogénomique, Carry le Rouet 2006 Progressive Iterative Local Global SB ML UPGMA NJ Genetic Algo. HMM SBpima multal multalign pileup clustalx dialign MLpima saga hmmt prrp Multiple alignment methods

26 Ecole Phylogénomique, Carry le Rouet 2006 BAliBASE: objective evaluation of MACS programs High-quality alignments based on 3D structural superpositions and manually verified Alignments compared only in reliable ‘core blocks’, excluding non-superposable regions Separate reference sets specifically designed to address distinct alignment problems reference setdescription 1small number of sequences: divergence, length 2a family with one to 3 orphans 3several sub-families 4long N/C terminal extensions 5long insertions 6repeats 7transmembrane regions 8circular permutations BAliBASE1 :Thompson et al Bioinformatics BAliBASE2 : Bahr et al, 2001 Nucl Acids Res.

27 Ecole Phylogénomique, Carry le Rouet 2006 => Need of reference alignments to evaluate the alignment programs BaliBASE (Thompson et al. Bioinformatics. 1999) – benchmark database Alignments based on 3D structure superposition Alignments must be compared for the superposable regions Alignments take into account : - the effect of the number of sequences - the effect of the sequence length - the effect of the sequence similarity - alignment of an orphan sequence with a sequence family - sub-family alignments - alignments of sequences with different length (insertions,extensions) Comparison of multiple alignment methods

28 Local / global methods Progressive / iterative methods  Iterative algorithms usually improve alignment quality  Problems : - Can give bad alignment in case of orphan sequences - Iteratif process can be very long ! ClustalW2 mins 41 secs PRRP3 hours 40 mins Dialign3 hours 48 mins Example : alignment of 89 histone sequences (66-92 residues):  Colinear sequences => global methods  N/C-ter extensions or insertions=> local methods To increase the alignment quality, as many sequences as possible have to be integrated ! Comparison of multiple alignment methods > 35% Id : any method

29 DbClustal: local and global algorithm coupling Domain A Domain B Domain C Blast Database Search Query Sequence Database Hits Ballast Anchors Query Sequence Anchors DbClustal Alignment

30 Ecole Phylogénomique, Carry le Rouet 2006 ClustalW DbClustal ClustalW / DbClustal comparison

31 Ecole Phylogénomique, Carry le Rouet 2006 T-Coffee (Notredame et al. 2000) performs local and global alignments for all pairs of sequences, then combines them in a progressive multiple alignment, similar to ClustalW. DbClustal (Thompson et al. 2000) designed to align the sequences detected by a database search. Locally conserved motifs are detected using the Ballast program (Plewniak et al. 1999) and are used in the global multiple alignment as anchor points. MAFFT (Katoh et al. 2002) detects locally conserved segments using a Fast Fourier Transform, then uses a restricted global DP and a progressive algorithm MUSCLE (Edgar, 2004) kmer distances and log-expectation scores, progressive and iterative refinement PROBCONS (Do et al, 2005) pairwise consistency based on an objective function Combinatorial algorithms

32 Multiple Alignment Quality Ref1Ref2Ref3Ref4Ref5Time V1 (<20%)V2 (20- 40%) orphanssubgroupsextensionsinsertion s (sec) ClustalW Dialign Mafft Maffti Muscle Muscle_fast Muscle_med Tcoffee Probcons muscle_fast : muscle –maxiters=1 –diags1 –sv –distance1 kbit20_3 muscle_medium : muscle –maxiters=2 Truncated Alignments 2. Twilight zone still exists 3. Probcons scores best in all tests, but is MUCH slower than MAFFT or MUSCLE 4. MAFFTI scores slightly better than MUSCLE in all test, and is more efficient 1. Significant improvement in accuracy/efficiency since 2000

33 Multiple Alignment Quality Ref1Ref2: orphansRef3: subgroupsTime (sec) for all refs V1 (<20%)V2 (20-40%) T FL T T T T ClustalW Dialign Mafft Maffti Muscle Muscle_fast Muscle_med Tcoffee Probcons Comparison: truncated versus full-length sequences 1.Loss of accuracy is more important in twilight zone (Ref1 V1, orphans, and subgroups) 2.Probcons still scores best in all tests 3.MAFFT still scores better than MUSCLE in all tests

34 Ecole Phylogénomique, Carry le Rouet 2006 Sum-of-pairs (Carrillo, Lipman, 1988) Sum the scores of all the pair of sequences (based on a similarity matrix and gap penalty) Multiple alignment quality norMD (Thompson et al, 2001) - scores by column using a substitution matrix and gap penalties - normalisation according to the sequences to align (their number, length and the similarity between them) Development of objective functions to estimate multiple alignment quality Relative Entropy: uses a normalized log-likelihood ratio to measure the degree of conservation for each column (identical residues only). MD (column scores used in ClustalX) uses a comparison matrix (Gonnet) to take into account similar residues

35 Ecole Phylogénomique, Carry le Rouet 2006 Evaluation of Objective Functions using BAliBase

36 Ecole Phylogénomique, Carry le Rouet 2006 SeqLab GCG Wisconsin Package SeaView (Gaultier et al, 1996) WEB servers : GeneAlign (Kurukawa) Jalview (Clamp, 1998) CINEMA (Lord et al, 2002) Multiple sequence alignment editors No automatic method is 100% reliable. Manual verification and refinement is essential!

37 Ecole Phylogénomique, Carry le Rouet 2006 FASTA format >O88763 Phosphatidylinositol 3-kinase MGEAEKFHYIYSCDLDINVQLKIGSLEGKREQKSYKAVLEDPMLKFSGLYQETC SDLYVTCQVFAEGKPLALPVRTSYKPFSTRWN-WNEWLKLPVKYPDLPRNAQVALTIWD VYGPG-RAVPVGGTTVSLFGKYGMFRQGMHDLKVWPNVEADGSEPTRTPGRTSST LSEDQMSRLAKLTKAHRQGHMVKVLDRLTFREIEMINESEKRSS--NFMYLMVEFRCVKC DDKE-YGIVYYE---- >Q9W1M7 CG5373-PA (GH13170p) MDQPDDHFRYIHSSSLHERVQIKVGTLEGKKRQPDYEKLLEDPILRFSGLYSEEH PSFQVRLQVFNQGRPYCLPVTSSYKAFGKRWS-WNEWVTLPLQFSDLPRSAMLVLTILD CSGAG-QTTVIGGTSISMFGKDGMFRQGMYDLRVWLGVEGDGNFPSRTPGK-GKE SSKSQMQRLGKLAKKHRNGQVQKVLDRLTFREIEVINEREKRMS--DYMFLMIEFPAIVV DDMYNYAVVYFE---- >Q7PMF0 ENSANGP (Fragment) LRYIGSSSLLQKISIKIGTLEGENVGYSYEKLIEQPLLKFSGMYTEKT PPLKVKLQIFDNGEPVGLPVCTSHKHFTTRWS-WNEWVTLPLRFTDISRTAVLGLTIYD CAGGREQLTVVGGTSISFFSTNGLFRQGLYDLKVWPQMEPDGACNSITPGK-AIT TGVHQMQRLSKLAKKHRNGQMEKILDRLTFRELEVINEMEKRNS--QFLYLMVEFPQVYI HEKL-YSVIHLE---- >Q9TXI7 Related to yeast vacuolar protein sorting factor protein 34 MIPGMRATPTESFSFVYSCDLQTNVQVKVAEFEG-----IFRDVLN-PVRRLNQLFAEIT VYCNNQQIGYPVCTSFHTPPDSSQLARQKLIQKWNEWLTLPIRYSDLSRDAFLHITIWEH EDDEIVNNSTFSRRLVAQSKLSMFSKRGILKSGVIDVQMNVSTTPDPFVKQPETWKYSDA WG-DEIDLLFKQVTRQSRGLVEDVLDPFASRRIEMIRAKYKYSSPDRHVFLVLEMAAIRL GPTF-YKVVYYEDETK

38 MSF format toto.msf MSF: 256 Type: P May 24, :34 Check: Name: O88763 Len: 256 Check: 9443 Weight: 1.00 Name: Q9W1M7 Len: 256 Check: 1161 Weight: 1.00 Name: Q7PMF0 Len: 256 Check: 8095 Weight: 1.00 Name: Q9TXI7 Len: 256 Check: 4716 Weight: 1.00 // 1 50 O MGEA EKFHYIYSCD LDINVQLKIG SLEGKREQKS YKAVLEDPML Q9W1M7.....MDQPD DHFRYIHSSS LHERVQIKVG TLEGKKRQPD YEKLLEDPIL Q7PMF LRYIGSSS LLQKISIKIG TLEGENVGYS YEKLIEQPLL Q9TXI7 MIPGMRATPT ESFSFVYSCD LQTNVQVKVA EFEG.....I FRDVLN.PVR O88763 KFSGLYQETC SDLYVTCQVF AEGKPLALPV RTSYKPFSTR WN.WNEWLKL Q9W1M7 RFSGLYSEEH PSFQVRLQVF NQGRPYCLPV TSSYKAFGKR WS.WNEWVTL Q7PMF0 KFSGMYTEKT PPLKVKLQIF DNGEPVGLPV CTSHKHFTTR WS.WNEWVTL Q9TXI7 RLNQLFAEIT VYCNNQQIGY PVCTSFHTPP DSSQLARQKL IQKWNEWLTL O88763 PVKYPDLPRN AQVALTIWD......VYGPG.RAVPVGGTT VSLFGKYGMF Q9W1M7 PLQFSDLPRS AMLVLTILD......CSGAG.QTTVIGGTS ISMFGKDGMF Q7PMF0 PLRFTDISRT AVLGLTIYD......CAGGR EQLTVVGGTS ISFFSTNGLF Q9TXI7 PIRYSDLSRD AFLHITIWEH EDDEIVNNST FSRRLVAQSK LSMFSKRGIL O88763 RQGMHDLKVW PNVEADGSEP TRTPGRTSST LSEDQMSRLA KLTKAHRQGH Q9W1M7 RQGMYDLRVW LGVEGDGNFP SRTPGK.GKE SSKSQMQRLG KLAKKHRNGQ Q7PMF0 RQGLYDLKVW PQMEPDGACN SITPGK.AIT TGVHQMQRLS KLAKKHRNGQ Q9TXI7 KSGVIDVQMN VSTTPDPFVK QPETWKYSDA WG.DEIDLLF KQVTRQSRGL O88763 MVKVLDRLTF REIEMINESE KRSS..NFMY LMVEFRCVKC DDKE.YGIVY Q9W1M7 VQKVLDRLTF REIEVINERE KRMS..DYMF LMIEFPAIVV DDMYNYAVVY Q7PMF0 MEKILDRLTF RELEVINEME KRNS..QFLY LMVEFPQVYI HEKL.YSVIH Q9TXI7 VEDVLDPFAS RRIEMIRAKY KYSSPDRHVF LVLEMAAIRL GPTF.YKVVY 251 O88763 YE.... Q9W1M7 FE.... Q7PMF0 LE.... Q9TXI7 YEDETK M ultiple S equence F ile

39 Ecole Phylogénomique, Carry le Rouet 2006 With an editor …

40 PipeAlign : protein family analysis tool Plewniak et al, 2003

41 BlastP search Identify motifs Build multiple alignment Refine alignment Correct alignment errors Remove unrelated seq. Cluster sequences INPUT: single sequence OR set of unaligned sequences conservation profile list of homologs MACS of user-specified homologs refined MACS MACS of validated homologs single sequence single sequence multiple alignment multiple alignment multiple alignment Validate alignment multiple alignment validated MACS PipeAlign Integrated family and sub-family analysis Identification of key residues, domain organisation, mean predictions of cellular location, transmembrane regions, 2D/3D structures, phylogeny studies, etc.

42 Ecole Phylogénomique, Carry le Rouet 2006 MSA Main Applications

43 Structure comparison, modelling Interaction networks Hierarchical function annotation: homologs, domains, motifs Phylogenetic studies Human genetics, SNPs Therapeutics, drug discovery Therapeutics, drug design DBD LBD insertion domain binding sites / mutations Gene identification, validation RNA sequence, structure, function Comparative genomics MACS MSA : central role in biology

44 MACS : new landscape Length: from tens of amino acids or nucleotides to thousands or millions (genomes) Length: from tens of amino acids or nucleotides to thousands or millions (genomes) Number: from tens up to thousands of sequences Number: from tens up to thousands of sequences Variability: from small percent identity to almost identical Variability: from small percent identity to almost identical Complexity: of the sequences to be aligned Complexity: of the sequences to be aligned - Family with linear or highly irregular repartition of sequence variability - Heterogeneity of length, structure or composition (large insertions or extensions, repeats, circular permutations, transmembrane regions…) Fidelity: from 15-30% errors (sequence, eucaryotic gene prediction, annotation…) Fidelity: from 15-30% errors (sequence, eucaryotic gene prediction, annotation…) High volume & heterogeneity of sequence data

45 MACS : new concepts Distinct objectives imply distinct needs & strategies Overview of one sequence family to quickly infer and integrate information from a limited number of closely related, well annotated sequences (reliable and efficient) Overview of one sequence family to quickly infer and integrate information from a limited number of closely related, well annotated sequences (reliable and efficient) Exhaustive analysis of one sequence family for (very high quality) Exhaustive analysis of one sequence family for (very high quality) - homology modeling - phylogenetic studies - subfamily-specific features (differentially conserved domains, regions or residues) Massive analysis of sets of sequences (reliable/high quality and efficient) Massive analysis of sets of sequences (reliable/high quality and efficient) - phylogenetic distribution, co-presence and co-absence and structural complex - genome annotation - target characterisation for functional genomics studies (transcriptomics…)

46 Ecole Phylogénomique, Carry le Rouet 2006 Residue conservation identification residues conserved in all sequences in family structural or functional importance: characteristic motifs residues conserved within a sub-group of sequences discriminant residues

47 Euc Bac Motif II Euc Arc Euc Bac EMAP domain N-terminal extension C-terminal extension S4 domain Motif I Euc Arc Euc Bac 10 aa Ordered Alignment analysis of TyrRS

48 Euc Bac Motif II Euc Arc Euc Bac EMAP domain N-terminal extension C-terminal extension S4 domain Motif I Euc Arc Euc Bac 10 aa Ordered Alignment analysis of TyrRS

49

50 Ecole Phylogénomique, Carry le Rouet 2006 Phylogenetic studies Multiple alignments = basis for calculation of the levels of similarity between sequences Multiple alignments = basis for calculation of sequences evolutionary distances Multiple alignments = basis for the computation of phylogenetic trees Creation of high quality phylogenetic tree implies to work with high quality multiple sequence alignments

51 Phylogenetic studies Whole alignment AQUIF AEOL THERM MARI PORPH GING CLOST ACET BORDE PERT NEISS GONO NEISS MENI PSEUD AERU SHEWA PUTR VIBRI CHOL YERSI PEST ESCHE COLI SALMO TYPH ACTIN ACTI HAEMO INFL BACIL SUBT ENTER FAEC STREP PYOG THERM THER DEINO RADI SYNECHO SP AR THA CHL CHLAM TRAC CAMPY JEJU HELIC PYLO MYCOB LEPR MYCOB TUBE CHLOR TEPI RHODO CAPS RICKE PROW BUCHN AFID MYCOP CAPR BORRE BURG TREPO PALI CAEN EL MT DROS ME MT SCHI PO MT SACC CE MT MYCOP GENI MYCOP PNEU ARABI THAL PLASM FALC CAENO ELEG DROSO MEGA HOMO SAPIE RATTU NORV SCHIZ POMB SACCH CERE CANDI ALBI HALOB SALI ARCHE FULG METBA THER METHA JANN PYROC KODA PYROC HORI Archaea Eucarya Bacteria + Mitochondrie

52 Phylogenetic studies 0.1 BACIL SUBT SYNECHO SP BORDE PERT NEISS GONO NEISS MENI PSEUD AERU SHEWA PUTR VIBRI CHOL YERSI PEST ESCHE COLI SALMO TYPH ACTIN ACTI HAEMO INFL THERM THER ENTER FAEC STREP PYOG AQUIF AEOL THERM MARI AR THA CHL TREPO PALI MYCOB LEPR MYCOB TUBE CAMPY JEJU HELIC PYLO CHLAM TRAC CHLOR TEPI RHODO CAPS RICKE PROW PORPH GING BUCHN AFID BORRE BURG MYCOP CAPR CAEN EL MT DROS ME MT HALOB SALI ARCHE FULG METBA THER METHA JANN PYROC KODA PYROC HORI ARABI THAL SCHIZ POMB SACCH CERE CANDI ALBI PLASM FALC CAENO ELEG DROSO MEGA HOMO SAPIE RATTU NORV SCHI PO MT SACC CE MT CLOST ACET DEINO RADI MYCOP GENI MYCOP PNEU N terminus global gap removal Bacteria Archaea Mito. Eukarya

53 Ecole Phylogénomique, Carry le Rouet 2006 Schematic alignment of Aspartyl-tRNA synthetases

54 Ecole Phylogénomique, Carry le Rouet 2006

55 Protein sequence validation Sequencing / frameshift error detection Example: transcription TFIIH complex protein Estimation: 44% of predicted proteins from genome sequencing projects and 31% of high-throughput cDNA (HTC) contain errors in their intron/exon structure. Bianchetti et al, 2005

56 Multiple alignment of complete sequences Determination of sequence groups Hierarchical clustering of positions based on insertion/deletion Definition of blocs N-terminal region analysis : Reference position Proposed N-terminus : potential start codon closest to the reference position MXXXXXX-XXXXXX XXX MXXXX-XXXXXXXXXX------XXX MXXXXXXMXXXMXXXXX-XXXXX-XXXXXXXX MXXXXXXXXXXXXX-XX--XXXXXXX MXXXXX-XXXXXXXXXXXXXXXX extension Reference position Clustered MACS : Starter ° 3000 proteins from B. subtilis with wrong randomly generated N-ter. : 82% predicted ° For the 3828 proteins from the Vibrio cholera proteome : 817 specific / 1722 valid start codons / 236 “wrong” (from 1 up to 56 aas)

57 Bianchetti et al. (2005) JBCB Clustered MACS : vAlid

58 Clustering Characterization of the specificity of the homologous sequences -> Filter Filter User sequence DBWatcher [Plewniak, IGBMC] Daily Blastp Automatic Daily Update Integration of the sub-family members Clustered MACS : DbW Automatic up-date of more than 300 different protein families Automatic up-date of more than 300 different protein families => 24 AaRS (amino-acid tRNA synhetases), nuclear receptors, ribosomal proteins, transcription factors… Databases : - Proteins - Structures Prigent et al. (2005) BioInformatics

59 F minV(Horiz) = 21 * F p p p minV(Verti) = MaxBranch * P GoAnno : find a pertinent level automatically and propagate Gene Ontology to an unannotated target protein according to clustered MACS 989 target proteins from retinal transcriptome analysis 795 proteins with a GO terms (increase of 47 %) 3085 GO terms (increase of 92 %) Subfamily of the Query Level 0 Level 2 Level 4 Level 3 Level 1 Level 6 Level 5 physiological processes metabolism cellular process cell communication biological_process Gene_Ontology nucleobase, nucleoside, nucleotide and nucleic acid metabolism transcription regulation of transcription signal transduction Clustered MACS : GOAnno Chalmel et al. (2005) Bioinfomatics

60 Ecole Phylogénomique, Carry le Rouet 2006 Basic steps for comparative (homology) modelling : 1.Identify a template structure 2.Align the target sequence to the template sequence 3.Copy the backbone coordinates from template to the matching residues in the target sequence 4.Build the side-chains (copied for identical residues, predicted for non-identical) 5.Model the loop regions 6.Optimise (energy refinement) Protein 3D structure prediction Applicable to ~60% of proteins from fully sequenced genomes Proteins with similar sequences tend to fold into similar structure  Above 50% identity, pairwise alignment is enough for accurate model  Below 50% identity, multiple alignment is better

61 Ecole Phylogénomique, Carry le Rouet 2006 Propagation of information from a known sequence to an unknown one e.g. domains, active sites, cellular localisation, post-transcriptional modifications, … 1. Database search for homologues e.g. BlastP, PSI-Blast 1. Database search for homologues e.g. BlastP, PSI-Blast 2. Domain databases : e.g. Interpro (EBI), CDD (NCBI) 2. Domain databases : e.g. Interpro (EBI), CDD (NCBI) 3. Multiple alignment construction and analysis e.g. PipeAlign 3. Multiple alignment construction and analysis e.g. PipeAlign Protein functional characterisation By homology : Similar sequences generally share similar structures and often have similar functions

62 Functionalgenomics Evolutionarystudies Structuremodeling Drug design Mutagenesisexperiments domain organization, structural motifs key functional residues, ORF definition localization signals, conservation pattern... Additional domain Intra-group conservation Universal conservation Differential conservation between the two families Transmembrane region NLS Bacteria Archaea Eucarya Bacteria Error in ORF definition 1 st FAMILY 2 nd FAMILY Phosphorylation site Lecompte et al Gene MSA applications : Summary

63 Ecole Phylogénomique, Carry le Rouet 2006 MACSIMS

64 Ecole Phylogénomique, Carry le Rouet 2006 MAO : Multiple Alignment Ontology Also available from OBO web site: MAO consortium: - RNA analysis - RNA analysis (Steve HOLBROOK, Berkeley) - MACS algorithm - MACS algorithm (Kazutake KATOH, Kyoto) - Protein 3D analysis - Protein 3D analysis (Patrice KOEHL, Davis) - Protein 3D structure - Protein 3D structure (Dino MORAS, Strasbourg) - 3D RNA structure - 3D RNA structure (Eric WESTHOF, Strasbourg) Thompson et al. (2005) Nucleic Acids Res.

65 MACSIMS Multiple Alignment of Complete Sequences Information Management System Structural and functional information is mined automatically from the public databases Homologous regions are identified in the MACS Mined data is evaluated and cross-validated Mined data is propagated from known to unknown sequences with the homologous regions MACSIMS provides a unique environment that facilitates knowledge extraction and the presentation of the most pertinent information to the biologist Thompson et al BMC Bioinformatics 2006

66 MACSIMS

67 MACSIMS Schematic overview of complete alignmentSchematic overview of complete alignment e.g. domain organisation (Interpro) e.g. domain organisation (Interpro) SH3 SH2 PI-PLC-X PI-PLC-Y PH C2 CH rhoGEF DAG_PE-bind Key:

68 MACSIMS visualisation JalView II, Coll. G. Barton

69 MACSIMS ******** E E E E C C C C GSVPTG GSTKVG GETRTG GSTEVG GSVSAG GSRDVG GSTNVF GSTAVF BAliBASE reference 3: aldehyde dehydrogenase-like NAD binding Active site Uniprot annotation

70

71

72 Ecole Phylogénomique, Carry le Rouet 2006 Summary Choice of multiple alignment method traditional progressive method (e.g. clustalw / clustalx) combined local and global method (e.g. mafft, muscle, dbclustal) knowledge-based method (e.g. PipeAlign) Web Server versus Local Installation ? WARNING: Automatic alignment methods can make mistakes. Verify alignment quality by automatic methods (e.g. norMD) and visual inspection ! Multiple alignment applications Traditional applications: phylogeny conserved residue / motif identification Information in multiple alignments also improves accuracy in: sequence error detection structure prediction functional annotation

73 Ecole Phylogénomique, Carry le Rouet 2006 Laboratory of Integrative Genomics and Bioinformatics IGBMC, Strasbourg

74 Iterative Refinement PRRP (Gotoh, 1993) refines an initial progressive multiple alignment by iteratively dividing the alignment into 2 profiles and realigning them. initial alignment divide sequences into 2 groups profile 1 profile 2 pairwise profile alignment refined alignment converged? no alternative algorithms

75 Genetic Algorithms SAGA (Notredame, Higgins, 1996) evolves a population of alignments in a quasi evolutionary manner, iteratively improving the fitness of the population select a number of individuals to be parents modify the parents by shuffling gaps, merging 2 alignments etc. evaluation of the fitness using OF (sum-of-pairs or COFFEE) END population n population n+1 alternative algorithms

76 HMM Probabilistic model for sequence profiles, visualized as a finite state machine For each column of the alignment a match state models the distribution of residues allowed Insert and delete states at each column allow for insertion or deletion of one or more residues YWYW V LLLL DDDD Original profile HMM (Krogh et al, 1994) match state delete, begin, end state insert state AK E AKY-L-D --WVLED alternative algorithms

77 Multiple Alignment using HMM HMMER (Eddy, unpublished) SAM-T98 (Hughey, 1996) produce a model generate new alignment (Viterbi algorithm or posterior decoding) END evaluate alignment (expectation maximization) generate initial alignment (Baum-Welch expectation maximization)

78 Segment-to-segment Alignment Dialign (Morgenstern et al. 1996) compares segments of sequences instead of single residues 1. construct dot-plots of all possible pairs of sequences 2. find a maximal set of consistent diagonals in all the sequences Sequence i Sequence j aeyVRALFDFngndeedlpfkKGDILRIrdkpeeq WWNAedsegkr.GMIPVPYVek nlFVALYDFvasgdntlsitKGEKLRVlgynhnge WCEAqtkngq..GWVPSNYItpvns ieqvpqqptyVQALFDFdpqedgelgfrRGDFIHVmdnsdpn WWKGachgqt..GMFPRNYVtpvnrnv..... gsmstselkkVVALYDYmpmnandlqlrKGDEYFIleesnlp WWRArdkngqe.GYIPSNYVteaeds tagkiFRAMYDYmaadadevsfkDGDAIINvqaideg WMYGtvqrtgrtGMLPANYVeai gsptfkcaVKALFDYkaqredeltfiKSAIIQNvekqegg WWRGdyggkkq.LWFPSNYVeemvnpegihrd gyqYRALYDYkkereedidlhLGDILTVnkgslvalgfsdgqearpeeigWLNGynettgerGDFPGTYVeyigrkkisp.. Local alignment - residues between the diagonals are not aligned alternative algorithms


Download ppt "Multiple Sequence Alignment (MSA). Ecole Phylogénomique, Carry le Rouet 2006 Plan Introduction to sequence alignments Multiple alignment construction."

Similar presentations


Ads by Google