Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple sequence alignment Why?

Similar presentations


Presentation on theme: "Multiple sequence alignment Why?"— Presentation transcript:

1 Multiple sequence alignment Why?
It is the most important means to assess relatedness of a set of sequences Gain information about the structure/function of a query sequence (conservation patterns) Construct a phylogenetic tree Putting together a set of sequenced fragments (Fragment assembly) Recognise alternative splice sites Many bioinformatics methods depend on it (secondary/tertiary structure)

2 Multiple sequence alignment (MSA) of 12 * Flavodoxin + cheY

3 Pairwise alignment Now we know how to do it:
How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial explosion than with pairwise alignment…..

4 Multi-dimensional dynamic programming (Murata et al. 1985)

5 Simultaneous Multiple alignment Multi-dimensional dynamic programming
MSA (Lipman et al., 1989, PNAS 86, 4412) extremely slow and memory intensive up to 8-9 sequences of ~250 residues DCA (Stoye et al., 1997, CABIOS 13, 625) still very slow

6 Alternative multiple alignment methods
Biopat (Hogeweg Hesper 1984, first method ever) MULTAL (Taylor 1987) DIALIGN (Morgenstern 1996) PRRP (Gotoh 1996) Clustal (Thompson Higgins Gibson 1994) Praline (Heringa 1999) T-Coffee (Notredame Higgins Heringa 2000) HMMER (Eddy 1998) [Hidden Markov Model] SAGA (Notredame Higgins1996) [Genetic algorithm]

7 Progressive multiple alignment general principles
1 Score 1-2 2 1 Score 1-3 3 4 Score 4-5 5 Scores Similarity matrix 5×5 Scores to distances Iteration possibilities Guide tree Multiple alignment

8 General progressive multiple alignment technique (follow generated tree)
1 3 1 3 2 5 1 3 2 5 root 1 3 2 5 4

9 Progressive multiple alignment
Problem: Accuracy is very important Errors are propagated into the progressive steps “Once a gap, always a gap” Feng & Doolittle, 1987

10 Pair-wise alignment quality versus sequence identity (Vogt et al
Pair-wise alignment quality versus sequence identity (Vogt et al., JMB 249, ,1995)

11 Multiple alignment profiles Gribskov et al. 1987
C D W Y 0.3 0.1 Gap penalties 1.0 0.5 Position dependent gap penalties

12 Profile-sequence alignment
ACD……VWY

13 Profile-profile alignment
C D . Y profile ACD……VWY

14 Clustal, ClustalW, ClustalX
CLUSTAL W/X (Thompson et al., 1994) uses Neighbour Joining (NJ) algorithm (Saitou and Nei, 1984), widely used in phylogenetic analysis, to construct guide tree. Sequence blocks are represented by profiles, in which the individual sequences are additionally weighted according to the branch lengths in the NJ tree. Further carefully crafted heuristics include: (i) local gap penalties (ii) automatic selection of the amino acid substitution matrix, (iii) automatic gap penalty adjustment (iv) mechanism to delay alignment of sequences that appear to be distant at the time they are considered. CLUSTAL (W/X) does not allow iteration (Hogeweg and Hesper, 1984; Corpet, 1988, Gotoh, 1996; Heringa, 1999, 2002)

15 Strategies for multiple sequence alignment
Profile pre-processing Secondary structure-induced alignment Globalised local alignment Matrix extension Objective: try to avoid (early) errors

16 Pre-profile generation
1 Score 1-2 2 1 Score 1-3 3 4 Score 4-5 5 Cut-off Pre-alignments Pre-profiles 1 1 A C D . Y 2 3 4 5 2 2 A C D . Y 1 3 4 5 5 A C D . Y 5 1 2 3 4

17 Pre-profile alignment
Pre-profiles 1 A C D . Y 2 A C D . Y Final alignment 3 A C D . Y 1 2 3 4 5 4 A C D . Y 5 A C D . Y

18 Pre-profile alignment
1 1 2 3 4 5 2 2 1 3 4 Final alignment 5 3 3 1 1 2 2 4 3 5 4 4 5 4 1 2 3 5 5 5 1 2 3 4

19 Strategies for multiple sequence alignment
Profile pre-processing Secondary structure-induced alignment Globalised local alignment Matrix extension Objective: try to avoid (early) errors

20 Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands) QUATERNARY STRUCTURE (oligomers) TERTIARY STRUCTURE (fold)

21 One of the Molecular Biology Dogma’s
“Structure more conserved than sequence”

22 Secondary structure-induced alignment

23 Using secondary structure for alignment
Dynamic programming search matrix Amino acid exchange weights matrices MDAGSTVILCFV HHHCCCEEEEEE M D A S T I L C G H C E H H C C E E Default

24 Flavodoxin-cheY Using predicted secondary structure
1fx PK-ALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACF e eeee b ssshhhhhhhhhhhhhhttt eeeee stt tttttt seeee b ee sss ee ttthhhhtt ttss tt eeeee FLAV_DESVH MPK-ALIVYGSTTGNTEYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACf e eeeeee hhhhhhhhhhhhhhh eeeeee eeeeee hhhhhh eeeee FLAV_DESGI MPK-ALIVYGSTTGNTEGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLYED-LDRAGLKDKKVGVf e eeeeee hhhhhhhhhhhhhh eeeeee hhhhhh eeeeeee hhhhhh eeeeee FLAV_DESSA MSK-SLIVYGSTTGNTETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLYDS-LENADLKGKKVSVf eeeeee hhhhhhhhhhhhhh eeeee eeeee hhhhhhh h eeeee FLAV_DESDE MSK-VLIVFGSSTGNTESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLFEE-FNRFGLAGRKVAAf eeee hhhhhhhhhhhhhh eeeee hhhhhhhhhhheeeee hhhhhhh hh eeeee 2fcr K-IGIFFSTSTGNTTEVADFIGKTLGAK---ADAPIDVDDVTDPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFLYDKLPEVDMKDLPVAIF eeeee ssshhhhhhhhhhhhhggg b eeggg s gggggg seeeeeee stt s s s sthhhhhhhtggg tt eeeee FLAV_ANASP SKK-IGLFYGTQTGKTESVaEIIRDEFGND--VVTL-HDVSQAE-VTDLNDYQYLIIgCPTWNIGEL QSDWEGLYSE-LDDVDFNGKLVAYf eeeee hhhhhhhhhhhh eee hhh hhhhhhheeeeee hhhhhhhhh eeeeee FLAV_ECOLI AI-TGIFFGSDTGNTENIaKMIQKQLGKD--VADV-HDIAKSS-KEDLEAYDILLLgIPTWYYGEA QCDWDDFFPT-LEEIDFNGKLVALf eee hhhhhhhhhhhh eee hhh hhhhhhheeeee hhhhh eeeeee FLAV_AZOVI AK-IGLFFGSNTGKTRKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFLPK-IEGLDFSGKTVALf eee hhhhhhhhhhhhh hhh hhhhhhheeeee hhhhhhhhh eeeeee FLAV_ENTAG MAT-IGIFFGSDTGQTRKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFTNT-LSEADLTGKTVALf eeee hhhhhhhhhhhh hhh hhhhhhheeeee hhhhh eeeee 4fxn MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVNIDELLNE-DILILGCSAMGDEVL------E-ESEFEPFIEE-IST-KISGKKVALF eeeee ssshhhhhhhhhhhhhhhtt eeeettt sttttt seeeeee btttb ttthhhhhhh hst t tt eeeee FLAV_MEGEL M---VEIVYWSGTGNTEAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVASK-DVILLgCPAMGSEEL------E-DSVVEPFFTD-LAP-KLKGKKVGLf hhhhhhhhhhhhhh eeeee hhhhhhhh eeeee eeeee FLAV_CLOAB M-K-ISILYSSKTGKTERVaKLIEEGVKRSGNIEVKTMNL-DAVDKKFLQESEGIIFgTPTY-YANI SWEMKKWIDE-SSEFNLEGKLGAAf eee hhhhhhhhhhhhhh eeeeee hhhhhhhhhh eeee hhhhhhhhh eeeee 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFNN-VEEAEDGV-DALNKLQAGGYGFVISD---WNMPNM DGLELLKTIRADGAMSALPVLMV tt eeee s hhhhhhhhhhhhhht eeeesshh hhhhhhhh eeeee s sss hhhhhhhhhh ttttt eeee 1fx GCGDS-SY-EYFCGAVDAIEEKLKNLGAEIVQD GLRIDGD--PRAARDDIVGWAHDVRGAI eee s ss sstthhhhhhhhhhhttt ee s eeees gggghhhhhhhhhhhhhh FLAV_DESVH GCGDS-SY-EYFCGAVDAIEEKLKNLgAEIVQD GLRIDGD--PRAARDDIVGwAHDVRGAI eee hhhhhhhhhhhh eeeee eeeee hhhhhhhhhhhhhh FLAV_DESGI GCGDS-SY-TYFCGAVDVIEKKAEELgATLVAS SLKIDGE--P--DSAEVLDwAREVLARV eee hhhhhhhhhhhh eeeee hhhhhhhhhhh FLAV_DESSA GCGDS-DY-TYFCGAVDAIEEKLEKMgAVVIGD SLKIDGD--P--ERDEIVSwGSGIADKI hhhhhhhhhhhh eeeee e eee FLAV_DESDE ASGDQ-EY-EHFCGAVPAIEERAKELgATIIAE GLKMEGD--ASNDPEAVASfAEDVLKQL e hhhhhhhhhhhhhh eeeee ee hhhhhhhhhhh 2fcr GLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------ eee ttt ttsttthhhhhhhhhhhtt eee b gggs s tteet teesseeeettt ss hhhhhhhhhhhhhhhht FLAV_ANASP GTGDQIGYADNFQDAIGILEEKISQRgGKTVGYWSTDGYDFNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------ hhhhhhhhhhhhhh eeee hhhhhhhhhhhhhhhh FLAV_ECOLI GCGDQEDYAEYFCDALGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA hhhhhhhhhhhhhh eeee hhhhhhhhhhhhhhhhhh FLAV_AZOVI GLGDQVGYPENYLDALGELYSFFKDRgAKIVGSWSTDGYEFESSEAVVD-GKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-- e hhhhhhhhhhhhhh eeeee hhhhhhhhhhh FLAV_ENTAG GLGDQLNYSKNFVSAMRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------ hhhhhhhhhhhhhhh eeee hhhhhhh hhhhhhhhhhhh 4fxn G-----SYGWGDGKWMRDFEERMNGYGCVVVET PLIVQNE--PDEAEQDCIEFGKKIANI e eesss shhhhhhhhhhhhtt ee s eeees ggghhhhhhhhhhhht FLAV_MEGEL G-----SYGWGSGEWMDAWKQRTEDTgATVIGT AIVNEM--PDNAPE-CKElGEAAAKA hhhhhhhhhhh eeeee eeee h hhhhhhhh FLAV_CLOAB STANSIA-GGSDIALLTILNHLMVK-gMLVYSG----GVAFGKPKTHLG-----YVHINEI--QENEDENARIfGERiANkV--KQIF-- hhhhhhhhhhhhhh eeeee hhhh hhh hhhhhhhhhhhh h 3chy TAEAKKENIIAAAQAGASGY VVK----P-FTAATLEEKLNKIFEKLGM------ ess hhhhhhhhhtt see ees s hhhhhhhhhhhhhhht G

25 Strategies for multiple sequence alignment
Profile pre-processing Secondary structure-induced alignment Globalised local alignment Matrix extension Objective: try to avoid (early) errors

26 Globalised local alignment
1. Local (SW) alignment (M + Po,e) + = 2. Global (NW) alignment (no M or Po,e) Double dynamic programming

27 M = BLOSUM62, Po= 0, Pe= 0

28 M = BLOSUM62, Po= 12, Pe= 1

29 M = BLOSUM62, Po= 60, Pe= 5

30 Strategies for multiple sequence alignment
Profile pre-processing Secondary structure-induced alignment Globalised local alignment Matrix extension Objective: try to avoid (early) errors

31 Matrix extension T-Coffee
Tree-based Consistency Objective Function For alignmEnt Evaluation Cedric Notredame Des Higgins Jaap Heringa J. Mol. Biol., 302, ;2000

32 Matrix extension – T COFFEE
2 1 3 1 4 1 3 2 4 2 4 3

33 Integrating alignment methods and alignment information with T-Coffee
Integrating different pair-wise alignment techniques (NW, SW, ..) Combining different multiple alignment methods (consensus multiple alignment) Combining sequence alignment methods with structural alignment techniques Plug in user knowledge

34 T-Coffee Using different sources of alignment information Clustal
Structure alignments Dialign Lalign Manual T-Coffee

35 Search matrix extension

36 T-Coffee Combine different alignment techniques by adding scores:
W(A(x), B(y)) = S(A(x), B(y)) A(x) is residue x in sequence A summation is over the scores S of the global and local alignments containing the residue pair (A(x), B(y)) S is sequence identity percentage of the associated alignment Combine direct alignment seqA- seqB with each seqA-seqI-seqB: W’(A(x), B(y)) = W(A(x), B(y)) + IA,BMin(W(A(x), I(z)), W(I(z), B(y))) Summation over all third sequences I other than A or B

37 T-Coffee Other sequences Direct alignment

38 Search matrix extension

39 Evaluating multiple alignments
Conflicting standards of truth evolution structure function With orphan sequences no additional information Benchmarks depending on reference alignments Quality issue of available reference alignment databases Different ways to quantify agreement with reference alignment (sum-of-pairs, column score) “Charlie Chaplin” problem

40 Evaluating multiple alignments
As a standard of truth, often a reference alignment based on structural superpositioning is taken

41 Evaluation measures Query Reference Column score Sum-of-Pairs score

42 Evaluating multiple alignments
SP BAliBASE alignment nseq * len

43 Summary Profile pre-processing (global/local)
Weighting schemes simulating simultaneous multiple alignment Profile pre-processing (global/local) Matrix extension (well balanced scheme) Smoothing alignment signals globalised local alignment Using additional information secondary structure driven alignment Schemes strike balance between speed and sensitivity

44 References Heringa, J. (1999) Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Comp. Chem. 23, Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, Heringa, J. (2002) Local weighting schemes for protein multiple sequence alignment. Comput. Chem., 26(5),

45 Where to find this…. http://www.ibivu.cs.vu.nl/teaching


Download ppt "Multiple sequence alignment Why?"

Similar presentations


Ads by Google