Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Alignments Why do Alignments?. Detecting Selection Evolution of Drug Resistance in HIV.
Measuring the degree of similarity: PAM and blosum Matrix
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
Kun Huang Department of Biomedical Informatics Ohio State University
Introduction to bioinformatics
Sequence Analysis Tools
Sequence Alignment III CIS 667 February 10, 2004.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Biology 4900 Biocomputing.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Multiple Sequence Alignment School of B&I TCD May 2010.
Protein Sequence Alignment and Database Searching.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
ORDered ALignment Information Explorer. Alignment editor Conservation computtion “barcode” = schematic alignment Phylogenic tree 3D viewer => sequence.
Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight.
Multiple sequence alignment
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Copyright OpenHelix. No use or reproduction without express written consent1.
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Manually Adjusting Multiple Alignments Chris Wilton.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. [many slides borrowed from various sources]
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
(PSI-)BLAST & MSA via Max-Planck. Where? (to find homologues) Structural templates- search against the PDB Sequence homologues- search against SwissProt.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Aidan Budd, EMBL Heidelberg Multiple Sequence Alignments.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment Carlow IT Bioinformatics November 2006.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
Lab 4.11 Lab 4.1: Multiple Sequence Alignment Jennifer Gardy Molecular Biology & Biochemistry Simon Fraser University.
Sequence Based Analysis Tutorial
Adva Yeheskel Bioinformatics Unit, Tel Aviv University 8/5/2018
Sequence Based Analysis Tutorial
Presentation transcript:

Multiple Sequence Alignments

Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple alignment?  what can it tell us, why is it useful? Adjusting a multiple alignment  Alignment editors and HowTo  Demonstration and practice

What is a Multiple Alignment? A comparison of sequences  “multiple sequence alignment” A comparison of equivalents:  Structurally equivalent positions  Functionally equivalent residues  Secondary structure elements  Hydrophobic regions, polar residues

Generating multiple alignments Pairwise sequence alignment is easy with sufficiently closely related sequences. Below a certain level of identity sequence alignment may become uncertain :  twilight zone for aa sequences ~ 30%. In or below the twilight zone it is good to make use of additional information, eg, from evolution. A multiple alignment of diverse sequences is more informative than a pairwise alignment:  residues conserved over longer period of time are under stronger evolutionary constraints.

Multiple Sequence Alignments Algorithms Multiple sequence alignment uses heuristic methods only:  With dynamic programming, computational time quickly explodes as the number of sequences increases. Different methods/algorithms:  Segment-based (DiAlign, …).  Iterative (HMMs, DiAlign, PRRP, …).  Progressive (Clustalw, T-Coffee, MUSCLE, …).

Progressive Alignment Step1: Calculate all pairwise alignments and calculate distances for all pairs of sequences. Step 2: Construct guide tree joining the most similar sequences using Neighbour Joining. ABCDE B2 C44 D666 E6664 F88888 Step 1Step 2

Progressive Alignment Step 3: From the tree assign weights for each sequence :  We want to down-weight nearly identical sequences and up-weight the most divergent ones. Step 4: Align sequences, starting at the leaves of the guide tree:  Pairwise comparisons as well as comparison of single sequence with a group of sequences (Profile) Caveat: errors introduced early cannot be corrected by subsequent information

Web servers ClustalW: T-Coffee: MUSCLE: DiAlign: and more at

Clustalw features Amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Insertions and deletions are more common in loop regions than in the core of the protein!

T-Coffee features More accurate than ClustalW Instead of amino acid substitution matrices, uses consistency in a library of pairwise alignments i j Vertices represent positions in protein sequence. Edges represent pairwise alignments between protein sequences. If residues I and J have many common neighbours, their consistency is high.

MUSCLE Fast implementation Sometimes more accurate than ClustalW or T-Coffee

Example Let’s build a multiple alignment for the following sequences : sequences >query MKNTLLKLGVCVSLLGITPFVSTISSVQAERTVEHKVIKNETGTISISQLNKNVW VHTELGYFSGEAVPSNGLVLNTSKGLVLVDSSWDDKLTKELIEMVEKKFKKRV TDVIITHAHADRIGGMKTLKERGIKAHSTALTAELAKKNGYEEPLGDLQSVTNLK FGNMKVETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSASSKDLGNVADAYV NEWSTSIENVLKRYGNINLVVPGHGEVGDRGLLLHTLDLLK >gi| MGGFLFFFLLVLFSFSSEYPKHVKETLRKITDRIYGVFGVYEQVSYENRGFISNAY FYVADDGVLVVDALSTYKLGKELIESIRSVTNKPIRFLVVTHYHTDHFYGAKAFR EVGAEVIAHEWAFDYISQPSSYNFFLARKKILKEHLEGTELTPPTITLTKNLNVYLQ VGKEYKRFEVLHLCRAHTNGDIVVWIPDEKVLFSGDIVFDGRLPFLGSGNSRTWL VCLDEILKMKPRILLPGHGEALIGEKKIKEAVSWTRKYIKDLRETIRKLYEEGCDVE CVRERINEELIKIDPSYAQVPVFFNVNPVNAYYVYFEIENEILMGE >gi|115023|sp|P10425| MKKNTLLKVGLCVSLLGTTQFVSTISSVQASQKVEQIVIKNETGTISISQLNKNVW VHTELGYFNGEAVPSNGLVLNTSKGLVLVDSSWDNKLTKELIEMVEKKFQKRVTD VIITHAHADRIGGITALKERGIKAHSTALTAELAKKSGYEEPLGDLQTVTNLKFGNTK VETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSAEAKNLGNVADAYVNEWSTSIE NMLKRYRNINLVVPGHGKVGDKGLLLHTLDLLK >gi|115030|sp|P25910| MKTVFILISMLFPVAVMAQKSVKISDDISITQLSDKVYTYVSLAEIEGWGMVPSNGM IVINNHQAALLDTPINDAQTEMLVNWVTDSLHAKVTTFIPNHWHGDCIGGLGYLQR KGVQSYANQMTIDLAKEKGLPVPEHGFTDSLTVSLDGMPLQCYYLGGGHATDNIV VWLPTENILFGGCMLKDNQATSIGNISDADVTAWPKTLDKVKAKFPSARYVVPGH GDYGGTELIEHTKQIVNQYIESTSKP >gi|282554|pir||S25844 MTVEVREVAEGVYAYEQAPGGWCVSNAGIVVGGDGALVVDTLSTIPRARRLAEWV DKLAAGPGRTVVNTHFHGDHAFGNQVFAPGTRIIAHEDMRSAMVTTGLALTGLWP RVDWGEIELRPPNVTFRDRLTLHVGERQVELICVGPAHTDHDVVVWLPEERVLFAGD VVMSGVTPFALFGSVAGTLAALDRLAELEPEVVVGGHGPVAGP EVIDANRDYLRWV QRLAADAVDRRLTPLQAARRADLGAFAGLLDAERLVANLHRAHEELLGGHVRDAM EIFAELVAYNGGQLPTCLA

ClustalW at EBI Many options:  CPU mode,  full/fast alignment,  window length in fast mode,  …  gap penalties.

ClustalW at EBI Automatic display of:  Score table  Alignment (optional colouring)  Tree guide Link to Jalview alignment editor!

A note on the example It is atypical:  It uses only three sequences.  One should use more in order to extract reliable informations. It illustrates a common mistake:  It uses too closely related sequences.  One should use as divergent and diverse sequences as possible in order to extract relevant informations.

A Good Multiple Alignment? Difficult to define… Good ones look pretty!  Aligned secondary structures  Strongly conserved residues / regions  Comparison with known structure helps Bad ones look chaotic and random.

A Good Multiple Alignment? ☻ ? conservation quality consensus

Multiple Alignment Features Barton (1993)  “The position of insertions and deletions suggests regions where surface loops exist…

Multiple Alignment Features

Barton (1993)  “The position of insertions and deletions suggests regions where surface loops exist…  Conserved glycine or proline suggests a β -turn...

Multiple Alignment Features

Barton (1993)  “The position of insertions and deletions suggests regions where surface loops exist…  Conserved glycine or proline suggests a β -turn…  Residues with hydrophobic properties conserved at i, i+2, i+4 (etc) separated by unconserved or hydrophilic residues suggests a surface β-strand…

Multiple Alignment Features

Barton (1993)  “The position of insertions and deletions suggests regions where surface loops exist…  Conserved glycine or proline suggests a β -turn…  Residues with hydrophobic properties conserved at i, i+2, i+4 (etc) separated by unconserved or hydrophilic residues suggests a surface β -strand…  A short run of hydrophobic amino acids (4 or 5 residues) suggests a buried β -strand…

Multiple Alignment Features

Barton (1993)  Pairs of conserved hydrophobic amino acids separated by pairs of unconserved or hydrophilic residues suggests an α -helix with one face packed in the protein core. Similarly, an i, i+3, i+4, i+7 pattern of conserved residues.”

Multiple Alignment Features

Cysteine is a rare amino acid, and is often used in disulphide bonds ( pairs of conserved cysteines ) Charged residues ( histidine, aspartate, glutamate, lysine, arginine ) and other polar residues embedded in a conserved region indicate functional importance

Multiple Alignment Features

Quality Assessment Bad residues  Large distance from column consensus Bad columns  Average distance from consensus is high – “entropy” Bad regions  Profile scores Bad quality doesn’t always mean badly aligned! RINAIEVMAKLIQRINAIEVMAKLIQ LIMIILVEIVLAMLIMIILVEIVLAM PERMKIDQGQNMWPERMKIDQGQNMW DLVTWDYAASLDFDLVTWDYAASLDF DNPGGACRTTLIDDNPGGACRTTLID

Quality Assessment Profiles  A profile holds scores for each residue type (plus gaps) over every column of a multiple alignment  Concepts: Consensus sequence Amino acid similarity  Some multiple alignment programs use profiles to build or add to an alignment  Any alignment, or even one sequence, can be a profile (one sequence isn’t a very good one…)

What can we do with a multiple alignment? Identify subgroups (phylogeny)  Intra-group sequence conservation  Evolutionary relatedness (view tree) Identify motifs (functionality)  Evolutionary signals  Highly conserved residues indicate functional or structural significance! Widen search for related proteins  MA better than single sequence  Consensus sequence / profile useful RPDDWHLHLR GGIDTHVHFI GFTLTHEHIC PFVEPHIHLD PKVELHVHLD

What do we want to do? Build a homology model?  Accuracy Perform phylogenetic analysis?  Completeness Functional analysis of a protein family?  Diversity

Building the initial alignment Fetch related sequences and run alignment  Clustal, Dialign, TCoffee, Muscle … Fetch a multiple alignment from a database and add sequences of interest  Pfam, ProDom, ADDA … Start from a motif-finding procedure  MEME, Pratt, Gibbs Sampler … MEMEPrattGibbs Sampler

Adjusting the alignment 1.Filter alignment:  Remove any redundancy  Remove unrelated sequences  Remove unwanted domains  Recalculate alignment if necessary 2.Look for conserved motifs, adjust any misalignments. Try different colour schemes and thresholds. 3.One step at a time…

Jalview Alignment Editor Clamp, M., Cuff, J., Searle, S. M. and Barton, G. J. (2004), "The Jalview Java Alignment Editor", Bioinformatics, 20,

Colouring your alignment HYDROPHOBIC / POLAR hydrophobic polar BURIED INDEX buried surface β-STRAND LIKELIHOOD probable unlikely HELIX LIKELIHOOD probable unlikely

Colouring your alignment By conservation thresholds:

Colouring your alignment Conservation index Amino Acid Property Classification Schema, eg: Livingstone & Barton 1993

Sequence Features

Check PDB Structures Load MA with sequence(s) for known PDB structure  View >> Feature Settings >> Fetch DAS Features (wait...) OR  Right-click >> Associate Structure with Sequence >> Discover PDB ids (quicker) Right-click sequence name >> View PDB Entry Structure opens in new window – residues acquire MA colours Highlight residues by hovering mouse over alignment or structure Label residues by clicking on structure

Compare Alignment to Structure

Crucial way of checking alignment! Where are gaps / insertions /deletions ?  In secondary structures: bad  In surface loops: okay Where are our key / functional residues?  Are they in probable active site?  Check they are clustered  Check they are accessible, not buried

Demonstration and Practice 1.Start Jalview (click here)click here 2.Tools >> Preferences >> Visual select Maximise Window, unselect Quality, set Font Size to 8 or 9, Colour >> Clustal, uncheck Open File Editing check Pad Gaps When Editing 3.File >> Input Alignment >> from URL (use this one)use this one 4.Get used to the controls – selecting and deselecting sequences/groups (drag mouse), dragging sequences/groups (use shift/ctrl), selecting sequence regions, hiding sequences/groups, removing columns and regions… Then explore menus and tools. 5.Now load this alignment – I’ve messed up a good alignment, and now I’d like you to correct it! There are two groups of sequences and one single sequence to adjust.this alignment

Demonstration and Practice 6.View >> Feature Settings >> DAS Settings  select Uniprot, dssp, cath, Pfam, PDBsum_ligands, PDBsum_DNAbinding, then click ‘Save as default’  click Fetch DAS Features (then click yes at prompt)...  Move mouse over alignment and read information about features  Move mouse over sequence names to check for PDB ids 7.Open a PDB structure (choose any) 8.View >> uncheck Show All Chains, then use up-arrow key to increase structure size. 9.Hover mouse over structure (see how residues are highlighted in the sequence), then do same for sequence. Select residues in the structure by clicking them – a label will appear. Click again to remove label. 10.Check position of insertions & deletions using this method.