Multiple sequence alignment MSA

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Multiple Sequence Alignment
Alignments Why do Alignments?. Detecting Selection Evolution of Drug Resistance in HIV.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
COFFEE: an objective function for multiple sequence alignments
Molecular Evolution Revised 29/12/06
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
1 Protein Multiple Alignment by Konstantin Davydov.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Sequence similarity.
Sequence Alignment III CIS 667 February 10, 2004.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Multiple Sequence Alignments
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Last lecture summary. Sequence database searching exhaustive, heuristic BLAST How it works, steps, parameters BLAST variants, reading frame.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Protein Sequence Alignment and Database Searching.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
Multiple sequence alignment
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Multiple Sequence Alignment
Phylogeny - based on whole genome data
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Phylogenetic Inference
MULTIPLE SEQUENCE ALIGNMENT
Presentation transcript:

Multiple sequence alignment MSA Edgar RC, Batzoglou S. Multiple sequence alignment. Curr Opin Struct Biol. 2006 Jun;16(3):368-73. PubMed PMID: 16679011. Prednaska zpracovana dle Pevsner, Bioinformatics and functional genomics

What is MSA Comparison of many (i.e., >2) sequences local or global

Why MSA Biological sequences often occur in families. Homologous sequences often retain similar structures and functions. related genes within an organism genes in various species sequences within a population (polymorphic variants) MSA reveals more biological information than pairwise alignment two sequences that may not align well to each other can be aligned via their relationship to a third sequence, thereby integrating information in a way not possible using only pairwise alignments Similar genes are conserved across widely divergent species, often performing a similar or even identical function, and at other times, mutating or rearranging to perform an altered function through the forces of natural selection. Thus, many genes are represented in highly conserved forms in organisms. Through simultaneous alignment of the sequences of these genes, sequence patterns that have been subject to alteration may be analyzed. For example, it allows the identification of conserved sequence patterns and motifs in the whole sequence family, which are not obvious to detect by comparing only two sequences. Many conserved and functionally critical amino acid residues can be identified in a protein multiple alignment. Multiple sequence alignment is also an essential prerequisite to carrying out phylogenetic analysis of sequence families and prediction of protein secondary and tertiary structures. Homologous residues are aligned in columns across the length of the sequences. These aligned residues are homologous in an evolutionary sense: they are presumably derived from a common ancestor. The residues in each column are also presumed to be homologous in a structural sense: aligned residues tend to occupy corresponding positions in the three-dimensional structure of each aligned protein.

Edgar R.S. et al. Peroxiredoxins are conserved markers of circadian rhythms. Nature 2012 485(7399):459-64

LUCA - last universal common ancestor Edgar R.S. et al. Peroxiredoxins are conserved markers of circadian rhythms. Nature 2012 485(7399):459-64

Why MSA Can be a reasonable way to infer gene function Characterize protein families by identifying shared regions of homology (conserved regions called motifs), such as active sites Determine the consensus sequence of several aligned sequences Establish relationships and phylogenies

What is a sequence motif? A short conserved region in DNA, RNA or protein sequence simple combinations of secondary structure elements motif = supersecondary structure in proteins, structure motifs usually consist of just a few elements; e.g., the 'helix-turn-helix' has just three Corresponds to a structural or functional feature in proteins Shared by several sequences, can be generated by MSA

Examples of motifs beta hairpin helix-loop-helix greek key HLH - Two α-helices (blue) are connected by a shortloop (red) beta hairpin - Two antiparallel beta strands connected by a tight turn of a few amino acids between them greek key - 4 beta strands folded over into a sandwich shape.

What is a protein family? A protein family is a group of evolutionarily-related proteins. Proteins in a family descend from a common ancestor and typically have similar three-dimensional structures, functions, and significant sequence similarity. Members of a protein family may range from very similar to quite diverse. Currently, over 60,000 protein families have been defined, although ambiguity in the definition of protein family leads different researchers to wildly varying numbers

What is a protein family? the use of protein family is somewhat context dependent A common usage is that superfamilies (structural homology) contain families (sequence homology) which contain sub-families. Example: superfamily PA clan the largest group of proteases with common ancestry as identified by structural homology has far lower sequence conservation than one of the families tit contains, the C04 family http://en.wikipedia.org/wiki/PA_clan

http://en.wikipedia.org/wiki/PA_clan

PA clan structure the double-beta barrel motif PA clan proteases all share a core motif of two β-barrels with covalent catalysis performed by an acid-histidine-nucleophile catalytic triad motif.  Structural homology in the PA superfamily. The double beta-barrel that characterises the superfamily is highlighted in red. Shown ate representative structures from several families within the PA superfamily. Note that some proteins show partially modified structural. Chymotrypsin (1gg6), thrombin(1mkx), tobacco etch virus protease (1lvm), calicivirin (1wqs), west nile virus protease (1fp7), exfoliatin toxin (1exf), HtrA protease (1l1j), snake venom plasminogen activator (1bqy), chloroplast protease (4fln) and equine arteritis virus protease (1mbm).

Pfam - http://pfam.sanger.ac.uk/ Database of protein families that includes their annotations and multiple sequence alignments 

What is a domain Families often share domains. Domain is a part of a protein, and is greater than a motif. Domain is formed by several motifs packed together. i.e. domain = tertiary structure Domain - a conserved part of a given protein sequence and structure that can evolve, function, and exist independently of the rest of the protein chain One domain may appear in a variety of different proteins Molecular evolution uses domains as building blocks and these may be recombined in different arrangements to create proteins with different functions

Pyruvate kinase domains Because they are independently stable, domains can be "swapped" by genetic engineering between one protein and another to make chimeric proteins. http://en.wikipedia.org/wiki/Protein_domains

Sequence logo Conserved: BIG letters with few others in that space Divergent: small letters with many others in that space

Logo and alignment reflect each other

Doing MSA As with aligning a pair of sequences, the difficulty in aligning a group of sequences varies considerably with sequence similarity. If the amount of sequence variation is minimal, it is quite straightforward to align the sequences, even without the assistance of a computer program. If the amount of sequence variation is great, it may be very difficult to find an optimal alignment of the sequences because so many combinations of substitutions, insertions, and deletions, each predicting a different alignment, are possible.

Challenges of the MSA Finding an optimal alignment of more than two sequences that includes matches, mismatches, and gaps, and that takes into account the degree of variation in all of the sequences at the same time poses a very difficult challenge. A second computational challenge is identifying a reasonable method of obtaining a cumulative score for the substitutions in the column of an MSA. Finally, the placement and scoring of gaps in the various sequences of an msa presents an additional challenge.

MSA algorithms As with the pairwise sequence comparisons, there are two types of multiple alignment algorithms optimal heuristic

Optimal algorithms Extension of dynamic programming to multiple sequences Exhaustive search Produce best alignment Computationally expensive Not feasible for n>10 sequences of length m>200 residues

Heuristic algorithms Limit the exhaustive search Attempt to rapidly find a good, but not necessarily optimal alignment Most popular methods: progressive methods (ClustalW) start from the most similar sequences and progressively add new sequences iterative methods (MUSCLE) make initial crude alignment, then revise it

Progressive sequence alignment The most commonly used algorithm, the most commonly used software ClustalW Popularized by Feng and Doolitle, often referred by these two names. Permits the rapid alignment of even hundreds of sequences. Limitation: the final alignment depends on the order in which sequences are joined. Not guaranteed to provide the most accurate alignments.

ClustalW http://www.clustal.org http://www.ebi.ac.uk/Tools/msa/clustalw2/ EMBOSS – a free open source software analysis package (European Molecular Biology Open Software Suite) - http://emboss.sourceforge.net Program emma is a ClustalW wrapper A variety of EMBOSS servers hosting emma are available, e.g. http://embossgui.sourceforge.net/demo/emma.html ClustalX – a downloadable stand-alone program offering a graphical user interface for editing multiple sequence alignments http://www.clustal.org/clustal2/ - ClustalW paper: http://nar.oxfordjournals.org/content/22/22/4673.abstract

http://www.ebi.ac.uk/Tools/msa/clustalw2/

ClustalW – how it works? Three stages

number of pairwise alignments 1st stage The global alignment (Needlman-Wunsch) is used to create pairwise alignments of every protein pair. number of pairwise alignments 𝑛(𝑛−1) 2 arrow shows the best score

1st stage The raw similarity scores are shown. However, for the next step the distance matrix is needed, and not the similarity one. Similarity scores must be converted into distances. Won't tell you how, believe me, it is doable.

2nd stage A guide tree is calculated from the distance matrix. The tree reflects the relatedness of all the proteins to be multiply aligned Newick format

Guided tree Guide trees are not true phylogenetic trees. They are templates used in the third stage of ClustalW to define the order in which sequences are added to a multiple alignment. A guided tree is estimated from a distance matrix of the sequences you are aligning. In contrast, a phylogenetic tree almost always includes a model to account for multiple substitutions that commonly occur at the position of aligned residues.

Construction of guided tree Unweighted Pair Group Method with Arithmetic Mean (UPGMA) A simple hierarchical clustering method How it works? http://www.southampton.ac.uk/~re1u06/teaching/upgma/ Neighbor joining Uses distance method, distance matrix is an input. The algorithm starts with a completely unresolved tree (its topology is a star network), and iterates over until the tree is completely resolved and all branch lengths are known. Dr. Richard Ewdards - UPGMA Worked Example UPGMA = hierarchical clustering, average linkage

From http://en.wikipedia.org/wiki/Neighbor_joining: Starting with a star tree (A), the Q matrix is calculated and used to choose a pair of nodes for joining, in this case f and g. These are joined to a newly created node, u, as shown in (B). The part of the tree shown as dotted lines is now fixed and will not be changed in subsequent joining steps. The distances from node u to the nodes a-e are computed from the formula given in the text. This process is then repeated, using a matrix of just the distances between the nodes, a,b,c,d,e, and u, and a Q matrix derived from it. In this case u and e are joined to the newly created v, as shown in (C). Two more iterations lead first to (D), and then to (E), at which point the algorithm is done, as the tree is fully resolved. Source: wikipedia

3rd stage The multiple sequence alignment is created in a series of steps based on the order presented in the guide tree. First select the two most closely related sequences from the guide tree and create a pairwise alignment.

3rd stage

3rd stage The next sequence is either added to the pairwise alignment (to generate an aligned group of three sequences, sometimes called a profile) or used in another pairwise alignment. At some point, profiles are aligned with profiles. The alignment continues progressively until the root of the tree is reached, and all sequences have been aligned.

Gaps “once a gap, always a gap” rule The most closely related pair of sequences is aligned first. As further sequences are added to the alignment, there are many ways that gaps could be included. Gaps are often added to first two (closest) sequences. To change the initial gap choices later on would be to give more weight to distantly related sequences. To maintain the initial gap choices is to trust that those gaps are most believable.

Iterative approaches Progressive alignment methods have the inherent limitation that once an error occurs in the alignment process it cannot be corrected, and iterative approaches can overcome this limitation. Create an initial alignment and then modify it to try to improve it. e.g. MUSCLE, IterAlign, Praline, MAFFT

MUSCLE Since its introduction in 2004, the MUSCLE program of Robert Edgar has become popular because of its accuracy and its exceptional speed, especially for multiple sequence alignments involving large number of sequences. Multiple sequence comparison by log expectation Three stages

MUSCLE 1. Draft alignment 2. Improved alignment 3. Refinement Edgar, R. C. Nucl. Acids Res. 2004 32:1792-1797; doi:10.1093/nar/gkh340

1st stage A draft progressive alignment is generated. Determine pairwise similarity through k-mer counting (not by alignment). Compute distance (triangular distance) matrix. Construct tree using UPGMA. Construct draft progressive alignment following tree.

2nd stage Improve the progressive alignment. Compute pairwise identity through current MSA using the fractional identity. Construct new tree using Kimura distance matrix. In a comparison of two sequences there is some likelihood that multiple amino acid (or nucleotide) substiutions occurred at any given position, and the Kimura distance matrix provides a model for such changes. Compare new and old trees: if improved, repeat this step, if not improved, then we’re done.

3rd stage Refinement of the MSA Systematically partition the tree to obtain subsets; an edge (branch) of the tree is deleted to create a bipartition. Extract a pair of profiles (multiple sequence alignments), and realign them Accept/reject the new alignment based on the sum-of-pairs score increase/decrease. All edges of the tree are systematically visited and deleted to create bipartitions. This iterative refinement step is rapid and had been shown earlier to increase the accuracy of the multiple sequence alignment.

MUSCLE online https://www.ebi.ac.uk/Tools/msa/muscle/

>neuroglobin 1OJ6A NP_067080 >neuroglobin 1OJ6A NP_067080.1 [Homo sapiens] -------------MERPEPELIRQSWRAVSRSPLEHGTVLFARLFALEPDLLPLFQYNCR QFSSPEDCLSSPEFLDHIRKVMLVI---DAAVTNVEDLSSLEEYLASLGRKHRAVGVKLS SFSTVGESLLYMLEKCLGPA-FTPATRAAWSQLYGAVVQAMSRGWDGE---- >rice_globin 1D8U rice Non-Symbiotic Plant Hemoglobin NP_001049476.1 [Oryza sativa (japonica cultivar-group)] MALVEDNNAVAVSFSEEQEALVLKSWAILKKDSANIALRFFLKIFEVAPSASQMFSF-LR NSDVP--LEKNPKLKTHAMSVFVMTCEAAAQLRKAGKVTVRDTTLKRLGATHLKYGVGDA HFEVVKFALLDTIKEEVPADMWSPAMKSAWSEAYDHLVAAIKQEMKPAE--- >soybean_globin 1FSL leghemoglobin P02238 LGBA_SOYBN [Glycine max] ----------MVAFTEKQDALVSSSFEAFKANIPQYSVVFYTSILEKAPAAKDLFSF-LA NGVDP----TNPKLTGHAEKLFALVRDSAGQLKASGTVVAD----AALGSVHAQKAVTDP QFVVVKEALLKTIKAAVGDK-WSDELSRAWEVAYDELAAAIKKA-------- >beta_globin 2hhbB NP_000509.1 [Homo sapiens] ----------MVHLTPEEKSAVTALWGKVNVD--EVGGEALGRLLVVYPWTQRFFES-FG DLSTPDAVMGNPKVKAHGKKVLGAF---SDGLAHLDNLKGTFATLSELHCDKLH--VDPE NFRLLGNVLVCVLAHHFGKE-FTPPVQAAYQKVVAGVANALAHKYH------ >myoglobin 2MM1 NP_005359.1 [Homo sapiens] -----------MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDK-FK HLKSEDEMKASEDLKKHGATVLTAL---GGILKKKGHHEAEIKPLAQSHATKHK--IPVK YLEFISECIIQVLQSKHPGD-FGADAQGAMNKALELFRKDMASNYKELGFQG

MUSCLE vs. Clustal Q = fraction of correctly aligned residues (pairwise) TC = fraction of correctly aligned columns

Logo visualization of the alignment Make a logo from your alignment Can be easier to compare Nice graphic Students love ‘em http://weblogo.berkeley.edu/logo.cgi