Biology 4900 Biocomputing.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Molecular Evolution Revised 29/12/06
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence similarity.
Multiple sequence alignments and motif discovery Tutorial 5.
Similar Sequence Similar Function Charles Yan Spring 2006.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Multiple Sequence Alignments
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Multiple sequence alignment MSA
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Multiple sequence alignment Monday, December 6, 2010 Bioinformatics J. Pevsner
Multiple Sequence Alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Multiple sequence alignment Tuesday, Feb Suggested installation for the following tools on your own computer: ClustalX, Mega4, GeneDoc; treeview.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Multiple Sequence Alignment (MSA) 1.Uses of MSA 2.Technical difficulties 1.Select sequences 2.Select objective function 3.Optimize the objective function.
Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Multiple sequence alignment
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Multiple Sequence Alignment
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Multiple Sequence Alignment
Sequence Based Analysis Tutorial
Introduction to Bioinformatics
Presentation transcript:

Biology 4900 Biocomputing

Multiple Sequence Alignments Chapter 6 Multiple Sequence Alignments

Relationships between biological sequences Biological sequences tend to occur in families These may be related genes within an organism (paralogs) or between species (orthologs) Presumably derived from common ancestor Nucleotides corresponding to coding regions are typically less well conserved than proteins due to degeneracy of genetic code More difficult to align Sequences evolve faster than structures, but homologous sequences tend to retain similar structure and function (e.g., rat vs. human CaM)

Multiple sequence alignments Homology can be observed through multiple sequence alignments (MSA) MSA: 3 or more protein (or nucleic acid) sequences that are partially or completely aligned Homologous residues are aligned in columns across the length of the sequences

Multiple sequence alignments MSAs are powerful because they can reveal relationships between 2 sequences that can only be observed by their relationships with a third sequence Seq 1 AVGYDFGEKMLSGADDW LVGERADLTGAEIDE Seq 2 Seq 1 AVGYDFGEKMLSGA--DDW LVGYDRADK-LTGAE-DD- LVG-ERAD--LTGAEIDE- Seq 3 Seq 2

How MSAs are determined? MSAs can be determined based on: Presence of highly-conserved residues such as cysteine Conserved motifs and domains Conserved features of protein secondary structure Regions showing consistent patterns of insertions or deletions C-terminal domain of CaM (from 3cln.pdb) Conserved 2° structure (α-helices)

Why use MSAs? If protein (or gene) you are studying is part of a larger group, you may be able to gain insight into structure, function and evolution of the sequence. MSAs more sensitive than pairwise alignments to detect homologs. MSAs can reveal conserved residues, motifs, domains. Useful for generating phylogeny trees. Regulatory regions of many genes contain conserved consensus sequences.

Benchmarking Q: How good is a MSA? A: Compare sequence alignment against known structure alignments (reference scores). Measured by an objective scoring system such as sum-of-pairs scores (SPS).   M Columns Sum of scores for all pairs in 1 column Ai1 Ai2 Ai3 Ai4 Ai5 1 A V L I 2 G M 3 R N Rows Sum of scores for all your aligned columns   Sum of reference scores

Five MSA Approaches Exact methods Progressive alignment (e.g., ClustalW) Iterative approaches (e.g., PRALINE, IterAlign, MUSCLE) Consistency-based methods (e.g., MAFFT, ProbCons) Structure-based methods (e.g., Expresso) Our Focus

Exact Methods Exact methods, like Needleman and Wunsch, generate optimal alignments but aren’t feasible for alignments of many sequences. Computational time for this approach is describe in Big O notation as O(2NLN). Algorithm computational time (T = number of steps) has order O of (2NLN) complexity, where N is the number of sequences and L is the average sequence length.  

Progressive Sequence Alignment (Feng-Doolittle) How it works: Calculates pairwise sequence alignment scores between all proteins (or nucleic acid sequences) Aligns 2 closest sequences using a guide tree Progressively aligns more sequences to the first 2 Advantages: Permits rapid alignment of 100s of sequences. Disadvantages: May not provide most accurate alignment depending on how alignment is started.   ClustalW   MUSCLE

What these numbers mean… N=10 sequences, L=100 residues (Avg.) Needleman & Wunsch Too large to calculate ClustalW 20,000 MUSCLE 110,000

Progressive MSA stage 1 of 3: generate global pairwise alignments best score For n sequences, (n-1)(n) / 2 = number of alignments For 5 sequences, (4)(5) / 2 = 10 alignments *First find the two that produce the highest score

Tree Views of alignments Alignments may be evaluated by either similarity or distance measures A tree shows the distance between objects Closely-related Sequences Distantly-related Sequences

How to read tree views of alignments Closely-related Sequences

5 closely related globins

Feng-Doolittle stage 2: guide tree Convert similarity scores to distance scores Use unweighted pair group method of arithmetic averages UPGMA (defined in Chapter 7) ClustalW output shown below. Use JalView in ClustalW to display tree view.

Feng-Doolittle stage 3: progressive alignment Build MSA based on the order in the guide tree Start with the two most closely related sequences Then add the next closest sequence Continue until all sequences are added to the MSA Follows Rule: “once a gap, always a gap.” 2 closest alignments

Why “once a gap, always a gap”? There are many possible ways to make a MSA Where gaps are added is a critical question Gaps are often added to the first two (closest) sequences To change the initial gap choices later on would be to give more weight to distantly related sequences To maintain the initial gap choices is to trust that those gaps are most believable Insertions receive higher penalties than deletions, and are propagated throughout alignment Note placement of M and A at end of gap

Partial ClustalW Output for CD2 Protein The Big Picture

ClustalW Output for CD2 Protein 1 2 3 4 5 Color coding indicates AA property class * Indicates 100% conserved over entire alignment : Conservative mutations . Less conservative mutations [blank] gap or least conserved mutations

Alignment Size Can use to build phylogeny tree Medium Medium Small

Clustal W alignment of 5 distantly related globins

Clustal W alignment of 5 closely related globins * asterisks indicate identity in a column

Additional features of ClustalW improve its ability to generate accurate MSAs Individual weights are assigned to sequences; very closely related sequences are given less weight, while distantly related sequences are given more weight Scoring matrices are varied dependent on the presence of conserved or divergent sequences, e.g.: PAM20 80-100% id PAM60 60-80% id PAM120 40-60% id PAM350 0-40% id Residue-specific gap penalties are applied

In-Class Assignment Multiple sequence alignments using ClustalW Example of MSA using ClustalW: two data sets Five distantly related globins (human to plant) Five closely related beta globins Obtain your sequences in the FASTA format! You can save them in Notepad or other text editor.

MSA: Iterative Methods Compute a sub-optimal solution and keep modifying that intelligently using dynamic programming or other methods until the solution converges. Unlike progressive methods, iterative methods can dynamically correct alignment errors Examples: MUSCLE: Multiple Sequence Comparison by Log-Expectation (Edgar, 2004) Iteralign: (Karlin and Brocchieri, 1998) Praline: PRofile ALInNmEnt (Heringa, 1999; Simossis and Heringa, 2005) MAFFT: Multiple Alignment using Fast Fourier-Transform (Katoh et al., 2005)

Iterative approaches: MAFFT Available at http://mafft.cbrc.jp/alignment/software/ Uses Fast Fourier Transform to speed up profile alignment Uses fast two-stage method for building alignments using k-mer (matching 6-tuples) frequencies Offers many different scoring and aligning techniques One of the more accurate programs available Available as standalone or web interface Many output formats, including interactive phylogenetic trees

Iterative approaches: MUSCLE Available at http://www.ebi.ac.uk/Tools/msa/muscle/ 3 Stage approach Stage 1: Algorithm builds initial alignment based on similarities of paired alignments Calculates distance matrix and generates rooted tree Stage 2: Improves tree by recalculating similarities Stage 3: Rescores pairs at branches

MSA: Consistency-based algorithms Use database of both local high-scoring alignments and long-range global alignments to create a final alignment Incorporates evidence from multiple sequences to guide pairwise alignment In a sequence, if x is related to y, and y is related to z, then x should be related to z. Fast and accurate Examples: T-COFFEE, Prrp, DiAlign, ProbCons

Which methods are best? Depends on: Other Considerations: Number of sequences to align. What you are trying to do. Level of user expertise. Personal Preference. Other Considerations: Does method use benchmarking of multiple structures? Do you want to evaluate 3D protein structures (e.g., try Expresso at http://www.tcoffee.org)? You might want to: Try making multiple sequence alignments with many different sets of proteins (very related, very distant, few gaps, many gaps, insertions, outliers). Compare results.

Example: 5 alignments of 5 globins Let’s look at a multiple sequence alignment (MSA) of five globins proteins. We’ll use five prominent MSA programs: ClustalW, Praline, MUSCLE (used at HomoloGene), ProbCons, and TCoffee. Each program offers unique strengths. We’ll focus on a histidine (H) residue that has a critical role in binding oxygen in globins, and should be aligned. But often it’s not aligned, and all five programs give different answers. Our conclusion will be that there is no single best approach to MSA.

ClustalW Results Note how the region of a conserved histidine (▼) varies depending on which of five prominent algorithms is used

Praline Results

Muscle Results

ProbCons Results

Tcoffee Results

ClustalW Praline Muscle ProbCons

See Thompson et al. (1994) for an explanation of the three stages of progressive alignment implemented in ClustalW