Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Molecular Evolution Revised 29/12/06
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.
1 Protein Multiple Alignment by Konstantin Davydov.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
11 Ch6 multiple sequence alignment methods 1 Biologists produce high quality multiple sequence alignment by hand using knowledge of protein sequence evolution.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Multiple alignment: heuristics
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Multiple sequence alignment
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple Sequence Alignments
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Introduction to Profile Hidden Markov Models
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
BINF6201/8201 Molecular phylogenetic methods
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Multiple Sequence Alignment
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Chapter 3 Computational Molecular Biology Michael Smith
Multiple sequence alignment Dr Alexei Drummond Department of Computer Science Semester 2, 2006.
1 Lecture 8 Chapter 6 Multiple Sequence Alignment Methods.
Progressive multiple sequence alignments from triplets by Matthias Kruspe and Peter F Stadler Presented by Syed Nabeel.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Chapter 6. Multiple sequence alignment methods
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
In Bioinformatics use a computational method - Dynamic Programming.
Alignment IV BLOSUM Matrices
Presentation transcript:

Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy

Multiple sequence alignment methods 2 Overview What a multiple alignment means Scoring a multiple alignment Break Multidimensional dynamic programming Progressive alignment methods

Multiple sequence alignment methods 3 What a multiple alignment means Homologous residues are aligned in columns – Structurally homologous – Evolutionarily homologous Similar 3D structural positions Diverging from a common ancestral residue

Multiple sequence alignment methods 4 Multiple alignment - issues Identifying unambiguously homologous positions is not possible A need to identify which alignment is best Protein structures and sequences evolve – Sequences not entirely superposable

Multiple sequence alignment methods 5 Multiple alignment - issues There always is an unambiguously correct evolutionary alignment – Common ancestral sequence Sheerly impossible to infer the evolutionary history Usually easier to construct a structural alignment

Multiple sequence alignment methods 6 Multiple alignment - issues Sequence diverges even faster than structure – Structurally unalignable protein parts cannot be aligned by sequence either Some parts are very well alignable – Use these parts to align whatever can be aligned Disregard the rest to assess alignment quality – Supposedly meaningless biases are omitted

Multiple sequence alignment methods 7 Scoring an alignment Some positions are more conserved than others – Position-specific scoring Sequences are not independent – Related to each other by a phylogenetic tree Specify a complete probabilistic model of molecular sequence evolution

Multiple sequence alignment methods 8 Complete probabilistic model Probabilities of all evolutionary events Prior probability of root ancestral sequence Probabilities of evolutionary change depend on evolutionary time Position-specific structural and functional constraints We just don’t have all the necessary data

Multiple sequence alignment methods 9 Workable approximations Assume that all columns are statistically independent Score for multiple alignment mGap score/penaltyScore for column i in the multiple alignment m

Multiple sequence alignment methods 10 Scoring an alignment Notations

Multiple sequence alignment methods 11 Minimum Entropy: Further simplification We already assumed independence between columns Complex statistical dependence between sequences (within columns) if their phylogenetic tree has many intermediate ancestors We assume independence between and within columns

Multiple sequence alignment methods 12 Minimum entropy Probability of column m i Score of column m i can be defined as the negative logarithm A regularized probability estimate as used in chapter 5 An entropy measure directly related to the Shannon entropy (chapter 11)

Multiple sequence alignment methods 13 Example (1)

Multiple sequence alignment methods 14 Example (2)

Multiple sequence alignment methods 15 Example (3) Will this ever be 0 in reality? Why (not)?

Multiple sequence alignment methods 16 Example (4)

Multiple sequence alignment methods 17 Minimum entropy Very near to the HMM formulation Choose the sequences carefully Usually the sample of sequences is biased Weighting schemes as discussed in chapter 5 are necessary This partially compensates for the defects of the assumption of sequence independence

Multiple sequence alignment methods 18 Sum of pairs Also assumes statistical independence between columns Uses substitution matrices For simple linear gap costs, s(a,-) s(-,a) and s(-,-) are defined, with s(-,-) = 0 Scores s(a,b) come from substitution matrices like PAM or BLOSUM

Multiple sequence alignment methods 19 Sum of pairs Substitution scores are usually log-odds scores for pairwise comparisons – log(p ab /q a q b ) + log(p bc /q b q c ) + log(p ac /q a q c ) – log(p abc /q a q b q c ) Each sequence is scored as if it descended from the N-1 other sequences Evolutionary events are over-counted

Multiple sequence alignment methods 20 Problem with SP scores Consider an alignment of N sequences All have leucine (L) at position i Score for an L-L alignment according to the BLOSUM50 matrix Number of symbol pairs in the column

Multiple sequence alignment methods 21 What if one sequence has glycine (G) at i? – G-L pair scores -4, difference with L-L is 9 The score is worse than the all-leucine column by a fraction Problem with SP scores

Multiple sequence alignment methods 22 What a multiple alignment means Scoring a multiple alignment Questions? Break

Multiple sequence alignment methods 23 Multidimensional dynamic programming We assume that columns of an alignment are statistically independent Gaps are scored with a linear gap cost Now we can calculate overall score S(m) Where S(m i ) is a score for column i

Multiple sequence alignment methods 24 Calculating the overall score Define as the maximum score of an alignment up to the subsequences ending with

Multiple sequence alignment methods 25

Multiple sequence alignment methods 26 Simple notation Introduce  i which is 0 or 1 and define the “product” Now recursion can be written as follows

Multiple sequence alignment methods 27 Complexity of algorithm The algorithm requires the computation of the whole dynamic programming matrix with L 1, L 2,…,L N entries. We have to view 2 N - 1 combinations of gaps in a column. All sequences have roughly the same length Memory complexity of algorithm is Time complexity is

Multiple sequence alignment methods 28 MSA Let a kl denote the pairwise alignment between sequences k and l the score of the complete alignment is given Let â kl be the optimal pairwise alignment of k, l Obviously

Multiple sequence alignment methods 29 Lower bound Assume that we have a lower bound of the optimal multiple alignment, so In other words Where

Multiple sequence alignment methods 30 Lower bound Now we can look only at pairwise alignments of k and l that score better  kl We need to obtain  (a), and this can be done by using a progressive alignment algorithm

Multiple sequence alignment methods 31 Restricted algorithm For each pair k, l we can find the complete set B kl of coordinate pairs (i k, i l ) such that the best alignment of x k to x l through (i k, i l ) scores more than  kl Now we only have to look at cells (i 1, i 2,…, i N ) which meet the following condition: (i k, i l ) is in B kl for all k, l

Multiple sequence alignment methods 32

Multiple sequence alignment methods 33 Progressive alignment methods The algorithms differ in several ways Choice of order to do the alignment Whether the progression involves only alignment of sequences to a single growing alignment or whether subfamilies are built upon a tree structure

Multiple sequence alignment methods 34 Feng-Doolittle progressive multiple alignment 1. Calculate a diagonal matrix of N(N-1)/2 distances between all pairs of N sequences by standard pairwise alignment 2. Construct a guide tree from the distance matrix using the Fitch&Margoliash clustering algorithm 3. Starting from the first node added to the tree, align the child nodes Repeat until all sequences have been aligned.

Multiple sequence alignment methods 35 Converting scores to distances Where S max is the maximum score S obs is the observed pairwise alignment score S rand is the expected score for aligning two random sequences

Multiple sequence alignment methods 36 Profile alignment Linear gap scores can be included in the SP score: Global alignment score:

Multiple sequence alignment methods 37 CLUSTALW progressive alignment 1. Construct a distance matrix of all N(N-1)/2 pair by pairwise dynamic programming alignment. 2. Construct a guide tree by a neighbor-joining clustering algorithm (Saitou & Nei). 3. Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence-profile and profile-profile alignment.

Multiple sequence alignment methods 38 Sequences are weighted to compensate for biased representation. The substitution matrix used to score an alignment is chosen based on the expected similarity of the sequences Position-specific gap-open profile penalties are multiplied by a modifier that is a function of the residues observed at the position. CLUSTALW properties

Multiple sequence alignment methods 39 Gap-open penalties are also decreased if the position is spanned by a consecutive stretch of five or more hydrophilic residues. Both gap-open and gap-extend penalties are increased if there are also no gaps occur nearby in the alignment. In the progressive alignment stage, if the score of an alignment is low, we have to accumulate profile information CLUSTALW properties

Multiple sequence alignment methods 40 Iterative refinement methods: Barton-Stenberg multiple alignment 1. Find two sequences with the highest pairwise similarity and align them using standard pairwise dynamic programming alignment. 2. Find the sequence that is most similar to a profile of the alignment of the first two and align it to the first two by profile-sequence alignment. Repeat until all sequences have been included in the multiply alignment.

Multiple sequence alignment methods Remove sequence and realign it to a profile of the other aligned sequences by profile-sequence alignment. Repeat for sequences. 4. Repeat the previous realignment step a fixed number of times or until the alignment score converges. Iterative refinement methods: Barton-Stenberg multiple alignment