Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.

Slides:



Advertisements
Similar presentations
Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.
Advertisements

Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
1 Protein Multiple Alignment by Konstantin Davydov.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures CLUSTAL W Algorithm Lecturer:
Multiple sequence alignments and motif discovery Tutorial 5.
Multiple alignment: heuristics
Multiple sequence alignment
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Multiple Sequence Alignment
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Protein Sequence Alignment and Database Searching.
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight.
Copyright OpenHelix. No use or reproduction without express written consent1.
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
Multiple Sequence Alignment Colin Dewey BMI/CS 576 Fall 2015.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Introduction to Bioinformatics
Presentation transcript:

Practical multiple sequence algorithms Sushmita Roy BMI/CS Sushmita Roy Sep 24th, 2013

Goals for today Review Guide-tree based multiple sequence alignment Two practical implementations of algorithms for multiple sequence alignment –CLUSTALW –MUSCLE

The problems with progressive alignment Greedy –The tree might not be correct, that is, reflect an incorrect ordering of how sequences should be joined –Errors in alignment Even if the tree is correct, there might be some positions that are misaligned. Choice of alignment parameters –Especially when the sequences are diverged and there are more mismatches than identities For closely related sequences, identities dominate over mismatches –Different weight matrices might be optimal for different evolutionary distances. –Gaps do not occur randomly Gaps more likely to occur between “secondary structures” rather than within them.

ClustalW A progressive alignment algorithm with several heuristics Based on a guide tree approach Dynamically varies the gap penalties in a position and residue specific manner Weight different sequences differently Thompson et al, 1994

Alignments based on guide trees Build up a multiple sequence alignment by progressively adding new sequences by following the order of a phylogenetic tree. Needs sequences to have different extents of divergence Start with aligning the closest pairs of sequences. Gaps inserted in the earlier alignments should be preserved as these gaps are most reliable.

Steps in ClustalW Align all pairs of sequences separately to create a pairwise distance matrix. Calculate a guide tree from the matrix Align sequences progressively according to guide tree starting from the leaves

Calculating the pairwise distance For two sequences with the following alignment –AATAATA ATAA_TA Similarity S –No. of identical bases/size of alignment 4/7 for the above example Distance=1-S

Example of creating distance matrix Consider four sequences 1.AAAC 2.AGC 3.ACC 4.GAC Generate pairwise alignments for all pairs of sequences

Pairwise alignment for all the pairs of sequences AAAC _AGC AAAC _ACC AAAC _GAC 1. and and and and and 4. AGC ACC AGC GAC 3. and 4. ACC GAC Sequence pairAlignment X0.5 X X X % similarity 2/4 2/3 1/3 Distance

Creating a tree from the distance matrix using UPGMA UPGMA: Unweighted pair group method using arithmetic averages Represent all sequences as the leaf nodes of a tree Merge two closest nodes at a time to create a new node in the tree –Set new node at height determined by nodes being merged Let i and j be two existing nodes that are merged to create a new node Distance between a new node k created from two existing nodes i and j and other nodes l Distance between node k and lNumber of elements in cluster associated with node j

UPGMA in practice 1234 X0.5 X X X X0.5 X0.67 X d 23 /2=1/6 Place new node at height d 23 /2

UPGMA in practice X0.5 X0.67 X /6 d 14 /2= X0.59 X /6 d 14 /2= d 56 /2=0.29

Computing the sum of scores for two alignments Assume we have two alignments corresponding to intermediate nodes of the guide tree At each step we maximize over score from –aligning column i in A1 to a column j in A2 –aligning column i in A1 to gaps in A2 –aligning column j in A2 to gaps in A1 ClustalW uses an average of all pairwise comparisons between two alignments AAAC _GAC AGC ACC Alignment A1Alignment A2

ClustalW scores for aligning columns from two alignments AAAC _GAC AGC ACC Alignment 1 Alignment 2 Score of aligning column 3 from Alignment 1 and column 2 from alignment 2 Assume a score of 1 for mismatch, 2 for match and 0 for gap

An example for aligning two alignments A A A C _ G A C A G C A C C Max of three options A_A_ A A_A_ _ ____ A Alignment 1 Alignment 2

Assigning sequence weights in ClustalW ClustalW also considers different weights for different sequences Closely related sequences need to be down- weighted Divergent sequences are up-weighted Uses the branch length of the tree to calculate weights

ClustalW weights of sequences Weight of a sequence: sum of branch lengths from root to leaf, but sequences sharing a branch share the weight For example, weight for Hbb_Human=0.081+(0.226/2)+(0.061/4)+(0.015/5)+(0.062/6)

ClustalW score computation

ClustalW gap handling rules Gap penalties are dynamically adjusted For each position in the alignment compute a possible gap penalty value –If there is a gap in any of the sequences being aligned reduce its penalty –If there is no gap, and this position is <8 positions from another gap, increase the gap open penalty –Reduce gap penalty for positions inside a hydrophilic stretch of 5 residues –Otherwise use the gap penalty associated with residue-specific gap penalties estimated based on the known alignments –different amino acid substitution matrices depending upon the estimated divergence of sequences being aligned at a particular stage may be selected.

Position-specific gap penalties in ClustalW Higgins et al, methods in Enzymology, 1996 Hydrophilic stretchesExisting gap High gap penalty within 8 positions of existing gaps

Switching weight matrices Dynamically switch between matrices depending upon the average similarity between sequences being aligned PAM %: PAM %: PAM %: PAM %: PAM350 BLOSUM %: BLOSUM %: BLOSUM %: BLOSUM %: BLOSUM30

Applying ClustalW to SH3 domain proteins Proteins share <12% sequence identity Alignment blocks correspond to beta strand secondary structures

Summary of ClustalW Guide tree method Complex gap penalty rules Sequences are weighted to reduce the importance of very similar sequences Adaptive scoring matrix

MUSCLE: Multiple Sequence Comparison by log-expectation Progressive + iterative Has three main stages Stage1: Draft Progressive Stage 2: Improved Progressive Stage 3: Refinement: –Select pairs of subtrees and re-align the alignment for the subtrees. –Keep if it improves alignment

Steps in MUSCLE Stage 1: Draft progressive Stage 2: Improved progressive Stage 3: Refinement

MUSCLE Stage Compute k-mer distance matrix 1.2 Use UPGMA to make tree (TREE1) 1.3. Use guide tree to make first MSA

K-mer distance K-mer distance is defined from common fractional k-mer count ( F ) D=1-F Let k=2 Sequence2-mers AKFLAAK,KF, FL,LA LKFLLK, KF, FL A k-mer # of instances in sequence 1 Length of sequences

K-mer distance example Sequence2-mers AKFLAAK,KF, FL,LA LKFLFLLK, KF, FL,LF,FL K-mer (τ)# in sequence 1# in sequence 2Min(n1(t),n2(t)) AK100 KF111 FL121 LA100 LK010 LF020

Stage 2: Improved progressive 2.1 Recompute similarity of sequences of pairs using mutual alignment in MSA 2.2 Construct a phylogenetic tree (TREE2) using an alignment-based distance 2.3 Build a new progressive alignment only for subtrees where branching order has changed between TREE1 and TREE2 2.4 Repeat 2.3 until number of “reordered nodes” does not decrease.

Stage 2.1. Recomputing pairwise sequence similarity from a multiple alignment -TGTTAAC -TGT-AAC -TGT--AC ATGT---C ATGT-GGC An MSA TGTTAAC TGT-AAC TGTTAAC TGT--AC -TGTTAAC ATGT---C -TGTTAAC ATGT-GGC … Derived pairwise alignmentFraction identity 6/7 5/7 4/8 … Exclude gaps in both sequences

Stage 2.2: Phylogenetic tree creation Construct a phylogenetic tree using a Kimura distance D: fractional identity of sequences

Stage 2.3 Re-align only when branching order is changed Branching order same Branching order different: x branches before v Recompute alignment for these nodes

Stage 3: Iterative Refinement 3.1 Select a branch 3.2 Extract profiles 3.3 Re-align profiles 3.4 Update MSA if its score is better than current MSA

3.1 Selecting a branch Select a branch in order of decreasing distance from the root MQTIF LH-IW LQSW MQTIF LHIW MQTIF LH-IW LQS-W LSF LQSW L-SW Branch selection order: 1,2,3,4,5,6

3.2 Extracting a profile MQTIF LH-IW LQSW MQTIF LHIW MQTIF LH-IW LQS-W L-S-W LSF LQSW L-SW Delete branch 1 MQTIF Re-align profiles for subtrees LH-IW LQS-W L-S-W Is score better? yes Keep new alignment Discard LHI-W MQTIF LQS-W L-S-W

3.2 Extracting a profile MQTIF LH-IW LQSW LHIW MQTIF LH-IW LQS-W L-S-W LSF LQSW L-SW Delete branch 2 Re-align profiles for subtrees MQTIF LQS-W L-S-W Is score better? yes Keep new alignment Discard MQTIF LHIW LHI-W MQTIF LQS-W L-S-W 1

Summary of MUSCLE Three stage algorithm Stage 1: Draft progressive –k-mer distance –UPGMA tree (TREE1) –Guide tree based alignment (MSA1) Stage 2: Improved progressive –Distance derived from MSA1 –UPGMA tree (TREE2) –Redo alignment for nodes with changed orderings –Repeat until number of re-ordered nodes does not change Stage 3: Iterative refinement –Generate subtree profiles –Realign profiles –Keep realignment if of higher score –Repeat until no more improvement or fixed number of steps. MUSCLE-fast: Stage 1 MUSCLE-p: Stage1 and 2

Accuracy scores of different MSA algorithms on benchmark datasets Edgar, 2004, BMC Bioinformatics Accuracy measures the fraction of residues correctly aligned with the reference alignment

Run time of different MSA algorithm

Summary of algorithms ClustalW –Lots of heuristics for gaps –One guide tree and then alignment –Weights sequences –Dynamically selects scoring matrix depending upon sequence identity MUSCLE –Three-stage algorithm: Draft, Improved, Iterative refinement –Two guide trees –Uses k-mer distance for first tree –Selectively re-aligns using second tree –Refines iteratively by working on subtree-associated alignments –Fast and has as good or better quality alignments

How do MUSCLE and CLUSTALW work in practice Consider coding sequences of 15 yeast species Consider promoter sequences of 15 yeast species Align with MUSCLE and CLUSTALW

Protein sequence alignment MUSCLE CLUSTALW

Promoter sequence alignment MUSCLE CLUSTALW

Comparing alignment of promoters to shuffled sequences in CLUSTALW Original sequences Shuffled sequences

Comparing alignment of promoters to shuffled sequences in MUSCLE Original sequences Shuffled sequences

Conclusion Algorithms seemed similar for protein/coding sequences Algorithms gave different alignments for DNA sequence –Possibly DNA sequence is harder to align –DNA sequence in non-coding regions are even harder to align

Summary of sequence alignment algorithms Pairwise alignment –Global: (Needleman-Wunsch) –Local: (Smith-Waterman) Database searching –BLAST Multiple sequence alignment –Star alignment –Progressive alignment with guide tree: CLUSTALW –Progressive + Iterative alignment with guide tree: MUSCLE