Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.

Slides:



Advertisements
Similar presentations
Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.
Advertisements

Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012.
Multiple Sequence Alignment
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
BNFO 602 Multiple sequence alignment Usman Roshan.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
1 Protein Multiple Alignment by Konstantin Davydov.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
Multiple Sequence Alignment. Sequence Families Most sequences are members of large families, some with the same function and others with different functions.
Multiple sequence alignments and motif discovery Tutorial 5.
Multiple alignment: heuristics
Multiple sequence alignment
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Marina Sirota CS374 October 19, 2004 P ROTEIN M ULTIPLE S EQUENCE A LIGNMENT.
Multiple Sequence Alignment
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Phylogenetic trees Sushmita Roy BMI/CS 576
Multiple sequence alignment MSA
MCB 5472 Lecture #6: Sequence alignment March 27, 2014.
Last lecture summary. Sequence database searching exhaustive, heuristic BLAST How it works, steps, parameters BLAST variants, reading frame.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Protein Sequence Alignment and Database Searching.
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Multiple sequence alignment
Copyright OpenHelix. No use or reproduction without express written consent1.
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
Multiple Sequence Alignment Colin Dewey BMI/CS 576 Fall 2015.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Multiple Sequence Alignment
Multiple sequence alignment (msa)
Overview of Multiple Sequence Alignment Algorithms
A Hybrid Algorithm for Multiple DNA Sequence Alignment
Multiple Sequence Alignment
Multiple Sequence Alignment
Introduction to Bioinformatics
Presentation transcript:

Practical multiple sequence algorithms Sushmita Roy BMI/CS Sushmita Roy Sep 23rd, 2014

RECAP Scores for multiple sequence alignment – Sum of pairs – Minimum entropy based Heuristic algorithms for performing multiple sequence alignment – Progressive Star alignment Guide tree-based – ClustalW – Iterative MUSCLE

Goals for today General description of iterative algorithms A practical implementation – MUSCLE

Iterative algorithms for multiple sequence alignment Key idea: revisit the alignments Algorithms vary depending upon how exactly the alignments are changing between iterations

Simple iterative algorithm (Also called the Barton-Sternberg alignment algorithm) 1.Align two sequences with highest alignment score using standard dynamic programming techniques for pairwise alignment 2.Repeat until all sequences are in the alignment – Find the sequence most similar to current alignment – Add to alignment. 3.For all sequences x i, – Remove x i from alignment, re-align to the partial alignment of { x 1...x n }\x i. Repeat 3 until the score does not improve OR we have executed a fixed number of steps

MUSCLE: Multiple Sequence Comparison by log-expectation Progressive + iterative Has three main stages Stage1: Draft Progressive Stage 2: Improved Progressive Stage 3: Refinement: – Select pairs of subtrees and re-align the alignment for the subtrees. – Keep if it improves alignment Each stage returns an alignment – Could be terminated anywhere

Steps in MUSCLE Stage 1: Draft progressive Stage 2: Improved progressive Stage 3: Refinement

MUSCLE Stage Compute k-mer distance matrix 1.2 Use UPGMA to make tree (TREE1) (We will see this in a bit) 1.3. Use guide tree to make first MSA

K-mer distance D K-mer distance is defined from common fractional k- mer count ( F ) For two sequences x and y D=1-F

K-mer distance example Sequencek=2-mers AKFLAAK,KF, FL,LA LKFLFLLK, KF, FL,LF,FL K-mer ( τ ) n x (τ)n y (τ)min(n x (τ), n y (τ)) AK100 KF111 FL121 LA100 LK010 LF020 x y

Stage 2: Improved progressive 2.1 Recompute similarity of sequences of pairs using mutual alignment in MSA 2.2 Construct a phylogenetic tree (TREE2) using an alignment-based distance 2.3 Build a new progressive alignment only for subtrees where branching order has changed between TREE1 and TREE2 2.4 Repeat 2.3 until number of “reordered nodes” does not decrease.

Stage 2.1. Recomputing pairwise sequence similarity from a multiple alignment -TGTTAAC -TGT-AAC -TGT--AC ATGT---C ATGT-GGC An MSA TGTTAAC TGT-AAC TGTTAAC TGT--AC -TGTTAAC ATGT---C -TGTTAAC ATGT-GGC … Derived pairwise alignmentFraction identity 6/7 5/7 4/8 … Exclude gaps in both sequences

Stage 2.2: Phylogenetic tree creation Construct a phylogenetic tree using a Kimura distance D: fractional identity of sequences

Stage 2.3 Re-align only when branching order is changed Branching order same Branching order different: x branches before v Recompute alignment for these nodes

Stage 3: Iterative Refinement 3.1 Delete an edge 3.2 Extract profiles from subtrees 3.3 Re-align profiles 3.4 Update MSA if its score is better than current MSA

3.1 Selecting a branch Select a branch in order of decreasing distance from the root MQTIF LH-IW LQSW MQTIF LHIW LSF LQSW L-SW Branch selection order: 1,2,3,4,5,6 MQTIF LH-IW LQS-W L-S-W

3.2 Extracting a profile MQTIF LH-IW LQSW LHIW MQTIF LH-IW LQS-W L-S-W LSF LQSW L-SW Delete branch 2 Re-align profiles for subtrees MQTIF LQS-W L-S-W Is score better? yes Keep new alignment Discard MQTIF LHIW LHI-W MQTIF LQS-W L-S-W 1

Summary of MUSCLE Three stage algorithm Stage 1: Draft progressive – k-mer distance – UPGMA tree (TREE1) – Guide tree based alignment (MSA1) Stage 2: Improved progressive – Distance derived from MSA1 – UPGMA tree (TREE2) – Redo alignment for nodes with changed orderings – Repeat until number of re-ordered nodes does not change Stage 3: Iterative refinement – Generate subtree profiles – Realign profiles – Keep realignment if of higher score – Repeat until no more improvement or fixed number of steps. MUSCLE-fast: Stage 1 MUSCLE-p: Stage1 and 2 Note different convergence criteria in Stages 2 and 3

Accuracy scores of different MSA algorithms on benchmark datasets Edgar, 2004, BMC Bioinformatics Accuracy measures the fraction of residues correctly aligned with the reference alignment

Run time of different MSA algorithm

Summary of algorithms ClustalW – Lots of heuristics for gaps – One guide tree and then alignment – Weights sequences – Dynamically selects scoring matrix depending upon sequence identity MUSCLE – Three-stage algorithm: Draft, Improved, Iterative refinement – Two guide trees – Uses k-mer distance for first tree – Selectively re-aligns using second tree – Refines iteratively by working on subtree-associated alignments – Fast and has as good or better quality alignments

How do MUSCLE and CLUSTALW work in practice Consider coding sequences of 15 yeast species Consider promoter sequences of 15 yeast species Align with MUSCLE and CLUSTALW

Protein sequence alignment MUSCLE CLUSTALW

Promoter sequence alignment MUSCLE CLUSTALW

Comparing alignment of promoters to shuffled sequences in CLUSTALW Original sequences Shuffled sequences

Comparing alignment of promoters to shuffled sequences in MUSCLE Original sequences Shuffled sequences

Conclusion Algorithms seemed similar for protein/coding sequences Algorithms gave different alignments for DNA sequence – Possibly DNA sequence is harder to align – DNA sequence in non-coding regions are even harder to align

Summary of sequence alignment Pairwise alignment – Algorithms Global: (Needleman-Wunsch) Local: (Smith-Waterman) Heuristic search to align large number of sequences – BLAST Multiple sequence alignment – Star alignment – Progressive alignment with guide tree: CLUSTALW – Progressive + Iterative alignment with guide tree: MUSCLE