Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Multiple Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct. 6, 2005 ChengXiang Zhai Department of Computer Science University.
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Multiple Sequence Alignment
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
COFFEE: an objective function for multiple sequence alignments
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Sequence Analysis Tools
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
Multiple alignment: heuristics
Multiple sequence alignment
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Chapter 5 Multiple Sequence Alignment.
Sequence Alignment.
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple sequence alignment
Biology 4900 Biocomputing.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Phylogenetic Trees Tutorial 5. Agenda How to construct a tree using Neighbor Joining algorithm Phylogeny.fr tool Cool story of the day: Horizontal gene.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Multiple Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW:
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment Che-Lun Hung,
Multiple Sequence Alignment by Iterative Tree-Neighbor Alignments Susan Bibeault June 9, 2000.
Progressive multiple sequence alignments from triplets by Matthias Kruspe and Peter F Stadler Presented by Syed Nabeel.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight.
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Traveling Salesman Problem (TSP)
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Tutorial 5 Phylogenetic Trees.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Protein multiple sequence alignment by hybrid bio-inspired algorithms Vincenzo Cutello, Giuseppe Nicosia*, Mario Pavone and Igor Prizzi Nucleic Acids Research,
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Topic 3: MSA Iterative Algorithms in Multiple Sequence Alignment Prepared By: 1. Chan Wei Luen 2. Lim Chee Chong 3. Poon Wei Koot 4. Xu Jin Mei 5. Yuan.
Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.
Multiple Alignment Anders Gorm Pedersen / Henrik Nielsen
Multiple sequence alignment (msa)
Multiple Sequence Alignment
Introduction to Bioinformatics
Multiple Sequence Alignment
Presentation transcript:

Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong

Topics Background Algorithm Design Test Results

Background Definitions

What is a Sequence Alignment? Given 2 or more sequences a scoring scheme Insert gaps in each sequence, so that all sequences have the same length maximum pairing score match score mismatch score gap penalty

Scoring Matrix match = 2 mismatch = -1 gap penalty = -2 Simplified Scoring Scoring matrix In Practice

Global vs. Local Alignments F G K – G K G F G K F G K G F G K G K G F G K F G K G - - Global: entire lengths of sequences Local: regions of sequences

Pairwise Alignment vs. Multiple Sequence Alignment (MSA) F G K  G K G F G K F G K G Pairwise: 2 sequences MSA: more than 2 sequences F G K  G K G F G K F G K G - G K Q G K G - - K F G K G

Background Basic Dynamic Programming

Dynamic Programming Algorithm for Pairwise Alignments Two sequences GAATTC GGATC 1. Initialization Scoring scheme match = 2 mismatch = -1 gap penalty = G A A T T C GGATCGGATC

Scoring scheme match = 2 mismatch = -1 gap g = Table fill M i-1,j-1 M i-1,j M i,j-1 M ij M i-1,j-1 + S(c i, c j ) M i,j-1 + g M i-1,j + g M ij = max cici cjcj G A A T T C GGATCGGATC

3. Trace back G A A T T C | | G G A – T C G A A T T C GGATCGGATC

Multidimensional Dynamic Programming for MSA n strings of length L each, running time is O(L n ). Impractical: 5-7 proteins of residues each.

Topics Background Algorithm Design Test Results

An MSA Heuristic Algorithm Design

1. Align 2 of the sequences S i, S j 2. Align a 3 rd sequence S k to the alignment S i, S j 3. Repeat 2 until all sequences are aligned Feng-Doolittle Progressive Alignment Running Time O( n L 2 ) * T A S cjcj cici S(c i, c j ) = (S(T, S) + S(A, S)) / 2

Features of Feng-Doolittle Algorithm x:G A A G T T y:G A C – T T x:G A A G T T y:G A – C T T z:G A A C T G Once a gap, always a gap Early mistakes cannot be corrected Alignment order is important z:G A A C T G

TspMsa: First Version Algorithm Design

Traveling Salesman Problem (TSP) Given n nodes distances for each pair of nodes Find a roundtrip, so that visit each node exactly once minimal total length NP-complete Well studied

calculate pairwise distances TspMsa: Algorithm Design Feng-Doolittle alignment Alignment order determine a TSP tour

Starting Point and Direction of TSP Tour data set kinase_ref3

TspMsa: Modified Design Algorithm Design

TspMsa: Modified Algorithm Design calculate pairwise distances determine a TSP tour align closest nodes , , 0 2, , 1, 0, 2, 4 3, 1, 0 2, one node left ? end yes no

Modified Algorithm is Better Alignment order for Kinase_ref3 Original TspMsa : (worst) (best) Modified TspMsa :

Topics Background Algorithm Design Test Results

What to Compare With?

Existing MSA Programs Progressive multal Iterative multalign pileup clustalw poa prrp saga hmmt less computation timebetter quality best quality Fast

CLUSTALW 1. Calculate pairwise distances 2. Derive a guide tree by the Neighbor Joining method repeat until one node left at the center choose 2 closest nodes, derive an internal node r i =(Σd ik )/(n-2) d ix =(d ij + r i - r j ) /2 d jx =d ij – d ix d xm =(d im + d jm - d ij )/2 j i x j i

CLUSTALW 2 gap penalty values: opening, extension Dynamically changes the gap penalty and the scoring matrix 3. Progressively align all sequences following the guide tree Weighted sequences 1 p e e k s a v t a l 2 g e e k a a v l a l 3 e g e w q l v l h v Without weights Score = [S(t,v) + S(l,v)] / 2 With weights Score = [S(t,v)*w 1 *w 3 + S(l,v)*w 2 *w 3 ] / 2

POA E T - - P K M I V R E T T H – K M L V R 1. Convert sequences to partial order graphs E T N K E TNK E T P K TH M I V L R

POA 2. Align 2 sequences 3. Align one sequence to the current group E T P KTH E T N K 4. Repeat 3 until all sequences are aligned

Test Results Quality Evaluation

BAliBASE Benchmark Reference 1: equidistance sequences with various levels of similarity. < 25% sequence identity 20-40% sequence identity > 35% sequence identity Reference 2: closely related sequences with a highly divergent “orphan” sequence. Reference 3: subgroups with <25% identity between groups. Reference 4: sequences with N/C-terminal extensions. Reference 5: sequences with internal insertions.

Reference 1 Sequences with < 25% Identity shortmediumlong Average Score All Test Scores

Reference 1 Sequences with 20-40% Identity shortmediumlong Average Score All Test Scores

Reference 1 Sequences with >35% Identity shortmediumlong Average Score All Test Scores

Reference 2 shortmediumlong Average Score All Test Scores

Reference 3 shortmediumlong Average Score All Test Scores

Reference 4 and Reference 5 Reference 4Reference 5 Average Score All Test Scores

Alignment Quality Comparison Reference 1: <25% identity: Similar * 20-40% identity: Similar * > 35% identity: Similar Reference 2: Similar * Reference 3: TspMsa better Reference 4: CLUSTALW better Reference 5: Similar * CLUSTALW slightly better for short sequences. TspMsa and POA:TspMsa better TspMsa and CLUSTALW: comparable

Test Results Execution Time Evaluation

Fast Mode TspMsa Slow mode: full dynamic programming (accurate) Fast mode: a fast approximate method (heuristic) Most time consuming step: Pairwise distance calculations

Quality Impact of the Fast Mode

Execution Time Evaluation CLUSTALW and TspMsa in fast mode

Conclusions Slow mode close to CLUSTALW (slow mode) better than POA Fast mode (not as good as slow mode) comparable to CLUSTALW (fast mode) better than POA Fast mode faster than CLUSTALW (fast mode) comparable to POA QUALITY SPEED

Acknowledgement Dr. Robert Robinson Dr. Russell Malmberg Dr. Eileen Kraemer Computer Science Department