COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Advertisements

Lecture 8 Alignment of pairs of sequence Local and global alignment
COFFEE: an objective function for multiple sequence alignments
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Multiple sequence alignments and motif discovery Tutorial 5.
Multiple alignment: heuristics
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Multiple sequence alignment
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignments
Multiple Sequence Alignment
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Bioinformatics Sequence Analysis III
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple sequence alignment
Biology 4900 Biocomputing.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
An Introduction to Bioinformatics
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Protein Sequence Alignment and Database Searching.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
Chapter 3 Computational Molecular Biology Michael Smith
Multiple sequence alignment
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
DNA, RNA and protein are an alien language
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Multiple Sequence Alignment Carlow IT Bioinformatics November 2006.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
INTRODUCTION TO BIOINFORMATICS
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Multiple Sequence Alignment
In Bioinformatics use a computational method - Dynamic Programming.
Sequence Based Analysis Tutorial
Introduction to Bioinformatics
Presentation transcript:

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Outline Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods Multidimensional dynamic programming Star Alignment Tree Alignment  Progressive Alignment  Clustalw: a widely used algorithm Iterative Alignment  Genetic Algorithm

What is a Multiple Sequence Alignment? Pairwise alignments: involve two sequences Multiple sequence alignments: involve more than 2 sequences (often 100’s, either nucleotide or protein). A formal definition A multiple alignment of strings S 1, … S k is a series of strings with spaces such that |S 1 ’| = … = |S k ’| S j ’ is an extension of S j by insertion of spaces Goal: Find an optimal multiple alignment. Hs ---MK LSLVAAML LLLSAARAEE EDKK-EDVGT VVGIDLGTTY Sp ---MKKFQLF SILSYFVALF LLPMAFASGD DNST-ESYGT VIGIDLGTTY Tg MTAAKKLSLF SLAALFCLLS VATLRPVAAS DAEEGKVKDV VIGIDLGTTY Pf MN QIRPYILLLI VSLLKFISAV DSN---IEGP VIGIDLGTTY

Why we do multiple alignments? In order to reveal the relationship between a group of sequences (homology) Simultaneous alignment of similar gene sequences may Discover the conserved regions in genes Determine the consensus sequence of these aligned sequences Help defines a protein family that may share a common biochemical function or evolutionary origin and thus reveals an evolutionary history of the sequences. Help prediction of the secondary and tertiary structures of new sequences

MSA Methods Multidimensional dynamic programming Extension of DP to multiple (3) sequences Star Alignment, Tree Alignment, Progressive Alignment Starting with an alignment of the most alike sequences and building an alignment by adding more sequences Iterative methods Making an initial alignment of groups of sequences and revising the alignment to achieve a more reasonable result

Outline Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods Multidimensional dynamic programming Star Alignment Tree Alignment  Progressive Alignment  Clustalw: a widely used algorithm Iterative Alignment  Genetic Algorithm

Multiple Sequence Alignment by DP Pairwise sequence alignment a scoring matrix where each position provides the best alignment up to that point Extension to 3 sequences the lattice of a cube that is to be filled with calculated dynamic programming scores. Scoring positions on 3 surfaces of the cube represent the alignment of a pair

Scoring of MSA: Sum of Pairs Scores = summation of all possible combinations of amino acid pairs Using BLOSUM62 matrix, gap penalty -8 In column 1, we have pairs -,S S,S k(k-1)/2 pairs per column -IK SIK SSE = -12

Sum of Pairs Given 5 sequences: N C C E N N C E N - C N S C S N S C S E How many possible combinations of pairwise alignments for each position?

Sum of Pairs Assume: match/mismatch/gap = 1/0/-1 N C C E N N C E N - C N S C S N S C S E The 1 st position: # of N-N (3), # of S-S (1), # of N-S (6) SP(1) = 4*1 + 0*6 + (-1)*0 = 4 The 2 nd position: # of C-C (3), # of N-C (3), # of gaps (4), SP(2) = 3*1 + 0*3 + (-1)*4 = -1

G T G C T T G A TGGCCTTGGCCT Dynamic programming matrix Pairwise alignment Gap in sequence 2 Match/Mismatch Gap in sequence 1 Seq 1 Seq 2

Multiple sequence alignment Dynamic programming matrix many possibilities SMVSMV S M T A M V Seq 1 Seq 2 Seq 3

DP Alignment Examples All three match/mismatch Sequence 1 & 2 match/mismatch with gap in 3 Sequence 1 & 3 match/mismatch with gap in 2 Sequence 2 & 3 match/mismatch with gap in 1 Sequence 1 with gaps in 2 & 3 Sequence 2 with gaps in 1 & 3 Sequence 3 with gaps in 1 & 2 Choose the largest value among the above seven possibilities

Computational Complexity For protein sequences each 300 amino acid in length & excluding gaps, with DP algorithm Two sequences, comparisons Three sequences, comparisons N sequences, 300 N comparisons O(L N ) L: length of the sequences; N: number of sequences The number of comparisons & memory required are too large for n > 3 and not practical

Outline Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods Multidimensional dynamic programming Star Alignment Tree Alignment  Progressive Alignment  Clustalw: a widely used algorithm Iterative Alignment  Genetic Algorithm

Star Alignments Heuristic method for multiple sequence alignments Select a sequence s c as the center of the star For each sequence s 1, …, s k such that index i  c, perform a global alignment (using DP) Aggregate alignments with the principle “once a gap, always a gap.”

Star Alignments Example s2s2 s1s1 s3s3 s4s4 s 1 : MPE s 2 : MKE s 3 : MSKE s 4 : SKE MPE | MKE MSKE - || MKE || SKE MPE MKE -MPE -MKE MSKE -MPE -MKE MSKE -SKE

Choosing a center Try them all and pick the one with the best score Calculate all O(k 2 ) alignments, and pick the sequence s c that maximizes

Star Alignment Example S1=ATTGCCATT S2=ATGGCCATT S3=ATCCAATTTT S4=ATCTTCTT S5=ATTGCCGATT s1s1 s2s2 s3s3 s4s4 s5s5 s1s s2s s3s s4s s5s

Star Alignments Example Merging Pairwise Alignment

Star Alignment Example Merging Pairwise Alignment

Analysis Assuming all sequences have length n O(n 2 ) to calculate global alignment O(k 2 ) global alignments to calculate Using a reasonable data structure for joining alignments, no worse than O(kl), where l is upper bound on alignment lengths O(k 2 n 2 +kl)=O(k 2 n 2 ) overall cost

Outline Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods Multidimensional dynamic programming Star Alignment Tree Alignment  Progressive Alignment  Clustalw: a widely used algorithm Iterative Alignment  Genetic Algorithm

Tree Alignment Compute the overall similarity based on pairwise alignment along the edge The sum of all these weights is the score of the tree sequence sequence S 2 sequence S 1 weight : sim(s 1,s 2 ) Consensus String The consensus string derived from multiple alignment is the concatenation of the consensus characters for each column. The consensus character for column is the character that minimizes the summed distance to it from all the characters in column

Tree Alignment Example Scoring system used is CAT GT CTG CG CAT - GT CTG We have a score of 8 CAT CTG C - G

Tree Alignment Example

Example

Analysis We don’t know the correct tree Without the tree, the tree alignment problem is NP-complete Likely only exponential time solution available (for optimal answers)

Outline Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods Multidimensional dynamic programming Star Alignment Tree Alignment  Progressive Alignment  Clustalw: a widely used algorithm Iterative Alignment  Genetic Algorithm

Progressive Methods DP-based MSA program is limited in 3 sequences or to a small # of relatively short sequences Progressive alignments uses DP to build a msa starting with the most related sequences and then progressively adding less-related sequences or groups of sequences to the initial alignment Most commonly used approach

Progressive Methods Progressive alignment is heuristic. It does not separate the process of scoring an alignment from the optimization algorithm It does not directly optimize any global scoring scoring function of “alignment correctness”. It is fast, efficient and the results are reasonable. We will illustrate this using ClustalW.

Progressive MSA occurs in 3 stages 1. Do a set of global pairwise alignments (Needleman and Wunsch) 2. Create a guide tree 3. Progressively align the sequences

ClustalW Procedure

Progressive Methods: ClustalW ClustalW is a general purpose multiple alignment program for DNA or proteins. ClustalW: The W standing for “weighting” to represent the ability of the program to provide weights to the sequence and program parameters. CLUSTALX provides a graphic interface

Operational options Output options Input options, matrix choice, gap opening penalty Gap information, output tree type File input in GCG, FASTA, EMBL, GenBank, Phylip, or several other formats Use Clustal W to do a progressive MSA

Progressive MSA stage 3 of 3 : progressive alignment Make a MSA based on the order in the guide tree Start with the two most closely related sequences Then add the next closest sequence Continue until all sequences are added to the MSA

Problems w/ Progressive Alignment Highly sensitive to the choice of initial pair to align. The very first sequences to be aligned are the most closely related on the sequence tree. If alignment good, few errors in the initial alignment The more distantly related these sequences, the more errors Errors in alignment propagated to the MSA

Outline Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods Multidimensional dynamic programming Star Alignment Tree Alignment  Progressive Alignment  Clustalw: a widely used algorithm Iterative Alignment  Genetic Algorithm

Iterative Methods Results do NOT depend on the initial pairwise alignment (recall progressive methods) Starting with an initial alignment and repeatedly realigning groups of the sequences Repeat until one MSA doesn’t change significantly from the next. After iterations, alignments are better and better. An example is genetic algorithm approach.

Genetic Algorithms A general problem solving method modeled on evolutionary change. Inspired by the biological evolution process Uses concepts of “Natural Selection” and “Genetic Inheritance” (Darwin 1859) Create a set of candidate solutions to your problem, and cause these solutions to evolve and become more and more fit over repeated generations. Use survival of the fittest, mutation, and crossover to guide evolution.

Genetic Search Algorithms Random generation (candidate solutions) Evaluation (fitness function) Selection (candidate solutions with larger fitness values will have larger chance to be included) Crossover + Mutation (change some selected candidate solutions to converge to the optimal solution and to prevent a local extreme

Outline Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods Multidimensional dynamic programming Star Alignment Tree Alignment  Progressive Alignment  Clustalw: a widely used algorithm Iterative Alignment  Genetic Algorithm