1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.

Slides:

Advertisements

Similar presentations

Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group

Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.

COFFEE: an objective function for multiple sequence alignments

Molecular Evolution Revised 29/12/06

Structural bioinformatics

1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN

Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.

Sequence Analysis Tools

Multiple sequence alignments and motif discovery Tutorial 5.

Multiple sequence alignment

Sequence Alignment III CIS 667 February 10, 2004.

Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.

Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Multiple Sequence Alignments

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,

CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Bioinformatics Sequence Analysis III

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

CS 177 Sequence Alignment Classification of sequence alignments

Chapter 5 Multiple Sequence Alignment.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.

Multiple sequence alignment

Biology 4900 Biocomputing.

An Introduction to Bioinformatics

Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.

Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.

Protein Sequence Alignment and Database Searching.

RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?

Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.

Applied Bioinformatics Week 8 Jens Allmer. Practice I.

Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.

Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.

Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.

Multiple sequence alignment

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Sequence Alignment.

Step 3: Tools Database Searching

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Protein Sequence Alignment Multiple Sequence Alignment

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune

Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Multiple Sequence Alignment

Multiple sequence alignment (msa)

Multiple Sequence Alignment

In Bioinformatics use a computational method - Dynamic Programming.

MULTIPLE SEQUENCE ALIGNMENT

Basic Local Alignment Search Tool

Multiple Sequence Alignment

Presentation transcript:

1 Multiple Sequence Alignment(MSA)

2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score

3 Multiple Alignment Multiple sequence alignment can be viewed as an extension of pairwise sequence alignment, but the complexity of the computation grows exponentially with the number of sequences being considered and their lengths and, therefore, it is not feasible to search exhaustively for the optimal alignment even for even a modest number of short sequences. For example, there are approximately possible alignments that can be produced from five 10-nucleotide-long sequences

4 Possible Alignments TCG-GC-TCGAC GAGCGT-GAT-- G-GCG-AG---C CG-TCGTA--AC TCGGCTCGAC GAGCGTGAT- GGCGAGC--- CGTCGTAAC- … 10 38

5 Multiple Alignment In addition to the scores employed by pairwise alignment, we have additional scores. However, we still seek an alignment that maximizes score

6 Similarity Scoring Scheme {4,0,0,0,0} = 1 {3,1,0,0,0} = 0.75 {2,2,0,0,0} = 0.5 {2,1,1,0,0} = 0.5 {1,1,1,1,0} = 0 5 character states: A, C, T, G, –

7 Two Possible Alignments GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC GTCGTAGTCGGCTCGAC GTCTAGCGAGCGTGAT- GCGAAGAGGCGAGC--- GCCGTCGCGTCGTAAC- 1*1 + 3* *0.5 = *1 + 13* *0.5 = 11.75

8 Alignments can be easy or difficult Easy Difficult

9 Multiple Alignment Dynamic programming (exhaustive, exact) –Consider 2 protein sequences of 100 amino acids in length. –If it takes (10 3 ) seconds to exhaustively align these sequences, then it will take 10 4 seconds to align 3 sequences, 10 5 to align 4 sequences, etc. –It will take ~10 21 seconds to align 20 sequences. One year is ~3 ✕ 10 7 seconds. The age of the visible universe is ~10 18 seconds. Progressive alignment (heuristic, approximate)

10 Progressive Alignment Devised by Feng and Doolittle in Essentially a heuristic method and, as such, not guaranteed to find the “optimal” or “best” alignment. Requires pairwise alignments as a starting point One of the first successful implementation was Clustal (by Des Higgins et al.)

12 Sum of pairwise alignment scores –For n sequences, there are n(n-1)/2 pairs GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC Based on pairwise alignment scores –Build n by n table of pairwise scores Align similar sequences first –After alignment, consider as single sequence –Continue aligning with further sequences

13 ABCDABCD DCBA A 11B 13C 1022D Compute the pairwise alignments for all against all (6 pairwise alignments) the similarities are stored in a table First step:

14 DCBA A 11B 13C 1022D A D C B Cluster the sequences to create a tree (guide tree): Represents the order in which pairs of sequences are to be aligned Represents the order in which pairs of sequences are to be aligned Similar sequences are neighbors in the tree Similar sequences are neighbors in the tree Distant sequences are distant from each other in the tree Distant sequences are distant from each other in the tree Second step: Guide tree

15 Guide Tree A guide tree is not a phylogenetic tree!

16 A D C B Align most similar pairs Align the alignments as if each of them was a single sequence (replace with a single consensus sequence or use a profile) Third step:

17 Alignment of alignments M Q T F L H T W L Q S W L T I F M T I W M Q T - F L H T - W L Q S - W L - T I F M - T I W X Y

19 > Usually FASTA format Input File

20 > Sulfolobus acidocaldarius gi|152927|gp|J03218|SSOATPMA_1 MVSEGRVVRVNGPLVIADGMREAQMFEVVYVSDLKLVGEITRIE > Thermococcus sp. gi| ATPase alpha subunit MGRIIRVTGPLVVADGMKGAKMYEVVRVGEMGLIGEIIRLEGDKAVIQVYEETAGIRPGE PVEGTGSSLS > Acetabularia acetabulum gi| |gnl|PID|d adenosine triphosphatase A subunit MSKAKEGDYGSIKKVSGPVVVADNMGGSAMYELVRVGTGELIGEIIRLEGDTATIQVYEE TSGLTVGDGV … Input file Output file

21 Sequences that are similar only in some smaller regions may be misaligned because MSA tries to find global, rather than local alignments. Sequence that contains a large insertion compared to the rest may be misaligned because MSA tries to find global, rather than local alignments. MSA: Problems

22 Sequence that contains a repetitive element (such as a domain), while another sequence only contains one copy. VS MSA: Problems

23 Pairwise alignment is an optimal algorithm. Multiple alignment is not an optimal algorithm. Better alignments might exist! The algorithm yields a possible alignment, but not necessarily the best one. MSA: Problems

24 MSA: Problems Because of the progressive methodology, errors at the beginning of the alignment are more important than errors at the end of the alignment.

25 Clustal: Advice Sequence weighting (Should we treat all the sequences as equals?) Position weighting (Should we treat all positions in the sequences as equals?) Varying substitution matrices Residue-specific gap penalties and reduced penalties in hydrophilic regions (external regions of protein sequences), encourage gaps in loops rather than in core regions. Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments Vary “gap opening penalty” and “gap extension penalty” Discourage too many gaps, too close to one another Take into account hydrophilic and hydrophobic structures Avoid divergent sequences, as they are the most difficult to align

26 Alignment of protein-coding DNA sequences It is not very sensible to align the DNA sequences of protein-coding genes. ATGCTGTTAGGG ATGCTCGTAGGG ATGCT-GTTAGGG ATGCTCGTA-GGG The result might be implausible and might not reflect what is known about biological processes. It is much more sensible to translate the sequences into their corresponding amino acid sequences, align these protein sequences and then put the gaps in the DNA sequences according to the amino acid alignment.

27 Effect of gap penalties on amino-acid alignment Human pancreatic hormone precursor versus chicken pancreatic hormone (a) Penalty for gaps is 0 (b) Penalty for a gap of size k nucleotides is w k = k (c) The same alignment as in (b), only the similarity between the two sequences is further enhanced by showing pairs of biochemically similar amino acids

28 anchor points The anchored-alignment procedure uses expert knowledge to improve multiple alignments. The user can specify a list of anchor points, each of which consists of a pair of equal- length segments that are to be aligned by the program. If residue x from one of the input sequences is paired to residue y from another sequence, then y is the only residue that can be aligned to x, and vice versa. All residues to the left and the right of x are aligned, respectively, to the residues to the left and the right of y. Anchored alignment:

29 Consider, for example, the following example: >seq1 WKKNADAPKRAMTSFMKAAY >seq2 WNLDTNSPEEKQAYIQLAKDDRIRYD >seq3 WRMDSNQKNPDSNNPKAAYNKGDANAPK The non-anchored default version of DIALIGN would calculate the following alignment for this input sequence set: seq1 1 WKKNAD-----APKRamtsfmKAAY seq2 1 WNLDTN-----SPEE------KQAYiqlaKDDriryd seq3 1 WRMDSNqknpdSNNP------KAAYn---KGDsnapk

30 Now let's assume, the user has some expert knowledge about a certain domain that is present in all of the input sequences; the domains in the three sequences are thought to be homologous to each other: >seq1 WKKNADAPKRAMTSFMKAAY >seq2 WNLDTNSPEEKQAYIQLAKDDRIRYD >seq3 WRMDSNQKNPDSNNPKAAYNKGDANAPK The user wants to define this motif as anchor and align the rest of the sequences automatically, given the pre-defined constraints imposed by this anchor. Since anchor points are defined as pairs of equal-length segments, we need two anchor points to enforce alignment of the above motif.

31 For example, one could choose Anchor point 1: >seq1 WKKNADAPKRAMTSFMKAAY >seq2 WNLDTNSPEEKQAYIQLAKDDRIRYD >seq3 WRMDSNQKNPDSNNPKAAYNKGDANAPK Anchor point 2: >seq1 WKKNADAPKRAMTSFMKAAY >seq2 WNLDTNSPEEKQAYIQLAKDDRIRYD >seq3 WRMDSNQKNPDSNNPKAAYNKGDANAPK If the above motif is to be aligned by our program, these two anchor points need to be specified.

32 Format for user-defined anchor points: To specify a set of anchor points, a file with the coordinates of these anchor points is needed. Since each anchor point corresponds to a equal-length segment pair involving two of the input sequences, coordinates for anchor points are defined as follows: (1) first sequence involved (2) second sequence involved (3) start of anchor in first sequence (4) start of anchor in second sequence (5) length of anchor. (6) specify a score of an anchor point. This score is necessary to prioritize anchor point in case they are inconsistent with each other, i.e., if not all of them can be used simultaneously for the same alignment. the above coordinates (in the above given order).

33 Thus, the above two anchor points are specified as follows: and 1.3 are (arbitrary) scores for anchor points 1 and 2.

34 MSA Approaches Progressive approach: CLUSTALW (CLUSTALX) T-COFFEE PILEUP Iterative approach (Repeatedly realigned subsets of sequences): MultAlin, DiAlign Statistical Methods (e.g., Hidden Markov Models) SAM2K Genetic algorithms SAGA

35 T-coffee An MSA program Uses principles similar to Clustal Combining sequence alignment method with structural alignment techniques More accurate but longer running time Limits to the number of sequences it can align (~100)

36 MSA: Problems Giddy Landan and Dan Graur Heads or Tails: A Simple Reliability Check for Multiple Sequence Alignments. Molecular Biology and Evolution 24(6):

Genomic alignment (with MAUVE)