Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Plant Molecular Systematics (Phylogenetics). Systematics classifies species based on similarity of traits and possible mechanisms of evolution, a change.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
Structural bioinformatics
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
Bioinformatics and Phylogenetic Analysis
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Analysis Tools
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Multiple alignment: heuristics
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Multiple sequence alignment
Similar Sequence Similar Function Charles Yan Spring 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
5.4 Cladistics The ancestry of groups of species can be deduced by comparing their base or amino acid sequences.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Trees, Stars, and Multiple Biological Sequence Alignment Jesse Wolfgang CSE 497 February 19, 2004.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Pairwise & Multiple sequence alignments
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Sequence Alignment Xuhua Xia
Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Bioinformatics Overview
Phylogenetic basis of systematics
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Multiple Sequence Alignment
Evidence and Phylogenetic trees
In Bioinformatics use a computational method - Dynamic Programming.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics

Alignment problem Given a set of sequences, produce a multiple alignment which corresponds as well as possible to the biological relationships between the corresponding bio-molecules

For homologous proteins  Two residues should be aligned (on top of each other) if they are homologous (evolved from the same residue in a common ancestor protein) if they are structurally equivalent

Automatic approach  Need a way of scoring alignments fitness function which for an alignment quantifies its “goodness”  Need an algorithm for finding alignments with good scores  Not all methods provide a scoring function for the final alignment!

Analysis of fitness function  One can test whether the alignments optimal under a given fitness function correspond well to the biological relationships between the sequences  For example, if the structure of (some of) the proteins are known.

Scoring Alignments  In order to find an optimal alignment, we need to be able to measure how good an alignment is Sum of pairs (SP) method: in a column, score each pair of letters and total the scores. Pairs of gaps score 0. Total up scores for each column

SP Method Example  Using BLOSUM62 matrix, gap penalty -8  In column 1, we have pairs -,S S,S  k(k-1)/2 pairs per column -IK SIK SSE = -12

Align by use of dynamic programming  Dynamic programming finds best alignment of k sequences with given scoring scheme  For two sequences there are three different column types  For three sequences there are seven different column types x means an amino acid, - a blank Sequence1 x - x x - - x Sequence2 x x - x - x - Sequence3 x x x - x - x

Use of dynamic programming  Dynamic programming finds best alignment of k sequences given scoring scheme

Algorithm for dynamic programming

Analysis  O(n k ) entries to fill  Each entry combines O(2 k ) other entries  Costs O(k 2 ) to calculate each SP score  Overall cost is O(k 2 2 k n k ), or exponential in the number of sequences!  NP-complete

General progressive alignment Algorithm. General progressive alignment. Progressive alignment of the sequences {s 1, s 2,..., s m } Var C current set of alignments begin C := ∅ for i := 1 to m do C := C union {{s i }} end one alignment of each seq. for i := 1 to m − 1 do choose two alignments A p,A q from C; C := C − {A p,A q } A r := align(A p,A q );C := C union {A r } end C now contains the (single) final alignment end

The Clustal Algorithm  Three steps: 1 Compare all pairs of sequences to obtain a similarity matrix 2 Based on the similarity matrix, make a guide tree relating all the sequences 3 Perform progressive alignment where the order of the alignments is determined by the guide tree

(A) 1 pairwise comparison 2 clustering/making tree (B) 3 Align according to tree

Clustal - summary  Does not use a score for the final alignment  Each pairwise alignment is done using dynamic programming  Heuristics are used - tailored to globular proteins  Graphical version: ClustalX

Phylogeny  The basic principle is that the origin of similarity is common ancestry.  The field of phylogeny has the goals of working out the relationships among species, populations, individuals, or genes.  Usually expressed as a tree.

Phylogeny  The basic principle is that the origin of similarity is common ancestry.  The field of phylogeny has the goals of working out the relationships among species, populations, individuals, or genes.  Usually expressed as a tree.

Phylogeny  A statement of phylogeny among objects assumes homology and depends on classification.  Phylogeny states a topology of the relationships based on classification according to similarity of one or more sets of characters, or on a model of evolutionary processes.

Phylogeny  It is rare for species relationships and ancestry to be directly observable.  Evolutionary trees determined from genetic data are often based on inferences from the patterns of similarity, which are all that is observable among species living now.