1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen

Slides:



Advertisements
Similar presentations
Heuristic Search techniques
Advertisements

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Multiple Sequence Alignment
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Lecture 13 CS5661 Phylogenetics Motivation Concepts Algorithms.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Phylogenetic reconstruction
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Molecular Evolution Revised 29/12/06
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Heuristic alignment algorithms and cost matrices
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Sequencing and Sequence Alignment
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
CSIE NCNU1 Block Alignment: An Approach for Multiple Sequence Alignment Containing Clusters Advisor: Professor R. C. T. Lee Speaker: B. W. Xiao 2004/06/04.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Multiple sequence alignment
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Multiple Sequence alignment Chitta Baral Arizona State University.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Strings and.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
Phylogenetic trees Sushmita Roy BMI/CS 576
Trees, Stars, and Multiple Biological Sequence Alignment Jesse Wolfgang CSE 497 February 19, 2004.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Catherine S. Grasso Christopher J. Lee Multiple Sequence Alignment Construction, Visualization, and Analysis Using Partial Order Graphs.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Introduction to Phylogenetics
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Phylogeny Ch. 7 & 8.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
An Improved Search Algorithm for Optimal Multiple-Sequence Alignment Paper by: Stefan Schroedl Presentation by: Bryan Franklin.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Phylogenetic basis of systematics
Multiple sequence alignment (msa)
Inferring a phylogeny is an estimation procedure.
Sequence comparison: Local alignment
Multiple Alignment and Phylogenetic Trees
Phylogenetic Trees.
Comparative RNA Structural Analysis
CSE 589 Applied Algorithms Spring 1999
Phylogeny.
Computational Genomics Lecture #3a
Presentation transcript:

1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen

2 Overview: What is a phylogeny? The Generalized Tree Alignment problem Sequence Graphs and their algorithms The Deferred Path Heuristic

3 Phylogeny: Describes evolutionary model -Common ancestor -Mutations happen all the time -Insertions, deletions, substitutions, translocations, inversions, duplications … Most mutations happen in DNA replication -Corrected by cell mechanisms Mutations accumulate → new species diverge Only mutations in sex cells are inherited (obviously)

4 Phylogeny: Phylogenetic inference: Given n sequences build a phylogenetic tree Most methods base T on a multiple alignment Likewise: Multiple alignments often based on guide trees Can we solve both problems at the same time?

5 Phylogeny: Describes the evolutionary relationship between species Notice root

6 Phylogeny:... or among a single taxon (here, human entovirus 71)

7 The Problem: Given n sequences s 1,…,s n … Multiple Alignment: Make an ordering A of the sequences by inserting gaps such that homologous bases are put in the same column Phylogenetic Inference: Build a (binary) tree T with s 1,…,s n in the leaves and possible ancestors s n+1,…,s n+k in internal nodes describing their evolutionary connection

8 Generalized Tree Alignment: Combines the two. The problem we want to solve is: Given: A set of n sequences s 1,…,s n from n different species (could be DNA, RNA or protein – for simplicity we focus on DNA) Problem: Generate an unrooted phylogenetic tree T with sequences s 1,…,s n in the leaves and a multiple alignment A of these sequences Placing the root is not trivial and is best left to biologists.

9 The given problem is proven to be MAXSNP-hard (Wang and Jiang, 1994) → Not possible to find an approximation algorithm. Exact solutions to NP-hard problems are intractable → The best we can hope for is a heuristic The given algorithm runs in time O(n 2. l n ) n: The number of sequences l: Their maximum length.

10 Sequence graphs (Hein, 1989): Recall pairwise alignment. Traceback ”spells” possible optimal alignments:

11 Sequence graphs: Make graph with alignment columns as edge labels → represents all optimal alignments We will get back to that shortly … Right now, we want to represent sequences Let us introduce sequence graphs. For instance, s = ACTGTA is represented by:

12 Sequence graphs: More formally: Directed, acyclic graph. Edge labels l from alphabet Σ. Here, Σ={ A,C,G,T,- } Source s: The unique node with no incoming edges Sink t: The unique node with no outgoing edges. Each path from s to t spells a sequence.

13 Sequence graphs: Represents a set of sequences given by all paths from s to t:

14 Sequence graphs: Any single sequence can be represented by a linear sequence graph Any set of k sequences can be represented by making k paths from s to t A given sequence s’ can be represented by more than one path We can now represent sequences – but can we align them?

15 Aligning sequence graphs: Dynamic programming algorithm inspired by basic Pairwise Alignment: Given two sequences p and q Move one letter in p and move through q finding the optimal ”partial alignments” Sequence Graphs: Given two sequence graphs G 1 and G 2 We can have many outgoing edges to choose from

16 Aligning sequence graphs: Fill in a |V 1 | * |V 2 | score matrix For each pair of nodes i from G 1 and j from G 2 : Should we: Align the two characters we got by following e 1 into i and e 2 into j? Stay in G 1 and only move in G 2 ? Stay in G 2 and only move in G 1 ? Or have we already found a better path into i and j?

17 Optimal Alignment Graphs: Now we need a way to remember the optimal alignments Recall graphs from before: Directed, acyclic graphs Nodes s and t defined as before Edge labels of the form [ l a,l b ] where l a,l b ∊ Σ Backtrack through the matrix and consider each possible combination of edges.

18 Optimal Alignment Graphs: An example of an OAG: This one represents the alignments: We denote such a graph A* We have to convert the OAGs back to SGs

19 Optimal Alignment Graphs: This is done easily by considering the edge labels: If l a = l b : Make a single edge in the SG with label l a If l a ≠ l b : Make two edges in the SG: One with label l a and one with label l b The graph from before turns into the SG:

20 Summing up Sequence Graphs: Final graph represents all sequences giving an optimal alignment between G 1 and G 2 We can: Represent a set of sequences by a sequence graph Align two such graphs producing a new SG We can now get on with the main algorithm

21 The basic idea: Start by comparing all sequences –Find a closest pair. Represent all sequences giving the optimal solution –Defer the choice of a single sequence Repeat, but this time include the set of sequences In the end: Choose a single sequence and backtrack This shows a need for: -A compact representation of many sequences -An algorithm for aligning sets of sequences

22 The Deferred Path Heuristic: Similar to Kruskal’s algorithm for finding MSTs: From sequences s 1,…,s n,initialize n SGs G 1,…,G n. Until only two SGs remain: Align all pairs and choose a closest pair G i and G j Create A * (G i,G j ) and convert A * into a SG G k. Replace G i and G j with G k Note that we remember all candidate sequences

23 The Deferred Path Heuristic: When only two SGs G i and G j remain: Align them and connect them in T Choose some optimal alignment –This gives s i and s j in the root of the two subtrees. Backtrack through the subtrees –At each step: Align s k to the underlying SGs. –Choose some optimal alignment

24 The Deferred Path Heuristic: We defer our choice of actual sequences until the last moment, thereby enlarging our solution space: