Graphs and Graph Theory in Computational Biology Dan Gusfield Miami University, May 15, 2008 (four hour tutorial)

Some examples of graphs in biology Taken from the web - see the citations for details. Many other examples of graphs more complex than trees in biology.

From Max Delbrueck Center, Berlin

From http://www-personal.umich.edu/~mejn/networks/ Yeast protein interactions

Protein-Protein Interactions

Protein-Protein Interaction Modelling D r. Peter Uetz I nstitut fur Toxikologie und Genetik Forschungszentrum Karlsruhe

http://www.nytimes.com/interactive/2008/05/05/science/20080506_DISEASE.html NY Times May 5, 2008 The Diseasome

Graphs and Graph Theory 1. Numerous uses of graphs and networks to represent biological phenomena at many conceptual levels. Maybe several 1000s of papers using graph representations, particularly trees, but little graph theory. 2. A respectable number of papers that develop new non-trivial graph theory for problems in biology. 100s of papers, maybe 1000. 3. A handful of papers exploiting or extending non- trivial classic graph theory for problems in biology. Perhaps a few hundred.

Introduction and Conclusion Very diverse biological applications and very diverse graph theory. So no single grand reason for graphs and no single graph topic in biology. Lots of opportunity for graph theorists and graph algorithmists to develop or apply graph theory to biological problems. Even more opportunity for combinatorial optimization.

What I will do in this tutorial Emphasis on points 2 and 3, i.e., Examples of the development of new non-trivial graph theory, and of the exploitation of classic graph theory. And (my apologies) I will mostly emphasize topics I have been involved with. Still, There are some hot biological areas today where graphs arise, and some graph topics that recur commonly, and I should point those out even if I will not talk in detail on those topics.

The digression Hot biology: Network biology -- biological phenomena that are represented by networks -- gene regulatory networks and protein interaction networks, just to name two. These form the core of Systems biology. Other relationships in biology represented by graphs and networks. Ex. diseasome. Recurring graph problems: graph problems in clustering data ( ex. finding cliques or variants of cliques); variants of graph isomorphism in network motif or molecular pathway problems; need for more random graph theory for significance testing

Clique Problems Clique problems are recurrent in clustering applications, but true cliques are computationally hard to find. Suggested research for graph theorist and algorithmists: computationally tractable, biologically meaningful alternatives to cliques. As examples: maximum density subgraphs; extreme sets in a graph.

Subgraph density Given a graph G, and a subset S of its nodes, let G(S) be the subgraph of G induced by S, i.e, G(S) has node set S and edge set E(S) consisting of all edges in G both of whose ends are in S. A Maximum Density subgraph of G is induced by the set of nodes S which the Maximizes |E(S)|/|S|. The maximum density subgraph can be found in polynomial time. It has the flavor of a maximum clique, but has different properties.

Extreme Sets In an edge-weighted undirected graph G, a subset S of nodes of G is called an extreme set if for every subset S’ of S, the total weight of the edges crossing from S’ to V- S’ is larger than the total weight of the edges crossing from S to V-S. All the extreme sets in a graph can be found in polynomial time.

Also There is also a great need for more sophisticated application of random graph theory in the study of biological networks. This is needed in order to establish null models to use in assessing the statistical significance of subgraphs, paths, patterns and motifs that are found in biological networks. We need to be able to distinguish observed patterns and subgraphs from those that occur with a high probability in a random graph, under a biologically appropriate model of randomness (an open field).

End of digression Start of the main tutorial: Examples of Graph Theory in Bioinformatics and Computational Biology

Outline Three Smaller examples: Euler paths and sequencing; Tanglegrams and co-evolution; Network Design and Multiple Alignment. Haplotyping by Perfect Phylogeny: Graph Realization. Phylogenetic Networks: Incompatibility Graph; Galled-Trees; Recombination Networks; The Decomposition Theorem and sufficient conditions. Multi-state Perfect Phylogeny and Chordal Graphs.

To start: Three small examples 1.Euler paths in sequencing and sequence assembly. 2.Tanglegrams and planarity testing in the study of co-evolution. 3.Application of Tree-Design approximations in multiple sequence alignment. Interplay between trees and strings.

Topic I: Eulerian paths in sequencing problems The general situation is that we have a (DNA say) molecule S whose sequence is unknown, but we know all the k-mers that occur in S, for some fixed k. Given those k-mers, we want to determine S, if possible, or determine whatever is possible to determine about S. Note that k is not related to the alphabet size. A very useful approach to problems of this type is to build an Eulerian digraph, based on the (k-1)- mers.

Euler graph for general k For general k, there is one node for each (k-1)-mer contained in an observed k-mer. Then there is a directed edge from the node for (k-1)mer A to the node for (k-1)mer B, if the (k-2) suffix of A matches the (k-2) prefix of B, so that A and B can be overlapped to form the observed k-mer. Example: k = 5 and we observe the 5-mer XXYZW. Then there will be a node for XXYZ and a node for XYZW and a directed edge from the first node to the second node. Those two nodes and the directed edge between them represent the 5-mer XXYZW. In some applications, there will be one such edge for each observation of that 5-mer.

The Euler graph derived from the sequence ACACGCAACTTAAA If a triple is observed more than once, there should be One directed edge for each observation of the triple. Ex. k = 3. The graph will have one node for each of the 2-mers in the observed 3-mers. Then there is a directed edge from the node for the 2-mer XY to the node for the 2-mer YZ, for any X, Z.

The point: Every Eulerian path in the graph specifies a sequence whose k-mers match the given data, and conversely every sequence whose k-mers match the data specifies an Eulerian path in the graph. So the set of Eulerian paths specifies the set of candidate sequences for the unknown original sequence. Algorithms exist for efficiently finding Eulerian paths, for counting their number, for determining uniqueness etc. so we can use this representation to study the set of candidate sequences. Compare this approach to earlier efforts to represent the set of candidates by a graph with a Hamilton path: each node represents an observed k-mer, not a (k-1)-mer.

In general there may be many Eulerian paths in the graph, and we want some additional criteria to distinguish the goodness of one Eulierian path compared to another. Different biological considerations translate into having a value for each subpath of length two. Then the value of an Eulerian path P with n edges is the sum of the n-1 values of the n-1 length-two subpaths in P. The problem is to find an Eulerian path with maximum value. We have some reasonable approximations for that, but a simpler case can be solved optimally in polynomial time. Making finer distinctions in Euler paths

The case of a binary alphabet, but arbitrary k Since the alphabet size is two, each node in the graph has at most two incoming edges and two outgoing edges. Assume exactly two each. 011 110 001 101 Ex. k = 4

The case of a binary alphabet, but arbitrary k At any node, there are two possible ways for an Euler path to pass through the node. 011 110 001 101 Ex. k = 4 turning

The case of a binary alphabet, but arbitrary k At any node, there are two possible ways for an Euler path to pass through the node. 011 110 001 101 Ex. k = 4 crossing So in terms of subpaths of length two, we have two choices at each node.

Restating the optimal Euler path problem We are given an Eulerian graph where the in and out degrees are at most two at each node, and at each node there is a given value for the turning pair, and a value for the crossing pair. Then choose the turning or the crossing pairs at the nodes to maximize the total value of the choices, subject to the requirement that the choices create an Euler path in the graph.

Main Result The problem can be solved in polynomial time. The set of choices that give Euler paths has a matroidal structure, which allows a matroid-greedy algorithm to find the optimal Euler path. A more direct algorithm based on Minimum Spanning Trees also solves the problem.

The Matroid Structure At every node v, the edge pair (crossing or turning) which has the lowest value is called the low pair, and the other pair is the high pair. The difference in values is called the loss at v. A subset S of nodes is called independent if there is an Euler path in the graph where at every node in S, the low pair is chosen. As defined, the family of independent sets form a matroid, and so we can find, by a greedy algorithm, an independent set which minimizes the loss - and this gives the optimal Euler path.

Topic II: Tanglegrams A Tanglegram is a pair of trees drawn in the plane with no crossing edges, with the same labeled leaf set. The leaves of one tree are displayed on a line, and the leaves of the other tree are displayed on a parallel line. A straight line connect each leaf in one tree to the leaf with the same label in the other tree. The number of crossing lines is a measure of the similarity of the trees.

Topic III: Multiple Sequence Alignment Interplay between sequences and trees. Exploitation of network design approximation.

Intro to Hours 2 and 3: Two “Post-HGP” Topics Two topics in Population Genomics SNP Haplotyping in populations Reconstructing a history of recombination These topics in Population Genomics illustrate current challenges in biology, and illustrate the use of graph theory, combinatorial algorithms and discrete mathematics in biology.

What is population genomics? The Human genome “sequence” is done. Now we want to sequence many individuals in a population to correlate similarities and differences in their sequences with genetic traits (e.g. disease or disease susceptibility). Presently, we can’t sequence large numbers of individuals, but we can sample the sequences at SNP sites.

SNP Data A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more). SNP maps have been compiled with a density of about 1 site per 1000. SNP data is what is mostly collected in populations - it is much cheaper to collect than full sequence data, and focuses on variation in the population, which is what is of interest.

Haplotype Map Project: HAPMAP NIH lead project ($100M) to find common SNP haplotypes (“SNP sequences”) in the Human population. Association mapping: HAPMAP used to try to associate genetic-influenced diseases with specific SNP haplotypes, to either find causal haplotypes, or to find the region near causal mutations. The key to the logic of Association mapping is historical recombination in populations. Nature has done the experiments, now we try to make sense of the results.

Topic IV: Perfect Phylogeny Haplotyping via Graph Realization

Genotypes and Haplotypes Each individual has two “copies” of each chromosome. At each site, each chromosome has one of two alleles (states) denoted by 0 and 1 (motivated by SNPs) 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 2 1 2 1 0 0 1 2 0 Two haplotypes per individual Genotype for the individual Merge the haplotypes

Haplotyping Problem Biological Problem: For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is hard to collect. Genotype data is easy to collect. Computational Problem: Given a set of n genotypes, determine the original set of n haplotype pairs that generated the n genotypes. This is hopeless without a genetic model.

The Perfect Phylogeny Model for SNP sequences 00000 1 2 4 3 5 10100 10000 01011 00010 01010 12345 sites Ancestral sequence Extant sequences at the leaves Site mutations on edges The tree derives the set M: 10100 10000 01011 01010 00010 Only one mutation per site allowed.

Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs: 0,0 and 0,1 and 1,0 and 1,1 This is the 4-Gamete Test When can a set of sequences be derived on a perfect phylogeny?

So, in the case of binary characters, if each pair of columns allows a tree, then the entire set of columns allows a tree. For M of dimension n by m, the existence of a perfect phylogeny for M can be tested in O(nm) time and a tree built in that time, if there is one. Gusfield, Networks 91 We will use the classic theorem in two more modern and more genetic applications.

The Perfect Phylogeny Model We assume that the evolution of extant haplotypes can be displayed on a rooted, directed tree, with the all-0 haplotype at the root, where each site changes from 0 to 1 on exactly one edge, and each extant haplotype is created by accumulating the changes on a path from the root to a leaf, where that haplotype is displayed. In other words, the extant haplotypes evolved along a perfect phylogeny with all-0 root. Justification: Haplotype Blocks, rare recombination, base problem whose solution to be modified to incorporate more biological complexity.

Perfect Phylogeny Haplotype (PPH) Given a set of genotypes S, find an explaining set of haplotypes that fits a perfect phylogeny. 12 a22 b02 c10 sites A haplotype pair explains a genotype if the merge of the haplotypes creates the genotype. Example: The merge of 0 1 and 1 0 explains 2 2. Genotype matrix S

The PPH Problem Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny 12 a22 b02 c10 12 a10 a01 b00 b01 c10 c10

The Haplotype Phylogeny Problem Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny 12 a22 b02 c10 12 a10 a01 b00 b01 c10 c10 1 c c a a b b 2 10 01 00

The Alternative Explanation 12 a22 b02 c10 12 a11 a00 b00 b01 c10 c10 No tree possible for this explanation

Efficient Solutions to the PPH problem - n genotypes, m sites Reduction to a graph realization problem (GPPH) - build on Bixby- Wagner or Fushishige solution to graph realization O(nm alpha(nm)) time. Gusfield, Recomb 02 Reduction to graph realization - build on Tutte’s graph realization method O(nm^2) time. Chung, Gusfield 03 Direct, from scratch combinatorial approach -O(nm^2) Bafna, Gusfield et al JCB 03 Berkeley (EHK) approach - specialize the Tutte solution to the PPH problem - O(nm^2) time. Linear-time solutions - Recomb 2005, Ding, Filkov, Gusfield and a different linear time solution.

The Reduction Approach This is the original polynomial time method. Conceptually simplest at a high level (but not at the implementation level) and most extendable to other problems; nearly linear- time but not linear-time.

The case of the 1’s 1)For any row i in S, the set of 1 entries in row i specify the exact set of mutations on the path from the root to the least common ancestor of the two leaves labeled i, in every perfect phylogeny for S. 2)The order of those 1 entries on the path is also the same in every perfect phylogeny for S, and is easy to determine by “leaf counting”.

Leaf Counting 1234567 a1010000 b0101000 c1200202 d2200020 In any column c, count two for each 1, and count one for each 2. The total is the number of leaves below mutation c, in every perfect phylogeny for S. So if we know the set of mutations on a path from the root, we know their order as well. S Count 5 4 2 2 1 1 1

Simple Conclusions Root The order is known for the red mutations together with the leftmost blue(?) mutation. 1 2 3 4 5 6 7 i:0 1 0 1 2 2 2 Subtree for row i data 2424 sites 5

But what to do with the remaining blue entries (2’s) in a row?

More Simple Tools 3)For any row i in S, and any column c, if S(i,c) is 2, then in every perfect phylogeny for S, the path between the two leaves labeled i, must contain the edge with mutation c. Further, every mutation c on the path between the two i leaves must be from such a column c.

From Row Data to Tree Constraints Root 1 2 3 4 5 6 7 i:0 1 0 1 2 2 2 Subtree for row i data 2424 sites 5 Edges 5, 6 and 7 must be on the blue path, and 5 is already known to follow 4, but we don’t where to put 6 and 7. i i

The Graph Theoretic Problem Given a genotype matrix S with n sites, and a red-blue subgraph for each row i, create a directed tree T where each integer from 1 to n labels exactly one edge, so that each subgraph is contained in T. ii

Powerful Tool: Tree and Graph Realization Let Rn be the integers 1 to n, and let P be an unordered subset of Rn. P is called a path set. A tree T with n edges, where each is labeled with a unique integer of Rn, realizes P if there is a contiguous path in T labeled with the integers of P and no others. Given a family P1, P2, P3…Pk of path sets, tree T realizes the family if it realizes each Pi. The graph realization problem generalizes the consecutive ones problem, where T is a path. More generally, each set specifies a fundamental cycle in the unknown graph.

Tree Realization Example 1 2 4 5 6 3 8 7 P1: 1, 5, 8 P2: 2, 4 P3: 1, 2, 5, 6 P4: 3, 6, 8 P5: 1, 5, 6, 7 Realizing Tree T More generally, think of each path set as specifying a fundamental cycle containing the edges in the specified path.

Graph Realization Polynomial time (almost linear-time) algorithms exist for the graph realization problem, given the family of fundamental cycles the unknown graph should contain – Whitney, Tutte, Cunningham, Edmonds, Bixby, Wagner, Gavril, Tamari, Fushishige, Lofgren 1930’s - 1980’s Most of the literature on this problem is in the context of determining if a binary matroid is graphic. The algorithms are not simple; none implemented before 2002.

Reducing PPH to graph realization We solve any instance of the PPH problem by creating appropriate path sets, so that a solution to the resulting graph realization problem leads to a solution to the PPH problem instance. The key issue: How to encode the needed subgraph for each row, and glue them together at the root.

From Row Data to Tree Constraints Root 1 2 3 4 5 6 7 i:0 1 0 1 2 2 2 Subtree for row i data 2424 sites 5 Edges 5, 6 and 7 must be on the blue path, and 5 is already known to follow 4. i i

Encoding a Red-Blue directed path 2 4 5 P1: U, 2 P2: U, 2, 4 P3: 2, 4 P4: 2, 4, 5 P5: 4, 5 2 4 5 U U is a glue edge used to glue together the directed paths from the different rows. forced In T

Now add a path set for the blues in row i. Root 1 2 3 4 5 6 7 i:0 1 0 1 2 2 2 2424 sites 5 i i P: 5, 6, 7

That’s the Reduction The resulting path-sets encode everything that is known about row i in the input. The family of path-sets are input to the graph- realization problem, and every solution to the that graph-realization problem specifies a solution to the PPH problem, and conversely. Whitney (1933?) characterized the set of all solutions to graph realization (based on the three-connected components of a graph) and Tarjan et al showed how to find these in linear time.

An implicit representation of all solutions Whitney (1930) proved that a graph realization problem has a unique solution if and only if the graph is three-connected. That is, at least three nodes must be removed in order to disconnect the graph (assuming it is connected). Whitney (1931) proved that if the solution is not unique, then there is a semi-unique decomposition of the graph into three- connected components, so that the graph realizations are in one- one correspondence with all the ways that these components can be ``twisted” relative to each other. So the number of solutions is 2^(number of three connected comps. -1).

Tree Realization Example 1 2 4 5 6 3 8 7 P1: 1, 5, 8 P2: 2, 4 P3: 1, 2, 5, 6 P4: 3, 6, 8 P5: 1, 5, 6, 7 Realizing Tree T with edges added to create a fundamental cycle for each path

Topic V: Phylogenetic Networks with Recombination

Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs: 0,0 and 0,1 and 1,0 and 1,1 This is the 4-Gamete Test When can a set of sequences be derived on a perfect phylogeny?

Incompatible Sites A pair of sites (columns) of M that fail the 4-gametes test are said to be incompatible. A site that is not in such a pair is compatible.

A richer model 00000 1 2 4 3 5 10100 10000 01011 00010 01010 12345 10100 10000 01011 01010 00010 10101 added Pair 4, 5 fails the four gamete-test. The sites 4, 5 are incompatible. Real sequence histories often involve recombination. M

10100 01011 5 10101 The first 4 sites come from P (Prefix) and the sites from 5 onward come from S (Suffix). P S Sequence Recombination A recombination of P and S at recombination point 5. Single crossover recombination

Network with Recombination: ARG 00000 1 2 4 3 5 10100 10000 01011 00010 01010 12345 10100 10000 01011 01010 00010 10101 new 10101 The previous tree with one recombination event now derives all the sequences. 5 P S M

A Min ARG for Kreitman’s data ARG created by SHRUB

An illustration of why we are interested in recombination: Association Mapping of Complex Diseases Using ARGs

Association Mapping A major strategy being practiced to find genes influencing disease from haplotypes of a subset of SNPs. –Disease mutations: unobserved. A simple example to explain association mapping and why ARGs are useful, assuming the true ARG is known. 01001 Disease mutation site SNPs

00000 5 2 3 3 4 S P P S 1 4 a:00010 b:10010 c:00100 10010 01100 d:10100 e:01100 00101 01101 f:01101 g:00101 00100 00010 Very Simplistic Mapping the Unobserved Mutation of Mendelian Diseases with ARGs Diseased Assumption (for now): A sequence is diseased iff it carries the single disease mutation Where is the disease mutation? 1 2 3 4 5 What part of 01100 d, e, f inherit? d:e:f:d:e:f: ?? The single disease mutation occurs between sites 2 and 3!

Mapping Disease Gene with Inferred ARGs “..the best information that we could possibly get about association is to know the full coalescent genealogy…” – Zollner and Pritchard, 2005 But we do not know the true ARG! Goal: infer ARGs from SNP data for association mapping –Not easy and often approximation (e.g. Zollner and Pritchard) –Improved results to do the inference Y. Wu (RECOMB 2007)

Results on Reconstructing the Evolution of SNP Sequences Part I: Clean mathematical and algorithmic results: Galled-Trees, near- uniqueness, graph-theory lower bound, and the Decomposition theorem Part II: Practical computation of Lower and Upper bounds on the number of recombinations needed. Construction of (optimal) phylogenetic networks; uniform sampling; haplotyping with ARGs; LD mapping … Part III: Varied Biological Applications Part IV: Extension to Gene Conversion Part V: The Minimum Mosaic Model of Recombination This talk will discuss topics in Parts I

Problem: If not a tree, then what? If the set of sequences M cannot be derived on a perfect phylogeny (true tree) how much deviation from a tree is required? We want a network for M that uses a small number of recombinations, and we want the resulting network to be as ``tree-like” as possible.

4 1 3 25 a: 00010 b: 10010 d: 10100 c: 00100 e: 01100 f: 01101 g: 00101 A tree-like network for the same sequences generated by the prior network. 2 4 p s p s

Recombination Cycles In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet. The cycle specified by those two paths is called a ``recombination cycle”.

Galled-Trees A phylogenetic network where no recombination cycles share an edge is called a galled tree. A cycle in a galled-tree is called a gall. Question: if M cannot be generated on a true tree, can it be generated on a galled- tree?

Results about galled-trees Theorem: Efficient (provably polynomial-time) algorithm to determine whether or not any sequence set M can be derived on a galled-tree. Theorem: A galled-tree (if one exists) produced by the algorithm minimizes the number of recombinations used over all possible phylogenetic-networks. Theorem: If M can be derived on a galled tree, then the Galled-Tree is ``nearly unique”. This is important for biological conclusions derived from the galled-tree. Papers from 2003-2007.

Elaboration on Near Uniqueness Theorem: The number of arrangements (permutations) of the sites on any gall is at most three, and this happens only if the gall has two sites. If the gall has more than two sites, then the number of arrangements is at most two. If the gall has four or more sites, with at least two sites on each side of the recombination point (not the side of the gall) then the arrangement is forced and unique. Theorem: All other features of the galled-trees for M are invariant.

A whiff of the ideas behind the results

Incompatible Sites A pair of sites (columns) of M that fail the 4-gametes test are said to be incompatible. A site that is not in such a pair is compatible.

0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 1 2 3 4 5 abcdefgabcdefg 13 4 25 Two nodes are connected iff the pair of sites are incompatible, i.e, fail the 4-gamete test. Incompatibility Graph G(M) M THE MAIN TOOL: We represent the pairwise incompatibilities in a incompatibility graph.

The connected components of G(M) are very informative Theorem: The number of non-trivial connected components is a lower- bound on the number of recombinations needed in any network. Theorem: When M can be derived on a galled-tree, all the incompatible sites in a gall must come from a single connected component C, and that gall must contain all the sites from C. Compatible sites need not be inside any blob. In a galled-tree the number of recombinations is exactly the number of connected components in G(M), and hence is minimum over all possible phylogenetic networks for M.

4 1 3 25 a: 00010 b: 10010 d: 10100 c: 00100 e: 01100 f: 01101 g: 00101 2 4 p s p s 13 4 25 Incompatibility Graph

A Graph Theoretic Necessary Condition for a Galled-Tree If M can be generated on a galled-tree, then the incompatibility graph must be a bipartite bi-convex graph. Other structural properties of the conflict graph can be deduced and exploited.

Galled-Tree Haplotyping Problem: Given genotype matrix G, if there is no PPH solution for G, is there a haplotyping H for G such that H can be derived on a Galled-Tree?

A different Neccessary Condition for a one-gall tree 1. There exists a set of sequences S such that for every pair of incompatible sites p,q, a single p,q state-pair appears in all sequences in S, and does not appear in any sequence outside S. 2. There must be a number x such that p < x < q, for each incompatible pair p,q.

4 1 3 25 a: 000100 b: 100100 d: 101000 c: 001000 f: 011000 g: 001010 2 p s 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 0 1 0 1 0 abcdefgabcdefg H 6 e:1010001 S = {e,d} the sequences below the recombination node. Example

Surprising Result - Yun Song The necessary condition is also sufficient. Yun S. Song in TCBB 2006

Coming full circle - back to genotypes When can a set of genotypes be explained by a set of haplotypes derived on a galled-tree, rather than on a perfect phylogeny? The Song NASC can be translated into an ILP, using the part of the MinIncompat ILP that identifies which site pairs are incompatibile.

For the one gall problem, the ILP formulation solves very efficiently (200 rows x 40 sites in seconds to minutes). So far, the 2-gall case does not solve well (ongoing work). (Dan Brown, Gusfield 2006).

Coming full circle - back to genotypes When can a set of genotypes be explained by a set of haplotypes that derived on a galled-tree, rather than on a perfect phylogeny? Recently, we developed an Integer Linear Programming solution to this problem, and are now testing the practical efficiency of it. (Brown, Gusfield).

Change of Scope: Minimizing Recombinations in unconstrained networks Problem: given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations used to generate M, allowing only one mutation per site. This has biological meaning in appropriate contexts. We can solve this problem in poly-time for the special case of Galled-Trees. The minimization problem is NP-hard in general.

Minimization is an NP-hard Problem What we have done: 1. Solve small data-sets optimally with exponential-time methods or with algorithms that work well in practice; 2. Efficiently compute lower and upper bounds on the number of needed recombinations. 3. Apply these methods to address specific biological and bio-tech questions.

The Decomposition Theorem Since the minimization problem is NP-hard we want to break up a problem into subproblems that can be solved separately and combined.

0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 1 2 3 4 5 abcdefgabcdefg 13 4 25 Two nodes are connected iff the pair of sites are incompatible, i.e, fail the 4-gamete test. Incompatibility Graph G(M) M THE MAIN TOOL: We represent the pairwise incompatibilities in a incompatibility graph.

The connected components of G(M) are very informative For example we have the Theorem: The number of non-trivial connected components is a lower-bound on the number of recombinations needed in any network.

Recombination Cycles In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet. The cycle specified by those two paths is called a ``recombination cycle”.

A maximal set of intersecting cycles forms a Blob 00000 5 2 3 3 4S p P S 1 4 10010 01100 00101 01101 00100 00010 If directions on the edges are removed, a blob is a bi-connected component of the network.

Blobed Trees Contracting each blob in a network results in a directed, rooted tree, otherwise one of the “blobs” was not maximal. Simple, but key insight. So every phylogenetic network can be viewed as a directed tree of blobs - a blobbed-tree. The blobs are the non-tree-like parts of the network.

Ugly tangled network inside the blob. Every network is a tree of blobs. A network where every blob is a single cycle is a Galled-Tree.

A Simple Observation In any network N for M, all sites from the same connected component of G(M) must appear together in a single blob in N.

The Decomposition Theorem Theorem: For any set of sequences M, there is a phylogenetic network that derives M, where each blob contains all and only the sites in one non-trivial connected component of G(M). The compatible sites can always be put on edges outside of any blob. This “fully-decomposed” network is the finest decomposition possible.

Example: Network for input M with one blob 00000 5 2 3 3 4S p P S 1 4 a:00010 b:10010 c:00100 10010 01100 d:10100 e:01100 00101 01101 f:01101 g:00101 00100 00010

4 1 3 25 a: 00010 b: 10010 d: 10100 c: 00100 e: 01100 f: 01101 g: 00101 2 4 p s p s 13 4 25 Incompatibility Graph The fully- decomposed network for M

Moreover, the backbone tree is invariant over all the fully-decomposed networks for M, and can be determined in polynomial-time. So, we can find a network for M by solving the recombination minimization problem for each connected component of G(M) separately, and then connect those subnetworks in an invariant way.

Algorithmically Finding the tree part of the blobbed-tree is easy. Determining the sequences labeling the exterior nodes on any blob is easy. Determining a “good” structure inside a blob B is the problem of generating the sequences of the exterior nodes of B. It is easy to test whether the exterior sequences on B can be generated with only a single recombination. The original galled-tree problem is now just the problem of testing whether one single-crossover recombination is sufficient for each blob. That can be solved by successively removing each exterior sequence and testing if the remaining sequences can be generated on a perfect phylogeny of the correct form.

However … While fully-decomposed networks always exist, they do not necessarily minimize the number of recombination nodes, over all possible networks. That is, sometimes it pays to put sites from different connected components together on the same blob.

But we can prove several useful sufficient conditions for when there is a fully-decomposed network that minimizes the number of recombinations, over all possible networks. The deepest result: Theorem: Let N be a phylogenetic network for input M, let L be the set of sequences that label the nodes of N, and let G(L) be the incompatibility graph for L. If G(L) and G(M) have the same number of connected components, then there is a fully-decomposed network for M with the same number of recombinations as in N. JCB December 2007 Sufficient Conditions

Corollary A fully-decomposed network exists that minimizes the number of recombinations, unless every optimal network uses some recombination node(s) labeled by sequence(s) not in M, and the addition of those sequences to M creates an incompatibility between sites in different components of G(M).

000000 3 4 5 1 p 4 001000 0011010 010010 2 6 100001 100101 000100 3 5 p s s 100010 G(L) has one component. The addition of sequence 100010 reduces the number of components from 2 to 1. Sequences in M are in black. Sequence 100010 is not in M. G(M) has two components. Each requires two recs, but this combined network needs only three. p s

A Practical Sufficient Condition If M can be derived on a network N in which every edge contains at most one site, and every node is labeled with a sequence in M, then there is a fully-decomposed network for M which minimizes the number of recombinations over all possible networks for M.

Another Practical Sufficient Condition If M can be derived on a network N where the number of recombinations equals the (poly-computable) Haplotype Lower Bound, then there is a fully decomposed network for M which minimizes the number of recombinations over all possible networks.

Topic VI: Perfect Phylogeny Extension to non-binary characters We detail the case of three allowed states per character.

What is a Perfect Phylogeny for non-binary characters? Input consists of n sequences M with m sites (characters) each, where each site can take one of k states. In a Perfect Phylogeny T for M, each node of T is labeled with an m- length sequence where each site has a value from 1 to k. T has n leaves, one for each sequence in M, labeled by that sequence. For each character-state pair (C,s), the nodes of T that are labeled with state s for character C, form a connected subtree of T. It follows that the subtrees for any C are node-disjoint

Example: A perfect phylogeny for input M 321 232 3 23 1 13 123 A B C 1 2 3 4 5 M n = 5 m = 3 k = 3 (3,2,1) (2,3,2) (3,2,3) (1,2,3) (1,1,3) (1,2,3) (3,2,3)

Example 321 232 3 23 1 13 123 A B C 1 2 3 4 5 M n = 5 m = 3 k = 3 (3,2,1) (2,3,2) (3,2,3) (1,2,3) (1,1,3) (1,2,3) (3,2,3) The tree for State 2 of Character B

Perfect Phylogeny Problem Given M, is there a Perfect Phylogeny for M?

Chordal Graphs Basic Definition: A graph G is called Chordal if every cycle of length four or more contains a chord. More useful result: A graph G is chordal if and only if every minimal vertex separator in G is a clique. Chordal graphs have a large number of applications, more based on the separator result than on the basic definition. For example, a chordal graph on n nodes can have at most n maximal cliques and n-1 minimal vertex separators.

Another Classic Chordal Graph Theorem A graph G is chordal if and only if it is the intersection graph of a set S of subtrees of a tree T. Each node of G is a member of S. a b c d e f g {b,c} {b,c,d} {c,d,e,g} {a,e}{e,f,g} T {a,e,g} G

Relation to Perfect Phylogeny In a perfect phylogeny T for a table E, for any character C and any state X of character C, the sub-forest of T induced by the nodes labeled (C,X) form a single, connected subtree of T. So, there is a natural set of subtrees of T induced by E.

Chordal Completion Approach to Perfect Phylogeny 321 232 3 23 1 13 123 A B C 1 2 3 4 5 1 1 1 2 2 2 3 3 3 Each row of table E induces a clique in G(E). Table E Graph G(E) has one node for each character-state pair in E, and an edge between two nodes if and only if there is a row in E with both those character-state pairs. G(E)

Classic Theorem There is a perfect phylogeny for table E if and only if edges can be added to graph G(E) to make it a chordal, K-partite graph. If there is such a chordal graph, denote it by G’(E). Note that if table E has K columns, then G(E) is a K-partite graph. Theorem (Buneman 196?)

Deeper Result: If G’(E) exists Let C(E) be the graph derived from graph G’(E) as follows: create a node in C(E) for each maximal clique in G’(E), and create an edge (u,v) in C(E) iff the cliques for u and v in G’(E) share a node. Weight edge (u,v) by the number of shared nodes. Note that C(E) can be created from G’(E) in polynomial time. Any Maximum Spanning Tree T in C(E) is a perfect phylogeny for E. Actually, T can be found more directly in linear time from G’(E).

Perfect Phylogeny Results The perfect phylogeny problem was open for about 20 years, but solved by Dress, Steel, Warnow and Kannan, Agarwalla and Fernandez-Baca. For any fixed bound on the number of states per character, the Perfect Phylogeny Problem can be solved in polynomial time. However, if the number of states per character is not bounded, then the problem is NP-Complete. Also, for any fixed number of characters, the problem can be solved in polynomial time.

Dress-Steel solution for 3-state Perfect phylogeny given complete data (1991) Recode each site M(i) of M as three binary sites M’(i,1), M’(i,2), M’(i,3) each indicating the taxa that have state 1, 2, or 3. Theorem (DS) There is a 3-state perfect phylogeny for M, if and only if there is a binary- character perfect phylogeny for some subset of M’ consisting of exactly two of the columns M’(i,1), M’(i,2), M’(i,3), for each column i of M.

Example 1 2 3 4 M’ 3 21 2 32 3 23 1 13 1 23 A B C 1 2 3 4 5 M 001010100 010001010 001010001 100100001 100010001 5 A,1 A,2 A,3 B,1 B,2 B,3 C,1 C,2 C,3 Compatible subset

Solved in Poly-Time by 2-SAT As stated, the problem still seems like it would take exponential time to solve, but in fact it is easy to code the problem as a 2-SAT problem (Y. Wu) and hence is solvable in polynomial time. The Dress-Steel paper gave an independent poly-time solution.

Graphs and Graph Theory in Computational Biology Dan Gusfield Miami University, May 15, 2008 (four hour tutorial)

Similar presentations

Presentation on theme: "Graphs and Graph Theory in Computational Biology Dan Gusfield Miami University, May 15, 2008 (four hour tutorial)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Graphs and Graph Theory in Computational Biology Dan Gusfield Miami University, May 15, 2008 (four hour tutorial)

Similar presentations

Presentation on theme: "Graphs and Graph Theory in Computational Biology Dan Gusfield Miami University, May 15, 2008 (four hour tutorial)"— Presentation transcript:

Similar presentations

About project

Feedback