A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State.

Presentation on theme: "A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State."— Presentation transcript:

A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State University

My motivation for this project Trees in statistics or biology –Often a latent branching structure relating some observed data Trees in mathematics –Always a connected graph with no cycles

My motivation for this project Trees in statistics or biology –PROBLEM: Recover properties of latent branching structure Trees in mathematics –Always a connected graph with no cycles

My motivation for this project Trees in statistics or biology –PROBLEM: Recover properties of latent branching structure Trees in mathematics –Characterization of observed structure by spectral graph theory

My motivation for this project Trees in statistics or biology –PROBLEM: Recover properties of latent branching structure Trees in mathematics –Characterization of observed structure by spectral graph theory

Bridging the gap Rectifying trees and trees Can we use some powerful tools of spectral graph theory to recover latent structure? –Natural relationship between trees and complete graphs?!?

Tree and distance matrices The tree with vertex set {1,…,8} has distance matrix D The phylogenetic tree can only be observed at {1,…,5} –We can only observe (estimate) the phylogenetic portion D* The phylogenetic portion D*

More motivation for this project Trees in statistics or biology –PROBLEM: Recover properties of latent branching structure Given D* only, recover latent branching structure –This is the problem of phylogenetic reconstruction (w/o error!) The phylogenetic portion D*

NJ finds (2,n-2) splits from D* A split is a bipartition of the leaf set (e.g. {1,2,3,4,5}) that can be induced by cutting a branch on the tree –e.g. {{1,2},{3,4,5}} or {{1,2,5},{3,4}} Neighbor-joining criterion identifies (2,n-2) splits through {{1,2},{3,4,5}}{{1,2,5},{3,4}}

A recipe for tree reconstruction from D* 1.Find a split –NJ relies on theorem that guarantees (2,n-2) split from Q matrix 2.Use knowledge of split to reduce dimension –NJ prunes the cherry (neighboring taxa) to reduce leaves by one 3.Iterate until tree has been fully reconstructed –Tree topology specified by its split set

Our narrow goal 1.Find a split –NJ relies on theorem that guarantees (2,n-2) split from Q matrix –Hypothesize criterion that identifies deeper splits … and prove that it actually works

Our solution The phylogenetic portion D*

Our solution Let H be the centering matrix: Find eigenvector Y of HD*H with the smallest eigenvalue –The signs of the entries of Y identify a split of the tree The phylogenetic portion D*

About the matrix HD*H Entries of HD*H are D ij – D i. – D.j + D.. HD*H is negative semidefinite –Zero is a simple eigenvalue with unit eigenvector –Entries of remaining eigenvalues have both + and - entries HD*H appears prominently in: –Multidimensional scaling –Principal coordinate analysis

Example of our solution Find eigenvector Y of HD*H with the smallest eigenvalue: Signs of Y identify the split {{1,2},{3,4,5}} +0.5793 +0.4418 -0.0564 -0.4636 -0.5011

A real example (data from ToL) Two iterations

Our solution 1.Find a split –NJ relies on theorem that guarantees (2,n-2) split from Q matrix –Hypothesize criterion that identifies deep splits … and prove that it actually works

Affinity and distance In phylogenetics, common to consider pairwise distances –In graph theory, common to consider pairwise affinities Distance-based Affinity-based

Distance matrix Laplacian matrix

The genius of Miroslav Fiedler G connected smallest eigenvalue of L, zero, is simple –Smallest positive eigenvalue,, called algebraic connectivity of G Fiedler vectors Y satisfy LY= Y –Fiedler cut is the sign-induced bipartition +0.4840 +0.4038 -0.4047 -0.4277 -0.0223 +0.3449 -0.3653 -0.0158

The genius of Miroslav Fiedler G connected smallest eigenvalue of L, zero, is simple –Smallest positive eigenvalue,, called algebraic connectivity of G Fiedler vectors Y satisfy LY= Y –Fiedler cut is the sign-induced bipartition Fiedler cut here is –{{1,2,6},{3,4,5,7,8}} Note that the cut implies a leaf split: –{{1,2},{3,4,5}} +0.4840 +0.4038 -0.4047 -0.4277 -0.0223 +0.3449 -0.3653 -0.0158

Is this relevant here? We do not observe an 8x8 Laplacian matrix L –All we get is a 5x5 matrix of between-leaf pairwise distances D* Where is the connection to graph theory? The phylogenetic portion D*

Recall: Our solution Let H be the centering matrix: Find eigenvector Y of HD*H with the smallest eigenvalue –The signs of the entries of Y identify a split of the tree The phylogenetic portion D*

An extremely useful relationship Recall the centering matrix H –The (Moore-Penrose) pseudoinverse of HDH is in fact -2L We have shown in the context of this formula –Principal submatrices of D relate to Schur complements of L In particular, (HD*H) + = -2L* = -2(L/Z) = -2(W – XZ T Y), where WX Z Y

Recall: Our solution Find eigenvector Y of HD*H with the smallest eigenvalue –The signs of the entries of Y identify a split of the tree The smallest eigenvalue of HD*H (negative semidefinite) is the smallest positive eigenvalue of L* In fact, L* can be seen as a graph Laplacian –And our solution, Y, is the Fiedler vector of that graph! But what does this graph look like?

Schur complementation of a vertex The vertices adjacent to 8 become adjacent to each other

Schur complementation of the interior The graph described by L* is fully connected –All cuts yield connected subgraphs No help from Fiedler

Recap thus far Given matrix D* of pairwise distances between leaves Find eigenvector Y of HD*H with the smallest eigenvalue –Claim: The signs of the entries of Y identify a split of the tree Y shown to be a Fiedler vector of the Laplacian L* –But graph of L* is fully connected, has no apparent structure Thus Fiedler says nothing about signs of entries of Y –But claim requires signs to be consistent with structure of the tree

Recap thus far Thus Fiedler says nothing about signs of entries of Y –But claim requires signs to be consistent with structure of the tree How does L* inherit the structure of the tree? NO YES

The quotient rule inspires a Schur tower

How does this help?

Cutpoints and connected components A point of articulation (or cutpoint) is a point r G whose deletion yields a subgraph with 2 connected components –Cutpoints: 6,7,8 –Shown: {1}, {2}, {3,4,5,7,8} are connected components at 6 The cutpoints of a tree are its internal nodes

The key observation (i.e. theorem) Let L be the Laplacian of a graph G with some cutpoint v –Let L {v} be the Laplacian of G {v} obtained by Schur complement at v Then the Fiedler cut G {v} identifies a split of G –Here the Fiedler cut of G {6} is {{1,2,5,8},{3,4,7}} –Including 6 in {1,2,5,8} defines two connected components in G + G G {6} +0.5828 +0.4660 -0.3870 -0.4129 +0.0570 -0.3439 +0.0380 + + + - - - ?

The quotient rule inspires a Schur tower How does this help? Look at Schur paths to graph with Laplacian L* L L*

The punch line The graph with Laplacian L* can be obtained in three ways The Fiedler cut of G {6,7,8} must split G {6,7} and G {6,8} and G {7,8}

The punch line The graph with Laplacian L* can be obtained in three ways The Fiedler cut of G {6,7,8} must split G {6,7} and G {6,8} and G {7,8}

Recall: Example Find eigenvector Y of HD*H with the smallest eigenvalue: Signs of Y identify the split {{1,2},{3,4,5}} +0.5793 +0.4418 -0.0564 -0.4636 -0.5011

The punch line The graph with Laplacian L* can be obtained in three ways The Fiedler cut of G {6,7,8} must split G {6,7} and G {6,8} and G {7,8} This implies that the cut splits the progenitor graph G! {{1,2,6},{3,4,5,7,8}}

Our solution actually works Let H be the centering matrix: Find eigenvector Y of HD*H with the smallest eigenvalue –The signs of the entries of Y identify a split of the tree The phylogenetic portion D*

A recipe for tree reconstruction 1.Find a split –NJ relies on theorem that guarantees (2,n-2) split from Q matrix –We have a theorem that guarantees splits from HD*H matrix 2.Use knowledge of split to reduce dimension –NJ prunes the cherry (neighboring taxa) to reduce leaves by one –We use a divisive method that reduces to pairs of subtrees 3.Iterate until tree has been fully reconstructed –Tree topology specified by its split set

Reconstruction from the inside out

Connections with Classical MDS and PCoA Classical solution to multidimensional scaling –a.k.a. Principal coordinate analysis Recipe for dimension reduction given distance matrix D: 1.Construct matrix A from D entrywise: x -x 2 /2 2.Double centering: B = HAH 3.Find k largest eigenvalues i of B with corresponding eigenvectors X i 4.Coordinates of point P r given by row r of eigenvector entries k = 1 with sqrt of tree distance equivalent to our approach

Phylogenetic ordination PCoA on sequence data with k = 3: –For appropriate distance, C1 (x-axis) guaranteed to split taxa at 0 Our results support popular use of PCoA –Provided that the right distance is considered…

Conclusion I Natural connection between matrix of pairwise distances and the Laplacian of a complete graph

Conclusion II Structure of tree embedded in complete graph and recoverable via spectral theory Notion of Fiedler cut extends concept to Fiedler split –Inheritance propagated through Schur tower NO YES

Conclusion III Results inspire fast divisive tree reconstruction method

Conclusion IV Provides guidance and justification for ordination approach

Acknowledgements Alex Griffing (NCSU Bioinformatics) Carl Meyer (NCSU Math) Amy Langville (CoC Math)

Cutpoints and Perron components Each connected component identifies a principal submatrix Each such principal submatrix is inverse positive –Implies that the inverse has a Perron value that is simple –The Perron component is that with the largest Perron value

Cutpoints and Perron components INVERSE PRINCIPAL SUBMATRICES = 1 =.5 = 7.49 PERRON COMPONENT

The key observation Take Schur complement of L at cutpoint, e.g. 6 Consider Fiedler vector of derived Laplacian –Signs of entries outside Perron component are positive (+) –Signs of entries inside Perron component indeterminate (+/-) INVERSE PRINCIPAL SUBMATRICES = 1 =.5 = 7.49 PERRON COMPONENT SCHUR GRAPH AT 6 + + +/-

Similar presentations