Information Theoretic Approach to Whole Genome Phylogenies

Information Theoretic Approach to Whole Genome Phylogenies
David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University

Tree of Life “I believe it has been with the tree of life, which fills with its dead and broken branches the crust of the earth, and covers the surface with its ever branching and beautiful ramifications"... Charles Darwin, 1859 Since Charles Darwin the evolutionary history of extinct and extant species has been described as a tree. Each node represents a species and each split represents the speciation of a common ancestor into 2 or more different species. [click] The goal of phylogenetics is to reconstruct trees, which reflect the evolutionary relationships between species.

Accepted Evolutionary Model: Trees
Initial period: Primordial soup, where “you are what you eat”. Recombination events. Horizontal transfers. Formation of distinct taxa. Speciation events induce a tree-like evolution.

Phylogenetic Trees Based on What?
Morphology Single genes Whole genomes

Whole Genome Phylogenies: Motivation
Cons for single genes trees Require preprocessing Gene duplications Often too sensitive Pros for whole genomes trees Fully automatic More information Seems essential in viruses What about proteomes trees? Less “noise”, but do require preprocessing

Whole Genome Phylogenies: Challenges
Very large inputs: Up to 5G bp long Extreme length variability (5G to 1M bp) No meaningful alignment Different segments experienced different evolutionary processes

Previous Approaches Genome rearrangements (Hannanelly & Pevzner 1995,…) Gene/domain contents (Snel et al. 1999,…) Li et al (2001) – “Kolmogorov complexity” Otu et al (2003) – “Lempel Ziv compression” Qi et al (2004) – Composition vectors Common approach (ours too): Compute pairwise distances Build a tree from distance matrix (e.g. using Neighbor Joining, Saitou and Nei 1987) Our work is not the first to reconstruct phylogenetic trees from whole genomes. Earlier works in this field were based on Genome rearrangements and on Gene content. While these methods seems to be very effective for evolutionary close species it might be problematic when dealing with distant species. In the last few years a number of works that are inspired by Information Theory were presented, some of them appear on the slide. These works are close in nature to our method. Now, These are essentially distance methods. Their first step is to calculate all pairwise distances between the genomes. Then the tree is reconstructed from the distance matrix. Our work presents a novel method based on information theory, which we believe will contribute the field of Whole genomes and proteomes phylogenetics.

Genome Rearrangements
Emphasis on finding best sequence of rearrangements Drawbacks Requires manual definition of blocks Disregards changes within the block

Gene/Domain Content Genome  equi length Boolean vector
Various tree construction methods The drawback Requires gene/domain definition/knowledge Disregards most of the genetic information

Ming Li et al.- “Kolomogorov Complexity”
Kolmogorov Complexity is a wonderful measure But … it is not computable “Approximate” KC by compression Drawbacks Justification of the “approximation” Compression of one human chromosome reportedly took 24 hours (sloooow).

Otu et al.: “Lempel-Ziv Distance”
Run LZ compression on genome A. Use Genome A dictionary to compress Genome B. Log compression ratio (B given A vs. B given B) ≈ distance (B, A) Easy to implement Linear running time Drawback: Dictionary size effects

Qi et al.: Composition Vector
Calculate distributions of the K-tuples. For K=1 – nucleotide/amino acid frequencies. For K=5 – 45 (205) possible 5-tuples Various methods for scoring distances Report K=5 as seemingly optimal Genome A Genome B

Our Approach: Average Common Substring (ACS)
For every position in Genome A, find the longest common substring in Genome B. Genome A AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG Genome B AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT

Our Approach: ACS (cont.)
For every position in Genome A, find the longest common substring in Genome B. Genome A AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG Genome B AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT

For every position in Genome A, find the length of longest common substring in Genome B. In this case, l( )=5. Genome A AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG Genome B AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT

For every position in Genome A, find the length of longest common substring in Genome B. In this case, l( )=5. ACS= average l( ) = L(Genome A, Genome B) Genome A AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG Genome B AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT

From ACS to Our Distance: Intuition
High L( A , B ) indicates higher similarity. Should normalize to account for length of B. Now I will show how we use this Average Common Substring or ACS function to calculate our Distance. It is easy to see that higher ACS values indicate Higher Similarity. [click] For a given A, the longer B is, the higher the ACS tends to be. Thus, in order to account for B’s length we normalize by dividing by log length of B. The log function is used for theoretical reasons. Now this is a similarity measure, while we would like distance, so we take the inverse. then we subtract a correction term that guarantees that the distance of a sequence from itself will always equal zero We’ll denote this measure by D tilda. This measure is not symmetric, so we define a symmetric measure Ds which is the following sum. [final click]

High L( A , B ) indicates higher similarity. Should normalize to account for length of B. Still, we want distance rather than similarity.

High L( A , B ) indicates higher similarity. Should normalize to account for length of B. Still, we want distance rather than similarity. And want to have D( A , A ) = 0 .

High L( A , B ) indicates higher similarity. Should normalize to account for length of B. Still, we want distance rather than similarity. And want to have D( A , A ) = 0 . Finally, we want to ensure symmetry.

Comparison to Human (H)
9.13 4.57 0.9x106 E. coli 8.97 4.82 2x106 S. Cerevisiae (yeast) 5.56 5.29 11x106 Arabidopsis Thaliana 2.11 22.97 12x106 Mus Musculus (mouse) Ds(H,*) L(H,*) Proteome size Species This table presents a few sample figures from the comparison of Human with 4 model organisms: Mouse, Arabidopsis, yeast and E.coli. As expected the ACS decreases as the organism is evolutionary more distant, and our distance increases appropriately (accordingly, respectively).

What Good is this Weird Measure?
1) Our “ACS distance” is related to an information theoretic measure that is close to Kullback Leibler relative entropy between two distributions. 2) The proof of the pudding is in the eating: Will show this “weird measure” is empirically good.

An Info Theoretic Measure
Define = number of bits required to describe distribution p, given q. is closely related to Kullback Leibler relative entropy The entropy is the minimal number of bits in average which is required to describe the a random variable. The relative entropy is a measure of distance between 2 distributions. Our distance is the sum of both and represents the number of bits required to describe A given B. Our symmetric distance is hence the number of bits required to describe A given B plus the number of bits required to describe B given A.

An Info Theoretic Measure
Both and are common “distance measures” between two probability distributions p and q. Both “distances” are neither symmetric, nor satisfy triangle inequality. The entropy is the minimal number of bits in average which is required to describe the a random variable. The relative entropy is a measure of distance between 2 distributions. Our distance is the sum of both and represents the number of bits required to describe A given B. Our symmetric distance is hence the number of bits required to describe A given B plus the number of bits required to describe B given A.

Relations Between ACS and
Suppose p and q are Markovian probability distributions on strings, and A, B are generated by them. Abraham Wyner (1993) showed that w.h.p The entropy is the minimal number of bits in average which is required to describe the a random variable. The relative entropy is a measure of distance between 2 distributions. Our distance is the sum of both and represents the number of bits required to describe A given B. Our symmetric distance is hence the number of bits required to describe A given B plus the number of bits required to describe B given A.

Implementation and Complexity
Computation distance of two k long genomes: Naïve implementation requires O(k2) (disaster on billion letters long genomes) With suffix trees/arrays: Total time for computing is O(k) (much nicer).

Results and Comparisons
Many genomes and proteomes Small ribosomal subunit ML tree Compare to other whole-genome methods Quantitative and qualitative evaluation

Four Datasets Used Benchmark dataset – 75 species
191 species (all non-viral proteomes in NCBI) 1,865 viral genomes 34 mitochondrial DNA of mammals (same as Li et al.)

Benchmark Dataset – 75 Species
Genomes and proteomes of archaea, bacteria and eukarya Tree topologies reconstructed from distance matrix using Neighbor Joining (Saitou and Nei 1987) Reference tree and distance matrix obtained from the RDP (ribosomal database)

Results: Quantitative Evaluations
Benchmark dataset Genomes/Proteomes of 75 species from archaea, bacteria and eukarya. Methods tested : ACS (Ours) “Lempel Ziv complexity” (Otu and Sayhood) K-mers composition vectors (Qi et al.). In order to evaluate our method we used a Benchmark Dataset consisting of the Complete Genomes and Proteomes of 75 species from Bacteria Archaea and Eukarya. Our purpose was to compare our method with Previously published methods, which accepts the same kind of input. [click] For that matter we have employed two other methods: The first due to Out and Sayhood is based on Lempel-Ziv complexity. The second, due to Qi Wang and Hao is based on the K-mers composition vectors. Using these methods and our’s we have computed 3! matrices of pairwise distances. Then reconstructed the 3 corresponding trees using the Neighbor Joining algorithm. Tested Methods Tree Evaluation 4.0 5.3 3.5 E 3.4 2.4 4.6 D 2.3 C 1.2 B A NJ

Results: Quantitative Evaluations
Tree evaluation Reference tree: “Accepted” tree obtained from ribosomal database project (Cole et al. 2003) Tree Distance: Robinson-Foulds (1981) In order to Quantitatively Evaluate these trees, they were compared to an ”Accepted” reference tree obtained from the ribosomal [RAI-bosomal] database project. This was done using a standard measure of phylogenetic trees comparison, the Robinson-Foulds distance I will now elaborate on the computation of this metric. Tested Methods Tree Evaluation 4.0 5.3 3.5 E 3.4 2.4 4.6 D 2.3 C 1.2 B A NJ

Robinson-Foulds Distance
Each tree edge partitions species into 2 sets. Search which partitions exist only in one of the trees. A C A E Common Partition x A,B Note that each tree edge induces a partition of the species into 2 sets. The algorithm counts the number of edges which induce a partition existing in one tree but not in the other. [click] In this example the edge denoted by x [ex] partitions Tree A into A,B and C,D,E. This partition exists in Tree B. However, the edge denoted by y partitions Tree A into A,B,C and D,E. This partition does not exist in tree B. The Distance is, as mentioned, the number of edged which induce partitions existing only in one of the trees. For n leaves the distance ranges from 0 through 2 n minus 6. C,D,E A,B C,D,E y B B D E D C Tree A Tree B

Each tree edge partitions species into 2 sets. Search which partitions exist only in one of the trees. A C A E A,B,C Partition Not in B x y B B D,E D E D C Tree A Tree B

Distance = number of edges inducing partitions existing only in one of the trees. For n leaves, distance ranges from 0 through 2n-6. A C A E A,B,C Partition Not in B x y B B D,E D E D C Tree A Tree B

Robinson-Foulds Distance - Results
Benchmark set has n=75 species, so max distance is 144. 76 108 ACS (Our method) 92 110 Composition vector 126 118 LZ complexity Proteomes Genomes Method Back to our trees. Our dataset contains 75 species, so the maximal Robinson Foulds distance in this case is 144. The table presents the results of comparing the different methods with the reference tree. We expect a good tree to be one that mainly agrees with the reference tree, but most likely won’t be identical to it. As you may notice both in genome and proteome inputs our tree is the one which best agrees with the accepted phylogeny. The proteome outcome of our method being by far the closest.

All Proteomes Dataset 191 proteomes from NCBI Genome
11 Eukarya, 19 Archaea, 161 Bacteria Compared to NCBI Taxonomy Encouraged from the results we next employed our method on large scale datasets. The First consisted of all 191 available proteomes in NCBI Genome as of last October. Here is the tree which was produced:

All Proteomes Dataset 191 proteomes from NCBI Genome
11 Eukarya, 19 Archaea, 161 Bacteria Compared to NCBI Taxonomy Our method has correctly partitioned the species into the 3 domains of Life: Eukaryota, Archaea and Bacteria, excluding 2 archaeal species which will be discussed later. -Within the 11 eukaroyte species the evolutionary relationships reflect perfectly the accepted taxonomy. This isn’t surprising, given the major differences between the representative eukaryotes in the tree. -The performance over the bacterial and archeal domains is more challenging. The archeal species have generally been clustered with agreement to known taxonomy. The only two archean species which seem to be “misplaced” in the tree among Bacteria species, are Nanoarcheum and Halobacterium NRC-1. [click] Nanoarchaeum is one of a few known archeal parasites, missing multiple genes for lipid [LAIPID], cofactor, amino acid, or nucleotide biosynthesis in its genome making it very difficult to correctly place it on the tree. It was clustered together with endo-symbiotics bacteria[click] Halobacterium NRC-1 is the only archeal species present from the entire Halobacteria class, which might explain the difficulty of its classification [long branch attraction]. This class contains organisms which are tolerant to extreme environments. The Halobacteria are reported to lack basic proteins and have a highly acidic proteome, unlike other Archean species. It was clustered in our tree close to other stress-resistant species such as D.radiodurans and T.Thermophilus. Comparing the Bacteria to accepted taxonomy reveals that Overall, the tree is in very good agreement with the taxonomic knowledge. This is the case especially at the lower levels of genera, families and classes. The accuracy is decreasing for higher taxonomic groups, a common problem to the whole-genomic approaches.

Viral Forest 1865 viral genomes from EBI Split into super-families:
dsDNA ssDNA dsRNA ssRNA positive ssRNA negative Retroids Satellite nucleic acid The second large-scale dataset we’ve used Included all viral genomes from EBI. Viruses are known to be partitioned to a number of super-families, according to their nucleic acid content: DNA or RNA, double stranded or single stranded, positive-sense and negative sense. Two other super-families are the retroids, and Satellite nucleic acid which are naked RNA virtuses. Each of these super-families is believed to have a different evolutionary origin. We collected 1, 865 viral genomes, and applied the ACS method to generate a tree for each of these superfamilies. I’ll show two of those tree…

Retroid Tree 83 Reverse-transcriptases: Hepatitis B viruses
Avian Mammalian 83 Reverse-transcriptases: Hepatitis B viruses Circular dsDNA ssRNA The following tree was constructed using the genomes of 83 viruses classified as Retroid viruses Also known as Reverse Transcribing Viruses. [click] The partition of the viruses to the 3 main families of retroids : Hepatitis B viruses, Circular dsDNA and ssRNA retroids, has been supported by our tree. The Hepatitis B viruses are divided to mammalian and avian genera, which were correctly distinguished by our algorithm. At present the Ross Goose Hepatitis B and the Arctic Squirrel Hepatitis virus are currently not classified. Our algorithm classify the Ross Goose virus to the avian hepatitis and the Arctic squirrel virus within the mammalian hepatitis, which make some sort of sense…. As you may notice by the indications on the tree the other genera partitioning seems satisfactory as well, with only few disagreements with accepted taxonomy.

ssRNA Negative Tree Each segment treated separately
174 segments of 74 viruses. The next tree is of negative sense ssRNA viruses. [click] Many of these viruses’ genomes are segmented into 2,3 or more segments. In this analysis we have treated each segment separately. [click] This kind of analysis enables us to check whether different segments of the same virus have different origin, in some contexts this seems to be the case for example –Topsoviruses: As you can see the small and medium are clustered together over here, While the large segments are placed on a distant location in the tree. Another interesting issue is the fact that The genomes, or genome segments used to perform this analysis were very small, sometimes less the 1Kbp. Such small sequences contain relatively very little information for an algorithm as the one we used, which theoretically requires long strings. Nevertheless, the tree which was created by our method nicely agrees with the accepted viral taxonomy on several levels. ======== In general this tree largely agrees with accepted taxonomy. This is rather suprsing since our information theoretic method requires theoretically long strings to be accurate, and yet providing the limited information content found in the viral genomes the method reconstruct a tree with a good agreement to the known taxonomy on several levels. è Surprising theoritacally asyymptotically.

Mammalian mtDNA Tree Mammalian Avian
The following tree was constructed using the genomes of 83 viruses classified as Retroid viruses Also known as Reverse Transcribing Viruses. [click] The partition of the viruses to the 3 main families of retroids : Hepatitis B viruses, Circular dsDNA and ssRNA retroids, has been supported by our tree. The Hepatitis B viruses are divided to mammalian and avian genera, which were correctly distinguished by our algorithm. At present the Ross Goose Hepatitis B and the Arctic Squirrel Hepatitis virus are currently not classified. Our algorithm classify the Ross Goose virus to the avian hepatitis and the Arctic squirrel virus within the mammalian hepatitis, which make some sort of sense…. As you may notice by the indications on the tree the other genera partitioning seems satisfactory as well, with only few disagreements with accepted taxonomy. Avian

Throwing Branch Lengths In
Intelligent Design ?

General Insights Proteomes vs. Genomes Overlapping vs. Non-overlapping
Triangle inequality held in all cases

Additional Directions attempted
Naïve introduction of mismatches Division into segments Weighted combinations of genome and proteome data Bottom line (subject to change): Simple is beautiful.

Summary Whole genome phylogeny based on ACS method Effective algorithm
Information theoretic justification Successful reconstruction of known phylogenies. In this work we presented a novel method, the ACS algorithm, for phylogenetic reconstruction, based on complete genome or proteome sequences. As with any new reconstruction method, its adequacy will be determined with time, when sufficient experience is gained. We believe that our ACS approach is promising, and that its outcomes are interesting, so that Overall this is an important step in the of direction constructing whole genome or proteome phylogenies. The experimental results support further exploration of the proposed method. However, this work is certainly not the last algorithmic word in this direction, and many improvements remain to be discovered and developed. [click] One direction is the computation of the statistic significance of each one of the branches. This will enable a more meaningful (?) analysis of the reconstructed trees. Another example is the combination of proteomic and genomic data. We believe that combining those two sources of information can improve the quality of the reconstruction.

Future work Additional datasets Statistical significance
Improved branch lengths estimation Better time and space complexities

Questions ?

Information Theoretic Approach to Whole Genome Phylogenies

Similar presentations

Presentation on theme: "Information Theoretic Approach to Whole Genome Phylogenies"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Theoretic Approach to Whole Genome Phylogenies

Similar presentations

Presentation on theme: "Information Theoretic Approach to Whole Genome Phylogenies"— Presentation transcript:

Similar presentations

About project

Feedback