Information Theoretic Approach to Whole Genome Phylogenies

Slides:



Advertisements
Similar presentations
Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.
Advertisements

Indexing DNA Sequences Using q-Grams
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
Heuristic alignment algorithms and cost matrices
Bioinformatics Algorithms and Data Structures
Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
The Tree of Life From Ernst Haeckel, 1891.
Bas E. Dutilh Phylogenomics Using complete genomes to determine the phylogeny of species.
Sequence similarity.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Topic : Phylogenetic Reconstruction I. Systematics = Science of biological diversity. Systematics uses taxonomy to reflect phylogeny (evolutionary history).
Sequence alignment, E-value & Extreme value distribution
Phylogenetic trees Sushmita Roy BMI/CS 576
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Comparative Genomics of the Eukaryotes
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetic Reconstruction based on RNA Secondary Structural Alignment Benny Chor, Tel-Aviv Univ. Joint work with Moran Cabili, Assaf Meirovich, and Metsada.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)
Introduction to Phylogenetics
Comp. Genomics Recitation 3 The statistics of database searching.
Calculating branch lengths from distances. ABC A B C----- a b c.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Phylogeny Ch. 7 & 8.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Step 3: Tools Database Searching
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
Taxonomy & Phylogeny. B-5.6 Summarize ways that scientists use data from a variety of sources to investigate and critically analyze aspects of evolutionary.
Darwin’s Tree of Life, July million species Phylogenetic inference from genomic.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Metagenomic Species Diversity.
WRKY transcription factors in potato genome factors in potato genome
What is a Hidden Markov Model?
Phylogeny - based on whole genome data
Research in Computational Molecular Biology , Vol (2008)
Clustering methods Tree building methods for distance-based trees
Hidden Markov Models Part 2: Algorithms
The Tree of Life From Ernst Haeckel, 1891.
Summary and Recommendations
WRKY transcription factors in potato genome factors in potato genome
One-Way Analysis of Variance
Chapter 19 Molecular Phylogenetics
Lecture 7 – Algorithmic Approaches
Unit Genomic sequencing
Summary and Recommendations
Sequence alignment, E-value & Extreme value distribution
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University

Tree of Life “I believe it has been with the tree of life, which fills with its dead and broken branches the crust of the earth, and covers the surface with its ever branching and beautiful ramifications"... Charles Darwin, 1859 Since Charles Darwin the evolutionary history of extinct and extant species has been described as a tree. Each node represents a species and each split represents the speciation of a common ancestor into 2 or more different species. [click] The goal of phylogenetics is to reconstruct trees, which reflect the evolutionary relationships between species.

Accepted Evolutionary Model: Trees Initial period: Primordial soup, where “you are what you eat”. Recombination events. Horizontal transfers. Formation of distinct taxa. Speciation events induce a tree-like evolution.

Phylogenetic Trees Based on What? Morphology Single genes Whole genomes

Whole Genome Phylogenies: Motivation Cons for single genes trees Require preprocessing Gene duplications Often too sensitive Pros for whole genomes trees Fully automatic More information Seems essential in viruses What about proteomes trees? Less “noise”, but do require preprocessing

Whole Genome Phylogenies: Challenges Very large inputs: Up to 5G bp long Extreme length variability (5G to 1M bp) No meaningful alignment Different segments experienced different evolutionary processes

Previous Approaches Genome rearrangements (Hannanelly & Pevzner 1995,…) Gene/domain contents (Snel et al. 1999,…) Li et al (2001) – “Kolmogorov complexity” Otu et al (2003) – “Lempel Ziv compression” Qi et al (2004) – Composition vectors Common approach (ours too): Compute pairwise distances Build a tree from distance matrix (e.g. using Neighbor Joining, Saitou and Nei 1987) Our work is not the first to reconstruct phylogenetic trees from whole genomes. Earlier works in this field were based on Genome rearrangements and on Gene content. While these methods seems to be very effective for evolutionary close species it might be problematic when dealing with distant species. In the last few years a number of works that are inspired by Information Theory were presented, some of them appear on the slide. These works are close in nature to our method. Now, These are essentially distance methods. Their first step is to calculate all pairwise distances between the genomes. Then the tree is reconstructed from the distance matrix. Our work presents a novel method based on information theory, which we believe will contribute the field of Whole genomes and proteomes phylogenetics.

Genome Rearrangements Emphasis on finding best sequence of rearrangements Drawbacks Requires manual definition of blocks Disregards changes within the block

Gene/Domain Content Genome  equi length Boolean vector Various tree construction methods The drawback Requires gene/domain definition/knowledge Disregards most of the genetic information

Ming Li et al.- “Kolomogorov Complexity” Kolmogorov Complexity is a wonderful measure But … it is not computable “Approximate” KC by compression Drawbacks Justification of the “approximation” Compression of one human chromosome reportedly took 24 hours (sloooow).

Otu et al.: “Lempel-Ziv Distance” Run LZ compression on genome A. Use Genome A dictionary to compress Genome B. Log compression ratio (B given A vs. B given B) ≈ distance (B, A) Easy to implement Linear running time Drawback: Dictionary size effects

Qi et al.: Composition Vector Calculate distributions of the K-tuples. For K=1 – nucleotide/amino acid frequencies. For K=5 – 45 (205) possible 5-tuples Various methods for scoring distances Report K=5 as seemingly optimal Genome A Genome B

Our Approach: Average Common Substring (ACS) For every position in Genome A, find the longest common substring in Genome B. Genome A AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG Genome B AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT

Our Approach: ACS (cont.) For every position in Genome A, find the longest common substring in Genome B. Genome A AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG Genome B AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT

Our Approach: ACS (cont.) For every position in Genome A, find the longest common substring in Genome B. Genome A AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG Genome B AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT

Our Approach: ACS (cont.) For every position in Genome A, find the longest common substring in Genome B. Genome A AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG Genome B AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT

Our Approach: ACS (cont.) For every position in Genome A, find the longest common substring in Genome B. Genome A AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG Genome B AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT

Our Approach: ACS (cont.) For every position in Genome A, find the length of longest common substring in Genome B. In this case, l( )=5. Genome A AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG Genome B AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT

Our Approach: ACS (cont.) For every position in Genome A, find the length of longest common substring in Genome B. In this case, l( )=5. ACS= average l( ) = L(Genome A, Genome B) Genome A AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG Genome B AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT

From ACS to Our Distance: Intuition High L( A , B ) indicates higher similarity. Should normalize to account for length of B. Now I will show how we use this Average Common Substring or ACS function to calculate our Distance. It is easy to see that higher ACS values indicate Higher Similarity. [click] For a given A, the longer B is, the higher the ACS tends to be. Thus, in order to account for B’s length we normalize by dividing by log length of B. The log function is used for theoretical reasons. Now this is a similarity measure, while we would like distance, so we take the inverse. then we subtract a correction term that guarantees that the distance of a sequence from itself will always equal zero We’ll denote this measure by D tilda. This measure is not symmetric, so we define a symmetric measure Ds which is the following sum. [final click]

From ACS to Our Distance: Intuition High L( A , B ) indicates higher similarity. Should normalize to account for length of B. Still, we want distance rather than similarity.

From ACS to Our Distance: Intuition High L( A , B ) indicates higher similarity. Should normalize to account for length of B. Still, we want distance rather than similarity.

From ACS to Our Distance: Intuition High L( A , B ) indicates higher similarity. Should normalize to account for length of B. Still, we want distance rather than similarity. And want to have D( A , A ) = 0 .

From ACS to Our Distance: Intuition High L( A , B ) indicates higher similarity. Should normalize to account for length of B. Still, we want distance rather than similarity. And want to have D( A , A ) = 0 . Finally, we want to ensure symmetry.

Comparison to Human (H) 9.13 4.57 0.9x106 E. coli 8.97 4.82 2x106 S. Cerevisiae (yeast) 5.56 5.29 11x106 Arabidopsis Thaliana 2.11 22.97 12x106 Mus Musculus (mouse) Ds(H,*) L(H,*) Proteome size Species This table presents a few sample figures from the comparison of Human with 4 model organisms: Mouse, Arabidopsis, yeast and E.coli. As expected the ACS decreases as the organism is evolutionary more distant, and our distance increases appropriately (accordingly, respectively).

What Good is this Weird Measure? 1) Our “ACS distance” is related to an information theoretic measure that is close to Kullback Leibler relative entropy between two distributions. 2) The proof of the pudding is in the eating: Will show this “weird measure” is empirically good.

An Info Theoretic Measure Define = number of bits required to describe distribution p, given q. is closely related to Kullback Leibler relative entropy The entropy is the minimal number of bits in average which is required to describe the a random variable. The relative entropy is a measure of distance between 2 distributions. Our distance is the sum of both and represents the number of bits required to describe A given B. Our symmetric distance is hence the number of bits required to describe A given B plus the number of bits required to describe B given A.

An Info Theoretic Measure Both and are common “distance measures” between two probability distributions p and q. Both “distances” are neither symmetric, nor satisfy triangle inequality. The entropy is the minimal number of bits in average which is required to describe the a random variable. The relative entropy is a measure of distance between 2 distributions. Our distance is the sum of both and represents the number of bits required to describe A given B. Our symmetric distance is hence the number of bits required to describe A given B plus the number of bits required to describe B given A.

Relations Between ACS and Suppose p and q are Markovian probability distributions on strings, and A, B are generated by them. Abraham Wyner (1993) showed that w.h.p The entropy is the minimal number of bits in average which is required to describe the a random variable. The relative entropy is a measure of distance between 2 distributions. Our distance is the sum of both and represents the number of bits required to describe A given B. Our symmetric distance is hence the number of bits required to describe A given B plus the number of bits required to describe B given A.

Implementation and Complexity Computation distance of two k long genomes: Naïve implementation requires O(k2) (disaster on billion letters long genomes) With suffix trees/arrays: Total time for computing is O(k) (much nicer).

Results and Comparisons Many genomes and proteomes Small ribosomal subunit ML tree Compare to other whole-genome methods Quantitative and qualitative evaluation

Four Datasets Used Benchmark dataset – 75 species 191 species (all non-viral proteomes in NCBI) 1,865 viral genomes 34 mitochondrial DNA of mammals (same as Li et al.)

Benchmark Dataset – 75 Species Genomes and proteomes of archaea, bacteria and eukarya Tree topologies reconstructed from distance matrix using Neighbor Joining (Saitou and Nei 1987) Reference tree and distance matrix obtained from the RDP (ribosomal database)

Results: Quantitative Evaluations Benchmark dataset Genomes/Proteomes of 75 species from archaea, bacteria and eukarya. Methods tested : ACS (Ours) “Lempel Ziv complexity” (Otu and Sayhood) K-mers composition vectors (Qi et al.). In order to evaluate our method we used a Benchmark Dataset consisting of the Complete Genomes and Proteomes of 75 species from Bacteria Archaea and Eukarya. Our purpose was to compare our method with Previously published methods, which accepts the same kind of input. [click] For that matter we have employed two other methods: The first due to Out and Sayhood is based on Lempel-Ziv complexity. The second, due to Qi Wang and Hao is based on the K-mers composition vectors. Using these methods and our’s we have computed 3! matrices of pairwise distances. Then reconstructed the 3 corresponding trees using the Neighbor Joining algorithm. Tested Methods Tree Evaluation 4.0 5.3 3.5 E 3.4 2.4 4.6 D 2.3 C 1.2 B A NJ

Results: Quantitative Evaluations Tree evaluation Reference tree: “Accepted” tree obtained from ribosomal database project (Cole et al. 2003) Tree Distance: Robinson-Foulds (1981) In order to Quantitatively Evaluate these trees, they were compared to an ”Accepted” reference tree obtained from the ribosomal [RAI-bosomal] database project. This was done using a standard measure of phylogenetic trees comparison, the Robinson-Foulds distance I will now elaborate on the computation of this metric. Tested Methods Tree Evaluation 4.0 5.3 3.5 E 3.4 2.4 4.6 D 2.3 C 1.2 B A NJ

Robinson-Foulds Distance Each tree edge partitions species into 2 sets. Search which partitions exist only in one of the trees. A C A E Common Partition x A,B Note that each tree edge induces a partition of the species into 2 sets. The algorithm counts the number of edges which induce a partition existing in one tree but not in the other. [click] In this example the edge denoted by x [ex] partitions Tree A into A,B and C,D,E. This partition exists in Tree B. However, the edge denoted by y partitions Tree A into A,B,C and D,E. This partition does not exist in tree B. The Distance is, as mentioned, the number of edged which induce partitions existing only in one of the trees. For n leaves the distance ranges from 0 through 2 n minus 6. C,D,E A,B C,D,E y B B D E D C Tree A Tree B

Robinson-Foulds Distance Each tree edge partitions species into 2 sets. Search which partitions exist only in one of the trees. A C A E A,B,C Partition Not in B x y B B D,E D E D C Tree A Tree B

Robinson-Foulds Distance Distance = number of edges inducing partitions existing only in one of the trees. For n leaves, distance ranges from 0 through 2n-6. A C A E A,B,C Partition Not in B x y B B D,E D E D C Tree A Tree B

Robinson-Foulds Distance - Results Benchmark set has n=75 species, so max distance is 144. 76 108 ACS (Our method) 92 110 Composition vector 126 118 LZ complexity Proteomes Genomes Method Back to our trees. Our dataset contains 75 species, so the maximal Robinson Foulds distance in this case is 144. The table presents the results of comparing the different methods with the reference tree. We expect a good tree to be one that mainly agrees with the reference tree, but most likely won’t be identical to it. As you may notice both in genome and proteome inputs our tree is the one which best agrees with the accepted phylogeny. The proteome outcome of our method being by far the closest.

All Proteomes Dataset 191 proteomes from NCBI Genome 11 Eukarya, 19 Archaea, 161 Bacteria Compared to NCBI Taxonomy Encouraged from the results we next employed our method on large scale datasets. The First consisted of all 191 available proteomes in NCBI Genome as of last October. Here is the tree which was produced:

All Proteomes Dataset 191 proteomes from NCBI Genome 11 Eukarya, 19 Archaea, 161 Bacteria Compared to NCBI Taxonomy Our method has correctly partitioned the species into the 3 domains of Life: Eukaryota, Archaea and Bacteria, excluding 2 archaeal species which will be discussed later. -Within the 11 eukaroyte species the evolutionary relationships reflect perfectly the accepted taxonomy. This isn’t surprising, given the major differences between the representative eukaryotes in the tree. -The performance over the bacterial and archeal domains is more challenging. The archeal species have generally been clustered with agreement to known taxonomy. The only two archean species which seem to be “misplaced” in the tree among Bacteria species, are Nanoarcheum and Halobacterium NRC-1. [click] Nanoarchaeum is one of a few known archeal parasites, missing multiple genes for lipid [LAIPID], cofactor, amino acid, or nucleotide biosynthesis in its genome making it very difficult to correctly place it on the tree. It was clustered together with endo-symbiotics bacteria[click] Halobacterium NRC-1 is the only archeal species present from the entire Halobacteria class, which might explain the difficulty of its classification [long branch attraction]. This class contains organisms which are tolerant to extreme environments. The Halobacteria are reported to lack basic proteins and have a highly acidic proteome, unlike other Archean species. It was clustered in our tree close to other stress-resistant species such as D.radiodurans and T.Thermophilus. Comparing the Bacteria to accepted taxonomy reveals that Overall, the tree is in very good agreement with the taxonomic knowledge. This is the case especially at the lower levels of genera, families and classes. The accuracy is decreasing for higher taxonomic groups, a common problem to the whole-genomic approaches.

All Proteomes Dataset 191 proteomes from NCBI Genome 11 Eukarya, 19 Archaea, 161 Bacteria Compared to NCBI Taxonomy Our method has correctly partitioned the species into the 3 domains of Life: Eukaryota, Archaea and Bacteria, excluding 2 archaeal species which will be discussed later. -Within the 11 eukaroyte species the evolutionary relationships reflect perfectly the accepted taxonomy. This isn’t surprising, given the major differences between the representative eukaryotes in the tree. -The performance over the bacterial and archeal domains is more challenging. The archeal species have generally been clustered with agreement to known taxonomy. The only two archean species which seem to be “misplaced” in the tree among Bacteria species, are Nanoarcheum and Halobacterium NRC-1. [click] Nanoarchaeum is one of a few known archeal parasites, missing multiple genes for lipid [LAIPID], cofactor, amino acid, or nucleotide biosynthesis in its genome making it very difficult to correctly place it on the tree. It was clustered together with endo-symbiotics bacteria[click] Halobacterium NRC-1 is the only archeal species present from the entire Halobacteria class, which might explain the difficulty of its classification [long branch attraction]. This class contains organisms which are tolerant to extreme environments. The Halobacteria are reported to lack basic proteins and have a highly acidic proteome, unlike other Archean species. It was clustered in our tree close to other stress-resistant species such as D.radiodurans and T.Thermophilus. Comparing the Bacteria to accepted taxonomy reveals that Overall, the tree is in very good agreement with the taxonomic knowledge. This is the case especially at the lower levels of genera, families and classes. The accuracy is decreasing for higher taxonomic groups, a common problem to the whole-genomic approaches.

Viral Forest 1865 viral genomes from EBI Split into super-families: dsDNA ssDNA dsRNA ssRNA positive ssRNA negative Retroids Satellite nucleic acid The second large-scale dataset we’ve used Included all viral genomes from EBI. Viruses are known to be partitioned to a number of super-families, according to their nucleic acid content: DNA or RNA, double stranded or single stranded, positive-sense and negative sense. Two other super-families are the retroids, and Satellite nucleic acid which are naked RNA virtuses. Each of these super-families is believed to have a different evolutionary origin. We collected 1, 865 viral genomes, and applied the ACS method to generate a tree for each of these superfamilies. I’ll show two of those tree…

Retroid Tree 83 Reverse-transcriptases: Hepatitis B viruses Avian Mammalian 83 Reverse-transcriptases: Hepatitis B viruses Circular dsDNA ssRNA The following tree was constructed using the genomes of 83 viruses classified as Retroid viruses Also known as Reverse Transcribing Viruses. [click] The partition of the viruses to the 3 main families of retroids : Hepatitis B viruses, Circular dsDNA and ssRNA retroids, has been supported by our tree. The Hepatitis B viruses are divided to mammalian and avian genera, which were correctly distinguished by our algorithm. At present the Ross Goose Hepatitis B and the Arctic Squirrel Hepatitis virus are currently not classified. Our algorithm classify the Ross Goose virus to the avian hepatitis and the Arctic squirrel virus within the mammalian hepatitis, which make some sort of sense…. As you may notice by the indications on the tree the other genera partitioning seems satisfactory as well, with only few disagreements with accepted taxonomy.

ssRNA Negative Tree Each segment treated separately 174 segments of 74 viruses. The next tree is of negative sense ssRNA viruses. [click] Many of these viruses’ genomes are segmented into 2,3 or more segments. In this analysis we have treated each segment separately. [click] This kind of analysis enables us to check whether different segments of the same virus have different origin, in some contexts this seems to be the case for example –Topsoviruses: As you can see the small and medium are clustered together over here, While the large segments are placed on a distant location in the tree. Another interesting issue is the fact that The genomes, or genome segments used to perform this analysis were very small, sometimes less the 1Kbp. Such small sequences contain relatively very little information for an algorithm as the one we used, which theoretically requires long strings. Nevertheless, the tree which was created by our method nicely agrees with the accepted viral taxonomy on several levels. ======== In general this tree largely agrees with accepted taxonomy. This is rather suprsing since our information theoretic method requires theoretically long strings to be accurate, and yet providing the limited information content found in the viral genomes the method reconstruct a tree with a good agreement to the known taxonomy on several levels. è Surprising theoritacally asyymptotically.

Mammalian mtDNA Tree Mammalian Avian The following tree was constructed using the genomes of 83 viruses classified as Retroid viruses Also known as Reverse Transcribing Viruses. [click] The partition of the viruses to the 3 main families of retroids : Hepatitis B viruses, Circular dsDNA and ssRNA retroids, has been supported by our tree. The Hepatitis B viruses are divided to mammalian and avian genera, which were correctly distinguished by our algorithm. At present the Ross Goose Hepatitis B and the Arctic Squirrel Hepatitis virus are currently not classified. Our algorithm classify the Ross Goose virus to the avian hepatitis and the Arctic squirrel virus within the mammalian hepatitis, which make some sort of sense…. As you may notice by the indications on the tree the other genera partitioning seems satisfactory as well, with only few disagreements with accepted taxonomy. Avian

Throwing Branch Lengths In Intelligent Design ?

General Insights Proteomes vs. Genomes Overlapping vs. Non-overlapping Triangle inequality held in all cases

Additional Directions attempted Naïve introduction of mismatches Division into segments Weighted combinations of genome and proteome data Bottom line (subject to change): Simple is beautiful.

Summary Whole genome phylogeny based on ACS method Effective algorithm Information theoretic justification Successful reconstruction of known phylogenies. In this work we presented a novel method, the ACS algorithm, for phylogenetic reconstruction, based on complete genome or proteome sequences. As with any new reconstruction method, its adequacy will be determined with time, when sufficient experience is gained. We believe that our ACS approach is promising, and that its outcomes are interesting, so that Overall this is an important step in the of direction constructing whole genome or proteome phylogenies. The experimental results support further exploration of the proposed method. However, this work is certainly not the last algorithmic word in this direction, and many improvements remain to be discovered and developed. [click] One direction is the computation of the statistic significance of each one of the branches. This will enable a more meaningful (?) analysis of the reconstructed trees. Another example is the combination of proteomic and genomic data. We believe that combining those two sources of information can improve the quality of the reconstruction.

Future work Additional datasets Statistical significance Improved branch lengths estimation Better time and space complexities

Questions ?