Download presentation
Presentation is loading. Please wait.
1
Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University School Of Computer Science Tel Aviv University
2
Tree of Life “I believe it has been with the tree of life, which fills with its dead and broken branches the crust of the earth, and covers the surface with its ever branching and beautiful ramifications"... Charles Darwin, 1859
3
Accepted Evolutionary Model: Trees Initial period: Primordial soup, where “you are what you eat”. Recombination events. Horizontal transfers. Formation of distinct taxa. Speciation events induce a tree-like evolution.
4
Accepted Evolutionary Model: Trees Reconstructing this phylogenetic tree is the major challenge in evolutionary biology. But…
5
Phylogenetic Trees Based on What? 1. Morphology 2. Single genes 3. Whole genomes
6
Whole Genome Phylogenies: Motivation Cons for single genes trees Require preprocessing Gene duplications Often too sensitive Pros for whole genomes trees Fully automatic More information Seems essential in viruses What about proteomes trees? Less “noise”, but do require preprocessing
7
Whole Genome Phylogenies: Biological Motivation Recently (last 2-4 years) it was discovered (in laboratories) that ~60% of the genome transcribes to RNA, but this RNA does not translate to proteins. We are in the dark as to what this non-coding RNA does. But we should not ignore it and concentrate just on 3% coding parts!
8
Whole Genome Phylogenies: Availability Due to sequencing techniques that were unthinkable just 15 years ago, we now have the complete genome sequences of hundreds of species, from all ranks and sizes of life. These sequences are publicly available. They are a true treasure for analysis.
9
Whole Genome Phylogenies: Challenges Very large inputs: Up to 5G bp long Extreme length variability (5G to 1M bp) No meaningful alignment Different segments experienced different evolutionary processes
10
Previous Approaches Genome rearrangements (Hannanelly & Pevzner 1995,…) Gene/domain contents (Snel et al. 1999,…) Li et al (2001) – “Kolmogorov complexity” Otu et al (2003) – “Lempel Ziv compression” “IT” Qi et al (2004) – Composition vectors Common approach (ours too): Compute pairwise distances Build a tree from distance matrix (e.g. using Neighbor Joining, Saitou and Nei 1987)
11
Genome Rearrangements Emphasis on finding best sequence of rearrangements Drawbacks Requires manual definition of blocks Disregards changes within the block
12
Gene/Domain Content Genome equi length Boolean vector Various tree construction methods The drawback Requires gene/domain definition/knowledge Disregards most of the genetic information
13
“Information Theoretic” Approaches
14
Ming Li et al.- “Kolomogorov Complexity” Kolmogorov Complexity is a wonderful measure But … it is not computable “Approximate” KC by compression Drawbacks Justification of the “approximation” Reportedly slow.
15
Otu et al.: “Lempel-Ziv Distance” Run LZ compression on genome A. Use Genome A dictionary to compress Genome B. Log compression ratio (B given A vs. B given B) ≈ distance (B, A) Easy to implement Linear running time Drawback: Dictionary size effects
16
Calculate distributions of the K-tuples. For K=1 – nucleotide/amino acid frequencies. For K=5 – 4 5 (20 5 ) possible 5-tuples Various methods for scoring distances Report K=5 as seemingly optimal Genome A Genome B Qi et al.: Composition Vector
17
For every position in Genome A, find the longest common substring in Genome B. AGGCTTAGATCGAGGCTAGGATCCCCTTAGCGAGGCTTAGATCGAGGCTAGGATCCCCTTAGCG AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTTAAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT Genome A Genome B Our Approach: Average Common Substring (ACS)
18
For every position in Genome A, find the longest common substring in Genome B. AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT Genome A Genome B Our Approach: ACS (cont.)
19
For every position in Genome A, find the longest common substring in Genome B. AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT Genome A Genome B Our Approach: ACS (cont.)
20
For every position in Genome A, find the longest common substring in Genome B. AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT Genome A Genome B Our Approach: ACS (cont.)
21
For every position in Genome A, find the longest common substring in Genome B. AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT Genome A Genome B Our Approach: ACS (cont.)
22
For every position in Genome A, find the length of longest common substring in Genome B. In this case, l( )=5. AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT Genome A Genome B Our Approach: ACS (cont.)
23
For every position in Genome A, find the length of longest common substring in Genome B. In this case, l( )=5. ACS= average l( ) = L(Genome A, Genome B) AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT Genome A Genome B Our Approach: ACS (cont.)
24
From ACS to Our Distance: Intuition High L( A, B ) indicates higher similarity. Should normalize to account for length of B.
25
From ACS to Our Distance: Intuition High L( A, B ) indicates higher similarity. Should normalize to account for length of B. Still, we want distance rather than similarity.
26
From ACS to Our Distance: Intuition High L( A, B ) indicates higher similarity. Should normalize to account for length of B. Still, we want distance rather than similarity.
27
High L( A, B ) indicates higher similarity. Should normalize to account for length of B. Still, we want distance rather than similarity. And want to have D( A, A ) = 0. From ACS to Our Distance: Intuition
28
High L( A, B ) indicates higher similarity. Should normalize to account for length of B. Still, we want distance rather than similarity. And want to have D( A, A ) = 0. Finally, we want to ensure symmetry. From ACS to Our Distance: Intuition
29
Comparison to Human (H) 9.134.570.9x10 6 E. coli 8.974.822x10 6 S. Cerevisiae (yeast) 5.565.2911x10 6 Arabidopsis Thaliana 2.1122.9712x10 6 Mus Musculus (mouse) D s (H,*)L(H,*) Proteome size Species
30
What Good is this Weird Measure? 1) Our “ACS distance” is related to an information theoretic measure that is close to Kullback Leibler relative entropy between two distributions. 2) The proof of the pudding is in the eating: Will show this “weird measure” is empirically good.
31
Define = number of bits required to describe distribution p, given q. is closely related to Kullback Leibler relative entropy An Info Theoretic Measure
32
Both and are common “distance measures” between two probability distributions p and q. In general, the two “distances” are neither symmetric, nor satisfy triangle inequality. An Info Theoretic Measure
33
Suppose p and q are Markovian probability distributions on strings, and A, B are generated by them. Abraham Wyner (1993) showed that w.h.p Relations Between ACS and
34
Computing Our Distance Average number of bits for compression a bit from one genome by the other + vice versa. Practically achieved better results than the sum of relative entropies.
35
Computation distance of two k long genomes: Naïve implementation requires O(k 2 ) (disaster on billion letters long genomes) With suffix trees/arrays: Total time for computing is O(k) (much nicer). ACS Implementation and Complexity
36
Results and Comparisons Many genomes and proteomes Small ribosomal subunit ML tree Compare to other whole-genome methods Quantitative and qualitative evaluation
37
Benchmark dataset – 75 species 191 species (all non-viral proteomes in NCBI) 1,865 viral genomes 34 mitochondrial DNA of mammals (same as Li et al.) Four Datasets Used
38
Benchmark Dataset – 75 Species Genomes and proteomes of archaea, bacteria and eukarya Tree topologies reconstructed from distance matrix using Neighbor Joining ( Saitou and Nei 1987 ) Reference tree and distance matrix obtained from the RDP (ribosomal database)
39
Benchmark dataset Genomes/Proteomes of 75 species from archaea, bacteria and eukarya with known genomes, proteomes, and with RDP entries. Methods implemented and tested : ACS (Ours) “Lempel Ziv complexity” (Otu and Sayhood) K-mers composition vectors (Qi et al.). Results: Quantitative Evaluations Tree Evaluation 04.05.3 3.5E 4.0 5.3 3.5 E 03.42.44.6D 3.40 2.3C 2.43.401.2B 4.62.31.20A DCBA A B E D C Tested Methods
40
Tree evaluation Reference tree: “Accepted” tree obtained from ribosomal database project (Cole et al. 2003) Tree Distance: Robinson-Foulds (1981) Results: Quantitative Evaluations Tree Evaluation 04.05.3 3.5E 4.0 5.3 3.5 E 03.42.44.6D 3.40 2.3C 2.43.401.2B 4.62.31.20A DCBA A B E D C Tested Methods
41
Robinson-Foulds Distance Each tree edge partitions species into 2 sets. Search which partitions exist only in one of the trees. A B C DE A B E DC Tree A Tree B A,B C,D,E A,BC,D,E Common Partition x y
42
A B C DE A B E DC Tree A Tree B D,E A,B,C Robinson-Foulds Distance x y Partition Not in B Each tree edge partitions species into 2 sets. Search which partitions exist only in one of the trees.
43
Distance = number of edges inducing partitions existing only in one of the trees. For n leaves, distance ranges from 0 through 2n-6. Robinson-Foulds Distance A B C DE A B E DC Tree A Tree B D,E A,B,C x y Partition Not in B
44
Robinson-Foulds Distance - Results Benchmark set has n=75 species, so max distance is 144. 76108 ACS (Our method) 92110 Composition vector 126118 LZ complexity ProteomesGenomesMethod
45
All Proteomes Dataset 191 proteomes from NCBI Genome 11 Eukarya, 19 Archaea, 161 Bacteria Compared to NCBI Taxonomy
46
191 proteomes from NCBI Genome 11 Eukarya, 19 Archaea, 161 Bacteria Compared to NCBI Taxonomy All Proteomes Dataset
47
191 proteomes from NCBI Genome 11 Eukarya, 19 Archaea, 161 Bacteria Compared to NCBI Taxonomy All Proteomes Dataset Halobacterium Nanoarchaeum (parasitic/symbiotic)
48
Viral Forest 1865 viral genomes from EBI Split into super-families: dsDNA ssDNA dsRNA ssRNA positive ssRNA negative Retroids Satellite nucleic acid
49
83 Reverse-transcriptases: Hepatitis B viruses Circular dsDNA ssRNA Retroid Tree Avian Mammalian
50
Each segment treated separately 174 segments of 74 viruses. ssRNA Negative Tree
51
Mammalian mtDNA Tree Avian Mammalian
52
Throwing Branch Lengths In Intelligent Design ?
53
General Insights Proteomes vs. Genomes Overlapping vs. Non-overlapping Triangle inequality held in all cases
54
Additional Directions attempted Naïve introduction of mismatches Division into segments Weighted combinations of genome and proteome data Bottom line (subject to change): Simple is beautiful.
55
Summary Whole genome/proteome phylogeny based on ACS method Effective algorithm Information theoretic justification Successful reconstruction of known phylogenies.
56
Future work Statistical significance Improved branch lengths estimation Handle large eukaryotic genomes via improved suffix array routines (e.g. by Stephan Kurtz enhanced suffix arrays - smaller memory requirements) This should enable to have a full comparison of proteome vs. genome trees. Not there yet.
57
Thank you ! Questions ?
58
Species Evolution The affinities of all the beings of the same class have sometimes been represented by a great tree... Charles Darwin, 1859 Tree of life - http://tolweb.org/tree/learn/concepts/whatisphylogeny.html
59
Whole Genome Phylogenies: Challenges Very large inputs: Up to 5G bp long Extreme length variability (5G to 1M bp) No meaningful alignment Different segments experienced different evolutionary processes Common approach (ours too) Compute pairwise distances Build a tree using established methods
60
Defining Our Distance Motivated by all these considerations, we define: Large indicates higher similarity. Should normalize to account for length of g 2. We want distance, not similarity. And want it to be symmetric.
61
Robinson-Foulds Tree Distance Each tree edge partitions species into 2 sets. Check which partitions shared between trees. AA BB CC DDEE AA BB EE DDCC Tree A Tree B A,B C,D,E A,BC,D,E Shared Partition
62
Each tree edge partitions species into 2 sets. Check which partitions shared between trees. AA BB CC DDEE AA BB EE DDCC Tree A Tree B D,E A,B,C C,D A,B,E Robinson-Foulds Tree Distance Different Partition
63
AA BB CC DDEE AA BB EE DDCC Tree A Tree B D,E A,B,C C,D A,B,E Distance = number edges inducing different partitions. For n leaves, distance between 0 to 2n-6. Robinson-Foulds Tree Distance Different Partition
64
Comparisons: Results Recall benchmark set has n=75 species, so max Robinson-Foulds tree distance is 144.
65
All proteomes dataset 191 proteomes from NCBI Genome 11 Eukarya, 19 Archaea, 161 Bacteria Reconstruction took about 5 days Compared to NCBI Taxonomy
66
Our approach – Info Theoretic Kullback Leibler (KL) relative entropy: Enables computing “distances” between distributions In practice can serve as a metric
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.