Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Slides:



Advertisements
Similar presentations
Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
BLAST Sequence alignment, E-value & Extreme value distribution.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Heuristic alignment algorithms and cost matrices
Bioinformatics Algorithms and Data Structures
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
CPM '05 Sensitivity Analysis for Ungapped Markov Models of Evolution David Fernández-Baca Department of Computer Science Iowa State University (Joint work.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Bioinformatics and Phylogenetic Analysis
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
The Tree of Life From Ernst Haeckel, 1891.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
FOG: High-Resolution Fungal Orthologous Groups René van der Heijden Project 5.10: Comparative genomics for the prediction of protein function and pathways.
Protein Sequence Classification Using Neighbor-Joining Method
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
Introduction to Bioinformatics Algorithms Sequence Alignment.
1 Genome Rearrangements João Meidanis São Paulo, Brazil December, 2004.
Sequence alignment, E-value & Extreme value distribution
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Phylogenetic trees Sushmita Roy BMI/CS 576
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Comparative Genomics of the Eukaryotes
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
0 Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study Ming-Yang Kao Department of Computer Science Northwestern University Evanston,
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetic Reconstruction based on RNA Secondary Structural Alignment Benny Chor, Tel-Aviv Univ. Joint work with Moran Cabili, Assaf Meirovich, and Metsada.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Significance Tests for Max-Gap Gene Clusters Rose Hoberman joint work with Dannie Durand and David Sankoff.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
How many genes are there?
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Fitch-Margoliash Algorithm 1.From the distance matrix find the closest pair, e.g., A & B 2.Treat the rest of the sequences as a single composite sequence.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Darwin’s Tree of Life, July million species Phylogenetic inference from genomic.
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Information Theoretic Approach to Whole Genome Phylogenies
Searching Similar Segments over Textual Event Sequences
Chapter 19 Molecular Phylogenetics
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University School Of Computer Science Tel Aviv University

Tree of Life “I believe it has been with the tree of life, which fills with its dead and broken branches the crust of the earth, and covers the surface with its ever branching and beautiful ramifications"... Charles Darwin, 1859

Accepted Evolutionary Model: Trees  Initial period: Primordial soup, where “you are what you eat”. Recombination events. Horizontal transfers.  Formation of distinct taxa. Speciation events induce a tree-like evolution.

Accepted Evolutionary Model: Trees Reconstructing this phylogenetic tree is the major challenge in evolutionary biology. But…

Phylogenetic Trees Based on What? 1. Morphology 2. Single genes 3. Whole genomes

Whole Genome Phylogenies: Motivation  Cons for single genes trees Require preprocessing Gene duplications Often too sensitive  Pros for whole genomes trees Fully automatic More information Seems essential in viruses  What about proteomes trees? Less “noise”, but do require preprocessing

Whole Genome Phylogenies: Biological Motivation  Recently (last 2-4 years) it was discovered (in laboratories) that ~60% of the genome transcribes to RNA, but this RNA does not translate to proteins.  We are in the dark as to what this non-coding RNA does.  But we should not ignore it and concentrate just on 3% coding parts!

Whole Genome Phylogenies: Availability  Due to sequencing techniques that were unthinkable just 15 years ago, we now have the complete genome sequences of hundreds of species, from all ranks and sizes of life.  These sequences are publicly available.  They are a true treasure for analysis.

Whole Genome Phylogenies: Challenges  Very large inputs: Up to 5G bp long  Extreme length variability (5G to 1M bp)  No meaningful alignment  Different segments experienced different evolutionary processes

Previous Approaches  Genome rearrangements (Hannanelly & Pevzner 1995,…)  Gene/domain contents (Snel et al. 1999,…)  Li et al (2001) – “Kolmogorov complexity”  Otu et al (2003) – “Lempel Ziv compression” “IT”  Qi et al (2004) – Composition vectors Common approach (ours too): Compute pairwise distances Build a tree from distance matrix (e.g. using Neighbor Joining, Saitou and Nei 1987)

Genome Rearrangements  Emphasis on finding best sequence of rearrangements  Drawbacks Requires manual definition of blocks Disregards changes within the block

Gene/Domain Content  Genome  equi length Boolean vector  Various tree construction methods  The drawback Requires gene/domain definition/knowledge Disregards most of the genetic information

“Information Theoretic” Approaches

Ming Li et al.- “Kolomogorov Complexity”  Kolmogorov Complexity is a wonderful measure  But … it is not computable  “Approximate” KC by compression  Drawbacks Justification of the “approximation” Reportedly slow.

Otu et al.: “Lempel-Ziv Distance”  Run LZ compression on genome A.  Use Genome A dictionary to compress Genome B.  Log compression ratio (B given A vs. B given B) ≈ distance (B, A)  Easy to implement  Linear running time  Drawback: Dictionary size effects

 Calculate distributions of the K-tuples.  For K=1 – nucleotide/amino acid frequencies.  For K=5 – 4 5 (20 5 ) possible 5-tuples  Various methods for scoring distances  Report K=5 as seemingly optimal Genome A Genome B Qi et al.: Composition Vector

 For every position in Genome A, find the longest common substring in Genome B. AGGCTTAGATCGAGGCTAGGATCCCCTTAGCGAGGCTTAGATCGAGGCTAGGATCCCCTTAGCG AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTTAAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT Genome A Genome B Our Approach: Average Common Substring (ACS)

 For every position in Genome A, find the longest common substring in Genome B. AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT Genome A Genome B Our Approach: ACS (cont.)

 For every position in Genome A, find the longest common substring in Genome B. AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT Genome A Genome B Our Approach: ACS (cont.)

 For every position in Genome A, find the longest common substring in Genome B. AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT Genome A Genome B Our Approach: ACS (cont.)

 For every position in Genome A, find the longest common substring in Genome B. AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT Genome A Genome B Our Approach: ACS (cont.)

 For every position in Genome A, find the length of longest common substring in Genome B.  In this case, l( )=5. AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT Genome A Genome B Our Approach: ACS (cont.)

 For every position in Genome A, find the length of longest common substring in Genome B.  In this case, l( )=5.  ACS= average l( ) = L(Genome A, Genome B) AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT Genome A Genome B Our Approach: ACS (cont.)

From ACS to Our Distance: Intuition  High L( A, B ) indicates higher similarity.  Should normalize to account for length of B.

From ACS to Our Distance: Intuition  High L( A, B ) indicates higher similarity.  Should normalize to account for length of B.  Still, we want distance rather than similarity.

From ACS to Our Distance: Intuition  High L( A, B ) indicates higher similarity.  Should normalize to account for length of B.  Still, we want distance rather than similarity.

 High L( A, B ) indicates higher similarity.  Should normalize to account for length of B.  Still, we want distance rather than similarity.  And want to have D( A, A ) = 0. From ACS to Our Distance: Intuition

 High L( A, B ) indicates higher similarity.  Should normalize to account for length of B.  Still, we want distance rather than similarity.  And want to have D( A, A ) = 0.  Finally, we want to ensure symmetry. From ACS to Our Distance: Intuition

Comparison to Human (H) x10 6 E. coli x10 6 S. Cerevisiae (yeast) x10 6 Arabidopsis Thaliana x10 6 Mus Musculus (mouse) D s (H,*)L(H,*) Proteome size Species

What Good is this Weird Measure? 1) Our “ACS distance” is related to an information theoretic measure that is close to Kullback Leibler relative entropy between two distributions. 2) The proof of the pudding is in the eating: Will show this “weird measure” is empirically good.

Define = number of bits required to describe distribution p, given q. is closely related to Kullback Leibler relative entropy An Info Theoretic Measure

Both and are common “distance measures” between two probability distributions p and q. In general, the two “distances” are neither symmetric, nor satisfy triangle inequality. An Info Theoretic Measure

Suppose p and q are Markovian probability distributions on strings, and A, B are generated by them. Abraham Wyner (1993) showed that w.h.p Relations Between ACS and

Computing Our Distance  Average number of bits for compression a bit from one genome by the other + vice versa.  Practically achieved better results than the sum of relative entropies.

Computation distance of two k long genomes:  Naïve implementation requires O(k 2 ) (disaster on billion letters long genomes)  With suffix trees/arrays: Total time for computing is O(k) (much nicer). ACS Implementation and Complexity

Results and Comparisons  Many genomes and proteomes  Small ribosomal subunit ML tree  Compare to other whole-genome methods  Quantitative and qualitative evaluation

 Benchmark dataset – 75 species  191 species (all non-viral proteomes in NCBI)  1,865 viral genomes  34 mitochondrial DNA of mammals (same as Li et al.) Four Datasets Used

Benchmark Dataset – 75 Species  Genomes and proteomes of archaea, bacteria and eukarya  Tree topologies reconstructed from distance matrix using Neighbor Joining ( Saitou and Nei 1987 )  Reference tree and distance matrix obtained from the RDP (ribosomal database)

 Benchmark dataset Genomes/Proteomes of 75 species from archaea, bacteria and eukarya with known genomes, proteomes, and with RDP entries.  Methods implemented and tested : ACS (Ours) “Lempel Ziv complexity” (Otu and Sayhood) K-mers composition vectors (Qi et al.). Results: Quantitative Evaluations Tree Evaluation E E D C B A DCBA A B E D C Tested Methods

 Tree evaluation Reference tree: “Accepted” tree obtained from ribosomal database project (Cole et al. 2003) Tree Distance: Robinson-Foulds (1981) Results: Quantitative Evaluations Tree Evaluation E E D C B A DCBA A B E D C Tested Methods

Robinson-Foulds Distance  Each tree edge partitions species into 2 sets.  Search which partitions exist only in one of the trees. A B C DE A B E DC Tree A Tree B A,B C,D,E A,BC,D,E Common Partition x y

A B C DE A B E DC Tree A Tree B D,E A,B,C Robinson-Foulds Distance x y Partition Not in B  Each tree edge partitions species into 2 sets.  Search which partitions exist only in one of the trees.

 Distance = number of edges inducing partitions existing only in one of the trees.  For n leaves, distance ranges from 0 through 2n-6. Robinson-Foulds Distance A B C DE A B E DC Tree A Tree B D,E A,B,C x y Partition Not in B

Robinson-Foulds Distance - Results Benchmark set has n=75 species, so max distance is ACS (Our method) Composition vector LZ complexity ProteomesGenomesMethod

All Proteomes Dataset  191 proteomes from NCBI Genome  11 Eukarya, 19 Archaea, 161 Bacteria  Compared to NCBI Taxonomy

 191 proteomes from NCBI Genome  11 Eukarya, 19 Archaea, 161 Bacteria  Compared to NCBI Taxonomy All Proteomes Dataset

 191 proteomes from NCBI Genome  11 Eukarya, 19 Archaea, 161 Bacteria  Compared to NCBI Taxonomy All Proteomes Dataset Halobacterium Nanoarchaeum (parasitic/symbiotic)

Viral Forest  1865 viral genomes from EBI  Split into super-families: dsDNA ssDNA dsRNA ssRNA positive ssRNA negative Retroids Satellite nucleic acid

 83 Reverse-transcriptases: Hepatitis B viruses Circular dsDNA ssRNA Retroid Tree Avian Mammalian

 Each segment treated separately  174 segments of 74 viruses. ssRNA Negative Tree

Mammalian mtDNA Tree Avian Mammalian

Throwing Branch Lengths In Intelligent Design ?

General Insights  Proteomes vs. Genomes  Overlapping vs. Non-overlapping  Triangle inequality held in all cases

Additional Directions attempted  Naïve introduction of mismatches  Division into segments  Weighted combinations of genome and proteome data  Bottom line (subject to change): Simple is beautiful.

Summary  Whole genome/proteome phylogeny based on ACS method  Effective algorithm  Information theoretic justification  Successful reconstruction of known phylogenies.

Future work  Statistical significance  Improved branch lengths estimation  Handle large eukaryotic genomes via improved suffix array routines (e.g. by Stephan Kurtz enhanced suffix arrays - smaller memory requirements)  This should enable to have a full comparison of proteome vs. genome trees.  Not there yet.

Thank you ! Questions ?

Species Evolution The affinities of all the beings of the same class have sometimes been represented by a great tree... Charles Darwin, 1859 Tree of life -

Whole Genome Phylogenies: Challenges  Very large inputs: Up to 5G bp long  Extreme length variability (5G to 1M bp)  No meaningful alignment  Different segments experienced different evolutionary processes  Common approach (ours too) Compute pairwise distances Build a tree using established methods

Defining Our Distance  Motivated by all these considerations, we define:  Large indicates higher similarity.  Should normalize to account for length of g 2.  We want distance, not similarity.  And want it to be symmetric.

Robinson-Foulds Tree Distance  Each tree edge partitions species into 2 sets.  Check which partitions shared between trees. AA BB CC DDEE AA BB EE DDCC Tree A Tree B A,B C,D,E A,BC,D,E Shared Partition

 Each tree edge partitions species into 2 sets.  Check which partitions shared between trees. AA BB CC DDEE AA BB EE DDCC Tree A Tree B D,E A,B,C C,D A,B,E Robinson-Foulds Tree Distance Different Partition

AA BB CC DDEE AA BB EE DDCC Tree A Tree B D,E A,B,C C,D A,B,E  Distance = number edges inducing different partitions.  For n leaves, distance between 0 to 2n-6. Robinson-Foulds Tree Distance Different Partition

Comparisons: Results Recall benchmark set has n=75 species, so max Robinson-Foulds tree distance is 144.

All proteomes dataset  191 proteomes from NCBI Genome  11 Eukarya, 19 Archaea, 161 Bacteria  Reconstruction took about 5 days  Compared to NCBI Taxonomy

Our approach – Info Theoretic  Kullback Leibler (KL) relative entropy:  Enables computing “distances” between distributions  In practice can serve as a metric